iMet Collection 2019 - EDA & Keras

From: https://www.kaggle.com/dimitreoliveira/imet-collection-2019-eda-keras

Author: DimitreOliveira

Score: 0.09

iMet Collection 2019 - FGVC6

Recognize artwork attributes from The Metropolitan Museum of Art

In this competition we are charged to build models to add fine-grained attributes to aid in the visual understanding of the museum objects, from the 1.5M objects, 200k were digitized, and are provided here.

In this notebook I will be using a basic deep learning convolutional model to create a baseline.

Dependencies

In [1]:
import os
import cv2
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from keras import optimizers
from keras.models import Sequential
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, Activation, BatchNormalization

%matplotlib inline
sns.set(style="whitegrid")
warnings.filterwarnings("ignore")

# Set seeds to make the experiment more reproducible.
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(0)
seed(0)
Using TensorFlow backend.

Load data

In [2]:
train = pd.read_csv('../input/train.csv')
labels = pd.read_csv('../input/labels.csv')
test = pd.read_csv('../input/sample_submission.csv')

print('Number of train samples: ', train.shape[0])
print('Number of test samples: ', test.shape[0])
print('Number of labels: ', labels.shape[0])
display(train.head())
display(labels.head())
Number of train samples:  109237
Number of test samples:  7443
Number of labels:  1103
id attribute_ids
0 1000483014d91860 147 616 813
1 1000fe2e667721fe 51 616 734 813
2 1001614cb89646ee 776
3 10041eb49b297c08 51 671 698 813 1092
4 100501c227f8beea 13 404 492 903 1093
attribute_id attribute_name
0 0 culture::abruzzi
1 1 culture::achaemenid
2 2 culture::aegean
3 3 culture::afghan
4 4 culture::after british

Top 30 most frequent attributes

  • First, let's see between the 1103 attributes which are the most frequent ones.
In [3]:
attribute_ids = train['attribute_ids'].values
attributes = []
for item_attributes in [x.split(' ') for x in attribute_ids]:
    for attribute in item_attributes:
        attributes.append(int(attribute))
        
att_pd = pd.DataFrame(attributes, columns=['attribute_id'])
att_pd = att_pd.merge(labels)
top30 = att_pd['attribute_name'].value_counts()[:30].to_frame()
N_unique_att = att_pd['attribute_id'].nunique()
print('Number of unique attributes: ', N_unique_att)
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(y=top30.index, x="attribute_name", data=top30, palette="rocket", order=reversed(top30.index))
ax.set_ylabel("Surface type")
ax.set_xlabel("Count")
sns.despine()
plt.show()
Number of unique attributes:  1103
In [4]:
att_pd['tag'] = att_pd['attribute_name'].apply(lambda x:x.split('::')[0])
gp_att = att_pd.groupby('tag').count()

print('Number of attributes groups: ', gp_att.shape[0])
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(y=gp_att.index, x="attribute_name", data=gp_att, palette="rocket")
ax.set_ylabel("Attribute group")
ax.set_xlabel("Count")
sns.despine()
plt.show()
Number of attributes groups:  2

Number of tags per item

  • We saw on the training set that some of the items have more than one attribute tag, let's see the attribute tag distribution.
In [5]:
train['Number of Tags'] = train['attribute_ids'].apply(lambda x: len(x.split(' ')))
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="Number of Tags", data=train, palette="GnBu_d")
ax.set_ylabel("Surface type")
sns.despine()
plt.show()

Now let's see some of the items

In [6]:
sns.set_style("white")
count = 1
plt.figure(figsize=[20,20])
for img_name in os.listdir("../input/train/")[:20]:
    img = cv2.imread("../input/train/%s" % img_name)[...,[2, 1, 0]]
    plt.subplot(5, 5, count)
    plt.imshow(img)
    plt.title("Item %s" % count)
    count += 1
    
plt.show()
In [7]:
train["id"] = train["id"].apply(lambda x:x+".png")
test["id"] = test["id"].apply(lambda x:x+".png")
train["attribute_ids"] = train["attribute_ids"].apply(lambda x:x.split(" "))

Model

In [8]:
# Model parameters
BATCH_SIZE = 128
EPOCHS = 30
LEARNING_RATE = 0.0001
HEIGHT = 64
WIDTH = 64
CANAL = 3
N_CLASSES = N_unique_att
classes = list(map(str, range(N_CLASSES)))
In [9]:
model = Sequential()

model.add(Conv2D(filters=32, kernel_size=(5,5),padding='Same', input_shape=(HEIGHT, WIDTH, CANAL)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(filters=32, kernel_size=(5,5),padding='Same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.5))

model.add(Conv2D(filters=64, kernel_size=(4,4),padding='Same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(filters=64, kernel_size=(4,4),padding='Same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.5))

model.add(Flatten())

model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(N_CLASSES, activation="sigmoid"))
model.summary()

optimizer = optimizers.adam(lr=LEARNING_RATE)
model.compile(optimizer=optimizer , loss="binary_crossentropy", metrics=["accuracy"])
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 64, 64, 32)        2432      
_________________________________________________________________
batch_normalization_1 (Batch (None, 64, 64, 32)        128       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 64, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 64, 64, 32)        25632     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64, 64, 32)        128       
_________________________________________________________________
activation_2 (Activation)    (None, 64, 64, 32)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 32, 64)        32832     
_________________________________________________________________
batch_normalization_3 (Batch (None, 32, 32, 64)        256       
_________________________________________________________________
activation_3 (Activation)    (None, 32, 32, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 32, 64)        65600     
_________________________________________________________________
batch_normalization_4 (Batch (None, 32, 32, 64)        256       
_________________________________________________________________
activation_4 (Activation)    (None, 32, 32, 64)        0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 16, 16, 64)        0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 16, 16, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 16384)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              16778240  
_________________________________________________________________
batch_normalization_5 (Batch (None, 1024)              4096      
_________________________________________________________________
activation_5 (Activation)    (None, 1024)              0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1103)              1130575   
=================================================================
Total params: 18,040,175
Trainable params: 18,037,743
Non-trainable params: 2,432
_________________________________________________________________
In [10]:
train_datagen=ImageDataGenerator(rescale=1./255, validation_split=0.25)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator=train_datagen.flow_from_dataframe(
    dataframe=train,
    directory="../input/train",
    x_col="id",
    y_col="attribute_ids",
    batch_size=BATCH_SIZE,
    shuffle=True,
    class_mode="categorical",
    classes=classes,
    target_size=(HEIGHT, WIDTH),
    subset='training')

valid_generator=train_datagen.flow_from_dataframe(
    dataframe=train,
    directory="../input/train",
    x_col="id",
    y_col="attribute_ids",
    batch_size=BATCH_SIZE,
    shuffle=True,
    class_mode="categorical",    
    classes=classes,
    target_size=(HEIGHT, WIDTH),
    subset='validation')

test_generator = test_datagen.flow_from_dataframe(  
        dataframe=test,
        directory = "../input/test",    
        x_col="id",
        target_size = (HEIGHT, WIDTH),
        batch_size = 1,
        shuffle = False,
        class_mode = None)
Found 81928 images belonging to 1103 classes.
Found 27309 images belonging to 1103 classes.
Found 7443 images.
In [11]:
STEP_SIZE_TRAIN = train_generator.n // train_generator.batch_size
STEP_SIZE_VAL = valid_generator.n // valid_generator.batch_size

history = model.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VAL,
                    epochs=EPOCHS,
                    verbose=2)
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/30
 - 751s - loss: 0.0716 - acc: 0.9811 - val_loss: 0.1665 - val_acc: 0.9965
Epoch 2/30
 - 731s - loss: 0.0161 - acc: 0.9971 - val_loss: 0.0978 - val_acc: 0.9970
Epoch 3/30
 - 700s - loss: 0.0149 - acc: 0.9971 - val_loss: 0.0535 - val_acc: 0.9970
Epoch 4/30
 - 698s - loss: 0.0142 - acc: 0.9971 - val_loss: 0.0305 - val_acc: 0.9971
Epoch 5/30
 - 696s - loss: 0.0138 - acc: 0.9971 - val_loss: 0.0202 - val_acc: 0.9972
Epoch 6/30
 - 692s - loss: 0.0134 - acc: 0.9972 - val_loss: 0.0138 - val_acc: 0.9972
Epoch 7/30
 - 687s - loss: 0.0130 - acc: 0.9972 - val_loss: 0.0128 - val_acc: 0.9972
Epoch 8/30
 - 687s - loss: 0.0127 - acc: 0.9972 - val_loss: 0.0123 - val_acc: 0.9972
Epoch 9/30
 - 677s - loss: 0.0125 - acc: 0.9972 - val_loss: 0.0120 - val_acc: 0.9973
Epoch 10/30
 - 669s - loss: 0.0122 - acc: 0.9972 - val_loss: 0.0118 - val_acc: 0.9973
Epoch 11/30
 - 672s - loss: 0.0120 - acc: 0.9973 - val_loss: 0.0117 - val_acc: 0.9973
Epoch 12/30
 - 676s - loss: 0.0118 - acc: 0.9973 - val_loss: 0.0116 - val_acc: 0.9973
Epoch 13/30
 - 674s - loss: 0.0116 - acc: 0.9973 - val_loss: 0.0115 - val_acc: 0.9973
Epoch 14/30
 - 676s - loss: 0.0114 - acc: 0.9973 - val_loss: 0.0114 - val_acc: 0.9973
Epoch 15/30
 - 675s - loss: 0.0113 - acc: 0.9973 - val_loss: 0.0114 - val_acc: 0.9973
Epoch 16/30
 - 679s - loss: 0.0112 - acc: 0.9973 - val_loss: 0.0111 - val_acc: 0.9974
Epoch 17/30
 - 672s - loss: 0.0110 - acc: 0.9973 - val_loss: 0.0111 - val_acc: 0.9974
Epoch 18/30
 - 676s - loss: 0.0109 - acc: 0.9973 - val_loss: 0.0110 - val_acc: 0.9974
Epoch 19/30
 - 674s - loss: 0.0108 - acc: 0.9973 - val_loss: 0.0108 - val_acc: 0.9974
Epoch 20/30
 - 676s - loss: 0.0106 - acc: 0.9974 - val_loss: 0.0109 - val_acc: 0.9974
Epoch 21/30
 - 685s - loss: 0.0106 - acc: 0.9974 - val_loss: 0.0110 - val_acc: 0.9974
Epoch 22/30
 - 687s - loss: 0.0104 - acc: 0.9974 - val_loss: 0.0113 - val_acc: 0.9973
Epoch 23/30
 - 686s - loss: 0.0103 - acc: 0.9974 - val_loss: 0.0108 - val_acc: 0.9974
Epoch 24/30
 - 678s - loss: 0.0103 - acc: 0.9974 - val_loss: 0.0108 - val_acc: 0.9974
Epoch 25/30
 - 680s - loss: 0.0101 - acc: 0.9974 - val_loss: 0.0107 - val_acc: 0.9974
Epoch 26/30
 - 674s - loss: 0.0101 - acc: 0.9974 - val_loss: 0.0107 - val_acc: 0.9974
Epoch 27/30
 - 682s - loss: 0.0100 - acc: 0.9974 - val_loss: 0.0106 - val_acc: 0.9974
Epoch 28/30
 - 678s - loss: 0.0099 - acc: 0.9974 - val_loss: 0.0108 - val_acc: 0.9974
Epoch 29/30
 - 682s - loss: 0.0098 - acc: 0.9974 - val_loss: 0.0107 - val_acc: 0.9974
Epoch 30/30
 - 671s - loss: 0.0097 - acc: 0.9974 - val_loss: 0.0105 - val_acc: 0.9975

Model graph loss

In [12]:
sns.set_style("whitegrid")
fig, (ax1, ax2) = plt.subplots(1, 2, sharex='col', figsize=(20,7))

ax1.plot(history.history['acc'], label='Train Accuracy')
ax1.plot(history.history['val_acc'], label='Validation accuracy')
ax1.legend(loc='best')
ax1.set_title('Accuracy')

ax2.plot(history.history['loss'], label='Train loss')
ax2.plot(history.history['val_loss'], label='Validation loss')
ax2.legend(loc='best')
ax2.set_title('Loss')

plt.xlabel('Epochs')
sns.despine()
plt.show()

Apply model to test set and output predictions

In [13]:
test_generator.reset()
n_steps = len(test_generator.filenames)
preds = model.predict_generator(test_generator, steps = n_steps)
In [14]:
predictions = []
for pred_ar in preds:
    valid = ''
    for idx, pred in enumerate(pred_ar):
        if pred > 0.3:  # Using 0.3 as threshold
            if len(valid) == 0:
                valid += str(idx)
            else:
                valid += (' %s' % idx)
    if len(valid) == 0:
        valid = str(np.argmax(pred_ar))
    predictions.append(valid)
In [15]:
filenames=test_generator.filenames
results=pd.DataFrame({'id':filenames, 'attribute_ids':predictions})
results['id'] = results['id'].map(lambda x: str(x)[:-4])
results.to_csv('submission.csv',index=False)
results.head(10)
Out[15]:
id attribute_ids
0 10023b2cc4ed5f68 403
1 100fbe75ed8fd887 46
2 101b627524a04f19 546
3 10234480c41284c6 560 855 897
4 1023b0e2636dcea8 739
5 1039cd6cf85845c 105 138 444 988 997
6 103a5b3f83fbe88 105 209 897
7 10413aaae8d6a9a2 105 157 897
8 10423822b93a65ab 157
9 1052bf702cb099f7 105