iMet EDA

From: https://www.kaggle.com/yfyangd/imet-eda

Author: Yang Yuan Fu

iMet EDA

The Metropolitan Museum of Art in New York, also known as The Met , is the largest art museum in the United States. With 6,953,927 visitors in 2018. Including me is also attracted, I went to visit on 2019/05 with my wife and 3 kids. My children are very interested in Egyptian artifacts and have been in the exhibition area for a long time.

Its permanent collection contains over two million works of which over 200K have been digitized with imagery.

The online cataloguing information is generated by Subject Matter Experts (SME) and includes a wide range of data. SME can also be indirect in describing finer-grained attributes from the museum-goer’s understanding. Adding fine-grained attributes to aid in the visual understanding of the museum objects will enable the ability to search for visually related objects.

In this study, we tried to extract the feature of image and analysis the feature by each attributes. Simple method (Random Forest) was performed and I hope it is useful for Machine Learningers.

In [1]:
import numpy as np
import pandas as pd
import pylab as plt
import seaborn as sns
import cv2
import os

Data Import

In [2]:
train = pd.read_csv('../input/train.csv')
train.head()
Out[2]:
id attribute_ids
0 1000483014d91860 147 616 813
1 1000fe2e667721fe 51 616 734 813
2 1001614cb89646ee 776
3 10041eb49b297c08 51 671 698 813 1092
4 100501c227f8beea 13 404 492 903 1093

Data Exploration

We explore the image by 3 view: BGR/RBG/HSV

In [3]:
img_path='../input/train/'+train.id[5]+".png"
image=cv2.imread(img_path)
image_rgb=cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
h,s,v=np.average(hsv_image,axis=(0,1))

plt.subplot(131),plt.imshow(image),plt.title('BGR')
plt.subplot(132),plt.imshow(image_rgb),plt.title('RBG')
plt.subplot(133),plt.imshow(hsv_image),plt.title('HSV')
Out[3]:
(<matplotlib.axes._subplots.AxesSubplot at 0x7fe418f77240>,
 <matplotlib.image.AxesImage at 0x7fe418f3b198>,
 Text(0.5, 1.0, 'HSV'))

Data Preprocess

We try to remove the background of image and let the object clear. First, we convert BGR image to HSV, we can use this to extract a colored object. In HSV, it is more easier to represent a color than RGB color-space. In our application, we will try to extract a background colored object. So here is the method:

  • Convert from BGR to HSV color-space
  • We threshold the HSV image for a range of color
  • Now extract the object alone, we can do whatever on that image we want.
In [4]:
image=cv2.imread(img_path)
img=cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Convert BGR to HSV
hsv_image = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# define range of blue color in HSV
lower_blue=np.array([20,25,15])
upper_blue=np.array([130,255,255])
#Threshold the HSV impage to get only blue colors
mask=cv2.inRange(hsv_image,lower_blue,upper_blue)
# Bitwise-And mask and original image
res = cv2.bitwise_and(img,img,mask=mask)
plt.subplot(131),plt.imshow(img),plt.title('ORIGINAL')
plt.subplot(132),plt.imshow(mask),plt.title('Mask')
plt.subplot(133),plt.imshow(res),plt.title('Res')
Out[4]:
(<matplotlib.axes._subplots.AxesSubplot at 0x7fe418e06ef0>,
 <matplotlib.image.AxesImage at 0x7fe418dc4e48>,
 Text(0.5, 1.0, 'Res'))

Let's do the same thing in 5 image and check how the oject had been extract:

In [5]:
def mask(img_path):
    image=cv2.imread(img_path)
    img=cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    hsv_image = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    lower_blue=np.array([20,25,15])
    upper_blue=np.array([130,255,255])
    mask=cv2.inRange(hsv_image,lower_blue,upper_blue)
    res = cv2.bitwise_and(img,img,mask=mask)
    plt.subplot(131),plt.imshow(img),plt.title('ORIGINAL')
    plt.subplot(132),plt.imshow(mask),plt.title('Mask')
    plt.subplot(133),plt.imshow(res),plt.title('Res')
    plt.show()
    h,s,v=np.average(res,axis=(0,1))
    print(h, s, v)

for i in range(5):
    img_path='../input/train/'+train.id[i]+".png"
    mask(img_path)
165.70724680432645 155.00131760078662 143.93794493608652
22.46585500394011 20.96533490937746 19.68991331757289
122.93879452054794 104.72286757990868 95.91358904109589
160.94501862197393 145.44646182495345 133.1907355679702
135.76325126262626 115.3904356060606 94.31912247474747

Thanks for DHTT's kernel(https://www.kaggle.com/d5195295/hsv-analysis). After background remove, let's check the destribution of feature:

In [6]:
def image_feature_extracion(img_path):
    image=cv2.imread(img_path)
    img=cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    hsv_image = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    lower_blue=np.array([20,25,15])
    upper_blue=np.array([130,255,255])
    mask=cv2.inRange(hsv_image,lower_blue,upper_blue)
    res = cv2.bitwise_and(img,img,mask=mask)
    h,s,v=np.average(hsv_image,axis=(0,1))
    return h,s,v

read_len=1000
hsv_list=[]
for i in range(read_len):    
    img_path='../input/train/'+train.id[i]+".png"    
    hsv_list.append(image_feature_extracion(img_path))
    
import seaborn as sns
df = pd.DataFrame(hsv_list, columns=["Hue", "y",'Brightness(Values)'])
sns.jointplot(x="Hue", y="Brightness(Values)", data=df)
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[6]:
<seaborn.axisgrid.JointGrid at 0x7fe4180d95c0>
In [7]:
df.head()
Out[7]:
Hue y Brightness(Values)
0 104.608732 32.105320 193.992311
1 106.479961 20.198613 211.166572
2 110.689059 54.051269 149.338986
3 103.815540 43.876862 191.748492
4 105.284375 75.636641 155.093504

We use 10000 images to extract the feature. All image used will be long time.

In [8]:
def image_feature_extracion(img_path,ID):
    image=cv2.imread(img_path)
    img=cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    hsv_image = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    lower_blue=np.array([20,25,15])
    upper_blue=np.array([130,255,255])
    mask=cv2.inRange(hsv_image,lower_blue,upper_blue)
    res = cv2.bitwise_and(img,img,mask=mask)
    h,s,v=np.average(hsv_image,axis=(0,1))
    return ID,h,s,v

read_len=10000 #109237
hsv_list=[]
for i in range(read_len):    
    img_path='../input/train/'+train.id[i]+".png"
    ID=train.id[i]
    hsv_list.append(image_feature_extracion(img_path,ID))

df = pd.DataFrame(hsv_list, columns=["ID","Hue", "y",'Brightness(Values)'])
df.head()
Out[8]:
ID Hue y Brightness(Values)
0 1000483014d91860 104.608732 32.105320 193.992311
1 1000fe2e667721fe 106.479961 20.198613 211.166572
2 1001614cb89646ee 110.689059 54.051269 149.338986
3 10041eb49b297c08 103.815540 43.876862 191.748492
4 100501c227f8beea 105.284375 75.636641 155.093504
In [9]:
df.shape
Out[9]:
(10000, 4)

Data Preproces for Labels

In [10]:
train['attribute_ids'].head()
Out[10]:
0            147 616 813
1         51 616 734 813
2                    776
3    51 671 698 813 1092
4    13 404 492 903 1093
Name: attribute_ids, dtype: object
In [11]:
train["attribute_ids"] = train["attribute_ids"].apply(lambda x:x.split(" "))
train['attribute_ids'].head()
Out[11]:
0              [147, 616, 813]
1          [51, 616, 734, 813]
2                        [776]
3    [51, 671, 698, 813, 1092]
4    [13, 404, 492, 903, 1093]
Name: attribute_ids, dtype: object
In [12]:
labels = pd.read_csv('../input/labels.csv')
labels.shape
Out[12]:
(1103, 2)
In [13]:
train_labels = []
for label in train['attribute_ids'][:10000].values:
    zeros = np.zeros(labels.shape[0])
    for label_i in label:
        zeros[int(label_i)] = 1
    train_labels.append(zeros)
    
train_labels = np.asarray(train_labels)
train_labels
Out[13]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [14]:
train_labels.shape
Out[14]:
(10000, 1103)
In [15]:
Y = train_labels
features = ['Hue','y','Brightness(Values)']
X = df[features]
print(Y.shape,X.shape)
(10000, 1103) (10000, 3)

Feature Analysis

In [16]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, Y)
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(6,6))
y_pos = np.arange(len(features))
plt.barh(y_pos, model.feature_importances_, align='center', alpha=0.4)
plt.yticks(y_pos, features)
plt.xlabel('features')
plt.title('feature_importances')
plt.show()
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

We analysis the feature by Random Forest. The brightness is the most important factor, followed by Saturation and Hue. The analysis results show that different luminance of target causing the observer to have different perceptions of the object.