EDA on iMet Collection 2019

From: https://www.kaggle.com/go1dfish/eda-on-imet-collection-2019

Author: GoldFish

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_palette("husl")
import os
print(os.listdir("../input/"))
import warnings
warnings.filterwarnings('ignore')
import gc
from pathlib import Path
from PIL import Image
from IPython.display import clear_output
from tqdm import tqdm_notebook as tqdm
['test', 'train', 'train.csv', 'labels.csv', 'sample_submission.csv']

Let's recognize artwork attributes from The Metropolitan Museum of Art

Note

This is a Kernels-only competition
Submissions to this competition must be made through Kernels. In order for the "Submit to Competition" button to be active after a commit, the following conditions must be met:

  • 9 hour runtime limit (including GPU Kernels)
  • No internet access enabled
  • Only whitelisted data is allowed
  • No custom packages
  • Submission file must be named "submission.csv"
    Please see the Kernels-only FAQ for more information on how to submit.

Files

The filename of each image is its id.

  • train.csv gives the attribute_ids for the train images in /train
  • /test contains the test images. You must predict the attribute_ids for these images.
  • sample_submission.csv contains a submission in the correct format
  • labels.csv provides descriptions of the attributes

labels.csv

  • labels.csv provides descriptions of the attributes
In [2]:
labels_df = pd.read_csv("../input/labels.csv")
In [3]:
labels_df.head()
Out[3]:
attribute_id attribute_name
0 0 culture::abruzzi
1 1 culture::achaemenid
2 2 culture::aegean
3 3 culture::afghan
4 4 culture::after british
In [4]:
labels_df.tail()
Out[4]:
attribute_id attribute_name
1098 1098 tag::writing implements
1099 1099 tag::writing systems
1100 1100 tag::zeus
1101 1101 tag::zigzag pattern
1102 1102 tag::zodiac
In [5]:
print(f"labels.csv have {labels_df.shape[0]} attributes_name.")
labels.csv have 1103 attributes_name.

Let's check the number of culture and tag.

In [6]:
kind_dict = {}
for i in range(len(labels_df)):
    kind, name = labels_df.attribute_name[i].split("::")
    if(kind in kind_dict.keys()):
        kind_dict[kind] += 1
    else:
        kind_dict[kind] = 1
for key, val in kind_dict.items():
    print("The number of {} is {}({:.2%})".format(key, val, val/len(labels_df)))
The number of culture is 398(36.08%)
The number of tag is 705(63.92%)
In [7]:
label_dict = labels_df.attribute_name.to_dict()

train.csv

  • train.csv gives the attribute_ids for the train images in /train
In [8]:
train_df = pd.read_csv("../input/train.csv")
train_df.head()
Out[8]:
id attribute_ids
0 1000483014d91860 147 616 813
1 1000fe2e667721fe 51 616 734 813
2 1001614cb89646ee 776
3 10041eb49b297c08 51 671 698 813 1092
4 100501c227f8beea 13 404 492 903 1093

Check the amount of train/test data!

In [9]:
test_path = Path("../input/test/")
test_num = len(list(test_path.glob("*.png")))
train_num = len(train_df)
fig, ax = plt.subplots()
sns.barplot(y=["train", "test"], x=[train_num, test_num])
ax.set_title("The amount of data")
clear_output()

Test data is small.
Because this competition is 2-stage.
We can see theThe second-stage test set is approximately five times the size of the first. in Data Description.
So please care your memory usage in kernel!

In [10]:
id_len_dict = {}
id_num_dict = {}
for i in range(train_df.shape[0]):
    ids = list(map(int, train_df.attribute_ids[i].split()))
    id_len = len(ids)
    if(id_len in id_len_dict.keys()):
        id_len_dict[id_len] += 1
    else:
        id_len_dict[id_len] = 1
    for num in ids:
        if(num in id_num_dict.keys()):
            id_num_dict[num] += 1
        else:
            id_num_dict[num] = 1

Check the number of attribute_id per image and appearance frequency of attribute!

In [11]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
sns.barplot(x=list(id_len_dict.keys()), y=list(id_len_dict.values()), ax=ax1)
ax1.set_title("The number of attribute_id per image")
ax2.bar(list(id_num_dict.keys()), list(id_num_dict.values()))
ax2.set_title("Appearance frequency of attribute")
ax2.set_xticks(np.linspace(0, max(id_num_dict.keys()), 10, dtype='int'))
clear_output()
In [12]:
id_len_list = sorted(id_len_dict.items(), key=lambda x: -x[1])
print("The number of attribute_id per image\n")
print("{0:9s}{1:20s}".format("label num".rjust(9), "amount".rjust(20)))
for i in id_len_list:
    print("{0:9d}{1:20d}".format(i[0], i[1]))
The number of attribute_id per image

label num              amount
        2               37356
        3               29200
        4               20208
        5               10946
        6                6157
        1                4324
        7                 920
        8                 103
        9                  17
       10                   5
       11                   1
In [13]:
id_num_list = sorted(id_num_dict.items(), key=lambda x: -x[1])
print("Top 10 high appearance frequency attitude\n")
print("{0:4s}{1:15s}{2:30s}".format("rank".rjust(4), "num".rjust(15), "attitude_name".rjust(30)))
for i in range(10):
    print("{0:3d}.{1:15d}{2:30s}".format(i+1, id_num_list[i][1], (label_dict[id_num_list[i][0]]).rjust(30)))
Top 10 high appearance frequency attitude

rank            num                 attitude_name
  1.          19970                      tag::men
  2.          14281                    tag::women
  3.          13522               culture::french
  4.          10375              culture::italian
  5.           9151             culture::american
  6.           8419                  tag::flowers
  7.           7615              culture::british
  8.           7394                culture::japan
  9.           6564      tag::utilitarian objects
 10.           6542             culture::egyptian
In [14]:
id_num_list = sorted(id_num_dict.items(), key=lambda x: x[1])
print("Top 16 low appearance frequency attitude\n")
print("{0:4s}{1:15s}{2:30s}".format("rank".rjust(4), "num".rjust(15), "attitude_name".rjust(50)))
for i in range(16):
    print("{0:3d}.{1:15d}{2:50s}".format(i+1, id_num_list[i][1], (label_dict[id_num_list[i][0]]).rjust(50)))
Top 16 low appearance frequency attitude

rank            num                                     attitude_name
  1.              1                     culture::freiburg im breisgau
  2.              1                                culture::tsimshian
  3.              1                                     culture::dyak
  4.              1                               culture::kholmogory
  5.              1                                 culture::algerian
  6.              1                                  culture::palermo
  7.              1                                   culture::skyros
  8.              1                                    culture::nimes
  9.              1                                  tag::mark antony
 10.              1                        culture::mennecy or sceaux
 11.              1                               culture::macedonian
 12.              1         culture::chinese with european decoration
 13.              1                                culture::populonia
 14.              1                              culture::zoroastrian
 15.              1                                    culture::dehua
 16.              2                        culture::central highlands
  • The number of attribute_id per image
    Most data have 1-6 labels.
    But few data have 7-11 labels.
    The data that have 11 labels is only one!!!
  • Appearance frequency of attribute
    Top2 high appearance frequency tag is "men" and "women".
    Top3 high appearance frequency culture is "french", "italian" and "american".
    Low appearance frequency attitude is too many.
    I think we need care this low attitudes.

train/test images

Let's show training images.

In [15]:
train_path = Path("../input/train/")
fig, ax = plt.subplots(3, figsize=(10, 20))
for i, index in enumerate(np.random.randint(0, len(train_df), 3)):
    path = (train_path / (train_df.id[index] + ".png"))
    img = np.asarray(Image.open(str(path)))
    ax[i].imshow(img)
    ids = list(map(int, train_df.attribute_ids[index].split()))
    for num, attribute_id in enumerate(ids):
        x_pos = img.shape[1] + 100
        y_pos = (img.shape[0] - 100) / len(ids) * num + 100
        ax[i].text(x_pos, y_pos, label_dict[attribute_id], fontsize=20)

Let's show test images.

In [16]:
test_path = Path("../input/test/")
test_img_paths = list(test_path.glob("*.png"))
fig, ax = plt.subplots(3, figsize=(10, 20))
for i, path in enumerate(np.random.choice(test_img_paths, 3)):
    img = np.asarray(Image.open(str(path)))
    ax[i].imshow(img)

Check train/test image area and size.

In [17]:
def check_area_size(folder_path):
    area_list = []
    max_width = None
    min_width = None
    max_height = None
    min_height = None
    img_paths = list(folder_path.glob("*.png"))
    for path in tqdm(img_paths):
        img = np.asarray(Image.open(str(path)))
        shape = img.shape
        area_list.append(shape[0]*shape[1])
        if(max_width is None):
            max_width = (shape[1], path)
            min_width = (shape[1], path)
            max_height = (shape[0], path)
            min_height = (shape[0], path)
        else:
            if(max_width[0] < shape[1]):
                max_width = (shape[1], path)
            elif(min_width[0] > shape[1]):
                min_width = (shape[1], path)
            if(max_height[0] < shape[0]):
                max_height = (shape[0], path)
            elif(min_height[0] > shape[0]):
                min_height = (shape[0], path)
    return area_list, max_width, min_width, max_height, min_height
In [18]:
train_area_list, train_max_width, train_min_width, train_max_height, train_min_height\
    = check_area_size(train_path)
clear_output()
In [19]:
print("test max area size is {}".format(max(train_area_list)))
print("test min area size is {}".format(min(train_area_list)))
print("Max area is {:.2f} times min area".format(max(train_area_list)/ min(train_area_list)))
test max area size is 2259300
test min area size is 90000
Max area is 25.10 times min area
In [20]:
print(f"max train image width is {train_max_width[0]}")
img = np.asarray(Image.open(str(train_max_width[1])))
plt.imshow(img)
plt.show()
max train image width is 5314

WTF!? what is this...

In [21]:
print(f"min train image width is {train_min_width[0]}")
img = np.asarray(Image.open(str(train_min_width[1])))
plt.imshow(img)
plt.show()
min train image width is 300
In [22]:
print(f"max train image height is {train_max_height[0]}")
img = np.asarray(Image.open(str(train_max_height[1])))
plt.imshow(img)
plt.show()
max train image height is 7531
In [23]:
print(f"min train image height is {train_min_height[0]}")
img = np.asarray(Image.open(str(train_min_height[1])))
plt.imshow(img)
plt.show()
min train image height is 300
In [24]:
test_area_list, test_max_width, test_min_width, test_max_height, test_min_height\
    = check_area_size(test_path)
clear_output()
In [25]:
print("test max area size is {}".format(max(test_area_list)))
print("test min area size is {}".format(min(test_area_list)))
print("Max area is {:.2f} times min area".format(max(test_area_list)/ min(test_area_list)))
test max area size is 1133100
test min area size is 90000
Max area is 12.59 times min area
In [26]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharex=True)
sns.distplot(train_area_list, kde=False, ax=ax1)
ax1.set_title("Distribution of the train image area")
sns.distplot(test_area_list, kde=False, ax=ax2)
ax2.set_title("Distribution of the test image area")
plt.show()
In [27]:
print(f"max test image width is {test_max_width[0]}")
img = np.asarray(Image.open(str(test_max_width[1])))
plt.imshow(img)
plt.show()
max test image width is 3777
In [28]:
print(f"min train image width is {test_min_width[0]}")
img = np.asarray(Image.open(str(test_min_width[1])))
plt.imshow(img)
plt.show()
min train image width is 300
In [29]:
print(f"max train image height is {test_max_height[0]}")
img = np.asarray(Image.open(str(test_max_height[1])))
plt.imshow(img)
plt.show()
max train image height is 2886
In [30]:
print(f"min train image height is {test_min_height[0]}")
img = np.asarray(Image.open(str(test_min_height[1])))
plt.imshow(img)
plt.show()
min train image height is 300

Dataset images have big difference in size(10 times over).
And image that have big difference between width and height(like about 280*5314) is exist in dataset.
I think we need care that adjust the size.

Thank you for watching!

I hope this will help.
Please tell me if i make mistake.