A step by step EDA

From: https://www.kaggle.com/kiurtis/a-step-by-step-eda

Author: Kiurtis

In this step by step EDA, you will find:

  • 0. Data dummification before further preprocessing A way to dummify data in order to get them useable

  • I. Number and type of labels A brief look at the number of labels by image and their type

  • II. What are the most frequent labels? A brief look at the labels repartition

  • III. Label coexistence
    A brief study on labels correlation

  • IV. Random image displayer A simple function randomly displaying an image of a given label together with its shape and all its tags

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
import re
from collections import Counter

# Ploting
import matplotlib.pyplot as plt
import seaborn as sns 

plt.rcParams['figure.figsize'] = (30,30)
%matplotlib inline

import os
print(os.listdir("../input"))

def append_ext(fn):
    return fn+".png"

def remove_ext(fn):
    return fn[:-4]
['train', 'labels.csv', 'train.csv', 'sample_submission.csv', 'test']
In [2]:
labels_list = pd.read_csv('../input/labels.csv')
labels = pd.read_csv("../input/train.csv")
test_submission = pd.read_csv('../input/sample_submission.csv')
labels['attribute_ids'] = labels['attribute_ids'].str.split(" ")
labels['id'] = labels['id'].apply(append_ext)
test_submission['id'] =test_submission['id'].apply(append_ext)
labels_list.head()
labels = labels

0. Data dummification before further preprocessing

In [3]:
start = datetime.datetime.now()

labels_dummified = pd.DataFrame(columns=labels_list['attribute_id'])
d_list = []
for index, row in labels.iterrows():
    for value in row['attribute_ids']:
        d_list.append({'name':row['id'], 
                       'value':value})
labels_dummified = labels_dummified.append(d_list, ignore_index=True)
labels_dummified = labels_dummified.groupby('name')['value'].value_counts()
labels_dummified = labels_dummified.unstack(level=-1).fillna(0)
labels_dummified = labels_dummified[[str(y) for y in sorted([int(x) for x in labels_dummified.columns])]]
labels_dummified.columns = labels_list['attribute_name']
end = datetime.datetime.now()
print("Elapsed time:",end-start)
labels_dummified.head()
Elapsed time: 0:00:43.970356
Out[3]:
attribute_name culture::abruzzi culture::achaemenid culture::aegean culture::afghan culture::after british culture::after german culture::after german original culture::after italian culture::after russian original culture::akkadian culture::alexandria-hadra culture::algerian culture::alsace culture::american culture::american or european culture::amsterdam culture::ansbach culture::antwerp culture::apulian culture::arabian culture::aragon culture::arica culture::asia minor culture::assyrian culture::atlantic watershed culture::attic culture::augsburg culture::augsburg decoration culture::augsburg original culture::austrian culture::avignon culture::avon culture::aztec culture::babylonian culture::babylonian or kassite culture::bactria-margiana archaeological complex culture::balinese culture::bavaria culture::bayreuth culture::beautiran ... tag::vegetables tag::venus tag::vestments tag::vests tag::victory tag::villages tag::vines tag::violas tag::violins tag::virgin mary tag::vishnu tag::volcanoes tag::vulcan tag::wagons tag::walking tag::wars tag::washing tag::watches tag::waterfalls tag::watermills tag::waves tag::weapons tag::weights and measures tag::wells tag::wind tag::windmills tag::windows tag::wine tag::winter tag::women tag::working tag::world war i tag::worshiping tag::wreaths tag::writing tag::writing implements tag::writing systems tag::zeus tag::zigzag pattern tag::zodiac
name
1000483014d91860.png 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1000fe2e667721fe.png 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1001614cb89646ee.png 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10041eb49b297c08.png 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
100501c227f8beea.png 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

I. Number and type of labels

How many labels do the images have?

In [4]:
fig, ax = plt.subplots(figsize=(20,10))
ax.hist(labels_dummified.sum(axis=1),bins=10)
plt.show()

All the object have at least one label, and usually two and more. Having more than 7 labels is pretty rare.

What are the types of labels and how frequent are they?

In [5]:
n_labels = labels_dummified.sum()
Counter([re.match('([a-z]+)::\w+',x)[1] for x in n_labels.index])
Out[5]:
Counter({'culture': 398, 'tag': 705})

There are only 2 type of labels:

  • the culture of the object
  • and the tag, ie its content

2/3 of the labels are related to tags, and 398 different cultures are present. Now it's interesting to see how often images do have a tag and/or a culture label.

How frequently images have culture & tag labels?

In [6]:
culture_columns = [x for x in labels_dummified.columns if x.startswith('culture')]
tag_columns = [x for x in labels_dummified.columns if x.startswith('tag')]

n_culture_labels = labels_dummified[culture_columns].sum(axis=1)
n_tag_labels = labels_dummified[tag_columns].sum(axis=1)


fig, ax = plt.subplots(figsize=(20,10))
ax.hist(n_culture_labels,bins=4,alpha=0.7)
ax.axvline(n_culture_labels.mean())

ax.hist(n_tag_labels,bins=9,alpha=0.7)
ax.axvline(n_tag_labels.mean(),color='orange')

print("Number of images with 0 culture label:",(n_culture_labels == 0 ).sum())

print("Number of images with 0 tag label:",(n_tag_labels == 0 ).sum())

plt.show()
Number of images with 0 culture label: 11872
Number of images with 0 tag label: 220

Tag and culture labels have very different behaviour:

  • There is almost always at least one tag labels. They are also largely non-exclusive, as the average number of tag labels per image is more than 2.
  • On the contrary, the culture label are often exclusive, with about 90k images having exactly one culture. It's also pretty common to have an object without identified culture in the dataset: more than 10k image have not culture label.

Those differences in behaviour will probably allows specific strategies to take them into account.

II. What are the most frequent labels?

In [7]:
labels_count = labels_dummified.sum().reset_index().sort_values(ascending=False,by=0)
labels_count
Out[7]:
attribute_name 0
813 tag::men 19970.0
1092 tag::women 14281.0
147 culture::french 13522.0
189 culture::italian 10375.0
13 culture::american 9151.0
671 tag::flowers 8419.0
51 culture::british 7615.0
194 culture::japan 7394.0
1059 tag::utilitarian objects 6564.0
121 culture::egyptian 6542.0
896 tag::portraits 5955.0
1046 tag::trees 5591.0
79 culture::china 5382.0
780 tag::leaves 5259.0
156 culture::german 5163.0
369 culture::turkish or venice 4416.0
744 tag::inscriptions 3890.0
477 tag::birds 3692.0
738 tag::human figures 3665.0
1034 tag::textile fragments 3570.0
188 culture::islamic 3500.0
835 tag::mythical creatures 3005.0
903 tag::profiles 2552.0
420 tag::animals 2548.0
1099 tag::writing systems 2327.0
552 tag::clothing and accessories 2180.0
485 tag::books 2097.0
776 tag::landscapes 2075.0
161 culture::greek 2050.0
489 tag::bowls 2045.0
... ... ...
3 culture::afghan 3.0
250 culture::nailsea 3.0
727 tag::hindu religious figures 3.0
100 culture::cypriot or phoenician 3.0
201 culture::konigsberg 2.0
108 culture::devonshire 2.0
312 culture::san sabastian 2.0
389 culture::vulci 2.0
240 culture::moche-wari 2.0
142 culture::for russian market 2.0
71 culture::central highlands 2.0
904 tag::prostitutes 2.0
271 culture::northwest china/eastern central asia 2.0
987 tag::slavery 2.0
187 culture::isin-larsaold babylonian 2.0
396 culture::zoroastrian 1.0
81 culture::chinese with european decoration 1.0
366 culture::tsimshian 1.0
328 culture::skyros 1.0
221 culture::macedonian 1.0
262 culture::nimes 1.0
293 culture::populonia 1.0
104 culture::dehua 1.0
112 culture::dyak 1.0
11 culture::algerian 1.0
230 culture::mennecy or sceaux 1.0
805 tag::mark antony 1.0
199 culture::kholmogory 1.0
281 culture::palermo 1.0
146 culture::freiburg im breisgau 1.0

1103 rows × 2 columns

In [8]:
labels_count.loc[labels_count['attribute_name'].str.startswith('culture')]
Out[8]:
attribute_name 0
147 culture::french 13522.0
189 culture::italian 10375.0
13 culture::american 9151.0
51 culture::british 7615.0
194 culture::japan 7394.0
121 culture::egyptian 6542.0
79 culture::china 5382.0
156 culture::german 5163.0
369 culture::turkish or venice 4416.0
188 culture::islamic 3500.0
161 culture::greek 2050.0
304 culture::roman 1881.0
111 culture::dutch 1762.0
335 culture::spanish 1403.0
99 culture::cypriot 1327.0
259 culture::netherlandish 1302.0
212 culture::london 838.0
70 culture::central european 831.0
283 culture::paris 810.0
131 culture::flemish 787.0
25 culture::attic 676.0
231 culture::mexican 595.0
180 culture::india 593.0
182 culture::indonesia 580.0
184 culture::iran 545.0
29 culture::austrian 530.0
378 culture::venice 523.0
45 culture::bohemian 515.0
127 culture::european 499.0
125 culture::etruscan 494.0
... ... ...
376 culture::urbino with gubbio luster 3.0
329 culture::smyrna 3.0
372 culture::united states 3.0
160 culture::gonia 3.0
3 culture::afghan 3.0
250 culture::nailsea 3.0
100 culture::cypriot or phoenician 3.0
201 culture::konigsberg 2.0
108 culture::devonshire 2.0
312 culture::san sabastian 2.0
389 culture::vulci 2.0
240 culture::moche-wari 2.0
142 culture::for russian market 2.0
71 culture::central highlands 2.0
271 culture::northwest china/eastern central asia 2.0
187 culture::isin-larsaold babylonian 2.0
396 culture::zoroastrian 1.0
81 culture::chinese with european decoration 1.0
366 culture::tsimshian 1.0
328 culture::skyros 1.0
221 culture::macedonian 1.0
262 culture::nimes 1.0
293 culture::populonia 1.0
104 culture::dehua 1.0
112 culture::dyak 1.0
11 culture::algerian 1.0
230 culture::mennecy or sceaux 1.0
199 culture::kholmogory 1.0
281 culture::palermo 1.0
146 culture::freiburg im breisgau 1.0

398 rows × 2 columns

In [9]:
labels_count.loc[labels_count['attribute_name'].str.startswith('tag')]
Out[9]:
attribute_name 0
813 tag::men 19970.0
1092 tag::women 14281.0
671 tag::flowers 8419.0
1059 tag::utilitarian objects 6564.0
896 tag::portraits 5955.0
1046 tag::trees 5591.0
780 tag::leaves 5259.0
744 tag::inscriptions 3890.0
477 tag::birds 3692.0
738 tag::human figures 3665.0
1034 tag::textile fragments 3570.0
835 tag::mythical creatures 3005.0
903 tag::profiles 2552.0
420 tag::animals 2548.0
1099 tag::writing systems 2327.0
552 tag::clothing and accessories 2180.0
485 tag::books 2097.0
776 tag::landscapes 2075.0
489 tag::bowls 2045.0
1039 tag::tools and equipment 2001.0
733 tag::horse riding 1895.0
612 tag::dishes 1789.0
962 tag::seals 1744.0
487 tag::bottles 1685.0
501 tag::buildings 1667.0
1062 tag::vases 1540.0
961 tag::sculpture 1526.0
541 tag::christ 1492.0
734 tag::horses 1480.0
483 tag::bodies of water 1472.0
... ... ...
752 tag::jason 10.0
854 tag::old testament 9.0
476 tag::billiards 9.0
643 tag::esther 9.0
798 tag::magicians 9.0
452 tag::bakers 9.0
523 tag::cathedrals 9.0
652 tag::fairies 8.0
599 tag::deities 8.0
561 tag::concerts 8.0
460 tag::bathsheba 8.0
812 tag::medea 8.0
917 tag::rectangles 7.0
845 tag::new testament 7.0
544 tag::christmas 7.0
635 tag::easter 7.0
892 tag::polka-dot pattern 7.0
527 tag::celestial bodies 6.0
843 tag::nero 6.0
883 tag::pinecones 6.0
431 tag::architects 5.0
1060 tag::vajrapani 5.0
873 tag::pentecost 5.0
787 tag::living rooms 4.0
855 tag::olive trees 4.0
1017 tag::sunflowers 3.0
727 tag::hindu religious figures 3.0
904 tag::prostitutes 2.0
987 tag::slavery 2.0
805 tag::mark antony 1.0

705 rows × 2 columns

The most frequent labels are either very common contents (men, women, flowers etc.)/type of artwork (portrait, inscription etc.) or cultures famous in art history and cultural production (French, Italian, British, American, Japanese, Chinese, Egyptian etc.).

Interestingly, almost all of the less frequent labels are culture ones (subculture, less famous culture, combination of cultures etc.). It may be interesting to see if some culture should not basically be merged, especially when the description seems so accurate that I am not sure that we will find them in the dataset.

III. Label coexistence

First, let's have a look at the coexistence between the 100 most frequent labels

In [10]:
most_frequent_labels = labels_dummified[labels_dummified.sum().sort_values(ascending=False)[:100].index]
labels_corr = most_frequent_labels.corr()

fig_dims = (30, 16)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(labels_corr,ax=ax)
plt.plot()
Out[10]:
[]

To have a better look at the correlation, let's now take the 200 most frequent tags, and let's see what are the 40 most important correlation between them.

In [11]:
most_frequent_200_labels = labels_dummified[labels_dummified.sum().sort_values(ascending=False)[:200].index] # We take the most frequent labels
labels_corr_200 = most_frequent_200_labels.corr() # We then look at their correlations
largest_corr_200 = pd.DataFrame(np.sort(abs(labels_corr_200).values)[:,-2:-1], columns=['2nd-largest'],index=labels_corr_200.index) # And order by the second largest (absolute) correlation 
largest_corr_40 = largest_corr_200.sort_values(by='2nd-largest',ascending=False).iloc[:40] # And then just take the 40 first.

fig_dims = (30, 16)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(most_frequent_200_labels.loc[:,largest_corr_40.index].corr(),ax=ax, vmax=0.7)
plt.plot()
Out[11]:
[]

Looking at the most frequent labels, we can see some main topics:

  • Plants ( tag::flowers, tag::leaves)
  • American (holywoodian/star?) modern culture (culture::american, tag::actress, tag::portrait, tag::women)
  • Ancient Greece (culture::greek, culture::attic, culture::south_italia)
  • Ancient Egypt (culture::egyptian, tag::scarabs,tag::hyeroglyph)
  • Ancient Mesopotamia (tag::cuneiform, tag::tablets, culture::babylonian)
  • Water scene (tag::bodies of water, tag::boat)
  • British culture (culture::british, tag::london)
  • Christian religion (tag:: virgin mary, tag::christ, tag::christian imagery)
  • Landscape (tag:: landscape, tag::tree, tag::houses, tag::mountain)
  • French culture (culture::french, culture::paris)
  • etc.
In [12]:
labels_count.loc[labels_count['attribute_name'].str.startswith('culture')]
Out[12]:
attribute_name 0
147 culture::french 13522.0
189 culture::italian 10375.0
13 culture::american 9151.0
51 culture::british 7615.0
194 culture::japan 7394.0
121 culture::egyptian 6542.0
79 culture::china 5382.0
156 culture::german 5163.0
369 culture::turkish or venice 4416.0
188 culture::islamic 3500.0
161 culture::greek 2050.0
304 culture::roman 1881.0
111 culture::dutch 1762.0
335 culture::spanish 1403.0
99 culture::cypriot 1327.0
259 culture::netherlandish 1302.0
212 culture::london 838.0
70 culture::central european 831.0
283 culture::paris 810.0
131 culture::flemish 787.0
25 culture::attic 676.0
231 culture::mexican 595.0
180 culture::india 593.0
182 culture::indonesia 580.0
184 culture::iran 545.0
29 culture::austrian 530.0
378 culture::venice 523.0
45 culture::bohemian 515.0
127 culture::european 499.0
125 culture::etruscan 494.0
... ... ...
376 culture::urbino with gubbio luster 3.0
329 culture::smyrna 3.0
372 culture::united states 3.0
160 culture::gonia 3.0
3 culture::afghan 3.0
250 culture::nailsea 3.0
100 culture::cypriot or phoenician 3.0
201 culture::konigsberg 2.0
108 culture::devonshire 2.0
312 culture::san sabastian 2.0
389 culture::vulci 2.0
240 culture::moche-wari 2.0
142 culture::for russian market 2.0
71 culture::central highlands 2.0
271 culture::northwest china/eastern central asia 2.0
187 culture::isin-larsaold babylonian 2.0
396 culture::zoroastrian 1.0
81 culture::chinese with european decoration 1.0
366 culture::tsimshian 1.0
328 culture::skyros 1.0
221 culture::macedonian 1.0
262 culture::nimes 1.0
293 culture::populonia 1.0
104 culture::dehua 1.0
112 culture::dyak 1.0
11 culture::algerian 1.0
230 culture::mennecy or sceaux 1.0
199 culture::kholmogory 1.0
281 culture::palermo 1.0
146 culture::freiburg im breisgau 1.0

398 rows × 2 columns

Let's now look at the correlation between the culture labels only

In [13]:
culture_labels_dummified = labels_dummified[labels_count.loc[labels_count['attribute_name'].str.startswith('culture')]['attribute_name']]
most_frequent_400_culture_labels = culture_labels_dummified[culture_labels_dummified.sum().sort_values(ascending=False)[:400].index] # We take the 400 most frequent labels
culture_labels_corr = most_frequent_400_culture_labels.corr() #
culture_corr = pd.DataFrame(np.sort(abs(culture_labels_corr).values)[:,-2:-1], columns=['2nd-largest'],index=culture_labels_corr.index)
culture_corr = culture_corr.sort_values(by='2nd-largest',ascending=False).iloc[:40]

fig_dims = (30, 16)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(culture_labels_dummified.loc[:,culture_corr.index].corr(),ax=ax, vmax=0.7)
plt.plot()
Out[13]:
[]

Some culture labels are very similar, almost synonyms (london original/after british, augsburg original/after german). Notice that those strong correlations are also probably caused by a relative scarcity of the labels, as we see below.

In [14]:
pd.merge(culture_corr,labels_count,left_index=True,right_on='attribute_name') # To see the number of images for those tags
Out[14]:
2nd-largest attribute_name 0
213 1.000000 culture::london original 13.0
4 1.000000 culture::after british 13.0
5 0.840149 culture::after german 17.0
28 0.840149 culture::augsburg original 12.0
166 0.816489 culture::gurkha 4.0
181 0.816489 culture::indian or nepalese 6.0
27 0.807182 culture::augsburg decoration 13.0
228 0.807182 culture::meissen with german 17.0
254 0.784447 culture::naxos 8.0
97 0.784447 culture::cyclades 13.0
114 0.713100 culture::east greek/sardis 58.0
217 0.713100 culture::lydian 114.0
140 0.707097 culture::for iberian market 3.0
219 0.707097 culture::macao 6.0
357 0.707055 culture::thessaly 16.0
257 0.707055 culture::neolithic 32.0
110 0.672785 culture::dublin 45.0
185 0.672785 culture::irish 95.0
10 0.659341 culture::alexandria-hadra 10.0
120 0.659341 culture::egypt 23.0
90 0.645373 culture::copenhagen 30.0
102 0.645373 culture::danish 72.0
331 0.610194 culture::south italian 293.0
18 0.610194 culture::apulian 121.0
25 0.570598 culture::attic 676.0
161 0.570598 culture::greek 2050.0
29 0.516766 culture::austrian 530.0
383 0.516766 culture::vienna 148.0
61 0.484815 culture::campanian 71.0
162 0.448669 culture::greek islands 55.0
154 0.441192 culture::geneva 64.0
348 0.441192 culture::swiss 328.0
352 0.427787 culture::teano 13.0
308 0.413180 culture::russian 368.0
338 0.413180 culture::st. petersburg 63.0
351 0.399161 culture::tarentine 60.0
170 0.384715 culture::helladic 21.0
248 0.384715 culture::mycenaean 63.0
22 0.369242 culture::asia minor 22.0
329 0.369242 culture::smyrna 3.0

Let's now look at the correlation between the tag labels only

In [15]:
tag_labels_dummified = labels_dummified[labels_count.loc[labels_count['attribute_name'].str.startswith('tag')]['attribute_name']]
most_frequent_400_tag_labels = tag_labels_dummified[tag_labels_dummified.sum().sort_values(ascending=False)[:400].index] # We take the 300 most frequent labels
tag_labels_corr = most_frequent_400_tag_labels.corr() #
tag_corr = pd.DataFrame(np.sort(abs(tag_labels_corr).values)[:,-2:-1], columns=['2nd-largest'],index=tag_labels_corr.index)
tag_corr = tag_corr.sort_values(by='2nd-largest',ascending=False).iloc[:40]

fig_dims = (30, 16)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(tag_labels_dummified.loc[:,tag_corr.index].corr(),ax=ax, vmax=0.7)
plt.plot()
Out[15]:
[]
There is less strong correlations for tags than for culture. However, the number of images here are more important - and thuse more useful.
In [16]:
pd.merge(tag_corr,labels_count,left_index=True,right_on='attribute_name') # To see the number ofimages for those tags
Out[16]:
2nd-largest attribute_name 0
582 0.858320 tag::cuneiform 387.0
1023 0.858320 tag::tablets 454.0
645 0.835230 tag::eve 93.0
406 0.835230 tag::adam 89.0
445 0.585224 tag::athletes 219.0
457 0.585224 tag::baseball 81.0
498 0.524181 tag::buddhism 766.0
497 0.524181 tag::buddha 302.0
780 0.443135 tag::leaves 5259.0
671 0.443135 tag::flowers 8419.0
482 0.443011 tag::bodhisattva 221.0
896 0.421642 tag::portraits 5955.0
405 0.421642 tag::actresses 1457.0
724 0.369256 tag::hieroglyphs 1083.0
432 0.369256 tag::architectural elements 254.0
1061 0.347089 tag::vase fragments 744.0
563 0.347089 tag::coptic 216.0
942 0.346239 tag::saint joseph 79.0
731 0.346239 tag::holy family 160.0
480 0.344148 tag::boats 1414.0
483 0.344148 tag::bodies of water 1472.0
1064 0.341496 tag::venus 236.0
583 0.341496 tag::cupid 427.0
581 0.308725 tag::crucifixion 332.0
541 0.308725 tag::christ 1492.0
955 0.307619 tag::scarabs 682.0
507 0.302957 tag::buttons 123.0
464 0.302957 tag::beads 859.0
858 0.298582 tag::ornament 593.0
709 0.298582 tag::grotesques 114.0
1092 0.296257 tag::women 14281.0
1072 0.292064 tag::virgin mary 816.0
542 0.290969 tag::christian imagery 866.0
566 0.266618 tag::correspondence 114.0
888 0.266158 tag::playing cards 141.0
688 0.266158 tag::games 167.0
776 0.264994 tag::landscapes 2075.0
1046 0.264994 tag::trees 5591.0
973 0.263161 tag::shepherds 159.0
971 0.263161 tag::sheep 216.0

There is somehow less strong tags coexistence than for culture labels. However, some thema are clearly delineated: adam & eve, sport, bouddhism etc.

IV. Random image displayer

In [17]:
def print_next_image(attribute_name="culture::abruzzi",dataset='train'):
    ''' This function generate images having "attribute_name" as label'''
    file = next(att_gen[attribute_name])
    img=plt.imread('../input/'+dataset+'/'+next(att_gen[attribute_name]))
    plt.imshow(img,aspect='auto')
    print("File:",file)
    print("Image shape:",img.shape)
    if dataset == 'train':
        idx_to_name = labels_list.set_index('attribute_id').to_dict()['attribute_name']
        labs = labels.loc[labels['id'] == file,'attribute_ids'].iloc[0]
        print('labels:', labs)
        print('labels names:', [idx_to_name[int(x)] for x in labs])
    plt.show()
In [18]:
att_gen= {}
for att in labels_list['attribute_name'][:10]:
    att_gen[att] = (x for x in labels_dummified.loc[labels_dummified[att] > 0].index)
In [19]:
print_next_image('culture::akkadian')
File: 1c08d4968e97ee8d.png
Image shape: (300, 815, 3)
labels: ['9', '819', '1099']
labels names: ['culture::akkadian', 'tag::military equipment', 'tag::writing systems']