EDA iMet

From: https://www.kaggle.com/nubatama/eda-imet

Author: nubatama

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# basically libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# image libaries
import cv2
import matplotlib.pyplot as plt

# for split train and test
from sklearn.model_selection import train_test_split

# for model
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization, Flatten
from tensorflow.keras.layers import Add, Concatenate, GlobalAvgPool2D
from tensorflow.keras.layers import MaxPooling2D, SeparableConv2D 
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
['test', 'train', 'train.csv', 'labels.csv', 'sample_submission.csv']

First of all...

I am not good at English. So, I think my description is difficult to read and understand.
Everyone, Please pardon.

Confirm Input Data

Read 'labels.csv' and confirm contents.
There are 1,103 attibutes.

In [2]:
# Label.csv
labels_ds = pd.read_csv(filepath_or_buffer='../input/labels.csv', dtype={'attribute_id':np.object, 'attribute_name':np.object})
print(labels_ds.head())
print(labels_ds.tail())
print("")
print(labels_ds.info())
  attribute_id          attribute_name
0            0        culture::abruzzi
1            1     culture::achaemenid
2            2         culture::aegean
3            3         culture::afghan
4            4  culture::after british
     attribute_id           attribute_name
1098         1098  tag::writing implements
1099         1099     tag::writing systems
1100         1100                tag::zeus
1101         1101      tag::zigzag pattern
1102         1102              tag::zodiac

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1103 entries, 0 to 1102
Data columns (total 2 columns):
attribute_id      1103 non-null object
attribute_name    1103 non-null object
dtypes: object(2)
memory usage: 17.3+ KB
None

Read train.csv

Read train.csv to pandas data frame.
train.csv contains no n/a data.
attribute_ids contains multi values, so need to split.

In [3]:
# train.csv
train_ds = pd.read_csv(filepath_or_buffer='../input/train.csv')
print(train_ds.head())
print("")
print(train_ds.info())
print("")
print(train_ds.head())
                 id        attribute_ids
0  1000483014d91860          147 616 813
1  1000fe2e667721fe       51 616 734 813
2  1001614cb89646ee                  776
3  10041eb49b297c08  51 671 698 813 1092
4  100501c227f8beea  13 404 492 903 1093

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109237 entries, 0 to 109236
Data columns (total 2 columns):
id               109237 non-null object
attribute_ids    109237 non-null object
dtypes: object(2)
memory usage: 1.7+ MB
None

                 id        attribute_ids
0  1000483014d91860          147 616 813
1  1000fe2e667721fe       51 616 734 813
2  1001614cb89646ee                  776
3  10041eb49b297c08  51 671 698 813 1092
4  100501c227f8beea  13 404 492 903 1093

Check image files

Image files are exist in '../input/train/' folder.
Image file name is represented by '.png'.

In [4]:
print(os.listdir("../input/train/")[0:12])
['e232597c213332cd.png', '4c3e9596dafb4d13.png', '4712bc2351789604.png', 'be77f39c438a3448.png', '2ffa9ece3a622644.png', '9caa967cba20461c.png', 'a4a099220bcafb7.png', '5c6675ae34aa5307.png', '371705db8277fa72.png', '6c934d937be3d500.png', '2f305db4e14e9246.png', '635e5a6ef19476c.png']

Show image files and image attibutes

Show first 12 images, image height, width, and relative attributes.

In [5]:
# image data 
# Check image data size and image by first 12 files
image_file_list = os.listdir("../input/train/")[0:12]
image_data_list = []

fig = plt.figure(figsize=(10, 15))
for image_index in range(12):
    image_file_name = train_ds.iloc[image_index, 0]
    image_np = cv2.imread("../input/train/" + image_file_name + ".png")

    image_label = "{}\n height:{} width:{}\nattr:{}".format(
        image_file_name, image_np.shape[0], image_np.shape[1], train_ds.iloc[image_index, 1]
    )
    image_area = fig.add_subplot(4,3,image_index + 1, title=image_label)
    image_area.imshow(cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB))
    
fig.tight_layout()
fig.show()

One-hot encoding image attributes

Encoding the image attibutes, number to binary. In here, use original simple function, because MultiLabelBinarizer don't work expectly...

In [6]:
# One hot encoding for multi labels.
def OneHotEncoding(rec):
    attribute_id_list = rec["attribute_ids"].split()
    for attribute_id in attribute_id_list:
        rec[attribute_id] = 1
    
    return rec

# Append new columns from list
def AppendColumns(df, columnList):
    for newColumn in columnList:
        df[newColumn] = 0
    
    return df
In [7]:
# Create MultiLabelBinarizer instance and fit to attibute id in labels.csv
train_ds_encoded = AppendColumns(train_ds, labels_ds['attribute_id'])
train_ds_encoded = train_ds_encoded.apply(OneHotEncoding, axis=1)
train_ds_encoded.head()
Out[7]:
id attribute_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ... 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102
0 1000483014d91860 147 616 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1000fe2e667721fe 51 616 734 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1001614cb89646ee 776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 10041eb49b297c08 51 671 698 813 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 100501c227f8beea 13 404 492 903 1093 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
In [8]:
# Append filename column
train_ds_encoded["filename"] = train_ds_encoded["id"] + ".png" 
train_ds_encoded.head()
Out[8]:
id attribute_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ... 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 filename
0 1000483014d91860 147 616 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000483014d91860.png
1 1000fe2e667721fe 51 616 734 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000fe2e667721fe.png
2 1001614cb89646ee 776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1001614cb89646ee.png
3 10041eb49b297c08 51 671 698 813 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 10041eb49b297c08.png
4 100501c227f8beea 13 404 492 903 1093 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 100501c227f8beea.png

Check frequency

Check frequency of attribute.
81 attributes (about 7% attributes) are less than 5. So, we need to increase data that is infrequent.

In [9]:
summary_df = pd.DataFrame(data={'id':labels_ds['attribute_id'], 'attibute':labels_ds['attribute_name'], 'count':np.array(train_ds_encoded.iloc[:, 2:].sum(numeric_only=True))})
summary_df = summary_df.sort_values(by='count')
print(summary_df.head())
print(summary_df.tail(20))
      id                                   attibute  count
199  199                        culture::kholmogory      1
81    81  culture::chinese with european decoration      1
221  221                        culture::macedonian      1
230  230                 culture::mennecy or sceaux      1
366  366                         culture::tsimshian      1
        id                    attibute  count
1034  1034      tag::textile fragments   3570
738    738          tag::human figures   3665
477    477                  tag::birds   3692
744    744           tag::inscriptions   3890
369    369  culture::turkish or venice   4416
156    156             culture::german   5163
780    780                 tag::leaves   5259
79      79              culture::china   5382
1046  1046                  tag::trees   5591
896    896              tag::portraits   5955
121    121           culture::egyptian   6542
1059  1059    tag::utilitarian objects   6564
194    194              culture::japan   7394
51      51            culture::british   7615
671    671                tag::flowers   8419
13      13           culture::american   9151
189    189            culture::italian  10375
147    147             culture::french  13522
1092  1092                  tag::women  14281
813    813                    tag::men  19970
In [10]:
rare_attr_df = summary_df.sort_values(by='count').loc[summary_df['count'] <= 5]
rare_data_df = train_ds_encoded.loc[train_ds_encoded.apply(lambda x: set(x['attribute_ids'].split(' ')).isdisjoint(rare_attr_df['id']) == False, axis=1)]
rare_data_df
Out[10]:
id attribute_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ... 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 filename
25 1014ac8807369589 103 180 573 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1014ac8807369589.png
199 107ea49bc5e84c1a 147 203 554 612 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 107ea49bc5e84c1a.png
451 110f113afcbd53e1 160 257 1061 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110f113afcbd53e1.png
661 1178c36f22819170 160 257 1061 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1178c36f22819170.png
805 11c7143510270a1f 43 51 6 584 813 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11c7143510270a1f.png
1220 12afadc4c66ffa6a 100 744 922 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12afadc4c66ffa6a.png
1515 1344b255698c5067 71 616 1059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1344b255698c5067.png
1673 13a0960b5a90e861 30 147 758 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13a0960b5a90e861.png
2062 1491e4ded34a118d 268 437 462 767 369 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1491e4ded34a118d.png
2621 15ca08f805bebbff 146 156 584 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15ca08f805bebbff.png
2692 15f5523afc30f7d7 189 367 650 671 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15f5523afc30f7d7.png
4039 18d1e3f331733ffb 115 671 835 974 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 18d1e3f331733ffb.png
5189 1b5fd7de9a5bc566 189 376 489 663 707 1046 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1b5fd7de9a5bc566.png
5437 1be7d768f8f51d8e 189 291 671 780 965 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1be7d768f8f51d8e.png
6563 1e5aa2fc8e2a3354 189 813 873 1072 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1e5aa2fc8e2a3354.png
6604 1e6f1124008edcd3 125 389 819 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1e6f1124008edcd3.png
7629 20d968544eaabcab 147 203 554 612 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20d968544eaabcab.png
7658 20e86e64a21b9be1 497 802 970 1060 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20e86e64a21b9be1.png
7754 211fef49b5818838 13 372 813 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 211fef49b5818838.png
8230 2243a28894db074b 166 181 766 1059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2243a28894db074b.png
8637 2331e1bcf89444b2 30 147 554 647 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2331e1bcf89444b2.png
9314 249ec5c107eac539 51 298 586 784 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 249ec5c107eac539.png
9749 259721f22682e928 290 489 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 259721f22682e928.png
9771 25a47be4f14c5c9d 532 787 813 959 1022 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25a47be4f14c5c9d.png
10234 26a3b66286e117a1 132 515 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26a3b66286e117a1.png
10323 26d10c930034146a 51 250 487 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26d10c930034146a.png
10664 279a2688aa310f75 22 161 329 961 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 279a2688aa310f75.png
11183 28b609b5f9dae6fb 94 606 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28b609b5f9dae6fb.png
11847 2a4163797b2cae6d 366 590 738 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2a4163797b2cae6d.png
12178 2b1335e1921ca1e1 51 211 758 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2b1335e1921ca1e1.png
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94762 e0d73d9adce047d8 31 147 451 813 961 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e0d73d9adce047d8.png
94915 e125fe98aebdd948 271 1047 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e125fe98aebdd948.png
95257 e1e67d3002515808 160 257 1061 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e1e67d3002515808.png
95981 e37fa596ee3606de 125 293 813 974 994 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e37fa596ee3606de.png
96127 e3cf7cab469afda8 156 477 813 873 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 e3cf7cab469afda8.png
96139 e3d32393a44b9041 271 671 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e3d32393a44b9041.png
96255 e4160a5ff90af6a4 189 434 532 787 1005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e4160a5ff90af6a4.png
96716 e51b2e3711e8f4e5 30 147 671 758 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e51b2e3711e8f4e5.png
96797 e5477223c0e4775 147 431 615 835 858 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e5477223c0e4775.png
97276 e64edd8deaf16f05 108 671 1035 369 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e64edd8deaf16f05.png
97386 e68ea3ae5d263fc2 121 433 713 855 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e68ea3ae5d263fc2.png
99224 eaa059991a514a62 22 161 329 642 961 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eaa059991a514a62.png
100042 ec698b900e458e6 20 335 485 671 682 939 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ec698b900e458e6.png
101395 ef3dec5a4d926396 396 1034 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ef3dec5a4d926396.png
101529 ef8dc67b08ebbd3f 198 734 1059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ef8dc67b08ebbd3f.png
101886 f05f0902369ad92a 21 1053 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f05f0902369ad92a.png
101963 f08e87da37214e88 268 600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f08e87da37214e88.png
102054 f0b8b90cfcbd92d8 190 418 420 786 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f0b8b90cfcbd92d8.png
102218 f1178f42f3b09f4e 21 552 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f1178f42f3b09f4e.png
102379 f177e76cad1a216f 147 203 586 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f177e76cad1a216f.png
104022 f4f7edca73cf1f26 372 1039 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f4f7edca73cf1f26.png
105024 f7140a3d2721277a 290 496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f7140a3d2721277a.png
105175 f76afefa9fa14d05 103 587 1059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f76afefa9fa14d05.png
105972 f90699aa8a6218be 125 129 477 741 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f90699aa8a6218be.png
106914 fb005d7e0e642a7d 312 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 fb005d7e0e642a7d.png
106983 fb1bf1ac14ac4db3 92 138 477 671 780 369 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fb1bf1ac14ac4db3.png
107066 fb48196e4be2807b 258 482 498 703 1060 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fb48196e4be2807b.png
107586 fc6968c3cee2a2aa 104 138 616 369 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fc6968c3cee2a2aa.png
108314 fdfbd92127981390 359 447 482 803 1060 1099 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 fdfbd92127981390.png
108557 fe79cb978d582542 701 855 1059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fe79cb978d582542.png

251 rows × 1106 columns

In [11]:
train_df_2 = train_ds_encoded
for count in range(10):
    train_df_2 = train_df_2.append(rare_data_df)

train_df_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111747 entries, 0 to 108557
Columns: 1106 entries, id to filename
dtypes: int64(1103), object(3)
memory usage: 943.8+ MB
In [12]:
# Separate data and label
train_df_X = train_df_2.iloc[:, 0]
train_df_y = train_df_2.iloc[:, 2:]

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(train_df_X, train_df_y, test_size=0.10, random_state=42)
In [13]:
train_df_train = train_df_2.sample(frac=0.9, random_state=42)
train_df_test = train_df_2.drop(train_df_train.index)
print("{} {} {}".format(len(train_df_2), len(train_df_train), len(train_df_test)))
111747 100572 10899

Check Labels

In [14]:
# Split lable
splitted_attr = labels_ds['attribute_name'].str.split('::', expand = True)
splitted_attr.columns = ['main', 'sub']
splitted_attr
Out[14]:
main sub
0 culture abruzzi
1 culture achaemenid
2 culture aegean
3 culture afghan
4 culture after british
5 culture after german
6 culture after german original
7 culture after italian
8 culture after russian original
9 culture akkadian
10 culture alexandria-hadra
11 culture algerian
12 culture alsace
13 culture american
14 culture american or european
15 culture amsterdam
16 culture ansbach
17 culture antwerp
18 culture apulian
19 culture arabian
20 culture aragon
21 culture arica
22 culture asia minor
23 culture assyrian
24 culture atlantic watershed
25 culture attic
26 culture augsburg
27 culture augsburg decoration
28 culture augsburg original
29 culture austrian
... ... ...
1073 tag vishnu
1074 tag volcanoes
1075 tag vulcan
1076 tag wagons
1077 tag walking
1078 tag wars
1079 tag washing
1080 tag watches
1081 tag waterfalls
1082 tag watermills
1083 tag waves
1084 tag weapons
1085 tag weights and measures
1086 tag wells
1087 tag wind
1088 tag windmills
1089 tag windows
1090 tag wine
1091 tag winter
1092 tag women
1093 tag working
1094 tag world war i
1095 tag worshiping
1096 tag wreaths
1097 tag writing
1098 tag writing implements
1099 tag writing systems
1100 tag zeus
1101 tag zigzag pattern
1102 tag zodiac

1103 rows × 2 columns

In [15]:
print(splitted_attr['main'].drop_duplicates())
print('culture : {}; tag : {}'.format(len(splitted_attr.loc[splitted_attr.main == 'culture']), len(splitted_attr.loc[splitted_attr.main == 'tag'])))
0      culture
398        tag
Name: main, dtype: object
culture : 398; tag : 705
In [16]:
print(splitted_attr['sub'].drop_duplicates())
0                      abruzzi
1                   achaemenid
2                       aegean
3                       afghan
4                after british
5                 after german
6        after german original
7                after italian
8       after russian original
9                     akkadian
10            alexandria-hadra
11                    algerian
12                      alsace
13                    american
14        american or european
15                   amsterdam
16                     ansbach
17                     antwerp
18                     apulian
19                     arabian
20                      aragon
21                       arica
22                  asia minor
23                    assyrian
24          atlantic watershed
25                       attic
26                    augsburg
27         augsburg decoration
28           augsburg original
29                    austrian
                 ...          
1073                    vishnu
1074                 volcanoes
1075                    vulcan
1076                    wagons
1077                   walking
1078                      wars
1079                   washing
1080                   watches
1081                waterfalls
1082                watermills
1083                     waves
1084                   weapons
1085      weights and measures
1086                     wells
1087                      wind
1088                 windmills
1089                   windows
1090                      wine
1091                    winter
1092                     women
1093                   working
1094               world war i
1095                worshiping
1096                   wreaths
1097                   writing
1098        writing implements
1099           writing systems
1100                      zeus
1101            zigzag pattern
1102                    zodiac
Name: sub, Length: 1103, dtype: object

main category is 2, 'culture' and 'tag'. sub category is 1103, not duplicated.

Check corrolation

In [17]:
train_ds_encoded.head()
Out[17]:
id attribute_ids 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ... 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 filename
0 1000483014d91860 147 616 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000483014d91860.png
1 1000fe2e667721fe 51 616 734 813 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000fe2e667721fe.png
2 1001614cb89646ee 776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1001614cb89646ee.png
3 10041eb49b297c08 51 671 698 813 1092 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 10041eb49b297c08.png
4 100501c227f8beea 13 404 492 903 1093 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 100501c227f8beea.png
In [18]:
corr_df = train_ds_encoded.iloc[:, 2:-1].corr()
corr_df.head()
Out[18]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102
0 1.000000 -0.000389 -0.000145 -0.000067 -0.000140 -0.000160 -0.000078 -0.000067 -0.000145 -0.000206 -0.000123 -0.000039 -0.000145 -0.003882 -0.000786 -0.000227 -0.000135 -0.000174 -0.000427 -0.000103 -0.000078 -0.000087 -0.000182 -0.000607 -0.000249 -0.001013 -0.000517 -0.000140 -0.000135 -0.000896 -0.000087 -0.000087 -0.000330 -0.000665 -0.000078 -0.000424 -0.000067 -0.000103 -0.000123 -0.000067 ... -0.000269 -0.000597 -0.000408 -0.000389 -0.000182 -0.000426 -0.000465 -0.000227 -0.000413 -0.001114 -0.000283 -0.000246 -0.000227 -0.000145 -0.000343 -0.000280 -0.000194 -0.000373 -0.000465 -0.000223 -0.000415 -0.001209 -0.000587 -0.000269 -0.000155 -0.000306 -0.000616 -0.000160 -0.000330 -0.004979 -0.000971 -0.000135 -0.000306 -0.000455 -0.000489 -0.000859 -0.001894 -0.000213 -0.000358 -0.000216
1 -0.000389 1.000000 -0.000343 -0.000159 -0.000330 -0.000378 -0.000183 -0.000159 -0.000343 -0.000485 -0.000290 -0.000092 -0.000343 -0.009153 -0.001853 -0.000534 -0.000317 -0.000410 -0.001008 -0.000242 -0.000183 -0.000205 -0.000430 -0.001432 -0.000587 -0.002389 -0.001219 -0.000330 -0.000317 -0.002114 -0.000205 -0.000205 -0.000777 -0.001567 -0.000183 -0.001000 -0.000159 -0.000242 -0.000290 -0.000159 ... -0.000635 -0.001408 -0.000961 -0.000916 -0.000430 -0.001004 -0.001096 -0.000534 -0.000974 -0.002626 -0.000667 -0.000579 -0.000534 -0.000343 -0.000809 -0.000661 -0.000458 -0.000879 -0.001096 -0.000526 -0.000978 0.000390 -0.001384 -0.000635 -0.000366 -0.000721 -0.001453 -0.000378 -0.000777 -0.011739 -0.002291 -0.000317 -0.000721 -0.001073 -0.001152 -0.002026 -0.002369 -0.000502 -0.000845 -0.000510
2 -0.000145 -0.000343 1.000000 -0.000059 -0.000124 -0.000141 -0.000069 -0.000059 -0.000128 -0.000181 -0.000108 -0.000034 -0.000128 -0.003423 -0.000693 -0.000200 -0.000119 -0.000153 -0.000377 -0.000091 -0.000069 -0.000077 -0.000161 -0.000536 -0.000219 -0.000893 -0.000456 -0.000124 -0.000119 -0.000791 -0.000077 -0.000077 -0.000291 -0.000586 -0.000069 -0.000374 -0.000059 -0.000091 -0.000108 -0.000059 ... -0.000237 -0.000527 -0.000359 -0.000343 -0.000161 -0.000375 -0.000410 -0.000200 -0.000364 -0.000982 -0.000249 -0.000217 -0.000200 -0.000128 -0.000303 -0.000247 -0.000171 -0.000329 -0.000410 -0.000197 -0.000366 -0.001067 -0.000518 -0.000237 -0.000137 -0.000270 -0.000543 -0.000141 -0.000291 -0.004391 -0.000857 -0.000119 -0.000270 -0.000401 -0.000431 -0.000758 -0.001670 -0.000188 -0.000316 -0.000191
3 -0.000067 -0.000159 -0.000059 1.000000 -0.000057 -0.000065 -0.000032 -0.000027 -0.000059 -0.000084 -0.000050 -0.000016 -0.000059 -0.001585 -0.000321 -0.000092 -0.000055 -0.000071 -0.000175 -0.000042 -0.000032 -0.000035 -0.000074 -0.000248 -0.000102 -0.000414 -0.000211 -0.000057 -0.000055 -0.000366 -0.000035 -0.000035 -0.000135 -0.000271 -0.000032 -0.000173 -0.000027 -0.000042 -0.000050 -0.000027 ... -0.000110 -0.000244 -0.000166 -0.000159 -0.000074 -0.000174 -0.000190 -0.000092 -0.000169 -0.000455 -0.000115 -0.000100 -0.000092 -0.000059 -0.000140 -0.000114 -0.000079 -0.000152 -0.000190 -0.000091 -0.000169 -0.000494 -0.000240 -0.000110 -0.000063 -0.000125 -0.000251 -0.000065 -0.000135 -0.002032 -0.000397 -0.000055 -0.000125 -0.000186 -0.000199 -0.000351 -0.000773 -0.000087 -0.000146 -0.000088
4 -0.000140 -0.000330 -0.000124 -0.000057 1.000000 -0.000136 -0.000066 -0.000057 -0.000124 -0.000175 -0.000104 -0.000033 -0.000124 -0.003299 -0.000668 -0.000193 -0.000114 -0.000148 -0.000363 -0.000087 -0.000066 -0.000074 -0.000155 -0.000516 -0.000211 -0.000861 -0.000440 -0.000119 -0.000114 -0.000762 -0.000074 -0.000074 -0.000280 -0.000565 -0.000066 -0.000360 -0.000057 -0.000087 -0.000104 -0.000057 ... -0.000229 -0.000508 -0.000346 -0.000330 -0.000155 -0.000362 -0.000395 -0.000193 -0.000351 -0.000946 -0.000240 -0.000209 -0.000193 -0.000124 -0.000292 -0.000238 -0.000165 -0.000317 -0.000395 -0.000190 -0.000353 0.016946 -0.000499 -0.000229 -0.000132 -0.000260 -0.000524 -0.000136 -0.000280 -0.004231 -0.000826 -0.000114 -0.000260 -0.000387 -0.000415 -0.000730 -0.001610 -0.000181 -0.000304 -0.000184
In [19]:
corr_df2 = corr_df.replace(1, 0).abs()
corr_df2['id'] = corr_df2.index
corr_df2.head()
Out[19]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 id
0 0.000000 0.000389 0.000145 0.000067 0.000140 0.000160 0.000078 0.000067 0.000145 0.000206 0.000123 0.000039 0.000145 0.003882 0.000786 0.000227 0.000135 0.000174 0.000427 0.000103 0.000078 0.000087 0.000182 0.000607 0.000249 0.001013 0.000517 0.000140 0.000135 0.000896 0.000087 0.000087 0.000330 0.000665 0.000078 0.000424 0.000067 0.000103 0.000123 0.000067 ... 0.000597 0.000408 0.000389 0.000182 0.000426 0.000465 0.000227 0.000413 0.001114 0.000283 0.000246 0.000227 0.000145 0.000343 0.000280 0.000194 0.000373 0.000465 0.000223 0.000415 0.001209 0.000587 0.000269 0.000155 0.000306 0.000616 0.000160 0.000330 0.004979 0.000971 0.000135 0.000306 0.000455 0.000489 0.000859 0.001894 0.000213 0.000358 0.000216 0
1 0.000389 0.000000 0.000343 0.000159 0.000330 0.000378 0.000183 0.000159 0.000343 0.000485 0.000290 0.000092 0.000343 0.009153 0.001853 0.000534 0.000317 0.000410 0.001008 0.000242 0.000183 0.000205 0.000430 0.001432 0.000587 0.002389 0.001219 0.000330 0.000317 0.002114 0.000205 0.000205 0.000777 0.001567 0.000183 0.001000 0.000159 0.000242 0.000290 0.000159 ... 0.001408 0.000961 0.000916 0.000430 0.001004 0.001096 0.000534 0.000974 0.002626 0.000667 0.000579 0.000534 0.000343 0.000809 0.000661 0.000458 0.000879 0.001096 0.000526 0.000978 0.000390 0.001384 0.000635 0.000366 0.000721 0.001453 0.000378 0.000777 0.011739 0.002291 0.000317 0.000721 0.001073 0.001152 0.002026 0.002369 0.000502 0.000845 0.000510 1
2 0.000145 0.000343 0.000000 0.000059 0.000124 0.000141 0.000069 0.000059 0.000128 0.000181 0.000108 0.000034 0.000128 0.003423 0.000693 0.000200 0.000119 0.000153 0.000377 0.000091 0.000069 0.000077 0.000161 0.000536 0.000219 0.000893 0.000456 0.000124 0.000119 0.000791 0.000077 0.000077 0.000291 0.000586 0.000069 0.000374 0.000059 0.000091 0.000108 0.000059 ... 0.000527 0.000359 0.000343 0.000161 0.000375 0.000410 0.000200 0.000364 0.000982 0.000249 0.000217 0.000200 0.000128 0.000303 0.000247 0.000171 0.000329 0.000410 0.000197 0.000366 0.001067 0.000518 0.000237 0.000137 0.000270 0.000543 0.000141 0.000291 0.004391 0.000857 0.000119 0.000270 0.000401 0.000431 0.000758 0.001670 0.000188 0.000316 0.000191 2
3 0.000067 0.000159 0.000059 0.000000 0.000057 0.000065 0.000032 0.000027 0.000059 0.000084 0.000050 0.000016 0.000059 0.001585 0.000321 0.000092 0.000055 0.000071 0.000175 0.000042 0.000032 0.000035 0.000074 0.000248 0.000102 0.000414 0.000211 0.000057 0.000055 0.000366 0.000035 0.000035 0.000135 0.000271 0.000032 0.000173 0.000027 0.000042 0.000050 0.000027 ... 0.000244 0.000166 0.000159 0.000074 0.000174 0.000190 0.000092 0.000169 0.000455 0.000115 0.000100 0.000092 0.000059 0.000140 0.000114 0.000079 0.000152 0.000190 0.000091 0.000169 0.000494 0.000240 0.000110 0.000063 0.000125 0.000251 0.000065 0.000135 0.002032 0.000397 0.000055 0.000125 0.000186 0.000199 0.000351 0.000773 0.000087 0.000146 0.000088 3
4 0.000140 0.000330 0.000124 0.000057 0.000000 0.000136 0.000066 0.000057 0.000124 0.000175 0.000104 0.000033 0.000124 0.003299 0.000668 0.000193 0.000114 0.000148 0.000363 0.000087 0.000066 0.000074 0.000155 0.000516 0.000211 0.000861 0.000440 0.000119 0.000114 0.000762 0.000074 0.000074 0.000280 0.000565 0.000066 0.000360 0.000057 0.000087 0.000104 0.000057 ... 0.000508 0.000346 0.000330 0.000155 0.000362 0.000395 0.000193 0.000351 0.000946 0.000240 0.000209 0.000193 0.000124 0.000292 0.000238 0.000165 0.000317 0.000395 0.000190 0.000353 0.016946 0.000499 0.000229 0.000132 0.000260 0.000524 0.000136 0.000280 0.004231 0.000826 0.000114 0.000260 0.000387 0.000415 0.000730 0.001610 0.000181 0.000304 0.000184 4
In [20]:
corr_df3 = corr_df2.loc[lambda x: x[0:-1].max() > 0.4]
max_values = corr_df3.iloc[:, 0:-1].max(axis=1)
max_index1 = corr_df3.iloc[:, 0:-1].idxmax(axis=1)
max_index2 = corr_df3['id']
corr_df4 = pd.DataFrame(np.stack((max_values, max_index1, max_index2), axis=-1), columns=['value', 'id1', 'id2'])
corr_df4 = corr_df4.merge(labels_ds, left_on = 'id1', right_on = 'attribute_id')
corr_df4 = corr_df4.merge(labels_ds, left_on = 'id2', right_on = 'attribute_id', suffixes=('_1', '_2'))
corr_df4 = corr_df4.drop(columns=['attribute_id_1', 'attribute_id_2'])
corr_df4
Out[20]:
value id1 id2 attribute_name_1 attribute_name_2
0 0.840149 28 5 culture::augsburg original culture::after german
1 0.659341 120 10 culture::egypt culture::alexandria-hadra
2 0.610194 331 18 culture::south italian culture::apulian
3 0.484815 331 61 culture::south italian culture::campanian
4 0.570598 161 25 culture::greek culture::attic
5 0.807182 228 27 culture::meissen with german culture::augsburg decoration
6 0.840149 5 28 culture::after german culture::augsburg original
7 0.516766 383 29 culture::vienna culture::austrian
8 0.426615 582 33 tag::cuneiform culture::babylonian
9 0.85832 582 1023 tag::cuneiform tag::tablets
10 0.645373 102 90 culture::danish culture::copenhagen
11 0.784447 254 97 culture::naxos culture::cyclades
12 0.645373 90 102 culture::copenhagen culture::danish
13 0.672785 185 110 culture::irish culture::dublin
14 0.7131 217 114 culture::lydian culture::east greek/sardis
15 0.659341 10 120 culture::alexandria-hadra culture::egypt
16 0.707097 219 140 culture::macao culture::for iberian market
17 0.441192 348 154 culture::swiss culture::geneva
18 0.570598 25 161 culture::attic culture::greek
19 0.448669 97 162 culture::cyclades culture::greek islands
20 0.784447 97 254 culture::cyclades culture::naxos
21 0.816489 181 166 culture::indian or nepalese culture::gurkha
22 0.42974 750 180 tag::jainism culture::india
23 0.816489 166 181 culture::gurkha culture::indian or nepalese
24 0.672785 110 185 culture::dublin culture::irish
25 0.7131 114 217 culture::east greek/sardis culture::lydian
26 0.707097 140 219 culture::for iberian market culture::macao
27 0.807182 27 228 culture::augsburg decoration culture::meissen with german
28 0.707055 357 257 culture::thessaly culture::neolithic
29 0.41318 338 308 culture::st. petersburg culture::russian
30 0.610194 18 331 culture::apulian culture::south italian
31 0.41318 308 338 culture::russian culture::st. petersburg
32 0.441192 154 348 culture::geneva culture::swiss
33 0.427787 61 352 culture::campanian culture::teano
34 0.707055 257 357 culture::neolithic culture::thessaly
35 0.516766 29 383 culture::austrian culture::vienna
36 0.421642 896 405 tag::portraits tag::actresses
37 0.83523 645 406 tag::eve tag::adam
38 0.585224 457 445 tag::baseball tag::athletes
39 0.585224 445 457 tag::athletes tag::baseball
40 0.443011 498 482 tag::buddhism tag::bodhisattva
41 0.524181 498 497 tag::buddhism tag::buddha
42 0.524181 497 498 tag::buddha tag::buddhism
43 0.85832 1023 582 tag::tablets tag::cuneiform
44 0.83523 406 645 tag::adam tag::eve
45 0.443135 780 671 tag::leaves tag::flowers
46 0.770603 757 730 tag::judith tag::holofernes
47 0.476425 748 735 tag::isis tag::horus
48 0.476425 735 748 tag::horus tag::isis
49 0.42974 180 750 culture::india tag::jainism
50 0.770603 730 757 tag::holofernes tag::judith
51 0.443135 671 780 tag::flowers tag::leaves
52 0.421642 405 896 tag::actresses tag::portraits
53 0.404764 1043 913 tag::trains tag::railways
54 0.577258 1011 957 tag::students tag::schools
55 0.714231 1011 1029 tag::students tag::teachers
56 0.714231 1029 1011 tag::teachers tag::students
57 0.404764 913 1043 tag::railways tag::trains