From 197612766bd30df21c1a04d5a3d6e15b1961e8ee Mon Sep 17 00:00:00 2001 From: wizardforcel <562826179@qq.com> Date: Sun, 12 May 2019 17:34:12 +0800 Subject: [PATCH] 7 --- docs/7.md | 192 ++++++++++++++++++++++++++---------------------------- 1 file changed, 94 insertions(+), 98 deletions(-) diff --git a/docs/7.md b/docs/7.md index 083eb6a..dea4e27 100644 --- a/docs/7.md +++ b/docs/7.md @@ -1,36 +1,34 @@ -# 7\. Data Exploration +# 7\. 数据探索 -Chinese proverb +**千里之行,始于足下。** – 《老子》 -**A journey of a thousand miles begins with a single step** – idiom, from Laozi. +我不是说,理解你的数据集是数据科学中最困难的事情,但它非常重要且耗时。 数据探索是通过统计和可视化技术来描述数据。 我们探索数据来了解特征并将其带到我们的模型。 -I wouldn’t say that understanding your dataset is the most difficult thing in data science, but it is really important and time-consuming. Data Exploration is about describing the data by means of statistical and visualization techniques. We explore data in order to understand the features and bring important features to our models. +## 7.1\. 单变量分析 -## 7.1\. Univariate Analysis +在数学中,单变量是指仅含一个变量的表达式,方程式,函数或多项式。 “Uni”表示“一个”,换句话说,您的数据只有一个变量。 因此,您无需在此步骤中处理原因或关系。单变量分析获取数据,逐个汇总变量(属性)并发现数据中的模式。 -In mathematics, univariate refers to an expression, equation, function or polynomial of only one variable. “Uni” means “one”, so in other words your data has only one variable. So you do not need to deal with the causes or relationships in this step. Univariate analysis takes data, summarizes that variables (attributes) one by one and finds patterns in the data. +单变量数据中发现的模式可以通过多种方式描述,包括集中趋势(均值,众数和中值)和离散度:极差,方差,最大值,最小值,四分位数(包括四分位数极差),方差和标准差。 您还可以使用多个选项来视化和描述单变量数据。 如`频率分布表`,`条形图`,`直方图`,`频率多边形`,`扇形图`。 -There are many ways that can describe patterns found in univariate data include central tendency (mean, mode and median) and dispersion: range, variance, maximum, minimum, quartiles (including the interquartile range), coefficient of variation and standard deviation. You also have several options for visualizing and describing data with univariate data. Such as `frequency Distribution Tables`, `bar Charts`, `histograms`, `frequency Polygons`, `pie Charts`. +变量可以是分类变量或数值变量,我将演示不同的统计和可视化技术来研究变量的每种类型。 -The variable could be either categorical or numerical, I will demostrate the different statistical and visulization techniques to investigate each type of the variable. +* Jupyter 笔记本可以从[数据探索](_static/Data_exploration.ipynb)下载。 +* 数据可以从 [German Credit](_static/german_credit.csv) 下载。 -* The Jupyter notebook can be download from [Data Exploration](_static/Data_exploration.ipynb). -* The data can be downloaf from [German Credit](_static/german_credit.csv). +### 7.1.1\. 数值变量 -### 7.1.1\. Numerical Variables +**描述** -* Describe +`pandas`和`spark`中的`describe`函数将给出大部分统计结果,例如最小值,中值,最大值,四分位数和标准差。 借助用户定义的函数,您可以获得更多的统计结果。 -The `desctibe` function in `pandas` and `spark` will give us most of the statistical results, such as `min`, `median`, `max`, `quartiles` and `standard deviation`. With the help of the user defined function, you can get even more statistical results. - -``` -# selected varables for the demonstration +```py +# 为选择要展示的变量 num_cols = ['Account Balance','No of dependents'] df.select(num_cols).describe().show() ``` -``` +```py +-------+------------------+-------------------+ |summary| Account Balance| No of dependents| +-------+------------------+-------------------+ @@ -43,9 +41,9 @@ df.select(num_cols).describe().show() ``` -You may find out that the default function in PySpark does not include the quartiles. The following function will help you to get the same results in Pandas +您可能会发现 PySpark 中的默认函数不包含四分位数。 以下函数将帮助您在 Pandas 中获得相同的结果: -``` +```py def describe_pd(df_in, columns, deciles=False): ''' Function to union the basic stats results and deciles @@ -75,12 +73,12 @@ def describe_pd(df_in, columns, deciles=False): ``` -``` +```py describe_pd(df,num_cols) ``` -``` +```py +-------+------------------+-----------------+ |summary| Account Balance| No of dependents| +-------+------------------+-----------------+ @@ -96,14 +94,14 @@ describe_pd(df,num_cols) ``` -Sometimes, because of the confidential data issues, you can not deliver the real data and your clients may ask more statistical results, such as `deciles`. You can apply the follwing function to achieve it. +有时,由于机密数据问题,您无法提供真实数据,您的客户可能会请求更多统计结果,例如“十分位数”。 您可以应用以下函数来实现它。 -``` +```py describe_pd(df,num_cols,deciles=True) ``` -``` +```py +-------+------------------+-----------------+ |summary| Account Balance| No of dependents| +-------+------------------+-----------------+ @@ -127,30 +125,30 @@ describe_pd(df,num_cols,deciles=True) ``` -* Skewness and Kurtosis +* 偏度和峰度 - This subsection comes from Wikipedia [Skewness](https://en.wikipedia.org/wiki/Skewness). + 这个小节来自维基百科[偏度](https://en.wikipedia.org/wiki/Skewness)。 - In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined.For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. + 在概率论和统计学中,偏度是实值随机变量概率分布关于其均值的不对称性的度量。 偏度值可以是正数或负数,或者是未定义的。对于单峰分布,负偏度通常表示尾部位于分布的左侧,而正偏度表示尾部位于右侧。 - Consider the two distributions in the figure just below. Within each graph, the values on the right side of the distribution taper differently from the values on the left side. These tapering sides are called tails, and they provide a visual means to determine which of the two kinds of skewness a distribution has: + 考虑下图中的两个分布。 在每个图中,分布右侧的值与左侧的值不同。 这些逐渐变细的一端称为尾部,它们提供了一种可视方法来确定分布中的两种偏斜中的哪一种: + + 1. 负偏度:左尾较长;分布的质量集中在图的右侧。尽管曲线本身看起来是向右倾斜的,但这种分布成为左倾的。左是指尾部向左侧延伸,并且通常,平均值偏向数据的典型中心的左侧。 左倾分布通常表现为右倾曲线。 + 2. 正偏度:右尾更长; 分布的质量集中在图的左侧。尽管曲线本身看起来是向左倾斜的,但这种分布成为右倾的。右边是指尾部向右侧延伸,通常,平均值偏向于典型数据中心的右侧。 右倾分布通常表现为左倾曲线。 - 1. negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left, despite the fact that the curve itself appears to be skewed or leaning to the right; left instead refers to the left tail being drawn out and, often, the mean being skewed to the left of a typical center of the data. A left-skewed distribution usually appears as a right-leaning curve. - 2. positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being skewed to the right of a typical center of the data. A right-skewed distribution usually appears as a left-leaning curve. + 这一小节来自维基百科[峰度](https://en.wikipedia.org/wiki/Kurtosis)。 - This subsection comes from Wikipedia [Kurtosis](https://en.wikipedia.org/wiki/Kurtosis). - - In probability theory and statistics, kurtosis (kyrtos or kurtos, meaning “curved, arching”) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. + 在概率论和统计学中,峰度(kyrtos 或 kurtos,意思是“弯曲的,拱形的”)是实值随机变量的概率分布的“尾部”的度量。 与偏度概念类似,峰度描述概率分布形状,正如偏度一样,有不同的方法来量化它的理论分布,和相应的方法来估计它来自一个样本总体。 ![https://runawayhorse001.github.io/LearningApacheSpark/_images/skewed.png](img/6eb508bad184c89094f5045a5bf2e31c.jpg) -``` +```py from pyspark.sql.functions import col, skewness, kurtosis df.select(skewness(var),kurtosis(var)).show() ``` -``` +```py +---------------------+---------------------+ |skewness(Age (years))|kurtosis(Age (years))| +---------------------+---------------------+ @@ -159,31 +157,31 @@ df.select(skewness(var),kurtosis(var)).show() ``` -Warning +> 警告 -**Sometimes the statistics can be misleading!** +**有时统计量可能产生误导!** -F. J. Anscombe once said that make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding. These 13 datasets in Figure [Same Stats, Different Graphs](#fig-misleading) (the Datasaurus, plus 12 others) each have the same summary statistics (x/y mean, x/y standard deviation, and Pearson’s correlation) to two decimal places, while being drastically different in appearance. This work describes the technique we developed to create this dataset, and others like it. More details and interesting results can be found in [Same Stats Different Graphs](https://www.autodeskresearch.com/publications/samestats). +F. J. Anscombe 曾经说到执行计算和制作图表。 应研究两种结果;每个都有助于理解。 图[相同统计量的不同图表](#fig-misleading)(Datasaurus,和 12 个其他东西)中的这 13 个数据集各自具有相同的汇总统计量(`x/y`均值,`x/y`标准差和 Pearson 相关性),虽然外观完全不同。 这项工作描述了我们开发的技术,用于创建此数据集,以及其他类似的数据集。 更多细节和有趣的结果可以在[相同统计量和不同图表](https://www.autodeskresearch.com/publications/samestats)中找到。 ![https://runawayhorse001.github.io/LearningApacheSpark/_images/misleading.png](img/4fb175e4e5682ef75a156dfba37beeea.jpg) -Same Stats, Different Graphs - -* Histogram +相同统计量和不同图表 -Warning +**直方图** -**Histograms are often confused with Bar graphs!** +> 警告 +> +> **直方图经常和条形图混淆!** -The fundamental difference between histogram and bar graph will help you to identify the two easily is that there are gaps between bars in a bar graph but in the histogram, the bars are adjacent to each other. The interested reader is referred to [Difference Between Histogram and Bar Graph](https://keydifferences.com/difference-between-histogram-and-bar-graph.html). +直方图和条形图之间的根本区别,将帮助您轻松识别两者,条形图中的条形之间存在间隙,但在直方图中,条形彼此相邻。 感兴趣的读者可以参考[直方图和条形图之间的差异](https://keydifferences.com/difference-between-histogram-and-bar-graph.html)。 -``` +```py var = 'Age (years)' x = data1[var] bins = np.arange(0, 100, 5.0) plt.figure(figsize=(10,8)) -# the histogram of the data +# 数据直方图 plt.hist(x, bins, alpha=0.8, histtype='bar', color='gold', ec='black',weights=np.zeros_like(x) + 100\. / x.size) @@ -198,7 +196,7 @@ fig.savefig(var+".pdf", bbox_inches='tight') ![https://runawayhorse001.github.io/LearningApacheSpark/_images/his_s.png](img/0539212d2d3e4c28b27805e3c8783cab.jpg) -``` +```py var = 'Age (years)' x = data1[var] bins = np.arange(0, 100, 5.0) @@ -206,16 +204,16 @@ bins = np.arange(0, 100, 5.0) ######################################################################## hist, bin_edges = np.histogram(x,bins, weights=np.zeros_like(x) + 100\. / x.size) -# make the histogram +# 生成直方图 fig = plt.figure(figsize=(20, 8)) ax = fig.add_subplot(1, 2, 1) -# Plot the histogram heights against integers on the x axis +# 绘制高度与 x 轴上的整数的直方图 ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black', color='gold') -# # Set the ticks to the middle of the bars +# 将刻度设在条形中间 ax.set_xticks([0.5+i for i,j in enumerate(hist)]) -# Set the xticklabels to a string that tells us what the bin edges were +# 将 xticklabels 设置为一个字符串,告诉我们桶的边缘是什么 labels =['{}'.format(int(bins[i+1])) for i,j in enumerate(hist)] labels.insert(0,'0') ax.set_xticklabels(labels) @@ -224,16 +222,16 @@ plt.ylabel('percentage') ######################################################################## -hist, bin_edges = np.histogram(x,bins) # make the histogram +hist, bin_edges = np.histogram(x,bins) # 生成直方图 ax = fig.add_subplot(1, 2, 2) -# Plot the histogram heights against integers on the x axis +# 绘制高度与 x 轴上的整数的直方图 ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black', color='gold') -# # Set the ticks to the middle of the bars +# 将刻度设在条形中间 ax.set_xticks([0.5+i for i,j in enumerate(hist)]) -# Set the xticklabels to a string that tells us what the bin edges were +# 将 xticklabels 设置为一个字符串,告诉我们桶的边缘是什么 labels =['{}'.format(int(bins[i+1])) for i,j in enumerate(hist)] labels.insert(0,'0') ax.set_xticklabels(labels) @@ -249,9 +247,9 @@ fig.savefig(var+".pdf", bbox_inches='tight') ![https://runawayhorse001.github.io/LearningApacheSpark/_images/his_d.png](img/2a4a130bcfb223ced98c0de613bd076a.jpg) -Sometimes, some people will ask you to plot the unequal width (invalid argument for histogram) of the bars. YOu can still achieve it by the follwing trick. +有时,有些人会要求您绘制不等宽度的条形(直方图的无效参数)。 你仍然可以通过以下方法实现它。 -``` +```py var = 'Credit Amount' plot_data = df.select(var).toPandas() x= plot_data[var] @@ -262,13 +260,13 @@ hist, bin_edges = np.histogram(x,bins,weights=np.zeros_like(x) + 100\. / x.size) fig = plt.figure(figsize=(10, 8)) ax = fig.add_subplot(1, 1, 1) -# Plot the histogram heights against integers on the x axis +# 绘制高度与 x 轴上的整数的直方图 ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black',color = 'gold') -# # Set the ticks to the middle of the bars +# 将刻度设在条形中间 ax.set_xticks([0.5+i for i,j in enumerate(hist)]) -# Set the xticklabels to a string that tells us what the bin edges were +# 将 xticklabels 设置为一个字符串,告诉我们桶的边缘是什么 #labels =['{}k'.format(int(bins[i+1]/1000)) for i,j in enumerate(hist)] labels =['{}'.format(bins[i+1]) for i,j in enumerate(hist)] labels.insert(0,'0') @@ -282,13 +280,13 @@ plt.show() ![https://runawayhorse001.github.io/LearningApacheSpark/_images/unequal.png](img/cb63c877ea3af266bb0f5ad6ba5e0b1d.jpg) -* Box plot and violin plot +**箱形图和提琴图** -Note that although violin plots are closely related to Tukey’s (1977) box plots, the violin plot can show more information than box plot. When we perform an exploratory analysis, nothing about the samples could be known. So the distribution of the samples can not be assumed to a normal distribution and usually when you get a big data, the normal distribution will show some out liars in box plot. +请注意,虽然提琴图与 Tukey(1977)的箱形图密切相关,但提琴图可以显示比箱形图更多的信息。 当我们进行探索性分析时,没有样本的知识。 因此,样本分布不能假设为正态分布,并且通常当您获得大数据时,正态分布将在箱形图中显示一些溢出。 -However, the violin plots are potentially misleading for smaller sample sizes, where the density plots can appear to show interesting features (and group-differences therein) even when produced for standard normal data. Some poster suggested the sample size should larger that 250\. The sample sizes (e.g. n>250 or ideally even larger), where the kernel density plots provide a reasonably accurate representation of the distributions, potentially showing nuances such as bimodality or other forms of non-normality that would be invisible or less clear in box plots. More details can be found in [A simple comparison of box plots and violin plots](https://figshare.com/articles/A_simple_comparison_of_box_plots_and_violin_plots/1544525). +然而,对于较小的样本大小,提琴图可能会产生误导,其中即使在为标准正常数据生成时,密度图也可能显示出有趣的特征(以及其中的分组差异)。 一些作者建议样本量应大于 250(例如,`n> 250`或理想情况甚至更大)。其中核密度图提供了分布的合理准确表示,可能表现诸如双峰性或其他形式的细微差别,它在箱形图中是不可见的或不太清楚。 更多细节可以在[箱形图和小提琴图的简单比较]中找到(https://figshare.com/articles/A_simple_comparison_of_box_plots_and_violin_plots/1544525)。 -``` +```py x = df.select(var).toPandas() fig = plt.figure(figsize=(20, 8)) @@ -302,13 +300,13 @@ ax = sns.violinplot(data=x) ![https://runawayhorse001.github.io/LearningApacheSpark/_images/box_vio.png](img/0eb5759f21246505752043bb890ab6bf.jpg) -### 7.1.2\. Categorical Variables +### 7.1.2\. 类别变量 -Compared with the numerical variables, the categorical variables are much more easier to do the exploration. +与数值变量相比,分类变量更容易进行探索。 -* Frequency table +**频率表** -``` +```py from pyspark.sql import functions as F from pyspark.sql.functions import rank,sum,col from pyspark.sql import Window @@ -327,7 +325,7 @@ tab = df.select(['age_class','Credit Amount']).\ ``` -``` +```py +---------+----------+------------------+----------+----------+-------+ |age_class|Credit_num| Credit_avg|Credit_min|Credit_max|Percent| +---------+----------+------------------+----------+----------+-------+ @@ -341,16 +339,16 @@ tab = df.select(['age_class','Credit Amount']).\ ``` -* Pie plot +**扇形图** -``` -# Data to plot +```py +# 要绘制的数据 labels = plot_data.age_class sizes = plot_data.Percent colors = ['gold', 'yellowgreen', 'lightcoral','blue', 'lightskyblue','green','red'] explode = (0, 0.1, 0, 0,0,0) # explode 1st slice -# Plot +# 绘制 plt.figure(figsize=(10,8)) plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) @@ -362,9 +360,9 @@ plt.show() ![https://runawayhorse001.github.io/LearningApacheSpark/_images/pie.png](img/38cff4d0c27588f71d4ed00223dcc4a2.jpg) -* Bar plot +**条形图** -``` +```py labels = plot_data.age_class missing = plot_data.Percent ind = [x for x, _ in enumerate(labels)] @@ -381,7 +379,7 @@ plt.show() ![https://runawayhorse001.github.io/LearningApacheSpark/_images/bar.png](img/7f8b8ddc9f821d1c5a27849bc02e355f.jpg) -``` +```py labels = ['missing', '<25', '25-34', '35-44', '45-54','55-64','65+'] missing = np.array([0.000095, 0.024830, 0.028665, 0.029477, 0.031918,0.037073,0.026699]) man = np.array([0.000147, 0.036311, 0.038684, 0.044761, 0.051269, 0.059542, 0.054259]) @@ -404,15 +402,15 @@ plt.show() ![https://runawayhorse001.github.io/LearningApacheSpark/_images/stacked.png](img/aa2fbf6676b8fd4f67229d35f1c7c537.jpg) -## 7.2\. Multivariate Analysis +## 7.2\. 多变量分析 -In this section, I will only demostrate the bivariate analysis. Since the multivariate analysis is the generation of the bivariate. +在本节中,我将仅演示双变量分析。 由于多变量分析由双变量派生。 -### 7.2.1\. Numerical V.S. Numerical +### 7.2.1\. 数值 VS 数值 -* Correlation matrix +**相关矩阵** -``` +```py from pyspark.mllib.stat import Statistics import pandas as pd @@ -428,7 +426,7 @@ print(corr_df.to_string()) ``` -``` +```py +--------------------+--------------------+ | Account Balance| No of dependents| +--------------------+--------------------+ @@ -438,9 +436,9 @@ print(corr_df.to_string()) ``` -* Scatter Plot +**散点图** -``` +```py import seaborn as sns sns.set(style="ticks") @@ -452,15 +450,15 @@ plt.show() ![https://runawayhorse001.github.io/LearningApacheSpark/_images/pairplot.png](img/1428271961e4c95f6508f59083d5a645.jpg) -### 7.2.2\. Categorical V.S. Categorical +### 7.2.2\. 类别 VS 类别 -* Pearson’s Chi-squared test +**卡方检验** -Warning +> 警告 +> +> `pyspark.ml.stat` 只在 Spark 2.4.0 中可用。 -`pyspark.ml.stat` is only available in Spark 2.4.0. - -``` +```py from pyspark.ml.linalg import Vectors from pyspark.ml.stat import ChiSquareTest @@ -479,21 +477,21 @@ print("statistics: " + str(r.statistics)) ``` -``` +```py pValues: [0.687289278791,0.682270330336] degreesOfFreedom: [2, 3] statistics: [0.75,1.5] ``` -* Cross table +**交叉表** -``` +```py df.stat.crosstab("age_class", "Occupation").show() ``` -``` +```py +--------------------+---+---+---+---+ |age_class_Occupation| 1| 2| 3| 4| +--------------------+---+---+---+---+ @@ -507,9 +505,9 @@ df.stat.crosstab("age_class", "Occupation").show() ``` -* Stacked plot +**堆栈图** -``` +```py labels = ['missing', '<25', '25-34', '35-44', '45-54','55-64','65+'] missing = np.array([0.000095, 0.024830, 0.028665, 0.029477, 0.031918,0.037073,0.026699]) man = np.array([0.000147, 0.036311, 0.038684, 0.044761, 0.051269, 0.059542, 0.054259]) @@ -531,5 +529,3 @@ plt.show() ``` ![https://runawayhorse001.github.io/LearningApacheSpark/_images/stacked.png](img/aa2fbf6676b8fd4f67229d35f1c7c537.jpg) - -### 7.2.3\. Numerical V.S. Categorical \ No newline at end of file -- GitLab