Matplotlib 系列文章:
另有 NumPy、Pandas 系列文章已更新完毕,欢迎关注:
推荐学习资料与网站:
翻译丨TRHX 作者丨Selva Prabhakaran 原文丨《Top 50 matplotlib Visualizations – The Master Plots (with full python code)》
★ 本文中的示例原作者使用的编辑器为 Jupyter Notebook;
★ 译者使用 PyCharm 测试原文中有部分代码不太准确,部分已进行修改,对应有注释说明;
★ 运行本文代码,需要安装 Matplotlib 和 Seaborn 等可视化库,其他的一些辅助可视化库已在代码部分作标注;
★ 示例中用到的数据均储存在作者的 GitHub:https://github.com/selva86/datasets ,因此运行程序可能需要FQ;
★ 译者英文水平有限,若遇到翻译模糊的词建议参考原文来理解。
★ 本文50个示例代码已打包为 .py 文件,可直接下载:https://download.csdn.net/download/qq_36759224/12507219
1 2 3 4 这里是一段防爬虫文本,请读者忽略。 本译文首发于 CSDN,作者 Selva Prabhakaran,译者 TRHX。 本文链接:https://itrhx.blog.csdn.net/article/details/106558131 原文链接:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
【1x00】介绍(Introduction) 在数据分析和可视化中最常用的、最有价值的前 50 个 Matplotlib 图表。这些图表会让你懂得在不同情况下合理使用 Python 的 Matplotlib 和 Seaborn 库来达到数据可视化效果。
这些图表根据可视化目标的 7 个不同情景进行分组。 例如,如果要想象两个变量之间的关系,请查看“关联”部分下的图表。 或者,如果您想要显示值如何随时间变化,请查看“变化”部分,依此类推。
有效图表的重要特征:
在不歪曲事实的情况下传达正确和必要的信息;
设计简单,不必太费力就能理解它;
从审美角度支持信息而不是掩盖信息;
信息没有超负荷。
【2x00】准备工作(Setup) 在代码运行前先引入下面的基本设置,当然,个别图表可能会重新定义显示要素。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import numpy as npimport pandas as pdimport matplotlib as mplimport matplotlib.pyplot as pltimport seaborn as snsimport warnings; warnings.filterwarnings(action='once' )large = 22 ; med = 16 ; small = 12 params = {'axes.titlesize' : large, 'legend.fontsize' : med, 'figure.figsize' : (16 , 10 ), 'axes.labelsize' : med, 'axes.titlesize' : med, 'xtick.labelsize' : med, 'ytick.labelsize' : med, 'figure.titlesize' : large} plt.rcParams.update(params) plt.style.use('seaborn-whitegrid' ) sns.set_style("white" ) %matplotlib inline print(mpl.__version__) print(sns.__version__)
【3x00】关联(Correlation) 关联图用于可视化两个或多个变量之间的关系。也就是说,一个变量相对于另一个变量如何变化。
【01】散点图(Scatter plot) 散点图是研究两个变量之间关系的经典和基本的绘图。如果数据中有多个组,则可能需要以不同的颜色显示每个组。在 Matplotlib 中,您可以使用 plt.scatterplot()
方便地执行此操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv" ) categories = np.unique(midwest['category' ]) colors = [plt.cm.tab10(i/float(len(categories)-1 )) for i in range(len(categories))] plt.figure(figsize=(16 , 10 ), dpi= 80 , facecolor='w' , edgecolor='k' ) for i, category in enumerate(categories): plt.scatter('area' , 'poptotal' , data=midwest.loc[midwest.category==category, :], s=20 , cmap=colors[i], label=str(category)) plt.gca().set(xlim=(0.0 , 0.1 ), ylim=(0 , 90000 ), xlabel='Area' , ylabel='Population' ) plt.xticks(fontsize=12 ); plt.yticks(fontsize=12 ) plt.title("Scatterplot of Midwest Area vs Population" , fontsize=22 ) plt.legend(fontsize=12 ) plt.show()
【02】带边界的气泡图(Bubble plot with Encircling) 有时候您想在一个边界内显示一组点来强调它们的重要性。在本例中,您将从被包围的数据中获取记录,并将其传递给下面的代码中描述的 encircle()
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 from matplotlib import patchesfrom scipy.spatial import ConvexHullimport warnings; warnings.simplefilter('ignore' )sns.set_style("white" ) midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv" ) categories = np.unique(midwest['category' ]) colors = [plt.cm.tab10(i/float(len(categories)-1 )) for i in range(len(categories))] fig = plt.figure(figsize=(16 , 10 ), dpi=80 , facecolor='w' , edgecolor='k' ) for i, category in enumerate(categories): plt.scatter('area' , 'poptotal' , data=midwest.loc[midwest.category == category, :], s='dot_size' , cmap=colors[i], label=str(category), edgecolors='black' , linewidths=.5 ) def encircle (x,y, ax=None, **kw) : if not ax: ax = plt.gca() p = np.c_[x, y] hull = ConvexHull(p) poly = plt.Polygon(p[hull.vertices, :], **kw) ax.add_patch(poly) midwest_encircle_data = midwest.loc[midwest.state=='IN' , :] encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k" , fc="gold" , alpha=0.1 ) encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick" , fc="none" , linewidth=1.5 ) plt.gca().set(xlim=(0.0 , 0.1 ), ylim=(0 , 90000 ), xlabel='Area' , ylabel='Population' ) plt.xticks(fontsize=12 ); plt.yticks(fontsize=12 ) plt.title("Bubble Plot with Encircling" , fontsize=22 ) plt.legend(fontsize=12 ) plt.show()
【03】带线性回归最佳拟合线的散点图(Scatter plot with linear regression line of best fit) 如果你想了解两个变量之间是如何变化的,那么最佳拟合线就是常用的方法。下图显示了数据中不同组之间的最佳拟合线的差异。若要禁用分组并只为整个数据集绘制一条最佳拟合线,请从 sns.lmplot()
方法中删除 hue ='cyl'
参数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) df_select = df.loc[df.cyl.isin([4 , 8 ]), :] sns.set_style("white" ) gridobj = sns.lmplot(x="displ" , y="hwy" , hue="cyl" , data=df_select, height=7 , aspect=1.6 , robust=True , palette='tab10' , scatter_kws=dict(s=60 , linewidths=.7 , edgecolors='black' )) gridobj.set(xlim=(0.5 , 7.5 ), ylim=(0 , 50 )) plt.title("Scatterplot with line of best fit grouped by number of cylinders" , fontsize=20 ) plt.show()
针对每一组数据绘制线性回归线(Each regression line in its own column),可以通过在 sns.lmplot()
中设置 col=groupingcolumn
参数来实现,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) df_select = df.loc[df.cyl.isin([4 , 8 ]), :] sns.set_style("white" ) gridobj = sns.lmplot(x="displ" , y="hwy" , data=df_select, height=7 , robust=True , palette='Set1' , col="cyl" , scatter_kws=dict(s=60 , linewidths=.7 , edgecolors='black' )) gridobj.set(xlim=(0.5 , 7.5 ), ylim=(0 , 50 )) plt.show()
【04】抖动图(Jittering with stripplot) 通常,多个数据点具有完全相同的 X 和 Y 值。 此时多个点绘制会重叠并隐藏。为避免这种情况,可以将数据点稍微抖动,以便可以直观地看到它们。 使用 seaborn
库的 stripplot()
方法可以很方便的实现这个功能。
1 2 3 4 5 6 7 8 9 10 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) fig, ax = plt.subplots(figsize=(16 ,10 ), dpi= 80 ) sns.stripplot(df.cty, df.hwy, jitter=0.25 , size=8 , ax=ax, linewidth=.5 ) plt.title('Use jittered plots to avoid overlapping of points' , fontsize=22 ) plt.show()
【05】计数图(Counts Plot) 避免点重叠问题的另一个选择是根据点的位置增加点的大小。所以,点的大小越大,它周围的点就越集中。
1 2 3 4 5 6 7 8 9 10 11 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) df_counts = df.groupby(['hwy' , 'cty' ]).size().reset_index(name='counts' ) fig, ax = plt.subplots(figsize=(16 ,10 ), dpi= 80 ) sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2 , ax=ax) plt.title('Counts Plot - Size of circle is bigger as more points overlap' , fontsize=22 ) plt.show()
【06】边缘直方图(Marginal Histogram) 边缘直方图是具有沿 X 和 Y 轴变量的直方图。 这用于可视化 X 和 Y 之间的关系以及单独的 X 和 Y 的单变量分布。 这种图经常用于探索性数据分析(EDA)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) fig = plt.figure(figsize=(16 , 10 ), dpi= 80 ) grid = plt.GridSpec(4 , 4 , hspace=0.5 , wspace=0.2 ) ax_main = fig.add_subplot(grid[:-1 , :-1 ]) ax_right = fig.add_subplot(grid[:-1 , -1 ], xticklabels=[], yticklabels=[]) ax_bottom = fig.add_subplot(grid[-1 , 0 :-1 ], xticklabels=[], yticklabels=[]) ax_main.scatter('displ' , 'hwy' , s=df.cty*4 , c=df.manufacturer.astype('category' ).cat.codes, alpha=.9 , data=df, cmap="tab10" , edgecolors='gray' , linewidths=.5 ) ax_bottom.hist(df.displ, 40 , histtype='stepfilled' , orientation='vertical' , color='deeppink' ) ax_bottom.invert_yaxis() ax_right.hist(df.hwy, 40 , histtype='stepfilled' , orientation='horizontal' , color='deeppink' ) ax_main.set(title='Scatterplot with Histograms \n displ vs hwy' , xlabel='displ' , ylabel='hwy' ) ax_main.title.set_fontsize(20 ) for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()): item.set_fontsize(14 ) xlabels = ax_main.get_xticks().tolist() ax_main.set_xticklabels(xlabels) plt.show()
【07】边缘箱形图(Marginal Boxplot) 边缘箱形图与边缘直方图具有相似的用途。 然而,箱线图有助于精确定位 X 和 Y 的中位数、第25和第75百分位数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv" ) fig = plt.figure(figsize=(16 , 10 ), dpi= 80 ) grid = plt.GridSpec(4 , 4 , hspace=0.5 , wspace=0.2 ) ax_main = fig.add_subplot(grid[:-1 , :-1 ]) ax_right = fig.add_subplot(grid[:-1 , -1 ], xticklabels=[], yticklabels=[]) ax_bottom = fig.add_subplot(grid[-1 , 0 :-1 ], xticklabels=[], yticklabels=[]) ax_main.scatter('displ' , 'hwy' , s=df.cty*5 , c=df.manufacturer.astype('category' ).cat.codes, alpha=.9 , data=df, cmap="Set1" , edgecolors='black' , linewidths=.5 ) sns.boxplot(df.hwy, ax=ax_right, orient="v" ) sns.boxplot(df.displ, ax=ax_bottom, orient="h" ) ax_bottom.set(xlabel='' ) ax_right.set(ylabel='' ) ax_main.set(title='Scatterplot with Histograms \n displ vs hwy' , xlabel='displ' , ylabel='hwy' ) ax_main.title.set_fontsize(20 ) for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()): item.set_fontsize(14 ) plt.show()
【08】相关图(Correllogram) 相关图用于直观地查看给定数据帧(或二维数组)中所有可能的数值变量对之间的相关性度量。
1 2 3 4 5 6 7 8 9 10 11 12 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) plt.figure(figsize=(12 , 10 ), dpi=80 ) sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn' , center=0 , annot=True ) plt.title('Correlogram of mtcars' , fontsize=22 ) plt.xticks(fontsize=12 ) plt.yticks(fontsize=12 ) plt.show()
【09】成对图(Pairwise Plot) 成对图是探索性分析中最受欢迎的一种方法,用来理解所有可能的数值变量对之间的关系。它是二元分析的必备工具。
1 2 3 4 5 6 7 df = sns.load_dataset('iris' ) plt.figure(figsize=(10 , 8 ), dpi=80 ) sns.pairplot(df, kind="scatter" , hue="species" , plot_kws=dict(s=80 , edgecolor="white" , linewidth=2.5 )) plt.show()
1 2 3 4 5 6 7 df = sns.load_dataset('iris' ) plt.figure(figsize=(10 , 8 ), dpi=80 ) sns.pairplot(df, kind="reg" , hue="species" ) plt.show()
【4x00】偏差(Deviation) 【10】发散型条形图(Diverging Bars) 如果您想根据单个指标查看项目的变化情况,并可视化此差异的顺序和数量,那么散型条形图是一个很好的工具。 它有助于快速区分数据组的性能,并且非常直观,并且可以立即传达这一点。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) x = df.loc[:, ['mpg' ]] df['mpg_z' ] = (x - x.mean())/x.std() df['colors' ] = ['red' if x < 0 else 'green' for x in df['mpg_z' ]] df.sort_values('mpg_z' , inplace=True ) df.reset_index(inplace=True ) plt.figure(figsize=(14 ,10 ), dpi= 80 ) plt.hlines(y=df.index, xmin=0 , xmax=df.mpg_z, color=df.colors, alpha=0.4 , linewidth=5 ) plt.gca().set(ylabel='$Model$' , xlabel='$Mileage$' ) plt.yticks(df.index, df.cars, fontsize=12 ) plt.title('Diverging Bars of Car Mileage' , fontdict={'size' :20 }) plt.grid(linestyle='--' , alpha=0.5 ) plt.show()
【11】发散型文本图(Diverging Texts) 发散型文本图与发散型条形图相似,如果你希望以一种美观的方式显示图表中每个项目的值,就可以使用这种方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) x = df.loc[:, ['mpg' ]] df['mpg_z' ] = (x - x.mean())/x.std() df['colors' ] = ['red' if x < 0 else 'green' for x in df['mpg_z' ]] df.sort_values('mpg_z' , inplace=True ) df.reset_index(inplace=True ) plt.figure(figsize=(14 , 14 ), dpi=80 ) plt.hlines(y=df.index, xmin=0 , xmax=df.mpg_z) for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z): t = plt.text(x, y, round(tex, 2 ), horizontalalignment='right' if x < 0 else 'left' , verticalalignment='center' , fontdict={'color' :'red' if x < 0 else 'green' , 'size' :14 }) plt.yticks(df.index, df.cars, fontsize=12 ) plt.title('Diverging Text Bars of Car Mileage' , fontdict={'size' :20 }) plt.grid(linestyle='--' , alpha=0.5 ) plt.xlim(-2.5 , 2.5 ) plt.show()
【12】发散型散点图(Diverging Dot Plot) 发散型散点图类似于发散型条形图。 但是,与发散型条形图相比,没有条形会减少组之间的对比度和差异。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) x = df.loc[:, ['mpg' ]] df['mpg_z' ] = (x - x.mean())/x.std() df['colors' ] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z' ]] df.sort_values('mpg_z' , inplace=True ) df.reset_index(inplace=True ) plt.figure(figsize=(14 , 16 ), dpi=80 ) plt.scatter(df.mpg_z, df.index, s=450 , alpha=.6 , color=df.colors) for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z): t = plt.text(x, y, round(tex, 1 ), horizontalalignment='center' , verticalalignment='center' , fontdict={'color' : 'white' }) plt.gca().spines["top" ].set_alpha(.3 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(.3 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.yticks(df.index, df.cars) plt.title('Diverging Dotplot of Car Mileage' , fontdict={'size' : 20 }) plt.xlabel('$Mileage$' ) plt.grid(linestyle='--' , alpha=0.5 ) plt.xlim(-2.5 , 2.5 ) plt.show()
【13】带标记的发散型棒棒糖图(Diverging Lollipop Chart with Markers) 带有标记的棒棒糖提供了一种灵活的方式,强调您想要引起注意的任何重要数据点并在图表中适当地给出推理。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) x = df.loc[:, ['mpg' ]] df['mpg_z' ] = (x - x.mean())/x.std() df['colors' ] = 'black' df.loc[df.cars == 'Fiat X1-9' , 'colors' ] = 'darkorange' df.sort_values('mpg_z' , inplace=True ) df.reset_index(inplace=True ) import matplotlib.patches as patchesplt.figure(figsize=(14 , 16 ), dpi=80 ) plt.hlines(y=df.index, xmin=0 , xmax=df.mpg_z, color=df.colors, alpha=0.4 , linewidth=1 ) plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600 if x == 'Fiat X1-9' else 300 for x in df.cars], alpha=0.6 ) plt.yticks(df.index, df.cars) plt.xticks(fontsize=12 ) plt.annotate('Mercedes Models' , xy=(0.0 , 11.0 ), xytext=(1.0 , 11 ), xycoords='data' , fontsize=15 , ha='center' , va='center' , bbox=dict(boxstyle='square' , fc='firebrick' ), arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5' , lw=2.0 , color='steelblue' ), color='white' ) p1 = patches.Rectangle((-2.0 , -1 ), width=.3 , height=3 , alpha=.2 , facecolor='red' ) p2 = patches.Rectangle((1.5 , 27 ), width=.8 , height=5 , alpha=.2 , facecolor='green' ) plt.gca().add_patch(p1) plt.gca().add_patch(p2) plt.title('Diverging Bars of Car Mileage' , fontdict={'size' : 20 }) plt.grid(linestyle='--' , alpha=0.5 ) plt.show()
【14】面积图(Area Chart) 通过对轴和线之间的区域进行着色,面积图不仅强调波峰和波谷,还强调波峰和波谷的持续时间。 高点持续时间越长,线下面积越大。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import numpy as npimport pandas as pddf = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv" , parse_dates=['date' ]).head(100 ) x = np.arange(df.shape[0 ]) y_returns = (df.psavert.diff().fillna(0 )/df.psavert.shift(1 )).fillna(0 ) * 100 plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.fill_between(x[1 :], y_returns[1 :], 0 , where=y_returns[1 :] >= 0 , facecolor='green' , interpolate=True , alpha=0.7 ) plt.fill_between(x[1 :], y_returns[1 :], 0 , where=y_returns[1 :] <= 0 , facecolor='red' , interpolate=True , alpha=0.7 ) plt.annotate('Peak \n1975' , xy=(94.0 , 21.0 ), xytext=(88.0 , 28 ), bbox=dict(boxstyle='square' , fc='firebrick' ), arrowprops=dict(facecolor='steelblue' , shrink=0.05 ), fontsize=15 , color='white' ) xtickvals = [str(m)[:3 ].upper()+"-" +str(y) for y, m in zip(df.date.dt.year, df.date.dt.month_name())] plt.gca().set_xticks(x[::6 ]) plt.gca().set_xticklabels(xtickvals[::6 ], rotation=90 , fontdict={'horizontalalignment' : 'center' , 'verticalalignment' : 'center_baseline' }) plt.ylim(-35 , 35 ) plt.xlim(1 , 100 ) plt.title("Month Economics Return %" , fontsize=22 ) plt.ylabel('Monthly returns %' ) plt.grid(alpha=0.5 ) plt.show()
【5x00】排序(Ranking) 【15】有序条形图(Ordered Bar Chart) 有序条形图有效地传达了项目的排序顺序。在图表上方添加度量标准的值,用户就可以从图表本身获得精确的信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw[['cty' , 'manufacturer' ]].groupby('manufacturer' ).apply(lambda x: x.mean()) df.sort_values('cty' , inplace=True ) df.reset_index(inplace=True ) import matplotlib.patches as patchesfig, ax = plt.subplots(figsize=(16 ,10 ), facecolor='white' , dpi= 80 ) ax.vlines(x=df.index, ymin=0 , ymax=df.cty, color='firebrick' , alpha=0.7 , linewidth=20 ) for i, cty in enumerate(df.cty): ax.text(i, cty+0.5 , round(cty, 1 ), horizontalalignment='center' ) ax.set_title('Bar Chart for Highway Mileage' , fontdict={'size' :22 }) ax.set(ylabel='Miles Per Gallon' , ylim=(0 , 30 )) plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60 , horizontalalignment='right' , fontsize=12 ) p1 = patches.Rectangle((.57 , -0.005 ), width=.33 , height=.13 , alpha=.1 , facecolor='green' , transform=fig.transFigure) p2 = patches.Rectangle((.124 , -0.005 ), width=.446 , height=.13 , alpha=.1 , facecolor='red' , transform=fig.transFigure) fig.add_artist(p1) fig.add_artist(p2) plt.show()
【16】棒棒糖图(Lollipop Chart) 棒棒糖图表以一种视觉上令人愉悦的方式提供与有序条形图类似的目的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw[['cty' , 'manufacturer' ]].groupby('manufacturer' ).apply(lambda x: x.mean()) df.sort_values('cty' , inplace=True ) df.reset_index(inplace=True ) fig, ax = plt.subplots(figsize=(16 , 10 ), dpi=80 ) ax.vlines(x=df.index, ymin=0 , ymax=df.cty, color='firebrick' , alpha=0.7 , linewidth=2 ) ax.scatter(x=df.index, y=df.cty, s=75 , color='firebrick' , alpha=0.7 ) ax.set_title('Lollipop Chart for Highway Mileage' , fontdict={'size' : 22 }) ax.set_ylabel('Miles Per Gallon' ) ax.set_xticks(df.index) ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60 , fontdict={'horizontalalignment' : 'right' , 'size' : 12 }) ax.set_ylim(0 , 30 ) for row in df.itertuples(): ax.text(row.Index, row.cty+.5 , s=round(row.cty, 2 ), horizontalalignment='center' , verticalalignment='bottom' , fontsize=14 ) plt.show()
【17】点图(Dot Plot) 点图可以表示项目的排名顺序。由于它是沿水平轴对齐的,所以可以更容易地看到点之间的距离。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw[['cty' , 'manufacturer' ]].groupby('manufacturer' ).apply(lambda x: x.mean()) df.sort_values('cty' , inplace=True ) df.reset_index(inplace=True ) fig, ax = plt.subplots(figsize=(16 , 10 ), dpi=80 ) ax.hlines(y=df.index, xmin=11 , xmax=26 , color='gray' , alpha=0.7 , linewidth=1 , linestyles='dashdot' ) ax.scatter(y=df.index, x=df.cty, s=75 , color='firebrick' , alpha=0.7 ) ax.set_title('Dot Plot for Highway Mileage' , fontdict={'size' : 22 }) ax.set_xlabel('Miles Per Gallon' ) ax.set_yticks(df.index) ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment' : 'right' }) ax.set_xlim(10 , 27 ) plt.show()
【18】坡度图(Slope Chart) 坡度图最适合比较给定人员/项目的“前”和“后”位置。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 import matplotlib.lines as mlinesdf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv" ) left_label = [str(c) + ', ' + str(round(y)) for c, y in zip(df.continent, df['1952' ])] right_label = [str(c) + ', ' + str(round(y)) for c, y in zip(df.continent, df['1957' ])] klass = ['red' if (y1 - y2) < 0 else 'green' for y1, y2 in zip(df['1952' ], df['1957' ])] def newline (p1, p2, color='black' ) : ax = plt.gca() l = mlines.Line2D([p1[0 ], p2[0 ]], [p1[1 ], p2[1 ]], color='red' if p1[1 ] - p2[1 ] > 0 else 'green' , marker='o' , markersize=6 ) ax.add_line(l) return l fig, ax = plt.subplots(1 , 1 , figsize=(14 , 14 ), dpi=80 ) ax.vlines(x=1 , ymin=500 , ymax=13000 , color='black' , alpha=0.7 , linewidth=1 , linestyles='dotted' ) ax.vlines(x=3 , ymin=500 , ymax=13000 , color='black' , alpha=0.7 , linewidth=1 , linestyles='dotted' ) ax.scatter(y=df['1952' ], x=np.repeat(1 , df.shape[0 ]), s=10 , color='black' , alpha=0.7 ) ax.scatter(y=df['1957' ], x=np.repeat(3 , df.shape[0 ]), s=10 , color='black' , alpha=0.7 ) for p1, p2, c in zip(df['1952' ], df['1957' ], df['continent' ]): newline([1 , p1], [3 , p2]) ax.text(1 - 0.05 , p1, c + ', ' + str(round(p1)), horizontalalignment='right' , verticalalignment='center' , fontdict={'size' : 14 }) ax.text(3 + 0.05 , p2, c + ', ' + str(round(p2)), horizontalalignment='left' , verticalalignment='center' , fontdict={'size' : 14 }) ax.text(1 - 0.05 , 13000 , 'BEFORE' , horizontalalignment='right' , verticalalignment='center' , fontdict={'size' : 18 , 'weight' : 700 }) ax.text(3 + 0.05 , 13000 , 'AFTER' , horizontalalignment='left' , verticalalignment='center' , fontdict={'size' : 18 , 'weight' : 700 }) ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957" , fontdict={'size' : 22 }) ax.set(xlim=(0 , 4 ), ylim=(0 , 14000 ), ylabel='Mean GDP Per Capita' ) ax.set_xticks([1 , 3 ]) ax.set_xticklabels(["1952" , "1957" ]) plt.yticks(np.arange(500 , 13000 , 2000 ), fontsize=12 ) plt.gca().spines["top" ].set_alpha(.0 ) plt.gca().spines["bottom" ].set_alpha(.0 ) plt.gca().spines["right" ].set_alpha(.0 ) plt.gca().spines["left" ].set_alpha(.0 ) plt.show()
【19】哑铃图(Dumbbell Plot) 哑铃图传达了各种项目的“前”和“后”位置以及项目的等级顺序。如果您希望可视化特定项目/计划对不同对象的影响,那么它非常有用。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import matplotlib.lines as mlinesdf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv" ) df.sort_values('pct_2014' , inplace=True ) df.reset_index(inplace=True ) def newline (p1, p2, color='black' ) : ax = plt.gca() l = mlines.Line2D([p1[0 ], p2[0 ]], [p1[1 ], p2[1 ]], color='skyblue' ) ax.add_line(l) return l fig, ax = plt.subplots(1 , 1 , figsize=(14 , 14 ), facecolor='#f7f7f7' , dpi=80 ) ax.vlines(x=.05 , ymin=0 , ymax=26 , color='black' , alpha=1 , linewidth=1 , linestyles='dotted' ) ax.vlines(x=.10 , ymin=0 , ymax=26 , color='black' , alpha=1 , linewidth=1 , linestyles='dotted' ) ax.vlines(x=.15 , ymin=0 , ymax=26 , color='black' , alpha=1 , linewidth=1 , linestyles='dotted' ) ax.vlines(x=.20 , ymin=0 , ymax=26 , color='black' , alpha=1 , linewidth=1 , linestyles='dotted' ) ax.scatter(y=df['index' ], x=df['pct_2013' ], s=50 , color='#0e668b' , alpha=0.7 ) ax.scatter(y=df['index' ], x=df['pct_2014' ], s=50 , color='#a3c4dc' , alpha=0.7 ) for i, p1, p2 in zip(df['index' ], df['pct_2013' ], df['pct_2014' ]): newline([p1, i], [p2, i]) ax.set_facecolor('#f7f7f7' ) ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014" , fontdict={'size' : 22 }) ax.set(xlim=(0 , .25 ), ylim=(-1 , 27 ), ylabel='Mean GDP Per Capita' ) ax.set_xticks([.05 , .1 , .15 , .20 ]) ax.set_xticklabels(['5%' , '15%' , '20%' , '25%' ]) ax.set_xticklabels(['5%' , '15%' , '20%' , '25%' ]) plt.show()
【6x00】分布(Distribution) 【20】连续变量的直方图(Histogram for Continuous Variable) 连续变量的直方图显示给定变量的频率分布。下面的图表基于分类变量对频率条进行分组,从而更深入地了解连续变量和分类变量。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) x_var = 'displ' groupby_var = 'class' df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var) vals = [df[x_var].values.tolist() for i, df in df_agg] plt.figure(figsize=(16 , 9 ), dpi=80 ) colors = [plt.cm.Spectral(i / float(len(vals) - 1 )) for i in range(len(vals))] n, bins, patches = plt.hist(vals, 30 , stacked=True , density=False , color=colors[:len(vals)]) plt.legend({group: col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])}) plt.title(f"Stacked Histogram of ${x_var} $ colored by ${groupby_var} $" , fontsize=22 ) plt.xlabel(x_var) plt.ylabel("Frequency" ) plt.ylim(0 , 25 ) plt.xticks(ticks=bins[::3 ], labels=[round(b, 1 ) for b in bins[::3 ]]) plt.show()
【21】分类变量的直方图(Histogram for Categorical Variable) 分类变量的直方图显示该变量的频率分布。通过给条形图上色,您可以将分布与表示颜色的另一个类型变量相关联。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) x_var = 'manufacturer' groupby_var = 'class' df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var) vals = [df[x_var].values.tolist() for i, df in df_agg] plt.figure(figsize=(16 , 9 ), dpi=80 ) colors = [plt.cm.Spectral(i / float(len(vals) - 1 )) for i in range(len(vals))] n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True , density=False , color=colors[:len(vals)]) plt.legend({group: col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])}) plt.title(f"Stacked Histogram of ${x_var} $ colored by ${groupby_var} $" , fontsize=22 ) plt.xlabel(x_var) plt.ylabel("Frequency" ) plt.ylim(0 , 40 ) plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90 , horizontalalignment='left' ) plt.show()
【22】密度图(Density Plot) 密度图是连续变量分布可视化的常用工具。通过按“response”变量对它们进行分组,您可以检查 X 和 Y 之间的关系。如果出于代表性目的来描述城市里程分布如何随气缸数而变化,请参见下面的情况。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(16 , 10 ), dpi=80 ) sns.kdeplot(df.loc[df['cyl' ] == 4 , "cty" ], shade=True , color="g" , label="Cyl=4" , alpha=.7 ) sns.kdeplot(df.loc[df['cyl' ] == 5 , "cty" ], shade=True , color="deeppink" , label="Cyl=5" , alpha=.7 ) sns.kdeplot(df.loc[df['cyl' ] == 6 , "cty" ], shade=True , color="dodgerblue" , label="Cyl=6" , alpha=.7 ) sns.kdeplot(df.loc[df['cyl' ] == 8 , "cty" ], shade=True , color="orange" , label="Cyl=8" , alpha=.7 ) plt.title('Density Plot of City Mileage by n_Cylinders' , fontsize=22 ) plt.legend() plt.show()
【23】直方图密度曲线(Density Curves with Histogram) 具有直方图的密度曲线将两个图所传达的信息集合在一起,因此您可以将它们都放在一个图形中,而不是放在两个图形中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(13 , 10 ), dpi=80 ) sns.distplot(df.loc[df['class' ] == 'compact' , "cty" ], color="dodgerblue" , label="Compact" , hist_kws={'alpha' : .7 }, kde_kws={'linewidth' : 3 }) sns.distplot(df.loc[df['class' ] == 'suv' , "cty" ], color="orange" , label="SUV" , hist_kws={'alpha' : .7 }, kde_kws={'linewidth' : 3 }) sns.distplot(df.loc[df['class' ] == 'minivan' , "cty" ], color="g" , label="minivan" , hist_kws={'alpha' : .7 }, kde_kws={'linewidth' : 3 }) plt.ylim(0 , 0.35 ) plt.title('Density Plot of City Mileage by Vehicle Type' , fontsize=22 ) plt.legend() plt.show()
【24】山峰叠峦图 / 欢乐图(Joy Plot) Joy Plot 允许不同组的密度曲线重叠,这是一种很好的可视化方法,可以直观地显示大量分组之间的关系。它看起来赏心悦目,清楚地传达了正确的信息。它可以使用基于 matplotlib
的 joypy
包轻松构建。
【译者 TRHX 注:Joy Plot 看起来就像是山峰叠峦,山峦起伏,层次分明,但取名为 Joy,欢乐的意思,所以不太好翻译,在使用该方法时要先安装 joypy 库】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import joypympg = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(16 , 10 ), dpi=80 ) fig, axes = joypy.joyplot(mpg, column=['hwy' , 'cty' ], by="class" , ylim='own' , figsize=(14 , 10 )) plt.title('Joy Plot of City and Highway Mileage by Class' , fontsize=22 ) plt.show()
【25】分布式点图(Distributed Dot Plot) 分布点图显示按组分割的点的单变量分布。点越暗,数据点在该区域的集中程度就越高。通过对中值进行不同的着色,这些组的真实位置立即变得明显。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import matplotlib.patches as mpatchesdf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) cyl_colors = {4 : 'tab:red' , 5 : 'tab:green' , 6 : 'tab:blue' , 8 : 'tab:orange' } df_raw['cyl_color' ] = df_raw.cyl.map(cyl_colors) df = df_raw[['cty' , 'manufacturer' ]].groupby('manufacturer' ).apply(lambda x: x.mean()) df.sort_values('cty' , ascending=False , inplace=True ) df.reset_index(inplace=True ) df_median = df_raw[['cty' , 'manufacturer' ]].groupby('manufacturer' ).apply(lambda x: x.median()) fig, ax = plt.subplots(figsize=(16 , 10 ), dpi=80 ) ax.hlines(y=df.index, xmin=0 , xmax=40 , color='gray' , alpha=0.5 , linewidth=.5 , linestyles='dashdot' ) for i, make in enumerate(df.manufacturer): df_make = df_raw.loc[df_raw.manufacturer == make, :] ax.scatter(y=np.repeat(i, df_make.shape[0 ]), x='cty' , data=df_make, s=75 , edgecolors='gray' , c='w' , alpha=0.5 ) ax.scatter(y=i, x='cty' , data=df_median.loc[df_median.index == make, :], s=75 , c='firebrick' ) ax.text(33 , 13 , "$red \; dots \; are \; the \: median$" , fontdict={'size' : 12 }, color='firebrick' ) red_patch = plt.plot([], [], marker="o" , ms=10 , ls="" , mec=None , color='firebrick' , label="Median" ) plt.legend(handles=red_patch) ax.set_title('Distribution of City Mileage by Make' , fontdict={'size' : 22 }) ax.set_xlabel('Miles Per Gallon (City)' , alpha=0.7 ) ax.set_yticks(df.index) ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment' : 'right' }, alpha=0.7 ) ax.set_xlim(1 , 40 ) plt.xticks(alpha=0.7 ) plt.gca().spines["top" ].set_visible(False ) plt.gca().spines["bottom" ].set_visible(False ) plt.gca().spines["right" ].set_visible(False ) plt.gca().spines["left" ].set_visible(False ) plt.grid(axis='both' , alpha=.4 , linewidth=.1 ) plt.show()
【26】箱形图(Box Plot) 箱形图是可视化分布的一种好方法,同时牢记中位数,第 25 个第 75 个四分位数和离群值。 但是,在解释方框的大小时需要小心,这可能会扭曲该组中包含的点数。 因此,手动提供每个框中的观察次数可以帮助克服此缺点。
例如,左侧的前两个框,尽管它们分别具有 5 和 47 个 obs,但是却具有相同大小, 因此,有必要写下该组中的观察数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(13 , 10 ), dpi=80 ) sns.boxplot(x='class' , y='hwy' , data=df, notch=False ) def add_n_obs (df, group_col, y) : medians_dict = {grp[0 ]: grp[1 ][y].median() for grp in df.groupby(group_col)} xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()] n_obs = df.groupby(group_col)[y].size().values for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs): plt.text(x, medians_dict[xticklabel] * 1.01 , "#obs : " + str(n_ob), horizontalalignment='center' , fontdict={'size' : 14 }, color='white' ) add_n_obs(df, group_col='class' , y='hwy' ) plt.title('Box Plot of Highway Mileage by Vehicle Class' , fontsize=22 ) plt.ylim(10 , 40 ) plt.show()
【27】点 + 箱形图(Dot + Box Plot) 点 + 箱形图传达类似于分组的箱形图信息。此外,这些点还提供了每组中有多少数据点的含义。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(13 ,10 ), dpi= 80 ) sns.boxplot(x='class' , y='hwy' , data=df, hue='cyl' ) sns.stripplot(x='class' , y='hwy' , data=df, color='black' , size=3 , jitter=1 ) for i in range(len(df['class' ].unique())-1 ): plt.vlines(i+.5 , 10 , 45 , linestyles='solid' , colors='gray' , alpha=0.2 ) plt.title('Box Plot of Highway Mileage by Vehicle Class' , fontsize=22 ) plt.legend(title='Cylinders' ) plt.show()
【28】小提琴图(Violin Plot) 小提琴图是箱形图在视觉上令人愉悦的替代品。 小提琴的形状或面积取决于它所持有的观察次数。 但是,小提琴图可能更难以阅读,并且在专业设置中不常用。
1 2 3 4 5 6 7 8 9 10 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) plt.figure(figsize=(13 , 10 ), dpi=80 ) sns.violinplot(x='class' , y='hwy' , data=df, scale='width' , inner='quartile' ) plt.title('Violin Plot of Highway Mileage by Vehicle Class' , fontsize=22 ) plt.show()
【29】人口金字塔图(Population Pyramid) 人口金字塔可用于显示按体积排序的组的分布。或者它也可以用于显示人口的逐级过滤,因为它是用来显示有多少人通过一个营销漏斗(Marketing Funnel)的每个阶段。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv" ) plt.figure(figsize=(13 , 10 ), dpi=80 ) group_col = 'Gender' order_of_bars = df.Stage.unique()[::-1 ] colors = [plt.cm.Spectral(i / float(len(df[group_col].unique()) - 1 )) for i in range(len(df[group_col].unique()))] for c, group in zip(colors, df[group_col].unique()): sns.barplot(x='Users' , y='Stage' , data=df.loc[df[group_col] == group, :], order=order_of_bars, color=c, label=group) plt.xlabel("$Users$" ) plt.ylabel("Stage of Purchase" ) plt.yticks(fontsize=12 ) plt.title("Population Pyramid of the Marketing Funnel" , fontsize=22 ) plt.legend() plt.show()
【30】分类图(Categorical Plots) 由 seaborn
库提供的分类图可用于可视化彼此相关的两个或更多分类变量的计数分布。
1 2 3 4 5 6 7 8 9 10 11 12 titanic = sns.load_dataset("titanic" ) g = sns.catplot("alive" , col="deck" , col_wrap=4 , data=titanic[titanic.deck.notnull()], kind="count" , height=3.5 , aspect=.8 , palette='tab20' ) plt.show()
1 2 3 4 5 6 7 8 9 10 11 12 titanic = sns.load_dataset("titanic" ) sns.catplot(x="age" , y="embark_town" , hue="sex" , col="class" , data=titanic[titanic.embark_town.notnull()], orient="h" , height=5 , aspect=1 , palette="tab10" , kind="violin" , dodge=True , cut=0 , bw=.2 ) plt.show()
【7x00】组成(Composition) 【31】华夫饼图(Waffle Chart) 华夫饼图可以使用 pywaffle
包创建,用于显示较大群体中的组的组成。
【译者 TRHX 注:在使用该方法时要先安装 pywaffle 库】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from pywaffle import Waffledf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw.groupby('class' ).size().reset_index(name='counts' ) n_categories = df.shape[0 ] colors = [plt.cm.inferno_r(i / float(n_categories)) for i in range(n_categories)] fig = plt.figure( FigureClass=Waffle, plots={ '111' : { 'values' : df['counts' ], 'labels' : ["{0} ({1})" .format(n[0 ], n[1 ]) for n in df[['class' , 'counts' ]].itertuples()], 'legend' : {'loc' : 'upper left' , 'bbox_to_anchor' : (1.05 , 1 ), 'fontsize' : 12 }, 'title' : {'label' : '# Vehicles by Class' , 'loc' : 'center' , 'fontsize' : 18 } }, }, rows=7 , colors=colors, figsize=(16 , 9 ) ) plt.show()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 from pywaffle import Waffledf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df_class = df_raw.groupby('class' ).size().reset_index(name='counts_class' ) n_categories = df_class.shape[0 ] colors_class = [plt.cm.Set3(i / float(n_categories)) for i in range(n_categories)] df_cyl = df_raw.groupby('cyl' ).size().reset_index(name='counts_cyl' ) n_categories = df_cyl.shape[0 ] colors_cyl = [plt.cm.Spectral(i / float(n_categories)) for i in range(n_categories)] df_make = df_raw.groupby('manufacturer' ).size().reset_index(name='counts_make' ) n_categories = df_make.shape[0 ] colors_make = [plt.cm.tab20b(i / float(n_categories)) for i in range(n_categories)] fig = plt.figure( FigureClass=Waffle, plots={ '311' : { 'values' : df_class['counts_class' ], 'labels' : ["{1}" .format(n[0 ], n[1 ]) for n in df_class[['class' , 'counts_class' ]].itertuples()], 'legend' : {'loc' : 'upper left' , 'bbox_to_anchor' : (1.05 , 1 ), 'fontsize' : 12 , 'title' : 'Class' }, 'title' : {'label' : '# Vehicles by Class' , 'loc' : 'center' , 'fontsize' : 18 }, 'colors' : colors_class }, '312' : { 'values' : df_cyl['counts_cyl' ], 'labels' : ["{1}" .format(n[0 ], n[1 ]) for n in df_cyl[['cyl' , 'counts_cyl' ]].itertuples()], 'legend' : {'loc' : 'upper left' , 'bbox_to_anchor' : (1.05 , 1 ), 'fontsize' : 12 , 'title' : 'Cyl' }, 'title' : {'label' : '# Vehicles by Cyl' , 'loc' : 'center' , 'fontsize' : 18 }, 'colors' : colors_cyl }, '313' : { 'values' : df_make['counts_make' ], 'labels' : ["{1}" .format(n[0 ], n[1 ]) for n in df_make[['manufacturer' , 'counts_make' ]].itertuples()], 'legend' : {'loc' : 'upper left' , 'bbox_to_anchor' : (1.05 , 1 ), 'fontsize' : 12 , 'title' : 'Manufacturer' }, 'title' : {'label' : '# Vehicles by Make' , 'loc' : 'center' , 'fontsize' : 18 }, 'colors' : colors_make } }, rows=9 , figsize=(16 , 14 ) ) plt.show()
【32】饼图(Pie Chart) 饼图是显示组成的经典方法。然而,现在一般不宜使用,因为馅饼部分的面积有时会产生误导。因此,如果要使用饼图,强烈建议您显式地记下饼图每个部分的百分比或数字。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw.groupby('class' ).size() ''' 原代码:df.plot(kind='pie', subplots=True, figsize=(8, 8), dpi=80) 译者 TRHX 删除了 dpi= 80 ''' df.plot(kind='pie' , subplots=True , figsize=(8 , 8 )) plt.title("Pie Chart of Vehicle Class - Bad" ) plt.ylabel("" ) plt.show()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw.groupby('class' ).size().reset_index(name='counts' ) fig, ax = plt.subplots(figsize=(12 , 7 ), subplot_kw=dict(aspect="equal" ), dpi=80 ) data = df['counts' ] categories = df['class' ] explode = [0 , 0 , 0 , 0 , 0 , 0.1 , 0 ] def func (pct, allvals) : absolute = int(pct / 100. * np.sum(allvals)) return "{:.1f}% ({:d} )" .format(pct, absolute) wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: func(pct, data), textprops=dict(color="w" ), colors=plt.cm.Dark2.colors, startangle=140 , explode=explode) ax.legend(wedges, categories, title="Vehicle Class" , loc="center left" , bbox_to_anchor=(1 , 0 , 0.5 , 1 )) plt.setp(autotexts, size=10 , weight=700 ) ax.set_title("Class of Vehicles: Pie Chart" ) plt.show()
【33】矩阵树形图(Treemap) 矩阵树形图类似于饼图,它可以更好地完成工作而不会误导每个组的贡献。
【译者 TRHX 注:在使用该方法时要先安装 squarify 库】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import squarifydf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw.groupby('class' ).size().reset_index(name='counts' ) labels = df.apply(lambda x: str(x[0 ]) + "\n (" + str(x[1 ]) + ")" , axis=1 ) sizes = df['counts' ].values.tolist() colors = [plt.cm.Spectral(i / float(len(labels))) for i in range(len(labels))] plt.figure(figsize=(12 , 8 ), dpi=80 ) squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8 ) plt.title('Treemap of Vechile Class' ) plt.axis('off' ) plt.show()
【34】条形图(Bar Chart) 条形图是一种基于计数或任何给定度量的可视化项的经典方法。在下面的图表中,我为每个项目使用了不同的颜色,但您通常可能希望为所有项目选择一种颜色,除非您按组对它们进行着色。颜色名称存储在下面代码中的 all_colors
中。您可以通过在 plt.plot()
中设置 color
参数来更改条形的颜色。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import randomdf_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv" ) df = df_raw.groupby('manufacturer' ).size().reset_index(name='counts' ) n = df['manufacturer' ].unique().__len__()+1 all_colors = list(plt.cm.colors.cnames.keys()) random.seed(100 ) c = random.choices(all_colors, k=n) plt.figure(figsize=(16 ,10 ), dpi= 80 ) plt.bar(df['manufacturer' ], df['counts' ], color=c, width=.5 ) for i, val in enumerate(df['counts' ].values): plt.text(i, val, float(val), horizontalalignment='center' , verticalalignment='bottom' , fontdict={'fontweight' :500 , 'size' :12 }) plt.gca().set_xticklabels(df['manufacturer' ], rotation=60 , horizontalalignment= 'right' ) plt.title("Number of Vehicles by Manaufacturers" , fontsize=22 ) plt.ylabel('# Vehicles' ) plt.ylim(0 , 45 ) plt.show()
【8x00】变化(Change) 【35】时间序列图(Time Series Plot) 时间序列图用于可视化给定指标随时间的变化。在这里你可以看到 1949 年到 1969 年间的航空客运量是如何变化的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv' ) plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.plot('date' , 'traffic' , data=df, color='tab:red' ) plt.ylim(50 , 750 ) xtick_location = df.index.tolist()[::12 ] xtick_labels = [x[-4 :] for x in df.date.tolist()[::12 ]] plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0 , fontsize=12 , horizontalalignment='center' , alpha=.7 ) plt.yticks(fontsize=12 , alpha=.7 ) plt.title("Air Passengers Traffic (1949 - 1969)" , fontsize=22 ) plt.grid(axis='both' , alpha=.3 ) plt.gca().spines["top" ].set_alpha(0.0 ) plt.gca().spines["bottom" ].set_alpha(0.3 ) plt.gca().spines["right" ].set_alpha(0.0 ) plt.gca().spines["left" ].set_alpha(0.3 ) plt.show()
【36】带波峰和波谷注释的时间序列图(Time Series with Peaks and Troughs Annotated) 下面的时间序列绘制了所有的波峰和波谷,并注释了所选特殊事件的发生。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv' ) data = df['traffic' ].values doublediff = np.diff(np.sign(np.diff(data))) peak_locations = np.where(doublediff == -2 )[0 ] + 1 doublediff2 = np.diff(np.sign(np.diff(-1 * data))) trough_locations = np.where(doublediff2 == -2 )[0 ] + 1 plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.plot('date' , 'traffic' , data=df, color='tab:blue' , label='Air Traffic' ) plt.scatter(df.date[peak_locations], df.traffic[peak_locations], marker=mpl.markers.CARETUPBASE, color='tab:green' , s=100 , label='Peaks' ) plt.scatter(df.date[trough_locations], df.traffic[trough_locations], marker=mpl.markers.CARETDOWNBASE, color='tab:red' , s=100 , label='Troughs' ) for t, p in zip(trough_locations[1 ::5 ], peak_locations[::3 ]): plt.text(df.date[p], df.traffic[p] + 15 , df.date[p], horizontalalignment='center' , color='darkgreen' ) plt.text(df.date[t], df.traffic[t] - 35 , df.date[t], horizontalalignment='center' , color='darkred' ) plt.ylim(50 , 750 ) xtick_location = df.index.tolist()[::6 ] xtick_labels = df.date.tolist()[::6 ] plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=90 , fontsize=12 , alpha=.7 ) plt.title("Peak and Troughs of Air Passengers Traffic (1949 - 1969)" , fontsize=22 ) plt.yticks(fontsize=12 , alpha=.7 ) plt.gca().spines["top" ].set_alpha(.0 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(.0 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.legend(loc='upper left' ) plt.grid(axis='y' , alpha=.3 ) plt.show()
【37】自相关 (ACF) 和部分自相关 (PACF) 图(Autocorrelation (ACF) and Partial Autocorrelation (PACF) Plot) 自相关图(ACF图)显示了时间序列与其自身滞后的相关性。 每条垂直线(在自相关图上)表示系列与滞后 0 之间的滞后的相关性。图中的蓝色阴影区域是显著性级别。 那些位于蓝线之上的滞后是显著的滞后。
那么如何解释呢?
对于航空乘客来说,我们看到超过 14 个滞后已经越过蓝线,因此意义重大。这意味着,14 年前的航空客运量对今天的交通量产生了影响。
另一方面,部分自相关图(PACF)显示了任何给定滞后(时间序列)相对于当前序列的自相关,但消除了中间滞后的贡献。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from statsmodels.graphics.tsaplots import plot_acf, plot_pacfdf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv' ) fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize=(16 , 6 ), dpi=80 ) plot_acf(df.traffic.tolist(), ax=ax1, lags=50 ) plot_pacf(df.traffic.tolist(), ax=ax2, lags=20 ) ax1.spines["top" ].set_alpha(.3 ); ax2.spines["top" ].set_alpha(.3 ) ax1.spines["bottom" ].set_alpha(.3 ); ax2.spines["bottom" ].set_alpha(.3 ) ax1.spines["right" ].set_alpha(.3 ); ax2.spines["right" ].set_alpha(.3 ) ax1.spines["left" ].set_alpha(.3 ); ax2.spines["left" ].set_alpha(.3 ) ax1.tick_params(axis='both' , labelsize=12 ) ax2.tick_params(axis='both' , labelsize=12 ) plt.show()
【38】交叉相关图(Cross Correlation plot) 交叉相关图显示了两个时间序列相互之间的滞后。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import statsmodels.tsa.stattools as stattoolsdf = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv' ) x = df['mdeaths' ] y = df['fdeaths' ] ccs = stattools.ccf(x, y)[:100 ] nlags = len(ccs) conf_level = 2 / np.sqrt(nlags) plt.figure(figsize=(12 , 7 ), dpi=80 ) plt.hlines(0 , xmin=0 , xmax=100 , color='gray' ) plt.hlines(conf_level, xmin=0 , xmax=100 , color='gray' ) plt.hlines(-conf_level, xmin=0 , xmax=100 , color='gray' ) plt.bar(x=np.arange(len(ccs)), height=ccs, width=.3 ) plt.title('$Cross\; Correlation\; Plot:\; mdeaths\; vs\; fdeaths$' , fontsize=22 ) plt.xlim(0 , len(ccs)) plt.show()
【39】时间序列分解图(Time Series Decomposition Plot) 时间序列分解图将时间序列分解为趋势、季节和残差分量。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from statsmodels.tsa.seasonal import seasonal_decomposefrom dateutil.parser import parsedf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv' ) dates = pd.DatetimeIndex([parse(d).strftime('%Y-%m-01' ) for d in df['date' ]]) df.set_index(dates, inplace=True ) result = seasonal_decompose(df['traffic' ], model='multiplicative' ) plt.rcParams.update({'figure.figsize' : (10 , 10 )}) result.plot().suptitle('Time Series Decomposition of Air Passengers' ) plt.show()
【40】多重时间序列(Multiple Time Series) 您可以在同一图表上绘制多个测量相同值的时间序列,如下所示。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 df = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv' ) y_LL = 100 y_UL = int(df.iloc[:, 1 :].max().max() * 1.1 ) y_interval = 400 mycolors = ['tab:red' , 'tab:blue' , 'tab:green' , 'tab:orange' ] fig, ax = plt.subplots(1 , 1 , figsize=(16 , 9 ), dpi=80 ) columns = df.columns[1 :] for i, column in enumerate(columns): plt.plot(df.date.values, df[column].values, lw=1.5 , color=mycolors[i]) plt.text(df.shape[0 ] + 1 , df[column].values[-1 ], column, fontsize=14 , color=mycolors[i]) for y in range(y_LL, y_UL, y_interval): plt.hlines(y, xmin=0 , xmax=71 , colors='black' , alpha=0.3 , linestyles="--" , lw=0.5 ) plt.tick_params(axis="both" , which="both" , bottom=False , top=False , labelbottom=True , left=False , right=False , labelleft=True ) plt.gca().spines["top" ].set_alpha(.3 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(.3 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.title('Number of Deaths from Lung Diseases in the UK (1974-1979)' , fontsize=22 ) plt.yticks(range(y_LL, y_UL, y_interval), [str(y) for y in range(y_LL, y_UL, y_interval)], fontsize=12 ) plt.xticks(range(0 , df.shape[0 ], 12 ), df.date.values[::12 ], horizontalalignment='left' , fontsize=12 ) plt.ylim(y_LL, y_UL) plt.xlim(-2 , 80 ) plt.show()
【41】使用次要的 Y 轴来绘制不同范围的图形(Plotting with different scales using secondary Y axis) 如果要显示在同一时间点测量两个不同数量的两个时间序列,则可以在右侧的次要 Y 轴上再绘制第二个系列。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv" ) x = df['date' ] y1 = df['psavert' ] y2 = df['unemploy' ] fig, ax1 = plt.subplots(1 , 1 , figsize=(16 , 9 ), dpi=80 ) ax1.plot(x, y1, color='tab:red' ) ax2 = ax1.twinx() ax2.plot(x, y2, color='tab:blue' ) ax1.set_xlabel('Year' , fontsize=20 ) ax1.tick_params(axis='x' , rotation=0 , labelsize=12 ) ax1.set_ylabel('Personal Savings Rate' , color='tab:red' , fontsize=20 ) ax1.tick_params(axis='y' , rotation=0 , labelcolor='tab:red' ) ax1.grid(alpha=.4 ) ax2.set_ylabel("# Unemployed (1000's)" , color='tab:blue' , fontsize=20 ) ax2.tick_params(axis='y' , labelcolor='tab:blue' ) ax2.set_xticks(np.arange(0 , len(x), 60 )) ax2.set_xticklabels(x[::60 ], rotation=90 , fontdict={'fontsize' : 10 }) ax2.set_title("Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis" , fontsize=22 ) fig.tight_layout() plt.show()
【42】带误差带的时间序列(Time Series with Error Bands) 如果您有一个时间序列数据集,其中每个时间点(日期/时间戳)有多个观测值,则可以构造具有误差带的时间序列。下面您可以看到一些基于一天中不同时间的订单的示例。还有一个关于45天内到达的订单数量的例子。
在这种方法中,订单数量的平均值用白线表示。并计算95%的置信区间,并围绕平均值绘制。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 from scipy.stats import semdf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/user_orders_hourofday.csv" ) df_mean = df.groupby('order_hour_of_day' ).quantity.mean() df_se = df.groupby('order_hour_of_day' ).quantity.apply(sem).mul(1.96 ) plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.ylabel("# Orders" , fontsize=16 ) x = df_mean.index plt.plot(x, df_mean, color="white" , lw=2 ) plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D" ) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(1 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(1 ) plt.xticks(x[::2 ], [str(d) for d in x[::2 ]], fontsize=12 ) plt.title("User Orders by Hour of Day (95% confidence)" , fontsize=22 ) plt.xlabel("Hour of Day" ) s, e = plt.gca().get_xlim() plt.xlim(s, e) for y in range(8 , 20 , 2 ): plt.hlines(y, xmin=s, xmax=e, colors='black' , alpha=0.5 , linestyles="--" , lw=0.5 ) plt.show()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 "Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce#olist_orders_dataset.csv" from dateutil.parser import parsefrom scipy.stats import semdf_raw = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/orders_45d.csv' , parse_dates=['purchase_time' , 'purchase_date' ]) df_mean = df_raw.groupby('purchase_date' ).quantity.mean() df_se = df_raw.groupby('purchase_date' ).quantity.apply(sem).mul(1.96 ) plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.ylabel("# Daily Orders" , fontsize=16 ) x = [d.date().strftime('%Y-%m-%d' ) for d in df_mean.index] plt.plot(x, df_mean, color="white" , lw=2 ) plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D" ) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(1 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(1 ) plt.xticks(x[::6 ], [str(d) for d in x[::6 ]], fontsize=12 ) plt.title("Daily Order Quantity of Brazilian Retail with Error Bands (95% confidence)" , fontsize=20 ) s, e = plt.gca().get_xlim() plt.xlim(s, e - 2 ) plt.ylim(4 , 10 ) for y in range(5 , 10 , 1 ): plt.hlines(y, xmin=s, xmax=e, colors='black' , alpha=0.5 , linestyles="--" , lw=0.5 ) plt.show()
【43】堆积面积图(Stacked Area Chart) 堆积面积图提供了多个时间序列的贡献程度的可视化表示,以便相互比较。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/nightvisitors.csv' ) mycolors = ['tab:red' , 'tab:blue' , 'tab:green' , 'tab:orange' , 'tab:brown' , 'tab:grey' , 'tab:pink' , 'tab:olive' ] fig, ax = plt.subplots(1 , 1 , figsize=(16 , 9 ), dpi=80 ) columns = df.columns[1 :] labs = columns.values.tolist() x = df['yearmon' ].values.tolist() y0 = df[columns[0 ]].values.tolist() y1 = df[columns[1 ]].values.tolist() y2 = df[columns[2 ]].values.tolist() y3 = df[columns[3 ]].values.tolist() y4 = df[columns[4 ]].values.tolist() y5 = df[columns[5 ]].values.tolist() y6 = df[columns[6 ]].values.tolist() y7 = df[columns[7 ]].values.tolist() y = np.vstack([y0, y2, y4, y6, y7, y5, y1, y3]) labs = columns.values.tolist() ax = plt.gca() ax.stackplot(x, y, labels=labs, colors=mycolors, alpha=0.8 ) ax.set_title('Night Visitors in Australian Regions' , fontsize=18 ) ax.set(ylim=[0 , 100000 ]) ax.legend(fontsize=10 , ncol=4 ) plt.xticks(x[::5 ], fontsize=10 , horizontalalignment='center' ) plt.yticks(np.arange(10000 , 100000 , 20000 ), fontsize=10 ) plt.xlim(x[0 ], x[-1 ]) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.show()
【44】未堆积面积图(Area Chart UnStacked) 未堆积的面积图用于可视化两个或多个序列彼此之间的进度(起伏)。在下面的图表中,你可以清楚地看到,随着失业持续时间的中位数增加,个人储蓄率是如何下降的。未堆积面积图很好地展示了这一现象。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv" ) x = df['date' ].values.tolist() y1 = df['psavert' ].values.tolist() y2 = df['uempmed' ].values.tolist() mycolors = ['tab:red' , 'tab:blue' , 'tab:green' , 'tab:orange' , 'tab:brown' , 'tab:grey' , 'tab:pink' , 'tab:olive' ] columns = ['psavert' , 'uempmed' ] fig, ax = plt.subplots(1 , 1 , figsize=(16 , 9 ), dpi=80 ) ax.fill_between(x, y1=y1, y2=0 , label=columns[1 ], alpha=0.5 , color=mycolors[1 ], linewidth=2 ) ax.fill_between(x, y1=y2, y2=0 , label=columns[0 ], alpha=0.5 , color=mycolors[0 ], linewidth=2 ) ax.set_title('Personal Savings Rate vs Median Duration of Unemployment' , fontsize=18 ) ax.set(ylim=[0 , 30 ]) ax.legend(loc='best' , fontsize=12 ) plt.xticks(x[::50 ], fontsize=10 , horizontalalignment='center' ) plt.yticks(np.arange(2.5 , 30.0 , 2.5 ), fontsize=10 ) plt.xlim(-10 , x[-1 ]) for y in np.arange(2.5 , 30.0 , 2.5 ): plt.hlines(y, xmin=0 , xmax=len(x), colors='black' , alpha=0.3 , linestyles="--" , lw=0.5 ) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.show()
【45】日历热力图(Calendar Heat Map) 与时间序列相比,日历地图是另一种基于时间的数据可视化的不太受欢迎的方法。虽然在视觉上很吸引人,但数值并不十分明显。然而,它能很好地描绘极端值和假日效果。
【译者 TRHX 注:在使用该方法时要先安装 calmap 库】
1 2 3 4 5 6 7 8 9 10 11 12 import matplotlib as mplimport calmapdf = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv" , parse_dates=['date' ]) df.set_index('date' , inplace=True ) plt.figure(figsize=(16 , 10 ), dpi=80 ) calmap.calendarplot(df['2014' ]['VIX.Close' ], fig_kws={'figsize' : (16 , 10 )}, yearlabel_kws={'color' : 'black' , 'fontsize' : 14 }, subplot_kws={'title' : 'Yahoo Stock Prices' }) plt.show()
【46】季节图(Seasonal Plot) 季节图可用于比较上一季度同一天(年/月/周等)时间序列的表现。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 from dateutil.parser import parsedf = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv' ) df['year' ] = [parse(d).year for d in df.date] df['month' ] = [parse(d).strftime('%b' ) for d in df.date] years = df['year' ].unique() df.rename(columns={'value' : 'traffic' }, inplace=True ) mycolors = ['tab:red' , 'tab:blue' , 'tab:green' , 'tab:orange' , 'tab:brown' , 'tab:grey' , 'tab:pink' , 'tab:olive' , 'deeppink' , 'steelblue' , 'firebrick' , 'mediumseagreen' ] plt.figure(figsize=(16 , 10 ), dpi=80 ) for i, y in enumerate(years): plt.plot('month' , 'traffic' , data=df.loc[df.year == y, :], color=mycolors[i], label=y) plt.text(df.loc[df.year == y, :].shape[0 ] - .9 , df.loc[df.year == y, 'traffic' ][-1 :].values[0 ], y, fontsize=12 , color=mycolors[i]) plt.ylim(50 , 750 ) plt.xlim(-0.3 , 11 ) plt.ylabel('$Air Traffic$' ) plt.yticks(fontsize=12 , alpha=.7 ) plt.title("Monthly Seasonal Plot: Air Passengers Traffic (1949 - 1969)" , fontsize=22 ) plt.grid(axis='y' , alpha=.3 ) plt.gca().spines["top" ].set_alpha(0.0 ) plt.gca().spines["bottom" ].set_alpha(0.5 ) plt.gca().spines["right" ].set_alpha(0.0 ) plt.gca().spines["left" ].set_alpha(0.5 ) plt.show()
【9x00】分组( Groups) 【47】树状图(Dendrogram) 树状图根据给定的距离度量将相似的点组合在一起,并根据点的相似性将它们组织成树状链接。
1 2 3 4 5 6 7 8 9 10 11 12 import scipy.cluster.hierarchy as shcdf = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv' ) plt.figure(figsize=(16 , 10 ), dpi=80 ) plt.title("USArrests Dendograms" , fontsize=22 ) dend = shc.dendrogram(shc.linkage(df[['Murder' , 'Assault' , 'UrbanPop' , 'Rape' ]], method='ward' ), labels=df.State.values, color_threshold=100 ) plt.xticks(fontsize=12 ) plt.show()
【48】聚类图(Cluster Plot) 聚类图可以用来划分属于同一个聚类的点。下面是一个基于 USArrests 数据集将美国各州分成 5 组的代表性示例。这个聚类图使用 ‘murder’ 和 ‘assault’ 作为 X 轴和 Y 轴。或者,您可以将第一个主元件用作 X 轴和 Y 轴。
【译者 TRHX 注:在使用该方法时要先安装 sklearn 库】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 from sklearn.cluster import AgglomerativeClusteringfrom scipy.spatial import ConvexHulldf = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv' ) cluster = AgglomerativeClustering(n_clusters=5 , affinity='euclidean' , linkage='ward' ) cluster.fit_predict(df[['Murder' , 'Assault' , 'UrbanPop' , 'Rape' ]]) plt.figure(figsize=(14 , 10 ), dpi=80 ) plt.scatter(df.iloc[:, 0 ], df.iloc[:, 1 ], c=cluster.labels_, cmap='tab10' ) def encircle (x, y, ax=None, **kw) : if not ax: ax = plt.gca() p = np.c_[x, y] hull = ConvexHull(p) poly = plt.Polygon(p[hull.vertices,:], **kw) ax.add_patch(poly) encircle(df.loc[cluster.labels_ == 0 , 'Murder' ], df.loc[cluster.labels_ == 0 , 'Assault' ], ec="k" , fc="gold" , alpha=0.2 , linewidth=0 ) encircle(df.loc[cluster.labels_ == 1 , 'Murder' ], df.loc[cluster.labels_ == 1 , 'Assault' ], ec="k" , fc="tab:blue" , alpha=0.2 , linewidth=0 ) encircle(df.loc[cluster.labels_ == 2 , 'Murder' ], df.loc[cluster.labels_ == 2 , 'Assault' ], ec="k" , fc="tab:red" , alpha=0.2 , linewidth=0 ) encircle(df.loc[cluster.labels_ == 3 , 'Murder' ], df.loc[cluster.labels_ == 3 , 'Assault' ], ec="k" , fc="tab:green" , alpha=0.2 , linewidth=0 ) encircle(df.loc[cluster.labels_ == 4 , 'Murder' ], df.loc[cluster.labels_ == 4 , 'Assault' ], ec="k" , fc="tab:orange" , alpha=0.2 , linewidth=0 ) plt.xlabel('Murder' ); plt.xticks(fontsize=12 ) plt.ylabel('Assault' ); plt.yticks(fontsize=12 ) plt.title('Agglomerative Clustering of USArrests (5 Groups)' , fontsize=22 ) plt.show()
【49】安德鲁斯曲线(Andrews Curve) 安德鲁斯曲线有助于可视化是否存在基于给定分组的数值特征的固有分组。如果特征(数据集中的列)不能帮助区分组(cyl),则行将不会像下图所示被很好地分隔开。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from pandas.plotting import andrews_curvesdf = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv" ) df.drop(['cars' , 'carname' ], axis=1 , inplace=True ) plt.figure(figsize=(12 , 9 ), dpi=80 ) andrews_curves(df, 'cyl' , colormap='Set1' ) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.title('Andrews Curves of mtcars' , fontsize=22 ) plt.xlim(-3 , 3 ) plt.grid(alpha=0.3 ) plt.xticks(fontsize=12 ) plt.yticks(fontsize=12 ) plt.show()
【50】平行坐标图(Parallel Coordinates) 平行坐标有助于可视化功能是否有助于有效地隔离组。如果一个分离受到影响,则该特征可能在预测该组时非常有用。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from pandas.plotting import parallel_coordinatesdf_final = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/diamonds_filter.csv" ) plt.figure(figsize=(12 , 9 ), dpi=80 ) parallel_coordinates(df_final, 'cut' , colormap='Dark2' ) plt.gca().spines["top" ].set_alpha(0 ) plt.gca().spines["bottom" ].set_alpha(.3 ) plt.gca().spines["right" ].set_alpha(0 ) plt.gca().spines["left" ].set_alpha(.3 ) plt.title('Parallel Coordinated of Diamonds' , fontsize=22 ) plt.grid(alpha=0.3 ) plt.xticks(fontsize=12 ) plt.yticks(fontsize=12 ) plt.show()
1 2 3 4 这里是一段防爬虫文本,请读者忽略。 本译文首发于 CSDN,作者 Selva Prabhakaran,译者 TRHX。 本文链接:https://itrhx.blog.csdn.net/article/details/106558131 原文链接:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/