diff --git a/docs/6.md b/docs/6.md index 6446ca1131055396f934bb89a1746c0373df98bc..302d93c3e1ef2cf0b89c8ee14f5af988821a7e17 100644 --- a/docs/6.md +++ b/docs/6.md @@ -1,115 +1,113 @@ -# 6\. Statistics and Linear Algebra Preliminaries +# 6\. 统计与线性代数预备 -Chinese proverb +**知彼知己,百战不殆;不知彼而知己,一胜一负;不知彼,不知己,每战必殆。** – 《孙子兵法》 -**If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself** – idiom, from Sunzi’s Art of War +## 6.1\. 表示法 -## 6.1\. Notations +* m:样本数 +* n:特征数 +* ![y_i](img/8f58cf98a539286a53e41582f194fbed.jpg):第`i`个标签 +* ![\hat{y}_i](img/585d98b9749f0661bc9077e01f28eb15.jpg):第`i`个预测标签 +* ![{\displaystyle {\bar {\y}}} = {\frac {1}{m}}\sum _{i=1}^{m}y_{i}](img/791424a3e5f6e2f4372471d96e5b4676.jpg):![\y](img/afa87c5126806e604709f243ab72848b.jpg) 的均值 +* ![\y](img/afa87c5126806e604709f243ab72848b.jpg):标签向量 +* ![\hat{\y}](img/bab25b7785bf747bc1caa1442874df74.jpg):预测标签向量 -* m : the number of the samples -* n : the number of the features -* ![y_i](img/8f58cf98a539286a53e41582f194fbed.jpg) : i-th label -* ![\hat{y}_i](img/585d98b9749f0661bc9077e01f28eb15.jpg) : i-th predicted label -* ![{\displaystyle {\bar {\y}}} = {\frac {1}{m}}\sum _{i=1}^{m}y_{i}](img/791424a3e5f6e2f4372471d96e5b4676.jpg) : the mean of ![\y](img/afa87c5126806e604709f243ab72848b.jpg). -* ![\y](img/afa87c5126806e604709f243ab72848b.jpg) : the label vector. -* ![\hat{\y}](img/bab25b7785bf747bc1caa1442874df74.jpg) : the predicted label vector. +## 6.2\. 线性代数预备 -## 6.2\. Linear Algebra Preliminaries - -Since I have documented the Linear Algebra Preliminaries in my Prelim Exam note for Numerical Analysis, the interested reader is referred to [[Feng2014]](reference.html#feng2014) for more details (Figure. [Linear Algebra Preliminaries](#fig-linear-algebra)). +由于我在我的数值分析考试笔记中记录了线性代数预备,有兴趣的读者可以参考 [[Feng2014]](reference.html#feng2014)了解更多细节。 ![https://runawayhorse001.github.io/LearningApacheSpark/_images/linear_algebra.png](img/c089ca6ef2f36b0394d7bcf41db78030.jpg) -Linear Algebra Preliminaries +线性代数预备 -## 6.3\. Measurement Formula +## 6.3\. 测量公式 -### 6.3.1\. Mean absolute error +### 6.3.1\. 平均绝对误差 -In statistics, **MAE** ([Mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error)) is a measure of difference between two continuous variables. The Mean Absolute Error is given by: +在统计学中,**MAE**([平均绝对误差](https://en.wikipedia.org/wiki/Mean_absolute_error))衡量两个连续变量间的差异。 平均绝对误差由下式给出: ![{\displaystyle \mathrm {MAE} ={\frac{1}{m} {\sum _{i=1}^{m}\left|\hat{y}_i-y_i\right|}}.}](img/61bccf1d55cc6636fce9585573c9981a.jpg) -### 6.3.2\. Mean squared error +### 6.3.2\. 均方误差 -In statistics, the **MSE** ([Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error)) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. +在统计中,估计器(估计未观测量的过程)的 **MSE**([均方误差](https://en.wikipedia.org/wiki/Mean_squared_error))测量了误差或偏差的平方的平均值 - 即估计器与被估计值之间的差异。 ![\text{MSE}=\frac{1}{m}\sum_{i=1}^m\left( \hat{y}_i-y_i\right)^2](img/3152173a8fd696819c7a2c2b8c6ef005.jpg) -### 6.3.3\. Root Mean squared error +### 6.3.3\. 均方根误差 ![\text{RMSE} = \sqrt{\text{MSE}}=\sqrt{\frac{1}{m}\sum_{i=1}^m\left( \hat{y}_i-y_i\right)^2}](img/c8a2ccec457f128649ad30a2ba066a48.jpg) -### 6.3.4\. Total sum of squares +### 6.3.4\. 总体平方和 -In statistical data analysis the **TSS** ([Total Sum of Squares](https://en.wikipedia.org/wiki/Total_sum_of_squares)) is a quantity that appears as part of a standard way of presenting results of such analyses. It is defined as being the sum, over all observations, of the squared differences of each observation from the overall mean. +在统计数据分析中,**TSS**([总体平方和](https://en.wikipedia.org/wiki/Total_sum_of_squares))是一个数量,作为呈现此类分析结果的标准方式的一部分。 它被定义为在所有观察中,每个观测值与总体平均值的平方差的总和。 ![\text{TSS} = \sum_{i=1}^m\left( y_i-\bar{\y}\right)^2](img/16fd7a4c078cf22fee09b636dc10d55c.jpg) -### 6.3.5\. Explained Sum of Squares +### 6.3.5\. 解释平方和 -In statistics, the **ESS** ([Explained sum of squares](https://en.wikipedia.org/wiki/Explained_sum_of_squares)), alternatively known as the model sum of squares or sum of squares due to regression. +在统计学中,**ESS**([解释平方和](https://en.wikipedia.org/wiki/Explained_sum_of_squares)),或者称为模型平方和或回归平方和。 -The ESS is the sum of the squares of the differences of the predicted values and the mean value of the response variable which is given by: +ESS 是预测值和响应变量的均值的差的平方和,由下式给出: ![\text{ESS}= \sum_{i=1}^m\left( \hat{y}_i-\bar{\y}\right)^2](img/8dc8e70e19ec4318b12b16f1c5bdb879.jpg) -### 6.3.6\. Residual Sum of Squares +### 6.3.6\. 残差平方和 -In statistics, **RSS** ([Residual sum of squares](https://en.wikipedia.org/wiki/Residual_sum_of_squares)), also known as the sum of squared residuals (SSR) or the sum of squared errors of prediction (SSE), is the sum of the squares of residuals which is given by: +在统计中,**RSS/SSR**([残差平方和](https://en.wikipedia.org/wiki/Residual_sum_of_squares)),也称为预测误差平方和 预测(SSE),由下式给出: ![\text{RSS}= \sum_{i=1}^m\left( \hat{y}_i-y_i\right)^2](img/95594348fc6d49d2819be3d412a27e55.jpg) -### 6.3.7\. Coefficient of determination ![R^2](img/1ac835166928f502b55a31636602602a.jpg) +### 6.3.7\. 判定系数 ![R^2](img/1ac835166928f502b55a31636602602a.jpg) ![R^{2} := \frac{ESS}{TSS} = 1-{\text{RSS} \over \text{TSS}}.\,](img/fef76f108c095f250d8e9efb4cfcb710.jpg) -Note - -In general (![\y^{T}{\bar {\y}}={\hat {\y}}^{T}{\bar {\y}}](img/b288f19072faa2f8f373d5a8910c080b.jpg)), total sum of squares = explained sum of squares + residual sum of squares, i.e.: +> 注意 +> +> 一般来说,(![\y^{T}{\bar {\y}}={\hat {\y}}^{T}{\bar {\y}}](img/b288f19072faa2f8f373d5a8910c080b.jpg)),总体平方和,等于解释平方和加上残差平方和,也就是: ![\text{TSS} = \text{ESS} + \text{RSS} \text{ if and only if } {\displaystyle \y^{T}{\bar {\y}}={\hat {\y}}^{T}{\bar {\y}}}.](img/4a1a112aa8490f7c8410b710845e8c7a.jpg) -More details can be found at [Partitioning in the general ordinary least squares model](https://en.wikipedia.org/wiki/Explained_sum_of_squares). +更多细节可以在[普通最小二乘模型中的分区](https://en.wikipedia.org/wiki/Explained_sum_of_squares)中找到。 -## 6.4\. Confusion Matrix +## 6.4\. 混淆矩阵 ![https://runawayhorse001.github.io/LearningApacheSpark/_images/confusion_matrix.png](img/c789e9bbaa3506dc90047b5cd487a42a.jpg) -Confusion Matrix +混淆矩阵 -### 6.4.1\. Recall +### 6.4.1\. 召回率 ![\text{Recall}=\frac{\text{TP}}{\text{TP+FN}}](img/3f26c9365c0603f014f3bba403ed27fb.jpg) -### 6.4.2\. Precision +### 6.4.2\. 精确率 ![\text{Precision}=\frac{\text{TP}}{\text{TP+FP}}](img/1a8a8647a66b744ccd5c9137adb66255.jpg) -### 6.4.3\. Accuracy +### 6.4.3\. 准确率 ![\text{Accuracy }=\frac{\text{TP+TN}}{\text{Total}}](img/5a13655c0030372e1b06cd77ff1e53e0.jpg) -### 6.4.4\. ![F_1](img/baa636adac3ad30302c0a36fc2f58751.jpg)-score +### 6.4.4\. F1 得分 ![\text{F}_1=\frac{2*\text{Recall}*\text{Precision}}{\text{Recall}+ \text{Precision}}](img/1cef776388e6c2cba3cf00cab2199e3d.jpg) -## 6.5\. Statistical Tests +## 6.5\. 统计检验 -### 6.5.1\. Correlational Test +### 6.5.1\. 互相关检验 -* Pearson correlation: Tests for the strength of the association between two continuous variables. -* Spearman correlation: Tests for the strength of the association between two ordinal variables (does not rely on the assumption of normal distributed data). -* Chi-square: Tests for the strength of the association between two categorical variables. +* Pearson 互相关: 检验两个连续变量之间的相关度。 +* Spearman 互相关: 检验两个序数变量之间的相关度(不依赖于正态分布数据的假设)。 +* 卡方: 检验两个类别变量之间的相关度。 -### 6.5.2\. Comparison of Means test +### 6.5.2\. 均值检验的比较 -* Paired T-test: Tests for difference between two related variables. -* Independent T-test: Tests for difference between two independent variables. -* ANOVA: Tests the difference between group means after any other variance in the outcome variable is accounted for. +* 配对 T 检验: 检验两个相关变量之间的差异 +* 独立 T 检验: 检验两个独立变量之间的差异 +* ANOVA: 在考虑结果变量中的任何其他变化之后,检验组均值之间的差异。 -### 6.5.3\. Non-parametric Test +### 6.5.3\. 非配对检验 -* Wilcoxon rank-sum test: Tests for difference between two independent variables - takes into account magnitude and direction of difference. -* Wilcoxon sign-rank test: Tests for difference between two related variables - takes into account magnitude and direction of difference. -* Sign test: Tests if two related variables are different – ignores magnitude of change, only takes into account direction. \ No newline at end of file +* Wilcoxon 秩和检验: 检验两个独立变量之间的差异 - 考虑差异的大小和方向。 +* Wilcoxon 符号秩检验: 检验两个相关变量之间的差异 - 考虑差异的大小和方向。 +* 符号检验: 检验两个相关变量是否不同 - 忽略变化大小,仅考虑方向。 \ No newline at end of file