提交 dd20ffcb 编写于 作者: W wizardforcel

8

上级 0f96fea4
# 8\. Regression
# 8\. 回归
Chinese proverb
> **千里之行,始于足下。** – 《老子》
**A journey of a thousand miles begins with a single step.** – old Chinese proverb
在统计建模中,回归分析侧重于研究因变量与一个或多个自变量之间的关系。[维基百科的回归分析](https://en.wikipedia.org/wiki/Regression_analysis)
In statistical modeling, regression analysis focuses on investigating the relationship between a dependent variable and one or more independent variables. [Wikipedia Regression analysis](https://en.wikipedia.org/wiki/Regression_analysis)
在数据挖掘中,回归是一种模型,用于表示标签(或目标,它是数值变量)的值与一个或多个特征(或预测变量,它们可以是数值和分类变量)之间的关系。
In data mining, Regression is a model to represent the relationship between the value of lable ( or target, it is numerical variable) and on one or more features (or predictors they can be numerical and categorical variables).
## 8.1\. 线性回归
## 8.1\. Linear Regression
### 8.1.1\. 简介
### 8.1.1\. Introduction
Given that a data set ![{\displaystyle \{\,x_{i1},\ldots ,x_{in},y_{i}\}_{i=1}^{m}}](img/4b454255e179a3626e205ce324184acf.jpg) which contains n features (variables) and m samples (data points), in simple linear regression model for modeling ![{\displaystyle m}](img/2649ef98f720c129d663f5d82add4129.jpg) data points with ![j](img/aec897e37f71d43694de4db49ed3be3e.jpg) independent variables: ![{\displaystyle x_{ij}}](img/91d663abfef497e13ec41f9300a5c354.jpg), the formula is given by:
给定数据集 ![{\displaystyle \{\,x_{i1},\ldots ,x_{in},y_{i}\}_{i=1}^{m}}](img/4b454255e179a3626e205ce324184acf.jpg),它包含`n`个特征(变量)和`m`个样本(数据点),在简单线性回归模型中,使用`j`个自变量建模`m`个数据点 :![{\displaystyle x_{ij}}](img/91d663abfef497e13ec41f9300a5c354.jpg),公式由下式给出:
> ![y_i = \beta_0 + \beta_j x_{ij}, \text{where}, i= 1, \cdots m, j= 1, \cdots n.](img/59ebd939c24bf4d59d82b0daf4874daf.jpg)
In matrix notation, the data set is written as ![\X = [\x_1,\cdots, \x_n]](img/80a25ad6329d3836f4e625a1c93e7898.jpg) with ![\x_j = {\displaystyle \{x_{ij}\}_{i=1}^{m}}](img/c4660874124a448ac14209f4a59e367a.jpg), ![\y = {\displaystyle \{y_{i}\}_{i=1}^{m}}](img/82a22af158d760e46ae93ba1663a6487.jpg) (see Fig. [Feature matrix and label](#fig-fm)) and ![\Bbeta^\top = {\displaystyle \{\beta_{j}\}_{j=1}^{n}}](img/fad9e18cebad821450ed0f34abdb3988.jpg). Then the matrix format equation is written as
在矩阵表示法中,数据集写为 ![\X = [\x_1,\cdots, \x_n]](img/80a25ad6329d3836f4e625a1c93e7898.jpg),其中 ![\x_j = {\displaystyle \{x_{ij}\}_{i=1}^{m}}](img/c4660874124a448ac14209f4a59e367a.jpg),![\y = {\displaystyle \{y_{i}\}_{i=1}^{m}}](img/82a22af158d760e46ae93ba1663a6487.jpg),并且 ![\Bbeta^\top = {\displaystyle \{\beta_{j}\}_{j=1}^{n}}](img/fad9e18cebad821450ed0f34abdb3988.jpg)。之后矩阵形式的方程写为:
> (1)![\y = \X \Bbeta.](img/2d776487e1a2ee4683c3c6f51fca7e48.jpg)
![https://runawayhorse001.github.io/LearningApacheSpark/_images/fm.png](img/3b99ee07cd783026d41b65651ee5d293.jpg)
Feature matrix and label
### 8.1.2\. How to solve it?
1. Direct Methods (For more information please refer to my [Prelim Notes for Numerical Analysis](http://web.utk.edu/~wfeng1/doc/PrelimNum.pdf))
> * For squared or rectangular matrices
>
>
>
> > * Singular Value Decomposition
> > * Gram-Schmidt orthogonalization
> > * QR Decomposition
>
>
> * For squared matrices
>
>
>
> > * LU Decomposition
> > * Cholesky Decomposition
> > * Regular Splittings
2. Iterative Methods
> * Stationary cases iterative method
>
>
>
> > * Jacobi Method
> > * Gauss-Seidel Method
> > * Richardson Method
> > * Successive Over Relaxation (SOR) Method
>
>
> * Dynamic cases iterative method
>
>
>
> > * Chebyshev iterative Method
> > * Minimal residuals Method
> > * Minimal correction iterative method
> > * Steepest Descent Method
> > * Conjugate Gradients Method
### 8.1.3\. Ordinary Least Squares
In mathematics, [(1)](#equation-eq-ax) is a overdetermined system. The method of ordinary least squares can be used to find an approximate solution to overdetermined systems. For the system overdetermined system [(1)](#equation-eq-ax), the least squares formula is obtained from the problem
特征矩阵和标签
### 8.1.2\. 如何求解
1. 直接法 (更多信息请参考我的[数值分析预备笔记](http://web.utk.edu/~wfeng1/doc/PrelimNum.pdf))。
* 对于方阵或长方阵
* 奇异值分解
* 格兰施密特正交化
* QR 分解
* 对于方阵
* LU 分解
* Cholesky 分解
* 正则分割
2. 迭代方法
* 静态案例迭代法
* Jacobi 方法
* Gauss-Seidel 方法
* Richardson 方法
* 连续过度放松 (SOR) 方法
* 动态案例迭代法
* Chebyshev 迭代法
* 最小残差法
* 最小修正迭代法
* 最速下降法
* 共轭梯度法
### 8.1.3\. 普通最小二乘
在数学中,[(1)](#equation-eq-ax)是一个超定系统。 普通最小二乘法可用于找到超定系统的近似解。 对于系统超定系统[(1)](#equation-eq-ax),从问题中获得最小二乘公式
(2)![{\displaystyle \min _{\Bbeta} ||\X \Bbeta-\y||} ,](img/b8bf446d4a625497f28f2347b7ca0c92.jpg)
the solution of which can be written with the normal equations:
其解决方案可以用正规方程式编写:
(3)![\Bbeta = (\X^T\X)^{-1}\X^T\y](img/d2f9799d371fde446e6dc8292ba07393.jpg)
where ![{\displaystyle {\mathrm {T} }}](img/d09c46ec94d638e4ddcecfbba1c11ea8.jpg) indicates a matrix transpose, provided ![{\displaystyle (\X^{\mathrm {T} }\X)^{-1}}](img/d003fed20e7f2d040ccc24412cb854d1.jpg) exists (that is, provided ![\X](img/501025688da0cf9e2b3937cd7da9580d.jpg) has full column rank).
其中 ![{\displaystyle {\mathrm {T} }}](img/d09c46ec94d638e4ddcecfbba1c11ea8.jpg) 表示矩阵转置,假设 ![{\displaystyle (\X^{\mathrm {T} }\X)^{-1}}](img/d003fed20e7f2d040ccc24412cb854d1.jpg) 存在(也就是假设 ![\X](img/501025688da0cf9e2b3937cd7da9580d.jpg) 是列满秩的)。
Note
Actually, [(3)](#equation-eq-solax) can be derivated by the following way: multiply ![\X^T](img/d142da9aae51c6d3c3c736fc82252862.jpg) on side of [(1)](#equation-eq-ax) and then multiply ![(\X^T\X)^{-1}](img/16dd8d60ea9b042c3ce0652c9f0571e8.jpg) on both side of the former result. You may also apply the `Extreme Value Theorem` to [(2)](#equation-eq-minax) and find the solution [(3)](#equation-eq-solax).
> 注意
>
> 实际上,[(3)](#equation-eq-solax) 可以用下面的方式导出:将 ![\X^T](img/d142da9aae51c6d3c3c736fc82252862.jpg) 和 [(1)](#equation-eq-ax) 相乘,之后在之前结果的两边乘上 ![(\X^T\X)^{-1}](img/16dd8d60ea9b042c3ce0652c9f0571e8.jpg)。你也可以对 [(2)](#equation-eq-minax) 应用极值定理,并寻找 [(3)](#equation-eq-solax) 的解。
### 8.1.4\. Gradient Descent
### 8.1.4\. 梯度下降
Let’s use the following hypothesis:
让我们使用下列假设:
![h_\Bbeta = \beta_0 + \beta_j \x_{j}, \text{where}, j= 1, \cdots n.](img/a5fda7453d5707d5e8985434c789ba48.jpg)
Then, solving [(2)](#equation-eq-minax) is equivalent to minimize the following `cost fucntion` :
之后求解 [(2)](#equation-eq-minax) 等价于最小化下面的损失函数:
### 8.1.5\. Cost Function
### 8.1.5\. 损失函数
(4)![J(\Bbeta) = \frac{1}{2m}\sum_{i=1}^m \left( h_\Bbeta(x^{(i)})-y^{(i)}) \right)^2](img/77c47cf9cfec8ec740c5a18dc4386670.jpg)
Note
The reason why we prefer to solve [(4)](#equation-eq-lreg-cost) rather than [(2)](#equation-eq-minax) is because [(4)](#equation-eq-lreg-cost) is convex and it has some nice properties, such as it’s uniquely solvable and energy stable for small enough learning rate. the interested reader who has great interest in non-convex cost function (energy) case. is referred to [[Feng2016PSD]](reference.html#feng2016psd) for more details.
> 注意
>
> 我们倾向求解 [(4)](#equation-eq-lreg-cost) 而不是 [(2)](#equation-eq-minax) 的原因是,[(4)](#equation-eq-lreg-cost) 是凸的,并且属性良好,例如它是个唯一可解,对于足够小的学习率是能量稳定的。如果读者对非凸损失函数(能量)案例感兴趣,可以参考 [[Feng2016PSD]](reference.html#feng2016psd)。
![https://runawayhorse001.github.io/LearningApacheSpark/_images/gradient1d.png](img/875e532ac3b299876d209507d595df14.jpg)
Gradient Descent in 1D
一维中的梯度下降
![https://runawayhorse001.github.io/LearningApacheSpark/_images/gradient2d.png](img/d4b34834b440d5d60f25912180e7e130.jpg)
Gradient Descent in 2D
二维中的梯度下降
### 8.1.6\. Batch Gradient Descent
### 8.1.6\. 批量梯度下降
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. It searchs with the direction of the steepest desscent which is defined by the `negative of the gradient` (see Fig. [Gradient Descent in 1D](#fig-gd1d) and [Gradient Descent in 2D](#fig-gd2d) for 1D and 2D, respectively) and with learning rate (search step) ![\alpha](img/aef64ee73dc1b1a03a152855f685113e.jpg).
梯度下降是用于找到函数最小值的一阶迭代优化算法。 它沿着最陡峭的方向搜索,该方向由“梯度的相反数”(参见图[ 1D 中的梯度下降](#fig-gd1d)[ 2D 中的梯度下降](#fig-gd2d))和学习率(搜索步长)![\ alpha](img/aef64ee73dc1b1a03a152855f685113e.jpg) 定义。
### 8.1.7\. Stochastic Dradient Descent
### 8.1.7\. 随机梯度下降
### 8.1.8\. Mini-batch Gradient Descent
### 8.1.8\. 小批量梯度下降
### 8.1.9\. Demo
### 8.1.9\. 示例
* The Jupyter notebook can be download from [Linear Regression](_static/LinearRegression.ipynb) which was implemented without using Pipeline.
* The Jupyter notebook can be download from [Linear Regression with Pipeline](_static/LinearRegressionWpipeline.ipynb) which was implemented with using Pipeline.
* I will only present the code with pipeline style in the following.
* For more details about the parameters, please visit [Linear Regression API](http://takwatanabe.me/pyspark/generated/generated/ml.regression.LinearRegression.html) .
* Jupyter 笔记本可以从[线性回归](_static/LinearRegression.ipynb) 下载,它不使用流水线实现。
* upyter 笔记本可以从[带流水线的线性回归](_static/LinearRegressionWpipeline.ipynb),它使用流水线实现。
* 我下面仅仅展示流水线风格的代码。
* 参数的更多信息请见[线性回归 API](http://takwatanabe.me/pyspark/generated/generated/ml.regression.LinearRegression.html)
建立 spark 上下文和 SparkSession
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册