未验证 提交 059b9628 编写于 作者: C Chang Zhao 提交者: GitHub

Add Step 2 Translation Files: 04, 04.1 and 04.2 (#633)

* add translation 04

* add translation 04.1
上级 bd935080
<!--Copyright © Microsoft Corporation. All rights reserved.
适用于[License](https://github.com/Microsoft/ai-edu/blob/master/LICENSE.md)版权许可-->
# The Second Step - Linear Regression
## Abstract
Using linear regression as the starting point for learning neural networks is a very good choice, because the linear regression problem itself is easier to understand. On the basis of it, gradually adding extra knowledge will form a relatively gentle learning curve, which is the first small step towards a neural network.
A single-layer neural network is actually a neuron, which can perform some linear tasks, such as fitting a straight line. This can be achieved with a single neuron. When this neuron only receives one input, it is a univariate linear regression, which can be visualized on a two-dimensional plane. When receiving multiple variable inputs, it is called multivariate linear regression, which is difficult to visualize, and we usually use pairs of variables to represent it.
When there is more than one variable, the dimensions and values of the two variables may be very different. In this case, we usually need to normalize the sample characteristic data, and then feed the data to the neural network for training, otherwise "indigestion" will occur.
<!--Copyright © Microsoft Corporation. All rights reserved.
适用于[License](https://github.com/Microsoft/ai-edu/blob/master/LICENSE.md)版权许可-->
# Chapter 4 Single Layer Neural Network with Single Input and Single Output - Univariate Linear Regression
## 4.0 Univariate linear regression problem
### 4.0.1 Ask a question
In the early stage of Internet construction, the major operators need to solve the problem is to ensure that the temperature of the computer room is maintained at about 23 degrees Celsius all year round. In a new computer room, if we plan to deploy 346 servers, how do we configure the maximum power of the air conditioner?
This is a problem that can be solved by thermodynamics, but there is always an error. Therefore, people often install a thermostat in the computer room to control the switch of the air conditioner or the speed of the fan or the cooling capacity, of which the maximum cooling capacity is a key value. A more advanced method is to directly build the computer room under the sea, using isolated sea water circulation to reduce temperature.
Using some statistics (called sample data), we get Table 4-1.
Table 4-1 Sample data
|Sample No.|Number of servers (thousands)X|Air conditioning power (kw)Y|
|---|---|---|
|1|0.928|4.824|
|2|0.469|2.950|
|3|0.855|4.643|
|...|...|...|
In the above sample, we generally call the independent variable $X$ the sample characteristics, and the dependent variable $Y$ the sample label value.
This data is two-dimensional, so we can display it in a visual way. The abscissa is the number of servers, and the ordinate is the air conditioner power, as shown in Figure 4-1.
<img src="https://aiedugithub4a2.blob.core.windows.net/a2-images/Images/4/Data.png"/>
Figure 4-1 Visualization of sample data
By observing the above figure, we can judge that it belongs to a linear regression problem, and it is the simplest univariate linear regression. Therefore, we turn the problem of thermodynamic into a statistical problem, because it is impossible to accurately calculate how much heat each circuit board or each machine can generate.
A clever reader may think of a way: in the sample data, we find a very similar example with 346, which can be used as a reference to find the appropriate air-conditioning power value.
Admittedly, this is completely scientific and reasonable. This is the idea of linear regression: take existing values to predict unknown values. In other words, the reader inadvertently use linear regression models. In fact, this example is very simple, with only one independent variable and one dependent variable, so you can solve the problem in a straightforward way. However, this approach may fail when there are multiple independent variables. If we have three independent variables, it is very likely that we will not be able to find data that is very close to the combination of these three independent variables in the sample, and then we should resort to a more systematic approach.
### 4.0.2 Unary linear regression model
Regression analysis is a mathematical model. It is a special kind of linear model when the dependent variable and the independent variable are linear.
The simplest case is unary linear regression, which consists of an independent variable and a dependent variable with a roughly wired relationship. The model is:
$$Y=a+bX+\varepsilon \tag{1}$$
$X$ is the independent variable, $Y$ is the dependent variable, $\varepsilon$ is the random error, and $a$ and $b$ are the parameters. In the linear regression model, $a,b$ are what we need to learn through algorithms.
What is a model? When you first encounter this concept, you may feel a little subtle. Generally speaking, it is a description of objective things by means of entity or virtual expression through subjective consciousness. This description is usually an abstract expression with certain logical or mathematical meaning.
For example, if you model a car, it will be described like this: a four-wheel iron shell driven by an engine. To model the concept of energy, it is the famous formula of Einstein's special theory of relativity: $E=mc^2$.
For data modeling, it is to find a way to use one or several formulas to describe the generation conditions or mutual relations of these data. For example, if there is a group of data that roughly satisfies the formula $y=3x+2$, then this formula is the model. Why "roughly"? Because in the real world, there is usually noise (error), so it is not possible to satisfy this formula very accurately, as long as it is near both sides of this line, it is considered to satisfy the condition.
For linear regression models, there are some concepts to understand:
- usually assumed that the mean value of the random error $\varepsilon$ is $0$, and the variance is $σ^2$ ($σ^2>0$, $σ^2$ has nothing to do with the value of $X$)
- If it is further assumed that random errors follow a normal distribution, it is called a normal linear model
- Generally, if there are $k$ independent variables and $1$ dependent variables (i.e. $Y$ in Formula 1), the value of the dependent variable is divided into two parts: one part is affected by the independent variable, that is, expressed as its function, whose form is known and contains unknown parameters; the other part is affected by other unconsidered factors and randomness, namely random error
- When the function is a linear function with unknown parameters, it is called a linear regression analysis model
- When the function is a nonlinear function with unknown parameters, it is called a nonlinear regression analysis model
- When the number of independent variables is greater than $1$, it is called multiple regression
- When the number of dependent variables is greater than $1$, it is called multivariate regression
Through observation of the data, we can generally believe that it meets the conditions of the linear regression model, so Formula 1 is listed. Without considering random errors, our task is to find the appropriate $a,b$, which is the task of linear regression.
<img src="https://aiedugithub4a2.blob.core.windows.net/a2-images/Images/4/regression.png" />
Figure 4-2 The difference between linear regression and nonlinear regression
As shown in Figure 4-2, the linear model is on the left. You can see that the straight line crosses the center line of the area formed by a set of triangles. This straight line does not need to pass through every triangle. On the right is the non-linear model, with a curve passing through the center line of the area formed by a group of rectangles. In this chapter, we first learn how to solve the linear regression problem on the left.
We will use several methods to solve this problem next:
1. Least squares;
2. Gradient descent;
3. Simple neural network method;
4. More general neural network algorithms。
### 4.0.3 Formula form
Here is an explanation of the order of $W$ and $X$ in the linear formula. In many textbooks, we can see the following formula:
$$Y = W^{\top}X+B \tag{1}$$
or:
$$Y = W \cdot X + B \tag{2}$$
However, we use this formula in this book:
$$Y = X \cdot W + B \tag{3}$$
The main difference between the three is the shape definition of the sample data $X$, which will affect the shape definition of $W$ accordingly. For example, if $X$ has three feature values, then $W$ must have three weight values corresponding to the feature values, then:
#### Matrix form of formula 1
$X$ is a column vector:
$$
X=
\begin{pmatrix}
x_{1} \\\\
x_{2} \\\\
x_{3}
\end{pmatrix}
$$
$W$ is also a column vector:
$$
W=
\begin{pmatrix}
w_{1} \\\\ w_{2} \\\\ w_{3}
\end{pmatrix}
$$
$$
Y=W^{\top}X+B=
\begin{pmatrix}
w_1 & w_2 & w_3
\end{pmatrix}
\begin{pmatrix}
x_{1} \\\\
x_{2} \\\\
x_{3}
\end{pmatrix}
+b
$$
$$
=w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b \tag{4}
$$
Both $W$ and $X$ are column vectors, so firstly transpose $W$ before doing matrix multiplication with $X$.
#### Matrix form of formula 2
The difference between formula 2 and formula 1 is the shape of $W$. In formula 2, $W$ is a row vector:
$$
W=
\begin{pmatrix}
w_{1} & w_{2} & w_{3}
\end{pmatrix}
$$
And the shape of $X$ is still a column vector:
$$
X=
\begin{pmatrix}
x_{1} \\\\
x_{2} \\\\
x_{3}
\end{pmatrix}
$$
So there is no need to transpose the matrix before multiplying:
$$
Y=W \cdot X+B=
\begin{pmatrix}
w_1 & w_2 & w_3
\end{pmatrix}
\begin{pmatrix}
x_{1} \\\\
x_{2} \\\\
x_{3}
\end{pmatrix}
+b
$$
$$
=w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b \tag{5}
$$
#### Matrix form of formula 3
$X$ is a row vector:
$$
X=
\begin{pmatrix}
x_{1} & x_{2} & x_{3}
\end{pmatrix}
$$
$W$ is a column vector:
$$
W=
\begin{pmatrix}
w_{1} \\\\ w_{2} \\\\ w_{3}
\end{pmatrix}
$$
So $X$ comes first and $W$ comes last:
$$
Y=X \cdot W+B=
\begin{pmatrix}
x_1 & x_2 & x_3
\end{pmatrix}
\begin{pmatrix}
w_{1} \\\\
w_{2} \\\\
w_{3}
\end{pmatrix}
+b
$$
$$
=x_1 \cdot w_1 + x_2 \cdot w_2 + x_3 \cdot w_3 + b \tag{6}
$$
Comparing these three formulas, their results are the same.
Let’s analyze the first two forms of $X$ matrix. Since $X$ is a column vector, it means that the feature is represented by the second dimension (column) subscript of the array. When two samples participate in the calculation at the same time, $X$ needs to add a column, becomes the following form:
$$
X=
\begin{pmatrix}
x_{11} & x_{21} \\\\
x_{12} & x_{22} \\\\
x_{13} & x_{23}
\end{pmatrix}
$$
The first subscript $i$ of $x_{ij}$ represents the sample number, and the second subscript $j$ represents the sample feature, so $x_{21}$ is the first feature of the second sample. Look at the serial number of $x_{21}$ which is very awkward. Generally, we think that the line is first and the second is listed, but $x_{21}$ is in the first row and second column, which is just the opposite of the habit.
In the third form, the $X$ matrix of the two samples is:
$$
X=
\begin{pmatrix}
x_{11} & x_{12} & x_{13} \\\\
x_{21} & x_{22} & x_{23}
\end{pmatrix}
$$
The first row is the 3 features of the first sample, and the second row is the 3 features of the second sample. The second feature of the first sample is in the first row and second column of the matrix, which is exactly consistent with common reading habits. Therefore, we use the third form to describe linear equations throughout this book.
Another reason is that in the implementation of many deep learning libraries, $X$ is placed in front of $W$ for matrix operations, and $W$ also needs to be viewed from left to right. For example, there are 3 feature inputs of 2 samples on the left ($2\times 3$ means 2 samples and 3 feature values), and a one-dimensional output is needed on the right, then the shape of $W$ is $3\times 1$. The result of the matrix operation is $(2 \times 3) \times (3 \times 1)=(2 \times 1)$. Otherwise, you need to look backwards. The shape of $W$ becomes $1\times 3$, and $X$ becomes $3\times 2$, which is very awkward.
For $B$, it is always 1 row, and the number of columns is equal to the number of columns in $W$. For example, if $W$ is a $3\times 1$ matrix, then $B$ is a $1\times 1$ matrix. If $W$ is a $3\times 2$ matrix, which means that three features are input to two neurons, then $B$ is a $1\times 2$ matrix, and each neuron is assigned 1 bias.
<!--Copyright © Microsoft Corporation. All rights reserved.
适用于[License](https://github.com/Microsoft/ai-edu/blob/master/LICENSE.md)版权许可-->
## 4.1 Least Square Method
### 4.1.1 History
The least squares method is a procedure to find the best function match of the data by minimizing the square sum of the error. The least square method can be used to easily obtain unknown data, and minimize the sum of squares of errors between the obtained data and the actual data. The least squares method can also be used for curve fitting. Some other optimization problems can also be expressed by the least square method.
In 1801, Italian astronomer Giuseppe Piazzi discovered the first asteroid, Ceres. After 40 days of tracking, Piazzi lost the position of Ceres as Ceres moved behind the sun. Then scientists all over the world began to search for Ceres using Piazzi's observational data, but most of them had no results. At the age of 24, Gauss also calculated the orbit of Ceres. The Austrian astronomer Heinrich Olbers rediscovered Ceres based on the orbit calculated by Gauss.
The method of least squares used by Gauss was published in his book "Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium meaning" in 1809. The French scientist Legendre independently invented the least squares method in 1806, but it was unknown to the world. Legendre had a dispute with Gauss who first established the least squares method.
In 1829, Gauss provided proof that the optimization effect of the least squares method was stronger than other methods. This is called Gauss-Markov theorem.
### 4.1.2 Mathematical Principles
Linear regression tries to learn:
$$z_i=w \cdot x_i+b \tag{1}$$
To make:
$$z_i \simeq y_i \tag{2}$$
Among them, $x_i$ is the sample characteristics, $y_i$ is the sample label value, $z_i$ is the model prediction.
How to learn $w$ and $b$? The mean squared error (MSE) is a commonly used method in regression tasks:
$$
J = \frac{1}{2m}\sum_{i=1}^m(z_i-y_i)^2 = \frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2 \tag{3}
$$
$J$ is the loss function. It is trying to find a straight line to minimize the sum of squares of residuals from all samples to the straight line.
<img src="https://aiedugithub4a2.blob.core.windows.net/a2-images/Images/4/mse.png" />
图4-3 Figure 4-3 Evaluation principle of the mean square error function
In Figure 4-3, the circular points are the samples, and the straight line is the current fitting result. As shown in the figure on the left, we need to calculate the vertical distance from the sample points to the straight line. We need to calculate the (垂足) based on the slope of the straight line and then calculate the distance. This calculation is very slow; In engineering, we usually use the method on the right, that is, the vertical distance between the sample point and the straight line, becuase this calculation is very convenient, just a subtraction.
Suppose we calculate that the preliminary result is shown by the dashed line, is this straight line appropriate? We can calculate the distance from each point in the figure to this straight line, and add up the values of these distances (all positive numbers, there is no problem of cancelling each other) to become an error.
Because the points in the above figure are not on a straight line, there cannot be a straight line that can pass through them at the same time. Therefore, we can only think of ways to continuously change the angle and position of the red straight line to minimize the overall error (it can never be $0$), which means that the overall deviation is the smallest. Then the final straight line is the result we want.
If we want to minimize the value of the error, we can take the derivative of $w$ and $b$, and then set the derivative to $0$ (reach the minimum extreme value), which is the optimal solution of $w$ and $b$.
The derivation process is as follows:
$$
\begin{aligned}
\frac{\partial{J}}{\partial{w}} &=\frac{\partial{(\frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2)}}{\partial{w}} \\\\
&= \frac{1}{m}\sum_{i=1}^m(y_i-wx_i-b)(-x_i)
\end{aligned}
\tag{4}
$$
Let formula 4 be $0$:
$$
\sum_{i=1}^m(y_i-wx_i-b)x_i=0 \tag{5}
$$
$$
\begin{aligned}
\frac{\partial{J}}{\partial{b}} &=\frac{\partial{(\frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2)}}{\partial{b}} \\\\
&=\frac{1}{m}\sum_{i=1}^m(y_i-wx_i-b)(-1)
\end{aligned}
\tag{6}
$$
Let formula 6 be $0$:
$$
\sum_{i=1}^m(y_i-wx_i-b)=0 \tag{7}
$$
From formula 7, we can get (Suppose there are $m$ samples):
$$
\sum_{i=1}^m b = m \cdot b = \sum_{i=1}^m{y_i} - w\sum_{i=1}^m{x_i} \tag{8}
$$
Divide both sides by $m$:
$$
b = \frac{1}{m}\left(\sum_{i=1}^m{y_i} - w\sum_{i=1}^m{x_i}\right)=\bar y-w \bar x \tag{9}
$$
Among them:
$$
\bar y = \frac{1}{m}\sum_{i=1}^m y_i, \bar x=\frac{1}{m}\sum_{i=1}^m x_i \tag{10}
$$
Substitute formula 10 into formula 5:
$$
\sum_{i=1}^m(y_i-wx_i-\bar y + w \bar x)x_i=0
$$
$$
\sum_{i=1}^m(x_i y_i-wx^2_i-x_i \bar y + w \bar x x_i)=0
$$
$$
\sum_{i=1}^m(x_iy_i-x_i \bar y)-w\sum_{i=1}^m(x^2_i - \bar x x_i) = 0
$$
$$
w = \frac{\sum_{i=1}^m(x_iy_i-x_i \bar y)}{\sum_{i=1}^m(x^2_i - \bar x x_i)} \tag{11}
$$
Substitute formula 10 into formula 11:
$$
w = \frac{\sum_{i=1}^m (x_i \cdot y_i) - \sum_{i=1}^m x_i \cdot \frac{1}{m} \sum_{i=1}^m y_i}{\sum_{i=1}^m x^2_i - \sum_{i=1}^m x_i \cdot \frac{1}{m}\sum_{i=1}^m x_i} \tag{12}
$$
Multiply both the numerator and denominator by $m$:
$$
w = \frac{m\sum_{i=1}^m x_i y_i - \sum_{i=1}^m x_i \sum_{i=1}^m y_i}{m\sum_{i=1}^m x^2_i - (\sum_{i=1}^m x_i)^2} \tag{13}
$$
$$
b= \frac{1}{m} \sum_{i=1}^m(y_i-wx_i) \tag{14}
$$
In fact, there are many variants of formula 13, and you will see different versions in different articles, and you are often confused. For example, the following two formulas are also correct solutions:
$$
w = \frac{\sum_{i=1}^m y_i(x_i-\bar x)}{\sum_{i=1}^m x^2_i - (\sum_{i=1}^m x_i)^2/m} \tag{15}
$$
$$
w = \frac{\sum_{i=1}^m x_i(y_i-\bar y)}{\sum_{i=1}^m x^2_i - \bar x \sum_{i=1}^m x_i} \tag{16}
$$
For the above two formulas, if you substitute formula 10 into it, you should get the same answer as formula 13, but you need some calculation skills. For example, many people don’t know this magic formula:
$$
\begin{aligned}
\sum_{i=1}^m (x_i \bar y) &= \bar y \sum_{i=1}^m x_i =\frac{1}{m}(\sum_{i=1}^m y_i) (\sum_{i=1}^m x_i) \\\\
&=\frac{1}{m}(\sum_{i=1}^m x_i) (\sum_{i=1}^m y_i)= \bar x \sum_{i=1}^m y_i \\\\
&=\sum_{i=1}^m (y_i \bar x)
\end{aligned}
\tag{17}
$$
### 4.1.3 Code
Let's use `Python` to implement the above calculation process:
#### Calculate the value of $w$
```Python
# formula 15
def method1(X,Y,m):
x_mean = X.mean()
p = sum(Y*(X-x_mean))
q = sum(X*X) - sum(X)*sum(X)/m
w = p/q
return w
# formula 16
def method2(X,Y,m):
x_mean = X.mean()
y_mean = Y.mean()
p = sum(X*(Y-y_mean))
q = sum(X*X) - x_mean*sum(X)
w = p/q
return w
# formula 13
def method3(X,Y,m):
p = m*sum(X*Y) - sum(X)*sum(Y)
q = m*sum(X*X) - sum(X)*sum(X)
w = p/q
return w
```
With the help of the function library, we don't need to manually calculate the basic functions like `sum()` and `mean()`.
#### Calculate the value of $b$
```Python
# formula 14
def calculate_b_1(X,Y,w,m):
b = sum(Y-w*X)/m
return b
# formula 9
def calculate_b_2(X,Y,w):
b = Y.mean() - w * X.mean()
return b
```
### 4.1.4 Calculation result
Using the above methods, the final results are consistent, which can play a role of cross-validation:
```
w1=2.056827, b1=2.965434
w2=2.056827, b2=2.965434
w3=2.056827, b3=2.965434
```
### Code location
ch04, Level1
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册