In various machine learning documents, commonly seen expressions such as Error, Bias, Cost and Loss all have a very similar meaning. In this book, the expression "loss function" is used. "Loss Function" is represented by the symbol $J$, and the error value calculated by the Loss function is referred to as $loss$.\\
"Loss" is the sum of "errors" of all samples ($m$ is the number of samples).
$$J = \sum_{i=1}^m loss_i$$
In the black box example, it is inaccurate to say "loss of a sample". We can only say "error of a sample" since samples are calculated independently. Suppose we adjust the neural network's parameters to fully satisfy the independent sample's error as $0$, the error of other examples will usually increase, leading to a more enormous sum of loss value. Therefore, we typically calculate the loss function as a whole by adjusting the weight according to the error of every single training example to determine whether the network has been fully trained.
#### Purpose of the Loss function
The Loss function quantifies the error between predicted values and real values by running the forward pass to update the weight in the correct direction.
To compute the loss function:
1.Initialize parameters of the forward pass with random values;
2.Feed training example through the network to compute predicted output.
3.Utilize the loss function to compute the error between the predicted value and the label value (the real value);
4.Compute the derivative of the loss function, propagate this error backwards through the network along the minimum gradient direction and update the forward pass's weights.
5.Repeat Step 2-4 until the loss function value achieves a satisfactory value.
### 3.0.2 Common Loss functions in machine learning
Notations: $a$ is predicted output, $y$ is label value of example, $loss$ is loss function value.
- Gold Standard Loss，also called 0-1 Loss
$$
loss=\begin{cases}
0 & a=y \\\\
1 & a \ne y
\end{cases}
$$
- Absolute Error function
$$
loss = |y-a|
$$
- Hinge Loss，primarily used in SVM classifiers where the target values are in the set ${-1, 1}$.
$$
loss=\max(0,1-y \cdot a) \qquad y=\pm 1
$$
- Log Loss，also called the Cross-entropy error
$$
loss = -[y \cdot \ln (a) + (1-y) \cdot \ln (1-a)] \qquad y \in \\{ 0,1 \\}
$$
- Squared Loss
$$
loss=(a-y)^2
$$
- Exponential Loss
$$
loss = e^{-(y \cdot a)}
$$
### 3.0.3 Understanding the Loss function image
#### Use a two-dimensional function image to understand the effect of a single variable on the loss function
In Fig3-1, the y-axis is the loss value, and the x-axis is the variable. Changing the value of the variable will cause the loss value to rise or fall. The gradient descent algorithm leads us to move in the direction of loss function value decreasing.
1. Suppose we initial at point $A$, $x=x_0$, and the loss value (y coordinate) is large. It is propagated back to the network for training;
2. After one iteration, we move to point $B$ , $x=x_1$, and the loss value is reduced accordingly. We propagate it back to retrain;
3. Edging ever closer to the minimum, the loss function have experienced $x_2,x_3,x_4,x_5$;
4. Until the loss value reaches an acceptable level, such as $x_5$, stop training.
#### Understanding the effect of two variables on loss function with the contour map
In Fig3-2，the x and y-axis each represent a variable $w$ and $b$. The loss value formed by combining two variables corresponds to a unique coordinate point on the contour line in the diagram. Different values of $w,b$ will form a matrix of loss values. Connecting the matrix points with the same (similar) loss value to form an irregular ellipse. The loss is $0$ at the center position we are approaching.
The ellipse represents a depression in the contour plot in which the center position is lower than the edge position. Computing derivative of the loss function leads us to gradually descend along the contour line, edging ever closer to the minimum.
### 3.0.4 Common Loss functions in neural networks
- Mean square error function, primarily used for Regression problems
- Cross entropy function, mainly used for Classification problems.
Both Loss functions are non-negative functions with extreme at the bottom, which can be solved by the gradient descent method.
This function is the most intuitive loss function, calculating the Euclidean distance between the predicted and real observations. The mean square error decreases with the approximation of the predicted value and actual value.
The mean square error function is often used for function fitting in linear Regression. The formula is as follows:
The simplest way to get the difference between the predicted value $a$ and real value $y$ is to use $Error=a_i-y_i$.
This works for single examples, but when multiple samples are accumulated, $a_i-y_i$ may be positive or negative and cancel out during summation calculation. The idea of absolute difference, namely $Error=|a_i-y_i|$, seems a simple and ideal solution, then why introduce a mean square error function? A comparison of these two loss functions is shown in Table 3-1.
Table 3-1 A comparison of absolute error function and mean square error function
|Label of example|Predict value of example|Absolute error function|MSE function|
As shown above, the MSE loss of 5 is higher than the absolute error of 3 and 8 is twice as large as 4. The MSE loss of 8 also magnifies the impact of the local loss of a sample in overall training compared to 5. In technical terms, MSE is “Sensitive to samples with large deviation ”, which causes enough attention to the monitoring training process to backpropagate errors.
### 3.1.2 Practical cases
Assuming a set of data like figure 3-3, we are looking for a fitting line.
Figure 3-3 The scatter plot of sample data distribution
The first three images in figure 3-4 show a gradual process of finding the line of best bit.
- In the first plot, calculated with the mean square error function, $loss = 0.53 $;
- In the second plot, the fitting line shifted upward, and the error is calculated as $loss = 0.16 $, which is much lower than the error shown in Figure 1;
- In the third plot, the line shifted further upward, and the error is calculated as $loss = 0.048 $. You can then try either shifting the line(adjust the value of $b $) or rotating the line(change the value of $W $) to get a lower loss value;
- In the fourth plot, the line of best fit deviates from the optimal position, and the error value is $loss = 0.18 $. In this case, the algorithm will attempt to shift downward.
Fig. 3-4 The relationship between the loss value and the position of the fitting line
The third diagram shows the case where the loss is minimal. Comparing the loss value in the second and fourth graphs, how do we decide whether to shift upward or downward when the MSE loss are both positive?
It is unnecessary to calculate the loss value in the actual training process since the loss will be reflected in the back-propagation process. Let’s look at the derivative of the mean square error function:
$$
\frac{\partial{J}}{\partial{a_i}} = a_i-y_i
$$
Although $(a_i-y_i)^2$ is always a positive number, $a_i-y_i$ can be either positive (when the line is below the points) or negative (when the line is above the points). This positive or negative number is propagated back to the previous calculation, which will guide the training process in the right direction.
In the example above, we have two variables $W and b $, whose changes affect the final loss value.
Assume the equation for the fitting line is $y = 2x + 3 $. When we adjust the value of $b$ from $2$ to $4$ and remain $W = 2$ the changes of the loss is shown in figure 3-5.
Figure 3-5 Changes in the loss value when $W$ is fixed and $b$ is changed
Assume the equation for the fitting line is $y = 2x + 3 $. When we adjust the $w$ from $1$ to $3$ and remain $b = 3$, the changes of the loss is shown in figure 3-6.
Figure 3-5 Changes in the loss value when $b$ is fixed and $W$ is changed
### 3.1.3 Loss Function Visualization
#### 3D representation of loss value
The x-axis is $w$, and the y-axis is $b$. For each combination of $(W, B)$, a loss value represented by the height of the three-dimensional graph is computed. The bottom of the image below is not a plane but rather a slightly concave curve surface with low curvature, as shown in Figure 3-7.
Figure 3-7 A 3D loss curve formed by the simultaneous variation of $W and $b$
#### 2D representation of loss value
We often use the contour line to represent altitude in the plane map. The figure below is the projection of Figure 3-7 on a plane, which is the contour map of loss value, as shown in figure 3-8.
If that doesn’t make sense, we’ll draw a diagram using the following code:
```Python
s = 200
W = np.linspace(w-2,w+2,s)
B = np.linspace(b-2,b+2,s)
LOSS = np.zeros((s,s))
for i in range(len(W)):
for j in range(len(B)):
z = W[i] * x + B[j]
loss = CostFunction(x,y,z,m)
LOSS[i,j] = round(loss, 2)
```
The above code calculates a LOSS value for each combination of $(W, B)$, places in the ‘LOSS’ matrix，leaving 2 decimal. The matrix shown as follows:
```
[[4.69 4.63 4.57 ... 0.72 0.74 0.76]
[4.66 4.6 4.54 ... 0.73 0.75 0.77]
[4.62 4.56 4.5 ... 0.73 0.75 0.77]
...
[0.7 0.68 0.66 ... 4.57 4.63 4.69]
[0.69 0.67 0.65 ... 4.6 4.66 4.72]
[0.68 0.66 0.64 ... 4.63 4.69 4.75]]
```
We traverse the loss value in the matrix and plot points in the same colour on the position having the same value. For example, draw points with a value of 0.72 as red and points with a value of 0.75 as blue...The diagram is shown in Figure 3-9.
This diagram is equivalent to the contour plot, but since the contour plot is relatively concise and clear, we will use the contour plot to illustrate the problem.
Cross-entropy is an essential concept in Shannon information theory, mainly used to measure the distance between two probability distributions. In information theory, the cross-entropy is the difference between two probability distributions: $P, Q$, where $p$ is the real distribution and $q$ is the predicted distribution. $H(p,q)$ represents the cross-entropy:
The cross-entropy loss function can be used as a loss function in the neural network, where $p$ represents the real label distribution, and $q$ is the prediction label distribution of the trained model. Cross-entropy measures similarity between $p$ and $q$.
**The cross-entropy function is often used for logistic regression or classification.**
### 3.2.1 The origin of cross-entropy
#### Quantity of information
In information theory, the quantity of information is represented by:
$$I(x_j) = -\ln (p(x_j)) \tag{2}$$
$x_j$：represents an event Q
$p(x_j)$：indicates the probability of occurrence od $x_j$
$I(x_j)$：The quantity of information. The less likely $x_j$ is to happen, the more information it will have once it happens
Suppose we have three possible scenarios for learning the principles of neural networks, as shown in table 3-2.
Table 3-2 Overview and quantity of information of three events
|Event number|Event|Probability $p$|Quantity of information $I$|
|---|---|---|---|
|$x_1$|A|$p=0.7$|$I=-\ln(0.7)=0.36$|
|$x_2$|Pass|$p=0.2$|$I=-\ln(0.2)=1.61$|
|$x_3$|Fail|$p=0.1$|$I=-\ln(0.1)=2.30$|
Wow, someone failed! That’s a lot of information! In contrast, an “A” events contains much less information.
Relative entropy is also called KL divergence. Suppose we have two different probability distributions of $P(x)$ and $Q(x)$ for the same random variable $x$, we can use KL divergence (Kullback-leibler (KL) divergence) to measure the difference between the two distributions. This is equivalent to the mean square error of the category of information theory.
In machine learning, we need to evaluate the difference between the label value $y$ and the predicted value $a$. It is appropriate to use KL divergence，or $D (y | A) $. Since the first part of the KL divergence $h (Y)$ remains unchanged, we only need to focus on the cross-entropy in the optimization process. Therefore, the cross-entropy is directly used as the loss function to evaluate the model in machine learning.
$$loss =- \sum_{j=1}^n y_j \ln a_j \tag{7}$$
Eq7 is the case of a single sample, where $n$ is not the number of examples but the number of classes. Thus, the cross-entropy formula for batch samples is:
where $m$ is the number of samples and $n$ is the number of classes.
There is a special type of problem where events have only two outcomes, such as “Learned” and “Unlearned,” known as the $0/1$ classification or the binary classification. For such problems, if $n=2，y_1=1-y_2，a_1=1-a_2$, the cross-entropy can be reduced to:
### 3.2.2 Cross-entropy of binary classification problems
Break the Equation 10 into two cases. When $y = 1$, or the label value is $1$, which is a positive example. The plus sign followed by an entry of $0$:
$$loss = -\ln(a) \tag{11}$$
The x-axis is the predicted output, and the y-axis is the loss value. $y = 1$ means that the current sample has a label value of 1. The training result is more accurate as of the predicted output approach 1 and the loss getting lower. The loss function value is higher when the prediction output is closer to 0, and the training result is poor.
When $y=0$, or the label value is 0, which is a counterexample. The entry before the plus sign is 0:
Fig3-10 Binary classification cross entropy loss diagram
Suppose the label value of the well-learned course is 1, and the label value of the unlearned course is 0. We wanted to establish a system to predict the probability of a particular student who will learn the course well based on his attendance rate, class performance, homework, learning ability and other features.
For student A, the probability of learning the course well was predicted as 0.6, while the student passed the exam with an actual value of 1. Thus, the cross-entropy loss function for Student A is:
For student B, the probability of learning the course well was predicted as 0.7, while the student also passed the exam. So, the cross-entropy loss function for student B is:
Since 0.7 is closer to 1 and the prediction is relatively accurate, $loss2$ is less than $loss1 $, and its strength is lower in backpropagation.
### 3.2.3 Cross-entropy of multi-classification problems
When the label value is not 0 or 1, it is a multi-classification problem. Suppose there are three outcomes for the final exam:
1. A，OneHot encoding of the label is $[1,0,0]$；
2. Pass，OneHot encoding of the label is $[0,1,0]$；
3. Fail，OneHot encoding of the label is $[0,0,1]$。
Suppose we predict the probability that student C will get an "A", a "Pass", and a "Fail" is $[0.2,0.5,0.3]$, while the real situation is that the student fails, then the cross-entropy is:
Suppose we predict the probability that student D will get an A, pass, or fail: $[0.2,0.2,0.6] $, and the actual situation is that student D fails, then the cross-entropy is:
The loss2 of 0.51 is much lower than 1.2, which indicates that its predicted value is approximate to the actual label value (0.6 vs 0.3). The strength of backpropagation declined as the cross-entropy getting smaller.
### 3.2.4 Why can’t we use mean square error as the loss function for classification problems?
1. The regression problem usually uses the mean square error function to ensures the loss function is convex, which the optimal solution can be obtained.
However, it is difficult to get the optimal solution if the loss function is not convex when the mean square error is used. The cross-entropy function can guarantee the monotonicity in the interval.
2. The last layer network of classification problems requires a classification function such as Sigmoid or Softmax. If the mean square error function is connected, the derivation result is complex and requires massive calculations.
A straightforward calculation can be obtained using the cross-entropy function, and a simple subtraction can be used to obtain the reverse error.