# mirrors / microsoft / ai-edu 大约 21 小时 前同步成功

### Proofreads3 (#638)

* line edits

* Update 06.4-Visualization of linear binary classification results.md

* Update 06.5-Implemenrarion logic AND,NOR,NOT gate.md

* Update 06.6-Binary classification function with hyperbolic tangent function.md

* Update 07.0 Multi-input and multi-output single-layer neural nets - Linear multiclassification.md

* Update 07.1-Multi-classification function.md

* Marking unclear passage

** 标出的passage读不通。中文原文好像就读不太通

 ... ... @@ -3,9 +3,9 @@ ## 6.5 Implementing logic AND OR NOT gates Single-layer neural networks, also called perceptron, can easily implement logic AND, OR, and NOT gates. As the logic AND and OR gates need to have two variable inputs, while the logic NOT gates have only one variable input. However, their common feature is that the input is 0 or 1, which can be considered two categories of positive and negative. Single-layer neural networks, also called perceptrons, can easily implement logic AND, OR, and NOT gates. The logic AND and OR gates need to have two variable inputs, while the logic NOT gates have only one variable input. However, their common feature is that their input is 0 or 1, which can represent the two categories of positive or negative. Therefore, after learning the binary classification, we can use the idea of classification to implement the following 5 logic gates: Therefore, after learning about binary classification, we can use the idea of classification to implement the following 5 logic gates: - AND gate - NAND gate ... ... @@ -13,13 +13,13 @@ Therefore, after learning the binary classification, we can use the idea of clas - NOR gate - NOT gate Taking the logic AND as an example, the four dots in Figure 6-12 represent the four sample data. The blue dots indicate the negative cases ($y=0$), and the red triangles represent the positive cases ($y=1$). Taking the logic AND as an example, the four dots in Figure 6-12 represent four sample data points. The blue dots indicate the negative cases ($y=0$), and the red triangles represent the positive cases ($y=1$). Figure 6-12 Multiple dividing lines that can solve logic AND problems Suppose we use the classification idea, according to what we learned earlier, we should draw a dividing line between the red and blue points, which can precisely separate the positive cases from the negative ones. Since the sample data is sparse, the angle and position of this split line can be relatively arbitrary, such as the three straight lines in the figure, which can all be the solution to this problem. Let's see if the neural network can give us a surprise. Suppose we use the classification idea, according to what we learned earlier, we should draw a dividing line between the red and blue points, which can precisely separate the positive cases from the negative ones. Since the sample data is sparse, the angle and position of this split line can be relatively arbitrary, as the three straight lines in the figure can all be the solution to this problem. Let's see what the neural network can do for us. ### 6.5.1 Implementing logic NOT gate ... ... @@ -34,13 +34,13 @@ Thus, there is a neuronal structure as shown in Figure 6-13. Figure 6-13 Neuron implementation of incorrect logic NOT gate However, this becomes a fitting problem rather than a classification problem. For example, let $x=0.5$, substituting into the formula: However, this becomes a fitting problem rather than a classification problem. For example, let $x=0.5$ and substitute it into the formula below: $$y=wx+b = -1 \times 0.5 + 1 = 0.5$$ That is, when $x=0.5$, $y=0.5$, and the values of $x$ and $y$ do not have the meaning of "NOT". Therefore, the neuron shown in Figure 6-14 should be defined to solve the problem, and the sample data is also straightforward, as shown in Table 6-6, with only two rows of data. That is, when $x=0.5$, $y=0.5$, the values of $x$ and $y$ do not have the meaning of "NOT". Therefore, the neuron shown in Figure 6-14 should be defined to solve the problem. The sample data is also straightforward, as shown in Table 6-6, with only two rows of data. ... ... @@ -91,13 +91,13 @@ $$y = -12.468x + 6.031$$ The result shows that this line is not perpendicular to the $x$-axis, but slightly "skewed". This reflects the limitations of the capabilities of the neural network, which only "simulates" a result but cannot accurately obtain a perfect mathematical formula. The precise mathematical formula for this problem is a vertical line, equivalent to $w=\infty$, which is impossible to train. The result shows that this line is not perpendicular to the $x$-axis, but slightly "skewed". This reflects the neural network's limitations, which only "simulates" a result but cannot accurately obtain a perfect mathematical formula. The precise mathematical formula for this problem is a vertical line, equivalent to $w=\infty$, which is impossible to train. ### 6.5.2 Implementing logic AND,OR gates #### The neuron model The neuron model from Section 6.2 is still used, as in Figure 6-16. Still use the neuron model from Section 6.2, as in Figure 6-16. ... ... @@ -156,7 +156,7 @@ def Test(net, reader): return False  We know that a neural network can only give approximate solutions, but the extent to which this "approximation" can be made is something that we need to specify during training. For example, if the input is $(1, 1)$, the result of AND is $1$. However, the neural network can only give a probability value of $0.721$, which is not sufficient for the accuracy requirement, and the error must be less than 1e-2 for all 4 samples. We know that a neural network can only give approximate solutions, but the extent to which this "approximation" approaches accuracy is something that we need to specify during training. For example, if the input is $(1, 1)$, the result of AND is $1$. However, the neural network can only give a probability value of $0.721$, which is not sufficient for the accuracy requirement, as the error must be less than 1e-2 for all 4 samples. #### The Train function ... ... @@ -172,11 +172,11 @@ def train(reader, title): print(Test(net, reader)) ......  A maximum of epoch of 10,000 times, a learning rate of 0.5, and a stopping condition when the value of the loss function is as low as 2e-3 are specified in the hyperparameter. At the end of the training, the test function must be called first, and True needs to be returned to meet the requirements. The classification results are displayed graphically. A maximum of epoch of 10,000 times, a learning rate of 0.5, and a stopping condition when the value of the loss function is as low as 2e-3 are specified in the hyperparameter. At the end of training, the test function must be called first, and True needs to be returned to meet the requirements. The classification results are displayed graphically. #### Compiling results The printout of the results of the logic AND is as follows: The printout of the logic AND result is as follows:  ...... ... ... @@ -191,7 +191,7 @@ B= [[-17.80473354]] [2.35882891e-03]] True  The precision $loss<1e-2$ is achieved after 4236 iterations. When four combinations of $(1,1), (1,0), (0,1), and (0,0)$ are input, all the outputs meet the accuracy requirements. The precision $loss<1e-2$ is achieved after 4236 iterations. When the four combinations of $(1,1), (1,0), (0,1), and (0,0)$ are input, all the outputs meet the accuracy requirements. ### 6.5.3 Results Comparison ... ... @@ -211,16 +211,16 @@ Table 6-8 Results comparison of five logic gates We can draw two conclusions from the values and graphs: 1. the values of W1 and W2 are essentially the same and have the same sign, indicating that the partition line must have a slope of 135° 2. the higher the accuracy, the closer the starting and ending points of the dividing line are to the midpoint of the four sides at 0.5 2. the higher the accuracy, the closer the starting and ending points of the dividing line are to the midpoints of the four sides at 0.5 The above two points show that the neural network is intelligent and will find the dividing line as gracefully and robustly as possible. The above two points show that the neural network is intelligent and will find the dividing line as gracefully and resolutely as possible. ### Code Location ch06, Level4 ### Thinking and Exercises ### Thinking Exercises 1. Decrease the value of max_epoch and observe the training result of the neural network. 2. Why do the logic OR and NOR use only about 2000 epochs to achieve the same accuracy, while the logic AND and NAND need more than 4000 epochs? 2. Why do the logic OR and NOR use only about 2000 epochs while the logic AND and NAND need more than 4000 epochs to achieve the same accuracy?
 ... ... @@ -3,7 +3,7 @@ ## 6.6 Binary classification function with hyperbolic tangent function This section is an extended reading. Through a series of modifications to the source code, we can finally achieve the purpose of using the hyperbolic tangent function as the classification function. Although this "requirement" is fictitious, we can deepen our understanding of the basic concepts of classification function, loss function, backpropagation, etc., by practicing this process. Through a series of modifications to the source code, we can finally unlock the purpose of using the hyperbolic tangent function as the classification function. Although this process is not strictly required, we can deepen our understanding of the basic concepts of classification function, loss function, backpropagation, etc., by practicing it. ### 6.6.1 Raising Questions ... ... @@ -13,7 +13,7 @@ $$a_i=Logistic(z_i) = \frac{1}{1 + e^{-z_i}} \tag{1}$$ $$loss_i(w,b)=-[y_i \ln a_i + (1-y_i) \ln (1-a_i)] \tag{2}$$ There is also a function that looks very similar to the logistic function, namely the hyperbolic tangent function (Tanh Function), with the following equation: There is also a function that looks very similar to the logistic function--the hyperbolic tangent function (Tanh Function)--with the following equation: $$Tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} = \frac{2}{1 + e^{-2z}} - 1 \tag{3}$$ ... ... @@ -28,13 +28,13 @@ Table 6-9 Comparison of the logistic function and Tanh function ||| |Positive-negative boundary:$a=0.5$|Positive-negative boundary:$a=0$| For the logistic function, $0.5$ is generally used as the boundary between positive and negative classes. It is natural to think that $0$ can be used to separate the positive and negative classes for the Tanh function. In the logistic function, $0.5$ is generally used as the boundary between positive and negative classes. It is natural to think that $0$ can be used to separate the positive and negative classes in the Tanh function. The term dividing line is actually just a way for people to understand neural networks to do binary classification. For neural networks, there is actually no such concept as a dividing line. All it has to do is push the positive examples upwards and the negative examples downwards as much as possible through a linear transformation. The term dividing line is actually just a way for people to understand how neural networks practice binary classification. For neural networks, there is actually no such concept as a dividing line. All it does is push the positive examples upwards and the negative examples downwards as much as possible through a linear transformation. ### 6.6.2 Modify the feed-forward calculation and backpropagation functions Let's get down to business now. The first step is to replace the logistic function with a Tanh function and modify the feed-forward calculation while not forgetting to revise the code for backpropagation. Let's get down to business now. The first step is to replace the logistic function with a Tanh function and modify the feed-forward calculation while remembering to revise the code for backpropagation. #### Add the Tanh function ... ... @@ -46,7 +46,7 @@ def Tanh(z): #### Modify the feed-forward calculation function An important principle in software development is the open and closed principle: open to "add" and closed to "modify". The primary purpose is to prevent introducing bugs. In order not to modify the existing code of the NeuralNet class, we derive a subclass and add new functions to the subclass to override the functions of the parent class, and still use the logic of the parent class for the rest of the code; for this example, use the logic of the subclass. An important principle in software development is the open and closed principle: open to extension but closed for modification. The primary purpose of this is to prevent introducing bugs. In order not to modify the existing code of the NeuralNet class, we derive a subclass and add new functions to the subclass to override the functions of the parent class, and still use the logic of the parent class for the rest of the code; for this example, use the logic of the subclass. Python class TanhNeuralNet(NeuralNet): ... ... @@ -75,9 +75,9 @@ class NetType(Enum): #### Modify the backpropagation function The feed-forward calculation is easy to modify, but the backpropagation needs to deduce the formula ourselves! The feed-forward calculation is easy to modify, but we need to deduce the backpropagation formula ourselves! For the derivation of the cross-entropy function of Eq. 2, for convenience, we write the derivation process by using a single sample approach: For the derivation of the cross-entropy function of Eq. 2, we write the derivation process by using a single sample approach for convenience: $$\frac{\partial{loss_i}}{\partial{a_i}}= \frac{a_i-y_i}{a_i(1-a_i)} \tag{4} ... ... @@ -110,7 +110,7 @@ class TanhNeuralNet(NeuralNet): return dW, dB  This implementation makes use of the results of Equation 6. After carefully deducing the formula again and confirming that it is correct, we can try to run the program: This implementation uses the results of Equation 6. After carefully deducing the formula again and confirming that it is correct, we can try to run the program:  epoch=0 ... ... @@ -126,12 +126,12 @@ Level4_TanhAsBinaryClassifier.py:29: RuntimeWarning: invalid value encountered i Unsurprisingly, there is an error! The first error should be that the divisor is 0, which means that the value of batch_a is 0. Why hasn't such an exception been thrown when using the pair rate function? There are two reasons: 1. with the logistic function, the output value range is (0,1), so the value of a will always be greater than 0; it cannot be 0. While the output value range of the Tanh function is (-1,1), it is possible to be 0. 1. with the logistic function, the output value range is (0,1), so the value of a will always be greater than 0; it cannot be 0. Whereas the output value range of the Tanh function is (-1,1), it is possible to be 0. 2. the previous error term, dZ = batch_a - batch_y, does not have a division term. We cannot solve the first reason since that is a characteristic of the function itself. The derivative of the Tanh function is a fixed form (1+A)(1-A) and cannot be modified. If it is modified, it is not a Tanh function anymore. We cannot solve the first reason since that is a characteristic inherent to the function. The derivative of the Tanh function is a fixed form (1+A)(1-A) and cannot be modified. If it is modified, it is not a Tanh function anymore. Then let's consider the second reason, can we remove batch_a from dZ? That is, let the derivative of the cross-entropy function contain a (1-a)(1+a) term in the denominator, thereby allowing the derivatives of the Tanh function to cancel each other out? We modify the cross-entropy function according to this concept, with a simplified way to facilitate the derivation. Then, let's consider the second reason, can we remove batch_a from dZ? That is, let the derivative of the cross-entropy function contain a (1-a)(1+a) term in the denominator, thereby allowing the derivatives of the Tanh function to cancel each other out? We modify the cross-entropy function according to this concept, with a simplified way to facilitate the derivation. ### 6.6.3 Modify the loss function ... ... @@ -187,9 +187,9 @@ class NeuralNet(object): dZ = 2 * (batch_a - batch_y) ......  Note that we commented out the code for step1 and used the result of equation 9, replacing it with the code for step2. Note that we commented out the code for step1, and using the result from equation 9, we replaced the code for step2. In the second run, the result only runs for one round and then stopped. Looking at the printout and the loss function value, the loss function is actually a negative number! In the second run, the result only runs for one round and then it stops. Looking at the printout and the loss function value, we see that the loss function is actually a negative number!  epoch=0 ... ... @@ -220,7 +220,7 @@ After changing to 1+a:$$loss_i=-[(1+y_i) \ln (1+a_i) + (1-y_i) \ln (1-a_i)] \tag{7}$$The Tanh function outputs the value a as (-1,1) such that 1+a \in (0,2) and 1-a \in (0,2). When in the (1,2) interval, the values of ln(1+a) and ln(1-a) are greater than 0, which eventually leads to a negative loss. If we still use the cross-entropy function, it must conform to its original design concept of having both 1+a and 1-a in the (0,1) domain! The Tanh function outputs the value a as (-1,1) such that 1+a \in (0,2) and 1-a \in (0,2). When in the (1,2) interval, the values of ln(1+a) and ln(1-a) are greater than 0, which eventually leads to a negative loss. If we still use the cross-entropy function, it must conform to its original design concept of having both 1+a and 1-a in the (0,1) domain. ### 6.6.4 Re-modify the code of loss function ... ... @@ -258,7 +258,7 @@ This shift reminds us of comparing the logistic function and the Tanh function a ### 6.6.5 Modify sample data label values Considering the original data, its label value is either 0 or 1, indicating positive and negative classes, which is consistent with the output value domain of the logistic function. The Tanh requires the labels of positive and negative classes to be -1 and 1, so we need to alter the label values. The original data's label value is either 0 or 1, indicating positive and negative classes, which is consistent with the output value domain of the logistic function. The Tanh requires the labels of positive and negative classes to be -1 and 1, so we need to alter the label values. Derive a subclass SimpleDataReader_tanh on the SimpleDataReader class and add a ToZeroOne() method, the purpose is to change the original [0/1] label into a [-1/1] label. ... ... @@ -315,7 +315,7 @@ Table 6-10 Comparing the differences of the cross-entropy functions using differ ||| |output value domain a between (-1,1), the dividing line is a=0, the label value is y=-1/1|y=-1 for negative cases and y=1 for positive cases, the input value domain a is between (-1,1), which is consistent with the output value domain of the Tanh function| It can be graphically summarized that when the Tanh function is used, it is equivalent to stretch the range of the Logistic output domain to 2 times, and the lower boundary is changed from 0 to -1. While the corresponding cross-entropy function, which stretches the input value domain to 2 times, and the left boundary is changed from 0 to -1, matches the classification function exactly. It can be graphically summarized that when the Tanh function is used, it is equivalent to stretching the range of the Logistic output domain by a factor of 2, and the lower boundary is changed from 0 to -1; whereas the corresponding cross-entropy function, which stretches the input value domain by a factor of 2, and changes the left boundary from 0 to -1, which matches the classification function exactly. ### Code Location ... ...  ... ... @@ -7,9 +7,9 @@ ### 7.0.1 Raising Questions We have solved the Chu-Han Contention problem in BC and now look at the Three Kingdoms problem around 220 AD. We have solved the Chu-Han Contention problem from around 200 BCE and now let us look at the Three Kingdoms problem four hundred years later. There are 140 sample data in the dataset, as shown in Table 7-1. There are 140 sample data points in the dataset, as shown in Table 7-1. Table 7-1 Sample Data Sampling ... ... @@ -33,10 +33,10 @@ Figure 7-1 Sample data visualization Questions： 1. which country does it belong to when the relative latitude and longitude values are (5,1)? 2. Which country is it when the relative latitude and longitude values are (7,6)? 3. which country does it belong to when the relative latitude and longitude values are (5,6)? 4. Which country is it when the relative latitude and longitude values are (2,7)? 1. Under which country's territory do the relative latitude and longitude values (5,1) fall? 2. Under which country's territory do the relative latitude and longitude values (7,6) fall? 3. Under which country's territory do the relative latitude and longitude values (5,6) fall? 4. Under which country's territory do the relative latitude and longitude values (2,7) fall? ### 7.0.2 Multi-classification learning strategy ... ... @@ -48,25 +48,25 @@ Figure 7-2 shows the difference between linear and non-linear multi-classificati Figure 7-2 Intuitive understanding of the difference between linear and non-linear multi-classification The left side is linear multi-classification, and the right side is non-linear multi-classification. The difference between them is whether the sample points of different categories can be separated by a straight line. For neural networks, linear multiclassification can be solved using a single-layer structure, while non-linear multi-classification requires a two-layer structure. The left side is an example of linear multi-classification, and the right side is an example of non-linear multi-classification. The difference between them is whether or not the sample points of different categories can be separated by a straight line. For neural networks, linear multiclassification can be solved using a single-layer structure, while non-linear multi-classification requires a two-layer structure. #### The relationship between binary classification and multi-classification We have learned about using neural networks to do binary classification, which does not work for multi-classification. In traditional machine learning, some binary classification algorithms can be directly generalized to multi-classification, but more often than not, we will use binary classification learners to solve multi-classification problems based on some basic strategies. We have learned how to use neural networks for binary classification, which does not work for multi-classification. In traditional machine learning, some binary classification algorithms can be directly generalized for multi-classification, but more often than not, we will use basic strategies of binary classification to solve multi-classification problems. There are three ways to solve the multi-classification problem. There are three ways to solve multi-classification problems. 1. one-to-one approach Train one classifier by keeping only two categories of data at a time. If there are N categories, then C^2_N classifiers need to be trained. For example, if N=3, we need to train A|B, B|C, A|C classifiers. Train one classifier by keeping only two categories of data at a time. If there are N categories, then C^2_N classifiers need to be trained. For example, if N=3, we need to train the A|B, B|C, A|C classifiers. Figure 7-3 One-to-one approach As shown on the far left of Figure 7-3, this two classifier only cares about classifying blue and green samples, regardless of the red samples, which means that only blue and green samples are fed into the network during training. As shown on the far left of Figure 7-3, this binary classifier only cares about classifying blue and green samples, regardless of the red samples, which means that only blue and green samples are fed into the network during training. When the (A|B) classifier tells you that it is class A, you need to go to the (A|C) classifier and try again, and if it is also class A, then it is class A. If (A|C) tells you it's class C, it's basically class C. It can't be class B. If you don't believe me, you can go to the (B|C) classifier and test it again. When the (A|B) classifier tells you that something class A, you need to put it through the (A|C) classifier and try again, and if it is still class A, then it is class A. If (A|C) tells you it is class C, it is class C. It can't be class B. If you don't believe me, you can go to the (B|C) classifier and test it again. 2. One-to-many approach ... ... @@ -76,22 +76,22 @@ As in Figure 7-4, when dealing with one category, all other categories are tempo Figure 7-4 One-to-many approach As in the leftmost figure, the red samples are treated as one class, the blue and green samples are mixed together as another category during training. As in the leftmost figure, the red samples are treated as one class, and the blue and green samples are mixed together as another category during training. Three classifiers are called simultaneously, and the three results are combined to give the actual result. For example, if the first classifier tells you it's a "red class", then it is indeed a red class; if it tells you it is anon-red class, you need to look at the result of the second classifier, green class or non-green class; and so on. 3. Many-to-many approach Suppose there are 4 categories ABCD, we can count AB as one class and CD as one class, and train a classifier 1; then count AC as one class and BD as one class, and train a classifier 2. Suppose there are 4 categories: ABCD. We can count AB as one class and CD as one class, and train a classifier 1; then count AC as one class and BD as one class, and train a classifier 2. The first classifier tells you class AB, and the second classifier tells you class BD, then do the " AND " operation, which is class B. If the first classifier tells you class AB, and the second classifier tells you class BD, then perform the " AND " operation, which means it is class B. #### Multi-classification and multi-label In multi-classification learning, although there are multiple categories, each sample belongs to only one category. For example, if a picture has blue sky and white clouds and flowers and trees, there are two ways to label this picture. For example, if a picture has a blue sky and white clouds and flowers and trees, there are two ways to label this picture. - A picture labelled as "landscape" instead of "people" is a landscape picture, which is called classification. - The picture is labeled as "landscape" instead of "portrait." This is classification, and the picture has been classified as a landscape. - The picture is labeled as "blue sky", "white clouds", "flowers", "trees", etc. Such a task is not called multi-classification learning but multi-label learning, which we do not address here. - The picture is labeled as "blue sky," "white clouds," "flowers," "trees," etc. This kind of task is not called multi-classification learning but multi-label learning, which we do not address here.  ... ... @@ -5,13 +5,13 @@ This function works for both linear and non-linear multi-classifications. Recall that for the binary classification problem, a logistic function is used to calculate the probability value of the sample after the linear calculation, thus dividing the sample into positive and negative categories. What method should be used to calculate the probability values of the samples belonging to each category for the multiclassification problem? And how does it work into the backpropagation process? We focus on this question in this section. Recall that for the binary classification problem, a logistic function is used to calculate the probability value of the sample after the linear calculation, thus dividing the sample into positive and negative categories. What method should we use to calculate the probability values of the samples belonging to each category in a multiclassification problem? And how does the backpropagation process work? We focus on these questions in this section. ### 7.1.1 Definition of multi-classification functions - Softmax #### How to get the probability of classification results for multi-classification problems？ Logistic functions can yield binary probability values such as 0.8, 0.3, the former close to 1 and the latter close to 0. How can similar probability values be obtained for multi-classification problems? Logistic functions can yield binary probability values such as 0.8 and 0.3, the former being close to 1 and the latter being close to 0. How can similar probability values be obtained for multi-classification problems? We still assume that the classification value for a sample is obtained using this linear formula. ... ... @@ -19,17 +19,17 @@$$ z = x \cdot w + b $$However, we require that z is not a scalar but a vector. If it is a triple classification problem, we would need z to be a three-dimensional vector, and the element values of each cell in the vector represent the value of that sample belonging to each of the three categories, wouldn't that work? However, we require that z be not a scalar but a vector. If it is a triple classification problem, we would need z to be a three-dimensional vector, and the element values of each cell in the vector represent the value of that sample belonging to each of the three categories, wouldn't that work? Specifically, suppose x is a (1x2) vector, and design w as a (2x3) vector and b as a (1x3) vector, then z is a (1x3) vector. We assume that z is calculated as [3,1,-3], which represents the values of sample x in each of the three categories, and we convert it into probability values below. Some readers may wonder: can't we train the neural network to make its z-values directly into probability form? The answer is no, because z-values are obtained by linear computation, and linear computation has limited power to effectively turn them into probability values directly. Some readers may wonder: can't we train the neural network to make its z-values directly into probability form? The answer is no, because z-values are obtained by linear computation, and linear computation cannot efficiently turn z-values into probability values directly. #### Take the max value The z-value is [3,1,-3], which will become [1,0,0] if the max operation is taken, which meets our classification needs, that is, the sum of the three is 1, and the sample is considered to belong to the first class. But there are two shortcomings. The z-value is [3,1,-3], which will become [1,0,0] if the max operation is taken, which meets our classification needs. That is, the sum of the three values is 1, and the sample is considered to belong to the first class. But there are two shortcomings. 1. Classification result is [1,0,0], which only retains information of either 0 or 1, without how much the elements differ from each other. This can be interpreted as "Hard Max". 1. The classification result is [1,0,0], which only retains 0 and 1, without information of how much the elements differ from each other. This can be interpreted as "Hard Max". 2. The max operation itself is not differentiable and cannot be used in backpropagation. #### Introducing Softmax ... ... @@ -69,47 +69,47 @@ Table 7-2 Differences between MAX operation and Softmax When there are (at least) three categories, by calculating their outputs using the Softmax formula and comparing the relative sizes, it is concluded that the sample belongs to the first category, since the value of 0.879 for the first category is the largest among the three. Note that this is the value calculated for one sample, not three samples, i.e. Softmax gives the probability that a given sample belongs to each of the three categories. It has two characteristics. There are two characteristics of this formula. 1. the probabilities of the three categories add up to 1 2. the probability of each class is greater than 0 #### Working Principle of Back Propagation of Softmax We still assume that the predicted data output from the network is z=[3,1,-3] and the label value is y=[1,0,0]. When doing backpropagation, based on the previous experience, we will use z-y and get: We still assume that the predicted data output from the network is z=[3,1,-3] and the label value is y=[1,0,0]. When performing backpropagation, based on previous experience, we will use z-y and get:$$z-y=[2,1,-3]$$This information is strange. - The first item is 2, we have predicted accurately that this sample belongs to the first category, but the value of the reverse error is 2, i.e. the penalty value is 2 - The second term is 1, the penalty value is 1, the prediction is correct, there is still a penalty value - The third item is -3, the penalty value is -3, which means the reward value is 3. Obviously, the prediction is wrong, but the reward is given - The first item is 2. We have predicted accurately that this sample belongs to the first category, but the value of the reverse error is 2, i.e. the penalty value is 2 - The second term is 1. The penalty value is 1. The prediction is correct--there is still a penalty value - The third item is -3. The penalty value is -3, which means the reward value is 3. Obviously, the prediction is wrong, but the reward is given So, if we do not use a mechanism like Softmax, there is a problem: So, if we do not use a mechanism like Softmax, there is the following problem: - The z-value and y-value, i.e., the predicted value and the labelled value, are not comparable. For example, z_0=3 cannot be compared with y_0=1. - The three elements in the z-value are comparable, but they can only be compared in magnitude, not by difference, e.g. z_0>z_1>z_2, but the difference between 3 and 1 is 2, and the difference between 1 and -3 is 4, and these differences are meaningless. After using Softmax, we get the value a=[0.879,0.119,0.002], using a-y: After using Softmax, we get the value a=[0.879,0.119,0.002]. Using a-y:$$a-y=[-0.121, 0.119, 0.002]$$Let's analyze this information again: - The first term, -0.121, is a reward for giving that category 0.121 because it got it right, but it could make that probability value larger, preferably 1 - The first term, -0.121, rewards that category with 0.121 because the network got it correct, but it could make that probability value larger, preferably 1 - The second term, 0.119, is a penalty because it tries to give the second category a probability of 0.119, so it needs this probability value to be smaller, preferably 0 - The third term, 0.002, is a penalty because it tries to give a probability of 0.002 to the third category, so it needs this probability value to be smaller, preferably 0 This information is totally correct and can be used for backpropagation. Softmax first does a normalization to normalize the output value to between [0,1] to be compared with the label value of 0 or 1 and learn the magnitude of the penalty or reward. This information is totally correct and can be used for backpropagation. Softmax first uses normalization to normalize the output value to be between [0,1] so it can be compared with the label value of 0 or 1 and the magnitude of the penalty or reward can be learned. From the inheritance relation point of view, the Softmax function can be seen as an extension of the Logistic function, such as a binary classification problem: In terms of inheritance relation, the Softmax function can be seen as an extension of the Logistic function, such as in a binary classification problem:$$ a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{z_2 - z_1}} $$Is it very similar to the form of Logistic function? In fact, the logistic function also gives a probability value for the current sample, except that it relies on a bias close to 0 or close to 1 to determine whether it belongs to the positive or negative class. Isn't it very similar to the form of Logistic function? In fact, the logistic function also gives a probability value for the current sample, except that it relies on a bias close to 0 or close to 1 to determine whether it belongs to the positive or negative class. ### 7.1.2 Feed-forward propagation ... ... @@ -239,7 +239,7 @@$$ Since Softmax involves summation, there are two cases: - The derivative of the output term $a_1$ with respect to the input term $z_1$, where: $j=1, i=1, i=j$, which can be extended to any equal value of $i,j$ - Find the derivative of the output term $a_1$ with respect to the input term $z_1$, where: $j=1, i=1, i=j$, which can be extended to any equal value of $i,j$ - Find the derivative of the output $a_2$ or $a_3$ with respect to the input $z_1$, where $j$ is $2$ or $3$, $i=1,i \neq j$, and can be extended to any unequal value of $i,j$ The numerator of Softmax function: Since $a_j$ is calculated, the numerator is $e^{z_j}$. ... ... @@ -262,7 +262,7 @@ $$\tag{21}$$ - When $i \neq j$ (e.g., the derivative of the output categorical value $a_1$ to $z_2$, $j=1,i=2$), the derivative of $a_j$ to $z_i$, with $z_j$ on the numerator unrelated to $i$, the derivative is 0. The summation term in the denominator, $e^{z_i}$, is to be involved in the derivative. Again, Equation 33, since the derivative of the numerator $e^{z_j}$ with respect to $e^{z_i}$ results in 0: - When $i \neq j$ (e.g., the derivative of the output categorical value $a_1$ to $z_2$, $j=1,i=2$), the derivative of $a_j$ to $z_i$, with $z_j$ on the numerator unrelated to $i$, the derivative is 0. The summation term in the denominator, $e^{z_i}$, is to be involved in the derivative. Again, with respect to Equation 33, since the derivative of the numerator $e^{z_j}$ with respect to $e^{z_i}$ results in 0: $$\frac{\partial{a_j}}{\partial{z_i}}=\frac{-(E)'e^{z_j}}{E^2}$$ ... ... @@ -279,15 +279,15 @@ $$\tag{22}$$ 2. the overall backpropagation formula combining the loss function 2. the overall backpropagation formula combined with the loss function Looking at the figure above, we require the partial derivative of the Loss value with respect to Z1. Unlike the previous logistic function, that function is a z corresponding to an a, so the backward relationship is also one-to-one. And here, the calculation of a1 is involved in z1,z2,z3; the calculation of a2 is also participated in z1,z2,z3, i.e., all a's are calculated with respect to z in the previous layer, so it is also more complicated when considering the backward. Looking at the figure above, we require the partial derivative of the Loss value with respect to Z1. Unlike the previous logistic function, that function is a z corresponding to an a, so the backward relationship is also one-to-one. And here, the calculation of a1 is involved in z1,z2,z3; the calculation of a2 is also involved in z1,z2,z3, i.e., all a's are calculated with respect to z in the previous layer, so it is also more complicated when considering the backward direction. First, from the Loss formula, $loss=-(y_1lna_1+y_2lna_2+y_3lna_3)$, a1 is definitely related to z1, so are a2,a3 related to z1? First, from the Loss formula, $loss=-(y_1lna_1+y_2lna_2+y_3lna_3)$, a1 is definitely related to z1. So, are a2,a3 related to z1? Looking again at the form of the Softmax function: Both a1, a2, a3, are related to z1, not a one-to-one relationship, so, to find the partial derivative of Loss to Z1, we must add up the results of Loss->A1->Z1, Loss->A2->Z1, Loss->A3->Z1，all three ways. Thus, the following equation is obtained: Both a1, a2, a3, are related to z1. This is not a one-to-one relationship. So, to find the partial derivative of Loss to Z1, we must add up the results of Loss->A1->Z1, Loss->A2->Z1, Loss->A3->Z1，all three ways. Thus, the following equation is obtained: \begin{aligned} ... ... @@ -298,7 +298,7 @@ When $i=1,j=3$ in the above equation, it is fully consistent with our assumptions and does not lose generality. As mentioned before, since Softmax involves the summing of terms, whether the classification result of A and the classification of label values of Y are consistent, it needs to be discussed by case: As mentioned before, since Softmax involves the summing of terms, regardless of whether the classification result of A and the classification of label values of Y are consistent, it needs to be discussed by case: $$\frac{\partial{a_j}}{\partial{z_i}} = \begin{cases} a_j(1-a_j), & i = j \\\\ -a_ia_j, & i \neq j \end{cases} ... ... @@ -338,9 +338,9 @@$$ \end{aligned} \tag{25}$$Since y_j takes the values [1,0,0] or [0,1,0] or [0,0,1], these three add up to [1,1,1], and multiplying by [1,1,1] in the matrix multiplication operation is equivalent to doing nothing, it is equal to the original value. Since y_j takes the values [1,0,0] or [0,1,0] or [0,0,1], these three add up to [1,1,1], and multiplying by [1,1,1] in the matrix multiplication operation is equivalent to doing nothing: it is equal to the original value. We are surprised to find that the final backward calculation process is:$$a_i - y_i$$, assuming$$a_i=[0.879, 0.119, 0.002]$$for the current sample and$$y_i=[0, 1, 0]$$, then. We are surprised to find that the final backward calculation process is:$$a_i - y_i$$, assuming$$a_i=[0.879, 0.119, 0.002]$$for the current sample and$$y_i=[0, 1, 0]$$, then:$$a_i - y_i = [0.879, 0.119, 0.002]-[0,1,0]=[0.879,-0.881,0.002]$$The implication is that the sample predicts the first category, but it is actually the second category, so a penalty value of 0.879 is given to the first category, a reward of 0.881 to the second category, and a penalty of 0.002 to the third category, and back propagates to the neural network. ... ... @@ -376,13 +376,13 @@ The results of the two implementations are consistent: [0.95033021 0.04731416 0.00235563]  Why is it the same? The code looks so much different! Let's prove it: Why is it the same? The code looks so different! Let's prove it: Suppose there are 3 values a, b, c, and a is the largest of the three, then the Softmax weight of b should be written as follows: Suppose there are 3 values, a, b, c, and a is the largest of the three. Then, the Softmax weight of b should be written as follows:$$P(b)=\frac{e^b}{e^a+e^b+e^c}$$If subtracting the maximum becomes a-a, b-a, c-a, then the weight of Softmax accounted for by b' should be written as follows: If by subtracting the maximum the terms become a-a, b-a, c-a, then the weight of Softmax accounted for by b' should be written as follows:$$ \begin{aligned} ... ... @@ -398,7 +398,7 @@  The way Softmax2 is written is acceptable for a one-dimensional vector or array, but there is a problem if you encounter Z as a $M \times N$-dimensional (M, N>1) matrix because the function np.sum(exp_Z) adds all the elements in the $M \times N$ matrix together to get a scalar value, instead of adding the relevant column elements together. So it should be written like this: So it should be written like this instead: Python class Softmax(object): ... ... @@ -410,7 +410,7 @@ class Softmax(object):  The parameter axis=1 is essential because if the input Z is a single-sample prediction, it should be an array of $3\times 1$ if the input Z is divided into three categories, and if: **The parameter axis=1 is essential because if the input Z is a single-sample prediction, it should be an array of $3\times 1$ if the input Z is divided into three categories, and if: - $z = [3,1,-3]$ - $a = [0.879,0.119,0.002]$ ... ... @@ -439,4 +439,4 @@ A still contains two samples, but it becomes a sum of all 6 elements of the two ### Thinking and Exercises 1. Is it possible that two or more classified values of a sample are the same among the three classified values, such as $[0.3,0.3,0.4]$? 2. How do you plan to solve this problem? 2. How would you solve a problem like this?
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.