提交 282823b8 编写于 作者: Y Yifei Feng 提交者: GitHub

Merge pull request #5503 from yifeif/r0.11

Reformat markdown.
......@@ -436,35 +436,35 @@ you a desirable model size.
Finally, let's take a minute to talk about what the Logistic Regression model
actually looks like in case you're not already familiar with it. We'll denote
the label as $$Y$$, and the set of observed features as a feature vector
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. We define $$Y=1$$ if an individual earned >
50,000 dollars and $$Y=0$$ otherwise. In Logistic Regression, the probability of
the label being positive ($$Y=1$$) given the features $$\mathbf{x}$$ is given
the label as \\(Y\\), and the set of observed features as a feature vector
\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual earned >
50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the probability of
the label being positive (\\(Y=1\\)) given the features \\(\mathbf{x}\\) is given
as:
$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$
where $$\mathbf{w}=[w_1, w_2, ..., w_d]$$ are the model weights for the features
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. $$b$$ is a constant that is often called
where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the features
\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is often called
the **bias** of the model. The equation consists of two parts—A linear model and
a logistic function:
* **Linear Model**: First, we can see that $$\mathbf{w}^T\mathbf{x}+b = b +
w_1x_1 + ... +w_dx_d$$ is a linear model where the output is a linear
function of the input features $$\mathbf{x}$$. The bias $$b$$ is the
* **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
prediction one would make without observing any features. The model weight
$$w_i$$ reflects how the feature $$x_i$$ is correlated with the positive
label. If $$x_i$$ is positively correlated with the positive label, the
weight $$w_i$$ increases, and the probability $$P(Y=1|\mathbf{x})$$ will be
closer to 1. On the other hand, if $$x_i$$ is negatively correlated with the
positive label, then the weight $$w_i$$ decreases and the probability
$$P(Y=1|\mathbf{x})$$ will be closer to 0.
\\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
label. If \\(x_i\\) is positively correlated with the positive label, the
weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will be
closer to 1. On the other hand, if \\(x_i\\) is negatively correlated with the
positive label, then the weight \\(w_i\\) decreases and the probability
\\(P(Y=1|\mathbf{x})\\) will be closer to 0.
* **Logistic Function**: Second, we can see that there's a logistic function
(also known as the sigmoid function) $$S(t) = 1/(1+\exp(-t))$$ being applied
(also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being applied
to the linear model. The logistic function is used to convert the output of
the linear model $$\mathbf{w}^T\mathbf{x}+b$$ from any real number into the
range of $$[0, 1]$$, which can be interpreted as a probability.
the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real number into the
range of \\([0, 1]\\), which can be interpreted as a probability.
Model training is an optimization problem: The goal is to find a set of model
weights (i.e. model parameters) to minimize a **loss function** defined over the
......
......@@ -157,8 +157,8 @@ The higher the `dimension` of the embedding is, the more degrees of freedom the
model will have to learn the representations of the features. For simplicity, we
set the dimension to 8 for all feature columns here. Empirically, a more
informed decision for the number of dimensions is to start with a value on the
order of $$k\log_2(n)$$ or $$k\sqrt[4]n$$, where $$n$$ is the number of unique
features in a feature column and $$k$$ is a small constant (usually smaller than
order of \\(\log_2(n)\\) or \\(k\sqrt[4]n\\), where \\(n\\) is the number of unique
features in a feature column and \\(k\\) is a small constant (usually smaller than
10).
Through dense embeddings, deep models can generalize better and make predictions
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册