1.5 随机梯度下降 (0.21.3)

987c01b0 · Hanmin Qin · 3731741f · 987c01b0
隐藏空白更改
内联并排

Showing with 11 addition and 9 deletion

docs/0.21.3/6.md docs/0.21.3/6.md +11 -9

未找到文件。
--- a/docs/0.21.3/6.md
+++ b/docs/0.21.3/6.md
@@ -4,6 +4,7 @@
         [@A](https://github.com/apachecn/scikit-learn-doc-zh)
         [@HelloSilicat](https://github.com/HelloSilicat)
        [@Loopy](https://github.com/loopyme)
+        [@qinhanmin2014](https://github.com/qinhanmin2014)
 翻译者:
         [@L](https://github.com/apachecn/scikit-learn-doc-zh)

@@ -25,9 +26,9 @@ Stochastic Gradient Descent （随机梯度下降法）的劣势:

 >**警告**:
 >
->在拟合模型前，确保你重新排列了（打乱）)你的训练数据，或者在每次迭代后用 `shuffle=True` 来打乱。
+>在拟合模型前，确保你重新排列了（打乱）你的训练数据，或者使用 `shuffle=True` 在每次迭代后打乱训练数据。

-[`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) 类实现了一个简单的随机梯度下降学习例程, 支持不同的 loss functions（损失函数）和 penalties for classification（分类处罚）。
+[`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) 类实现了一个简单的随机梯度下降学习例程, 支持分类问题不同的损失函数和正则化方法。

 [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_sgd_separating_hyperplane_0011.png](img/b3206aa7b52a9c0918727730873d1363.jpg)](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_separating_hyperplane.html)

@@ -39,11 +40,12 @@ Stochastic Gradient Descent （随机梯度下降法）的劣势:
 >>> y = [0, 1]
 >>> clf = SGDClassifier(loss="hinge", penalty="l2")
 >>> clf.fit(X, y)
-SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
- eta0=0.0, fit_intercept=True, l1_ratio=0.15,
- learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
- n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
- shuffle=True, tol=None, verbose=0, warm_start=False)
+SGDClassifier(alpha=0.0001, average=False, class_weight=None,
+              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
+              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
+              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
+              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
+              validation_fraction=0.1, verbose=0, warm_start=False)

 ```

@@ -88,7 +90,7 @@ array([ 29.6...])
 *   `loss="log"`: logistic regression （logistic 回归），
 *   and all regression losses below（以及所有的回归损失）。

-前两个 loss functions（损失函数）是懒惰的，如果一个例子违反了 margin constraint（边界约束），它们仅更新模型的参数, 这使得训练非常有效率,即使使用了 L2 penalty（惩罚）我们仍然可能得到稀疏的模型结果。
+前两个 loss functions（损失函数）是懒惰的，只有一个例子违反了 margin constraint（边界约束），它们才更新模型的参数, 这使得训练非常有效率,即使使用了 L2 penalty（惩罚）我们仍然可能得到稀疏的模型结果。

 使用 `loss="log"` 或者 `loss="modified_huber"` 来启用 `predict_proba` 方法, 其给出每个样本 ![x](img/5c82dbae35dc43d2f556f9f284d9d184.jpg) 的概率估计 ![P(y|x)](img/3cca81fd08a4732dc7061cd246b323ed.jpg) 的一个向量：

@@ -182,7 +184,7 @@ X_test = scaler.transform(X_test) # apply same transformation to test data
 假如你的 attributes （属性）有一个固有尺度（例如 word frequencies （词频）或 indicator features（指标特征））就不需要缩放。
 *   最好使用 `GridSearchCV` 找到一个合理的 regularization term （正则化项） ![\alpha](img/d8b3d5242d513369a44f8bf0c6112744.jpg) ， 它的范围通常在 `10.0**-np.arange(1,7)` 。       
 *   经验表明，SGD 在处理约 10^6 训练样本后基本收敛。因此，对于迭代次数第一个合理的猜想是 `n_iter = np.ceil(10**6 / n)`，其中 `n` 是训练集的大小。  
-*   假如将 SGD 应用于使用 PCA 做特征提取，我们发现通过某个常数 c 来缩放特征值是明智的，比如使训练数据的 L2 norm 平均值为 1。
+*   假如将 SGD 应用于使用 PCA 提取的特征，我们发现通过某个常数 c 来缩放特征值是明智的，比如使训练数据的 L2 norm 平均值为 1。
 *   我们发现，当特征很多或 eta0 很大时， ASGD（平均随机梯度下降） 效果更好。

 > **参考资料**: