ch15.

aa919479 · wizardforcel · 58d465ac · aa919479
隐藏空白更改
内联并排

Showing with 197 addition and 0 deletion

15.md 15.md +197 -0

未找到文件。
--- a/15.md
+++ b/15.md
@@ -921,3 +921,200 @@ evaluate_accuracy(training_set, test_set, 5)
 作为脚注，你可能已经注意到布列塔尼·温格做得更好了。 她使用了什么技术？ 一个关键的创新是，她将置信评分纳入了结果：她的算法有一种方法来确定何时无法做出有把握的预测，对于那些患者，甚至不尝试预测他们的诊断。 她的算法对于做出预测的病人是 99% 准确的，所以这个扩展看起来有点帮助。
+## 多元回归
+现在我们已经探索了使用多个属性来预测类别变量的方法，让我们返回来预测定量变量。 预测数值量被称为回归，多个属性进行回归的常用方法称为多元线性回归。
+### 房价
+下面的房价和属性数据集在爱荷华州埃姆斯市收集了数年。 数据集的描述在线显示。 我们将仅仅关注列的一个子集。 我们将尝试从其它列中预测价格列。
+```py
+all_sales = Table.read_table('house.csv')
+sales = all_sales.where('Bldg Type', '1Fam').where('Sale Condition', 'Normal').select(
+    'SalePrice', '1st Flr SF', '2nd Flr SF', 
+    'Total Bsmt SF', 'Garage Area', 
+    'Wood Deck SF', 'Open Porch SF', 'Lot Area', 
+    'Year Built', 'Yr Sold')
+sales.sort('SalePrice')
+```
+| SalePrice | 1st Flr SF | 2nd Flr SF | Total Bsmt SF | Garage Area | Wood Deck SF | Open Porch SF | Lot Area | Year Built | Yr Sold |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 35000 | 498 | 0 | 498 | 216 | 0 | 0 | 8088 | 1922 | 2006 |
+| 39300 | 334 | 0 | 0 | 0 | 0 | 0 | 5000 | 1946 | 2007 |
+| 40000 | 649 | 668 | 649 | 250 | 0 | 54 | 8500 | 1920 | 2008 |
+| 45000 | 612 | 0 | 0 | 308 | 0 | 0 | 5925 | 1940 | 2009 |
+| 52000 | 729 | 0 | 270 | 0 | 0 | 0 | 4130 | 1935 | 2008 |
+| 52500 | 693 | 0 | 693 | 0 | 0 | 20 | 4118 | 1941 | 2006 |
+| 55000 | 723 | 363 | 723 | 400 | 0 | 24 | 11340 | 1920 | 2008 |
+| 55000 | 796 | 0 | 796 | 0 | 0 | 0 | 3636 | 1922 | 2008 |
+| 57625 | 810 | 0 | 0 | 280 | 119 | 24 | 21780 | 1910 | 2009 |
+| 58500 | 864 | 0 | 864 | 200 | 0 | 0 | 8212 | 1914 | 2010 |
+（省略了 1992 行）
+销售价格的直方图显示出大量的变化，分布显然不是正态。 右边的长尾包含几个价格非常高的房子。 左边的短尾不包含任何售价低于 35,000 美元的房子。
+```py
+sales.hist('SalePrice', bins=32, unit='$')
+```
+### 相关性
+没有单个属性足以预测销售价格。 例如，第一层面积（平方英尺）与销售价格相关，但仅解释其一些变化。
+```py
+sales.scatter('1st Flr SF', 'SalePrice')
+correlation(sales, 'SalePrice', '1st Flr SF')
+0.64246625410302249
+```
+事实上，没有任何单个属性与销售价格的相关性大于 0.7（销售价格本身除外）。
+```py
+for label in sales.labels:
+    print('Correlation of', label, 'and SalePrice:\t', correlation(sales, label, 'SalePrice'))
+Correlation of SalePrice and SalePrice:     1.0
+Correlation of 1st Flr SF and SalePrice:     0.642466254103
+Correlation of 2nd Flr SF and SalePrice:     0.35752189428
+Correlation of Total Bsmt SF and SalePrice:     0.652978626757
+Correlation of Garage Area and SalePrice:     0.638594485252
+Correlation of Wood Deck SF and SalePrice:     0.352698666195
+Correlation of Open Porch SF and SalePrice:     0.336909417026
+Correlation of Lot Area and SalePrice:     0.290823455116
+Correlation of Year Built and SalePrice:     0.565164753714
+Correlation of Yr Sold and SalePrice:     0.0259485790807
+```
+但是，组合属性可以提供更高的相关性。 特别是，如果我们总结一楼和二楼的面积，那么结果的相关性就比任何单独的属性都要高。
+```py
+both_floors = sales.column(1) + sales.column(2)
+correlation(sales.with_column('Both Floors', both_floors), 'SalePrice', 'Both Floors')
+0.7821920556134877
+```
+这种高度相关性表明，我们应该尝试使用多个属性来预测销售价格。 在具有多个观测属性和要预测的单个数值（这里是销售价格）的数据集中，多重线性回归可能是有效的技术。
+## 多元线性回归
+在多元线性回归中，通过将每个属性值乘以不同的斜率，从数值输入属性预测数值输出，然后对结果求和。 在这个例子中，第一层的斜率将代表房子第一层面积的美元每平方英尺，它应该用于我们的预测。
+在开始预测之前，我们将数据随机分成一个相同大小的训练和测试集。
+```py
+train, test = sales.split(1001)
+print(train.num_rows, 'training and', test.num_rows, 'test instances.')
+1001 training and 1001 test instances.
+```
+多元回归中的斜率是一个数组，例子中每个属性拥有一个斜率值。 预测销售价格包括，将每个属性乘以斜率并将结果相加。
+```py
+def predict(slopes, row):
+    return sum(slopes * np.array(row))
+example_row = test.drop('SalePrice').row(0)
+print('Predicting sale price for:', example_row)
+example_slopes = np.random.normal(10, 1, len(example_row))
+print('Using slopes:', example_slopes)
+print('Result:', predict(example_slopes, example_row))
+Predicting sale price for: Row(1st Flr SF=1092, 2nd Flr SF=1020, Total Bsmt SF=952.0, Garage Area=576.0, Wood Deck SF=280, Open Porch SF=0, Lot Area=11075, Year Built=1969, Yr Sold=2008)
+Using slopes: [  9.99777721   9.019661    11.13178317   9.40645585  11.07998556
+  11.03830075  10.26908341  10.42534332  11.00103437]
+Result: 195583.275784
+```
+结果是估计的销售价格，可以将其与实际销售价格进行比较，以评估斜率是否提供准确的预测。 由于上面的`example_slopes`是随机选取的，我们不应该期望它们提供准确的预测。
+```py
+print('Actual sale price:', test.column('SalePrice').item(0))
+print('Predicted sale price using random slopes:', predict(example_slopes, example_row))
+Actual sale price: 206900
+Predicted sale price using random slopes: 195583.275784
+```
+### 最小二乘回归
+执行多元回归的下一步是定义最小二乘目标。 我们对训练集中的每一行执行预测，然后根据实际价格计算预测的均方根误差（RMSE）。
+```py
+train_prices = train.column(0)
+train_attributes = train.drop(0)
+def rmse(slopes, attributes, prices):
+    errors = []
+    for i in np.arange(len(prices)):
+        predicted = predict(slopes, attributes.row(i))
+        actual = prices.item(i)
+        errors.append((predicted - actual) ** 2)
+    return np.mean(errors) ** 0.5
+def rmse_train(slopes):
+    return rmse(slopes, train_attributes, train_prices)
+print('RMSE of all training examples using random slopes:', rmse_train(example_slopes))
+RMSE of all training examples using random slopes: 69653.9880638
+```
+最后，我们使用`minimize `函数来找到使 RMSE 最低的斜率。 由于我们想要最小化的函数`rmse_train`需要一个数组而不是一个数字，所以我们必须向`minimize`函数传递`array = True`参数。 当使用这个参数时，`minimize`也需要斜率的初始猜测，以便知道输入数组的维数。 最后，为了加速优化，我们使用`smooth = True`属性，指出`rmse_train`是一个平滑函数。 计算最佳斜率可能需要几分钟的时间。
+```py
+best_slopes = minimize(rmse_train, start=example_slopes, smooth=True, array=True)
+print('The best slopes for the training set:')
+Table(train_attributes.labels).with_row(list(best_slopes)).show()
+print('RMSE of all training examples using the best slopes:', rmse_train(best_slopes))
+The best slopes for the training set:
+```
+| 1st Flr SF | 2nd Flr SF | Total Bsmt SF | Garage Area | Wood Deck SF | Open Porch SF | Lot Area | Year Built | Yr Sold |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 73.7779 | 72.3057 | 51.8885 | 46.5581 | 39.3267 | 11.996 | 0.451265 | 538.243 | -534.634 |
+```py
+RMSE of all training examples using the best slopes: 31146.4442711
+```
+### 解释多元线性回归
+让我们来解释这些结果。 最佳斜率为我们提供了一个方法，从其房屋属性估算价格。 一楼的面积约为 75 美元每平方英尺（第一个斜率），而二楼的面积约为 70 元每平方英尺（第二个斜率）。 最后的负值描述了市场：最近几年的价格平均较低。
+大约 3 万美元的 RMSE 意味着，我们基于所有属性的销售价格的最佳线性预测，在训练集上平均差了大约 3 万美元。 当预测测试集的价格时，我们发现了类似的误差，这表明我们的预测方法可推广到来自同一总体的其他样本。
+```py
+test_prices = test.column(0)
+test_attributes = test.drop(0)
+def rmse_test(slopes):
+    return rmse(slopes, test_attributes, test_prices)
+rmse_linear = rmse_test(best_slopes)
+print('Test set RMSE for multiple linear regression:', rmse_linear)
+Test set RMSE for multiple linear regression: 31105.4799398
+```
+如果预测是完美的，那么预测值和实际值的散点图将是一条斜率为 1 的直线。我们可以看到大多数点落在该线附近，但预测中存在一些误差。
+```py
+def fit(row):
+    return sum(best_slopes * np.array(row))
+test.with_column('Fitted', test.drop(0).apply(fit)).scatter('Fitted', 0)
+plots.plot([0, 5e5], [0, 5e5]);
+```
+多元回归的残差图通常将误差（残差）与预测变量的实际值进行比较。 我们在下面的残差图中看到，我们系统性低估了昂贵房屋的值，由图右侧的许多正的残差值所示。
+```py
+test.with_column('Residual', test_prices-test.drop(0).apply(fit)).scatter(0, 'Residual')
+plots.plot([0, 7e5], [0, 0]);
+```
+就像简单的线性回归一样，解释预测结果至少和预测一样重要。 很多解释多元回归的课程不包含在这个课本中。 完成这门课之后的下一步自然是深入研究线性建模和回归。
+## 最近邻回归