ch11.

5d8ecf48 · wizardforcel · 77190615 · 5d8ecf48
隐藏空白更改
内联并排

Showing with 157 addition and 2 deletion

11.md 11.md +157 -2

未找到文件。
--- a/11.md
+++ b/11.md
@@ -577,7 +577,7 @@ ratios.sort('Ratio BW/GD', descending=True).take(0)
 | --- | --- |
 | 116 | 148 | 0.783784 |

-中位数提供了通常比例的感觉，因为它不受非常大或非常小的比例的影响。 样本（比例）的中位数约为 0.429opd。
+中位数提供了通常比例的感觉，因为它不受非常大或非常小的比例的影响。 样本（比值）的中位数约为 0.429opd。

 ```py
 np.median(ratios.column(2))
@@ -588,7 +588,7 @@ np.median(ratios.column(2))

 我们的方法将与前一节完全相同。 我们将自举样本 5000 次，结果是 5000 个中位数的估计量。 我们 95% 的置信区间将是我们所有估计量的“中间 95%”。

-回忆前一节定义的`bootstrap_median`函数。 我们将调用这个函数，并构造总体（比例）中位数的 95% 置信区间。请记住，`ratios`表包含来自我们的原始样本的相关数据。
+回忆前一节定义的`bootstrap_median`函数。 我们将调用这个函数，并构造总体（比值）中位数的 95% 置信区间。请记住，`ratios`表包含来自我们的原始样本的相关数据。

 ```py
 def bootstrap_median(original_sample, label, replications):
@@ -639,3 +639,158 @@ plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);

 ### 总体均值的置信区间：自举百分位数方法

+我们为中位数所做的事情也可以用于均值。 假设我们想估计总体中的母亲的平均年龄。 自然估计量是样本中的母亲的平均年龄。 这是他们的年龄分布，他们的平均年龄大约是 27.2 岁。
+
+```py
+baby.select('Maternal Age').hist()
+```
+
+```py
+np.mean(baby.column('Maternal Age'))
+27.228279386712096
+```
+
+母亲的平均年龄是多少？ 我们不知道这个参数的值。
+
+我们用自举法来估计未知参数。 为此，我们将编辑`bootstrap_median`的代码，而不是定义函数`bootstrap_mean`。 代码是相同的，除了统计量是代替中位数的均值，并且收集在一个名为`means`而不是`medians`的数组中。
+
+```py
+def bootstrap_mean(original_sample, label, replications):
+
+    """Returns an array of bootstrapped sample means:
+    original_sample: table containing the original sample
+    label: label of column containing the variable
+    replications: number of bootstrap samples
+    """
+
+    just_one_column = original_sample.select(label)
+    means = make_array()
+    for i in np.arange(replications):
+        bootstrap_sample = just_one_column.sample()
+        resampled_mean = np.mean(bootstrap_sample.column(0))
+        means = np.append(means, resampled_mean)
+
+    return means
+# Generate the means from 5000 bootstrap samples
+bstrap_means = bootstrap_mean(baby, 'Maternal Age', 5000)
+
+# Get the endpoints of the 95% confidence interval
+left = percentile(2.5, bstrap_means)
+right = percentile(97.5, bstrap_means)
+
+make_array(left, right)
+array([ 26.89778535,  27.55962521])
+```
+
+95% 置信区间是约 26.9 岁到约 27.6 岁。 也就是说，我们估计的母亲的平均年龄在 26.9 岁到 27.6 岁之间。
+
+注意两端距原始样本均值 27.2 岁的距离。 样本量非常大 - 1174 个母亲 - 所以样本均值变化不大。 我们将在下一章进一步探讨这个观察。
+
+下面显示了 5000 个自举均值的经验直方图，以及总体均值的 95% 置信区间。
+
+```py
+resampled_means = Table().with_column(
+    'Bootstrap Sample Mean', bstrap_means
+)
+resampled_means.hist(bins=15)
+plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
+```
+
+原始样本的均值（27.23 岁）同样接近区间中心。 这并不奇怪，因为每个自举样本都是从相同的原始样本中抽取的。 自举样本的均值大约对称分布原始样本（从其中抽取）的均值的两侧。
+
+还要注意，即使所采样的年龄的直方图完全不是对称的，二次样本的均值的经验直方图也是大致对称的钟形：
+
+```py
+baby.select('Maternal Age').hist()
+```
+
+这是概率统计的中心极限定理的结果。 在后面的章节中，我们将看到这个定理是什么。
+
+### 80% 置信区间
+
+您可以使用自举法来构建任意水平的置信区间。 例如，要为总体中的平均年龄构建 80% 置信区间，可以选取二次样本的均值的“中间 80%”。 所以你会希望为两个尾巴的每一个分配 10%，因此端点是二次样本的均值的第 10 和第 90 个百分位数。
+
+```py
+left_80 = percentile(10, bstrap_means)
+right_80 = percentile(90, bstrap_means)
+make_array(left_80, right_80)
+array([ 27.01192504,  27.439523  ])
+resampled_means.hist(bins=15)
+plots.plot(make_array(left_80, right_80), make_array(0, 0), color='yellow', lw=8);
+```
+
+这个 80% 置信区间比 95% 置信区间要短得多。 它只是约定 27.0 岁到约 27.4 岁。 虽然这是估计量的较窄区间，你知道这个过程在 80% 的时间内产生良好的区间。
+
+之前过程产生了较宽的区间，但是我们对产生它的过程拥有更高的置信度。
+
+为了以较高的置信度获得较窄的置信区间，您必须从较大的样本开始。 我们将在下一章看到为什么。
+
+### 总体比例的置信区间：自举百分位数方法
+
+在样本中，39% 的母亲在怀孕期间吸烟。
+
+```py
+baby.where('Maternal Smoker', are.equal_to(True)).num_rows/baby.num_rows
+0.3909710391822828
+```
+
+以下对观察很实用，这个比例也可以通过数组操作来计算：
+
+```py
+smoking = baby.column('Maternal Smoker')
+np.count_nonzero(smoking)/len(smoking)
+0.3909710391822828
+```
+
+> 译者注：
+
+> `np.count_nonzero(arr)`等价于`np.sum(arr != 0)`。
+
+总体中有百分之多少的母亲在怀孕期间吸烟？ 这是一个未知的参数，我们可以通过自举置信区间来估计。 这个过程中的步骤与我们用来估计总体均值和中位数的步骤相似。
+
+我们将首先定义一个函数`bootstrap_proportion`，返回一个自举样本的比例的数组。 我们再一次通过编辑`bootstrap_median`的定义来实现它。 计算中唯一的变化是用二次样本的吸烟者比例代替中位数。 该代码假定数据列由布尔值组成。 其他的改变只是数组的名字，来帮助我们阅读和理解我们的代码。
+
+```py
+def bootstrap_proportion(original_sample, label, replications):
+
+    """Returns an array of bootstrapped sample proportions:
+    original_sample: table containing the original sample
+    label: label of column containing the Boolean variable
+    replications: number of bootstrap samples
+    """
+
+    just_one_column = original_sample.select(label)
+    proportions = make_array()
+    for i in np.arange(replications):
+        bootstrap_sample = just_one_column.sample()
+        resample_array = bootstrap_sample.column(0)
+        resampled_proportion = np.count_nonzero(resample_array)/len(resample_array)
+        proportions = np.append(proportions, resampled_proportion)
+
+    return proportions
+```
+
+让我们使用`bootstrap_proportion`来构建总体（母亲吸烟者百分比）的 95% 置信区间。 该代码类似于均值和中位数的相应代码。
+
+```py
+# Generate the proportions from 5000 bootstrap samples
+bstrap_props = bootstrap_proportion(baby, 'Maternal Smoker', 5000)
+
+# Get the endpoints of the 95% confidence interval
+left = percentile(2.5, bstrap_props)
+right = percentile(97.5, bstrap_props)
+
+make_array(left, right)
+array([ 0.36286201,  0.41908007])
+```
+
+置信区间是 36% 到 42%。原始样本的百分比 39% 非常接近于区间的中心。你可以在下面看到：
+
+```py
+resampled_proportions = Table().with_column(
+    'Bootstrap Sample Proportion', bstrap_props
+)
+resampled_proportions.hist(bins=15)
+plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
+```
+