3.3

5ee92412 · wizardforcel · f4c826a7 · 5ee92412
隐藏空白更改
内联并排

Showing with 43 addition and 112 deletion

docs/3.3_data.md docs/3.3_data.md +43 -112

未找到文件。
--- a/docs/3.3_data.md
+++ b/docs/3.3_data.md
-
 # 操纵和可视化数据

 我们已经学习了[加载文件](files.ipynb)的基础知识，现在是时候将加载的数据，从 [NumPy](http://www.numpy.org/) 和 [Pandas](http://pandas.pydata.org/) 重新组织为常用的数据结构了。 为了产生各种数据结构，我们将把它们提供给 [matplotlib](https://matplotlib.org/) 进行可视化。 然后，本实验将全面介绍我们的通用数据科学规划模板中的第 2,3 和 5 步：
@@ -538,8 +537,7 @@ Name: Customer Name, dtype: object
 '''
 ```

-Accessing rows via `sales[0]` then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation:
-
+通过`sales [0]`访问行不起作用，因为 Pandas 希望使用数组索引表示法来获取列。 相反，我们必须使用稍微麻烦的表示法：

 ```python
 sales.iloc[0]  # get first row of data
@@ -556,14 +554,7 @@ Name: 0, dtype: object
 '''
 ```

-
-
-
-
-
-
-To get individual elements, we can use regular list of lists Python notation after the `loc`:
-
+为了获得单个元素，我们可以在`loc`之后使用常规的 Python 表示法，用于列表的列表：

 ```python
 print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]
@@ -571,8 +562,7 @@ print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]
 # 10/13/10 6 38.94
 ```

-During construction and debugging of software, I often like the explicit printing of the column names as is the default shown above. On the other hand, if we need the elements as a plain old Python list, we can do that with `list()`:
-
+在软件的构建和调试过程中，我经常喜欢显式打印列名，如上所示。 另一方面，如果我们需要将元素转换为普通的 Python 列表，我们可以使用`list()`来实现：

 ```python
 row = list(sales.iloc[0])
@@ -581,10 +571,9 @@ print row
 # ['10/13/10', 6, 38.939999999999998, 35.0, 'Muhammed MacIntyre', 'Office Supplies', 'Eldon Base for stackable storage shelf, platinum']
 ```

-**Exercise**: Convert all rows of `sales` to a list of lists. Hint: use the map pattern and `list()`.
-
-The task of that exercise is common enough that Pandas provides a conversion mechanism directly:
+**练习**：将`sales`的所有行转换为列表的列表。 提示：使用映射模式和`list()`。

+这项工作的任务很普遍，Pandas 直接提供了一种转换机制：

 ```python
 m = sales.as_matrix()
@@ -598,9 +587,7 @@ Type is <type 'numpy.ndarray'>
 '''
 ```

-
-We can still get the columns individually using the wildcard notation we saw before:
-
+我们仍然可以使用之前看到的切片表示，来单独获取列：

 ```python
 m[:,0] # get first column
@@ -611,13 +598,6 @@ array(['10/13/10', '10/1/12', '10/1/12', ..., '11/8/10', '10/21/12',
 '''
 ```

-
-
-
-
-
-
-
 ```python
 m[:,4] # get fifth column

@@ -628,17 +608,11 @@ array(['Muhammed MacIntyre', 'Barry French', 'Barry French', ...,
 ```


+### 单独获取数据帧

+对于机器学习，我们经常希望将其中一列作为因变量分开，将其他列保持为一组独立变量。 我们通常使用的符号是`X - > Y`，意味着`X`中的观察集将结果预测或分类为`Y`.

-
-
-
-### Pulling data frames apart
-
-For machine learning, we often want to separate out one of the columns as the dependent variable, keeping the others as a group of independent variables.  Notation we typically use is X -> Y, meaning the set of observations in X predict or classify results in Y. 
-
-For example, let's say we wanted to predict engine size given the efficiency, number of cylinders, and overall car weight. We need to separate out the engine size as Y and combining the other columns into X. Using Pandas, we can easily separate the variables and keep the variables names:
-
+例如，假设我们想要根据效率，汽缸数量和整车重量来预测发动机尺寸。我们需要将引擎大小分离为`Y`并将其他列组合成`X`。使用 Pandas，我们可以轻松地分离变量并保留变量名称：

 ```python
 cars = pandas.read_csv('data/cars.csv')
@@ -671,9 +645,7 @@ Name: ENG, dtype: float64
 '''
 ```

-
-Converting to a NumPy array strips away the column names but let us treat it as a matrix, which is handy in a lot of cases (e.g., matrix addition). Separating columns from NumPy arrays is a bit more cumbersome, However:
-
+转换为 NumPy 数组会删除列名称，但让我们将其视为矩阵，这在很多情况下都很方便（例如，矩阵加法）。从 NumPy 数组中分离列有点麻烦，但是：

 ```python
 m = cars.as_matrix()
@@ -696,9 +668,7 @@ print Y
 '''
 ```

-
-While NumPy arrays are more cumbersome when pulling apart tables, accessing the elements without `loc` is usually more convenient:
-
+虽然 NumPy 数组在拆分表时更麻烦，但不用`loc`访问元素通常更方便：

 ```python
 print cars.iloc[0][1]
@@ -711,11 +681,11 @@ print m[0,1]
 ```


-## Mixed, missing data
+## 混合和缺失的数据

-Using tips from [Jeremy Howard](https://www.usfca.edu/data-institute/about-us/researchers) here on real-world data clean up.
+使用来自 [Jeremy Howard](https://www.usfca.edu/data-institute/about-us/researchers) 的提示，有关真实世界的数据清理。

-### Load and parse dates
+### 加载和解析日期


 ```python
@@ -734,9 +704,7 @@ df
 | 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | Sci |


-
-If you ever need to convert dates to the elapsed time, you can convert the date timestamp to UNIX time, the number of second since 1970:
-
+如果您需要将日期转换为已用时间，则可以将日期时间戳转换为 UNIX 时间，即自 1970 年以来的秒数：

 ```python
 d = df['Date']
@@ -754,20 +722,12 @@ Name: Date, dtype: timedelta64[ns]
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df3 = df.copy()
 df3['Date'] = delta.dt.total_seconds()
 df3
 ```

-
 |  | Date | Description | Size | Price | Topic | Size_na |
 | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 1.497917e+09 | NaN | 92.0 | 1.50 | 2 | False |
@@ -777,8 +737,7 @@ df3
 | 4 | 1.498176e+09 | foo&bar | 1.0 | 10.00 | 1 | False |
 | 5 | 1.498262e+09 | get off my lawn | 99.0 | 8.90 | 4 | False |

-Or, you can convert the timestamp into the number of days since 1970:
-
+或者，您可以将时间戳转换为自 1970 年以来的天数：

 ```python
 delta.dt.days
@@ -794,16 +753,9 @@ Name: Date, dtype: int64
 '''
 ```

+### 字符串到类别变量

-
-
-
-
-
-### String to categorical variable
-
-Here is how we convert a column to a categorical variable:
-
+以下是我们将列转换为类别变量的方法：

 ```python
 df['Topic'] = df['Topic'].astype('category')
@@ -825,8 +777,6 @@ print df['Topic'].cat.categories  # .cat field gives us access to categories stu
 # Index([u'News', u'Politics', u'Sci', u'Sports'], dtype='object')
 ```

-
-
 ```python
 print df['Topic'].cat.codes

@@ -858,10 +808,9 @@ Categories (4, object): [News < Politics < Sci < Sports]
 ```


-### String to ordinal
-
-We can convert that category to an integer if we like:
+### 字符串到序数

+如果我们愿意，我们可以将该类别转换为整数：

 ```python
 # make sure you convert to categorical first
@@ -879,15 +828,14 @@ df
 | 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | 1 | False |
 | 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 4 | False |

-### Missing data
-
-Our data has a missing description, which we can ignore, but also has a missing size (numeric) and topic (categorical) entry. 
+### 缺失数据

-* If the element is numeric, we replace the missing value with the column median and add a column to indicate 0 or 1 as to whether the value is missing.
-* If the element is categorical, Pandas can handle the missing value automatically when we use parameter `dummy_na=True` on `get_dummies()` (see next section).
+我们的数据缺少描述，我们可以忽略，但也缺少大小（数值）和主题（类别）条目。

-Let's convert the missing numeric data:
+*如果元素是数字，我们用列中位数替换缺失值，并添加一列 0 或 1 来表示是否缺少值。
+*如果元素是分类的，当我们在`get_dummies()`上使用参数`dummy_na = True`时，Pandas 可以自动处理缺失值（参见下一节）。

+让我们转换丢失的数值数据：

 ```python
 pandas.isnull(df['Size'])
@@ -903,13 +851,6 @@ Name: Size, dtype: bool
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df['Size_na'] = pandas.isnull(df['Size'])
 df
@@ -930,7 +871,6 @@ df['Size'] = szcol.fillna(szcol.median())
 df
 ```

-
 |  | Date | Description | Size | Price | Topic | Size_na |
 | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 2017-06-20 | NaN | 92.0 | 1.50 | 1 | False |
@@ -940,12 +880,11 @@ df
 | 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | 0 | False |
 | 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 3 | False |

-### Dummy variables
-
-Instead, we can convert the categorical variable to dummy variables, also called "*one hot encoding*."
+### 虚拟变量

-If were lazy, we can just convert everything to dummies but the `Description` field is not something that we need to convert as it is mostly just information we are carrying along.
+相反，我们可以将类别变量转换为虚拟变量，也称为“独热编码”。

+如果我们懒惰，我们可以将所有内容转换为虚拟对象，但是`Description`字段不是我们需要转换的内容，因为它主要是我们所携带的信息。

 ```python
 pandas.get_dummies(df) # convert all categorical to dummies
@@ -988,8 +927,7 @@ pandas.get_dummies(df['Topic'], dummy_na=True) # Add an "na" column
 | 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
 | 5 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |

-We can manually pack these new columns into the old data frame and delete the old column:
-
+我们可以手动将这些新列打包到旧数据帧中并删除旧列：

 ```python
 df2 = pandas.concat([df,pandas.get_dummies(df['Topic'], dummy_na=True)], axis=1)
@@ -997,7 +935,6 @@ df2.drop('Topic', axis=1, inplace=True) # Considered better than del df2['Topic'
 df2
 ```

-
 |  | Date | Description | Size | Price | Size_na | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | nan |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 2017-06-20 | NaN | 92.0 | 1.50 | False | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
@@ -1007,15 +944,12 @@ df2
 | 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
 | 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | False | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |

-Or, we can do it the easy way by just specifying the columns to convert:
-
+或者，我们可以通过指定要转换的列来轻松实现：

 ```python
 pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way
 ```

-
-
 |  | Date | Description | Size | Price | Size_na | Topic_0.0 | Topic_1.0 | Topic_2.0 | Topic_3.0 | Topic_4.0 | Topic_nan |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 2017-06-20 | NaN | 92.0 | 1.50 | False | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
@@ -1025,8 +959,7 @@ pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way
 | 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
 | 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | False | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |

-If you ever need to walk the columns in your data frame, you can do that with a for each loop:
-
+如果您需要遍历数据帧中的列，则可以使用`for`循环执行此操作：

 ```python
 # walk columns; col is the actual series
@@ -1044,7 +977,6 @@ Size_na
 ```


-
 ```python
 df.describe() # useful stats about columns
 ```
@@ -1062,20 +994,19 @@ df.describe() # useful stats about columns
 8 rows × 3 columns


+## 总结

-## Summary
-
-In this lecture, you've learned the basics of loading and manipulating data:
+在本讲座中，您已经学习了加载和操作数据的基础知识：

-* Loading data into Pandas data frames using `read_csv()`
-* Converting data frames to NumPy arrays using `as_matrix()`
-* Extracting columns with *dataframe*`.`*columnname* or *matrix*`[:,`*columnindex*`]`
-* Accessing elements via *dataframe*`.iloc[`rowindex`][`*columnindex*`]` or *matrix*`[`*rowindex*`,`*columnindex*`]`
-* Getting unique elements with `set(`*mylist*`)`
+* 使用`read_csv()`将数据加载到 Pandas 数据帧中
+* 使用`as_matrix()`将数据帧转换为 NumPy 数组
+* 使用`dataframe.columnname`或`matrix[:, columnindex]`提取列
+* 使用`dataframe.iloc[rowindex][columnindex]`或`matrix[rowindex, columnindex]`访问元素
+* 使用`set(mylist)`获取唯一元素

-And, you've learned how to visualize:
+而且，您已经学会了如何可视化：

-* Time series data
-* Functions over a given range
-* The relationship between variables using a scatterplot
-* Histograms approximating density function
+* 时间序列数据
+* 超出给定范围的函数
+* 使用散点图绘制变量之间的关系
+* 近似密度函数的直方图