提交 5ee92412 编写于 作者: W wizardforcel

3.3

上级 f4c826a7
# 操纵和可视化数据
我们已经学习了[加载文件](files.ipynb)的基础知识,现在是时候将加载的数据,从 [NumPy](http://www.numpy.org/)[Pandas](http://pandas.pydata.org/) 重新组织为常用的数据结构了。 为了产生各种数据结构,我们将把它们提供给 [matplotlib](https://matplotlib.org/) 进行可视化。 然后,本实验将全面介绍我们的通用数据科学规划模板中的第 2,3 和 5 步:
......@@ -538,8 +537,7 @@ Name: Customer Name, dtype: object
'''
```
Accessing rows via `sales[0]` then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation:
通过`sales [0]`访问行不起作用,因为 Pandas 希望使用数组索引表示法来获取列。 相反,我们必须使用稍微麻烦的表示法:
```python
sales.iloc[0] # get first row of data
......@@ -556,14 +554,7 @@ Name: 0, dtype: object
'''
```
To get individual elements, we can use regular list of lists Python notation after the `loc`:
为了获得单个元素,我们可以在`loc`之后使用常规的 Python 表示法,用于列表的列表:
```python
print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]
......@@ -571,8 +562,7 @@ print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]
# 10/13/10 6 38.94
```
During construction and debugging of software, I often like the explicit printing of the column names as is the default shown above. On the other hand, if we need the elements as a plain old Python list, we can do that with `list()`:
在软件的构建和调试过程中,我经常喜欢显式打印列名,如上所示。 另一方面,如果我们需要将元素转换为普通的 Python 列表,我们可以使用`list()`来实现:
```python
row = list(sales.iloc[0])
......@@ -581,10 +571,9 @@ print row
# ['10/13/10', 6, 38.939999999999998, 35.0, 'Muhammed MacIntyre', 'Office Supplies', 'Eldon Base for stackable storage shelf, platinum']
```
**Exercise**: Convert all rows of `sales` to a list of lists. Hint: use the map pattern and `list()`.
The task of that exercise is common enough that Pandas provides a conversion mechanism directly:
**练习**:将`sales`的所有行转换为列表的列表。 提示:使用映射模式和`list()`
这项工作的任务很普遍,Pandas 直接提供了一种转换机制:
```python
m = sales.as_matrix()
......@@ -598,9 +587,7 @@ Type is <type 'numpy.ndarray'>
'''
```
We can still get the columns individually using the wildcard notation we saw before:
我们仍然可以使用之前看到的切片表示,来单独获取列:
```python
m[:,0] # get first column
......@@ -611,13 +598,6 @@ array(['10/13/10', '10/1/12', '10/1/12', ..., '11/8/10', '10/21/12',
'''
```
```python
m[:,4] # get fifth column
......@@ -628,17 +608,11 @@ array(['Muhammed MacIntyre', 'Barry French', 'Barry French', ...,
```
### 单独获取数据帧
对于机器学习,我们经常希望将其中一列作为因变量分开,将其他列保持为一组独立变量。 我们通常使用的符号是`X - > Y`,意味着`X`中的观察集将结果预测或分类为`Y`.
### Pulling data frames apart
For machine learning, we often want to separate out one of the columns as the dependent variable, keeping the others as a group of independent variables. Notation we typically use is X -> Y, meaning the set of observations in X predict or classify results in Y.
For example, let's say we wanted to predict engine size given the efficiency, number of cylinders, and overall car weight. We need to separate out the engine size as Y and combining the other columns into X. Using Pandas, we can easily separate the variables and keep the variables names:
例如,假设我们想要根据效率,汽缸数量和整车重量来预测发动机尺寸。我们需要将引擎大小分离为`Y`并将其他列组合成`X`。使用 Pandas,我们可以轻松地分离变量并保留变量名称:
```python
cars = pandas.read_csv('data/cars.csv')
......@@ -671,9 +645,7 @@ Name: ENG, dtype: float64
'''
```
Converting to a NumPy array strips away the column names but let us treat it as a matrix, which is handy in a lot of cases (e.g., matrix addition). Separating columns from NumPy arrays is a bit more cumbersome, However:
转换为 NumPy 数组会删除列名称,但让我们将其视为矩阵,这在很多情况下都很方便(例如,矩阵加法)。从 NumPy 数组中分离列有点麻烦,但是:
```python
m = cars.as_matrix()
......@@ -696,9 +668,7 @@ print Y
'''
```
While NumPy arrays are more cumbersome when pulling apart tables, accessing the elements without `loc` is usually more convenient:
虽然 NumPy 数组在拆分表时更麻烦,但不用`loc`访问元素通常更方便:
```python
print cars.iloc[0][1]
......@@ -711,11 +681,11 @@ print m[0,1]
```
## Mixed, missing data
## 混合和缺失的数据
Using tips from [Jeremy Howard](https://www.usfca.edu/data-institute/about-us/researchers) here on real-world data clean up.
使用来自 [Jeremy Howard](https://www.usfca.edu/data-institute/about-us/researchers) 的提示,有关真实世界的数据清理。
### Load and parse dates
### 加载和解析日期
```python
......@@ -734,9 +704,7 @@ df
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | Sci |
If you ever need to convert dates to the elapsed time, you can convert the date timestamp to UNIX time, the number of second since 1970:
如果您需要将日期转换为已用时间,则可以将日期时间戳转换为 UNIX 时间,即自 1970 年以来的秒数:
```python
d = df['Date']
......@@ -754,20 +722,12 @@ Name: Date, dtype: timedelta64[ns]
'''
```
```python
df3 = df.copy()
df3['Date'] = delta.dt.total_seconds()
df3
```
| | Date | Description | Size | Price | Topic | Size_na |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 1.497917e+09 | NaN | 92.0 | 1.50 | 2 | False |
......@@ -777,8 +737,7 @@ df3
| 4 | 1.498176e+09 | foo&bar | 1.0 | 10.00 | 1 | False |
| 5 | 1.498262e+09 | get off my lawn | 99.0 | 8.90 | 4 | False |
Or, you can convert the timestamp into the number of days since 1970:
或者,您可以将时间戳转换为自 1970 年以来的天数:
```python
delta.dt.days
......@@ -794,16 +753,9 @@ Name: Date, dtype: int64
'''
```
### 字符串到类别变量
### String to categorical variable
Here is how we convert a column to a categorical variable:
以下是我们将列转换为类别变量的方法:
```python
df['Topic'] = df['Topic'].astype('category')
......@@ -825,8 +777,6 @@ print df['Topic'].cat.categories # .cat field gives us access to categories stu
# Index([u'News', u'Politics', u'Sci', u'Sports'], dtype='object')
```
```python
print df['Topic'].cat.codes
......@@ -858,10 +808,9 @@ Categories (4, object): [News < Politics < Sci < Sports]
```
### String to ordinal
We can convert that category to an integer if we like:
### 字符串到序数
如果我们愿意,我们可以将该类别转换为整数:
```python
# make sure you convert to categorical first
......@@ -879,15 +828,14 @@ df
| 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | 1 | False |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 4 | False |
### Missing data
Our data has a missing description, which we can ignore, but also has a missing size (numeric) and topic (categorical) entry.
### 缺失数据
* If the element is numeric, we replace the missing value with the column median and add a column to indicate 0 or 1 as to whether the value is missing.
* If the element is categorical, Pandas can handle the missing value automatically when we use parameter `dummy_na=True` on `get_dummies()` (see next section).
我们的数据缺少描述,我们可以忽略,但也缺少大小(数值)和主题(类别)条目。
Let's convert the missing numeric data:
*如果元素是数字,我们用列中位数替换缺失值,并添加一列 0 或 1 来表示是否缺少值。
*如果元素是分类的,当我们在`get_dummies()`上使用参数`dummy_na = True`时,Pandas 可以自动处理缺失值(参见下一节)。
让我们转换丢失的数值数据:
```python
pandas.isnull(df['Size'])
......@@ -903,13 +851,6 @@ Name: Size, dtype: bool
'''
```
```python
df['Size_na'] = pandas.isnull(df['Size'])
df
......@@ -930,7 +871,6 @@ df['Size'] = szcol.fillna(szcol.median())
df
```
| | Date | Description | Size | Price | Topic | Size_na |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 2017-06-20 | NaN | 92.0 | 1.50 | 1 | False |
......@@ -940,12 +880,11 @@ df
| 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | 0 | False |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 3 | False |
### Dummy variables
Instead, we can convert the categorical variable to dummy variables, also called "*one hot encoding*."
### 虚拟变量
If were lazy, we can just convert everything to dummies but the `Description` field is not something that we need to convert as it is mostly just information we are carrying along.
相反,我们可以将类别变量转换为虚拟变量,也称为“独热编码”。
如果我们懒惰,我们可以将所有内容转换为虚拟对象,但是`Description`字段不是我们需要转换的内容,因为它主要是我们所携带的信息。
```python
pandas.get_dummies(df) # convert all categorical to dummies
......@@ -988,8 +927,7 @@ pandas.get_dummies(df['Topic'], dummy_na=True) # Add an "na" column
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
We can manually pack these new columns into the old data frame and delete the old column:
我们可以手动将这些新列打包到旧数据帧中并删除旧列:
```python
df2 = pandas.concat([df,pandas.get_dummies(df['Topic'], dummy_na=True)], axis=1)
......@@ -997,7 +935,6 @@ df2.drop('Topic', axis=1, inplace=True) # Considered better than del df2['Topic'
df2
```
| | Date | Description | Size | Price | Size_na | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | nan |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 2017-06-20 | NaN | 92.0 | 1.50 | False | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
......@@ -1007,15 +944,12 @@ df2
| 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | False | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Or, we can do it the easy way by just specifying the columns to convert:
或者,我们可以通过指定要转换的列来轻松实现:
```python
pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way
```
| | Date | Description | Size | Price | Size_na | Topic_0.0 | Topic_1.0 | Topic_2.0 | Topic_3.0 | Topic_4.0 | Topic_nan |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 2017-06-20 | NaN | 92.0 | 1.50 | False | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
......@@ -1025,8 +959,7 @@ pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way
| 4 | 2017-06-23 | foo&bar | 1.0 | 10.00 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | False | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
If you ever need to walk the columns in your data frame, you can do that with a for each loop:
如果您需要遍历数据帧中的列,则可以使用`for`循环执行此操作:
```python
# walk columns; col is the actual series
......@@ -1044,7 +977,6 @@ Size_na
```
```python
df.describe() # useful stats about columns
```
......@@ -1062,20 +994,19 @@ df.describe() # useful stats about columns
8 rows × 3 columns
## 总结
## Summary
In this lecture, you've learned the basics of loading and manipulating data:
在本讲座中,您已经学习了加载和操作数据的基础知识:
* Loading data into Pandas data frames using `read_csv()`
* Converting data frames to NumPy arrays using `as_matrix()`
* Extracting columns with *dataframe*`.`*columnname* or *matrix*`[:,`*columnindex*`]`
* Accessing elements via *dataframe*`.iloc[`rowindex`][`*columnindex*`]` or *matrix*`[`*rowindex*`,`*columnindex*`]`
* Getting unique elements with `set(`*mylist*`)`
* 使用`read_csv()`将数据加载到 Pandas 数据帧中
* 使用`as_matrix()`将数据帧转换为 NumPy 数组
* 使用`dataframe.columnname``matrix[:, columnindex]`提取列
* 使用`dataframe.iloc[rowindex][columnindex]``matrix[rowindex, columnindex]`访问元素
* 使用`set(mylist)`获取唯一元素
And, you've learned how to visualize:
而且,您已经学会了如何可视化:
* Time series data
* Functions over a given range
* The relationship between variables using a scatterplot
* Histograms approximating density function
* 时间序列数据
* 超出给定范围的函数
* 使用散点图绘制变量之间的关系
* 近似密度函数的直方图
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册