Accessing rows via `sales[0]` then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation:
During construction and debugging of software, I often like the explicit printing of the column names as is the default shown above. On the other hand, if we need the elements as a plain old Python list, we can do that with `list()`:
For machine learning, we often want to separate out one of the columns as the dependent variable, keeping the others as a group of independent variables. Notation we typically use is X -> Y, meaning the set of observations in X predict or classify results in Y.
For example, let's say we wanted to predict engine size given the efficiency, number of cylinders, and overall car weight. We need to separate out the engine size as Y and combining the other columns into X. Using Pandas, we can easily separate the variables and keep the variables names:
Converting to a NumPy array strips away the column names but let us treat it as a matrix, which is handy in a lot of cases (e.g., matrix addition). Separating columns from NumPy arrays is a bit more cumbersome, However:
```python
```python
m=cars.as_matrix()
m=cars.as_matrix()
...
@@ -696,9 +668,7 @@ print Y
...
@@ -696,9 +668,7 @@ print Y
'''
'''
```
```
虽然 NumPy 数组在拆分表时更麻烦,但不用`loc`访问元素通常更方便:
While NumPy arrays are more cumbersome when pulling apart tables, accessing the elements without `loc` is usually more convenient:
```python
```python
printcars.iloc[0][1]
printcars.iloc[0][1]
...
@@ -711,11 +681,11 @@ print m[0,1]
...
@@ -711,11 +681,11 @@ print m[0,1]
```
```
## Mixed, missing data
## 混合和缺失的数据
Using tips from [Jeremy Howard](https://www.usfca.edu/data-institute/about-us/researchers) here on real-world data clean up.
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 4 | False |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 4 | False |
### Missing data
### 缺失数据
Our data has a missing description, which we can ignore, but also has a missing size (numeric) and topic (categorical) entry.
* If the element is numeric, we replace the missing value with the column median and add a column to indicate 0 or 1 as to whether the value is missing.
我们的数据缺少描述,我们可以忽略,但也缺少大小(数值)和主题(类别)条目。
* If the element is categorical, Pandas can handle the missing value automatically when we use parameter `dummy_na=True` on `get_dummies()` (see next section).
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 3 | False |
| 5 | 2017-06-24 | get off my lawn | 99.0 | 8.90 | 3 | False |
### Dummy variables
### 虚拟变量
Instead, we can convert the categorical variable to dummy variables, also called "*one hot encoding*."
If were lazy, we can just convert everything to dummies but the `Description` field is not something that we need to convert as it is mostly just information we are carrying along.