index.rst 3.7 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
.. _datasets:

.. currentmodule:: statsmodels.datasets

.. ipython:: python
   :suppress:

   import numpy as np
   np.set_printoptions(suppress=True)

The Datasets Package
====================

``statsmodels`` provides data sets (i.e. data *and* meta-data) for use in
examples, tutorials, model testing, etc.

Using Datasets from Stata
-------------------------

.. autosummary::
   :toctree: ./

   webuse

Using Datasets from R
---------------------

The `Rdatasets project <https://vincentarelbundock.github.io/Rdatasets/>`__ gives access to the datasets available in R's core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the :func:`get_rdataset` function. The actual data is accessible by the ``data`` attribute. For example:

.. ipython:: python

   import statsmodels.api as sm
   duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")
   print(duncan_prestige.__doc__)
   duncan_prestige.data.head(5)


R Datasets Function Reference
-----------------------------


.. autosummary::
   :toctree: ./

   get_rdataset
   get_data_home
   clear_data_home


Available Datasets
------------------

.. toctree::
   :maxdepth: 1
   :glob:

   generated/*

Usage
-----

Load a dataset:

.. ipython:: python

   import statsmodels.api as sm
   data = sm.datasets.longley.load(as_pandas=False)

The `Dataset` object follows the bunch pattern explained  in :ref:`proposal <dataset_proposal>`. The full dataset is available in the ``data`` attribute.

.. ipython:: python

   data.data

Most datasets hold convenient representations of the data in the attributes `endog` and `exog`:

.. ipython:: python

   data.endog[:5]
   data.exog[:5,:]

Univariate datasets, however, do not have an `exog` attribute.

Variable names can be obtained by typing:

.. ipython:: python

   data.endog_name
   data.exog_name

If the dataset does not have a clear interpretation of what should be an
`endog` and `exog`, then you can always access the `data` or `raw_data`
attributes. This is the case for the `macrodata` dataset, which is a collection
of US macroeconomic data rather than a dataset with a specific example in mind.
The `data` attribute contains a record array of the full dataset and the
`raw_data` attribute contains an ndarray with the names of the columns given
by the `names` attribute.

.. ipython:: python

   type(data.data)
   type(data.raw_data)
   data.names

Loading data as pandas objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a ``load_pandas``
method which returns a ``Dataset`` instance with the data readily available as pandas objects:

.. ipython:: python

   data = sm.datasets.longley.load_pandas()
   data.exog
   data.endog

The full DataFrame is available in the ``data`` attribute of the Dataset object

.. ipython:: python

   data.data


With pandas integration in the estimation classes, the metadata will be attached
to model results:

.. ipython:: python,okwarning

   y, x = data.endog, data.exog
   res = sm.OLS(y, x).fit()
   res.params
   res.summary()

Extra Information
^^^^^^^^^^^^^^^^^

If you want to know more about the dataset itself, you can access the
following, again using the Longley dataset as an example ::

    >>> dir(sm.datasets.longley)[:6]
    ['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information
----------------------

* The idea for a datasets package was originally proposed by David Cournapeau and can be found :ref:`here <dataset_proposal>` with updates by Skipper Seabold.
* To add datasets, see the :ref:`notes on adding a dataset <add_data>`.