{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formulas: Fitting models using R-style formulas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n", "\n", "* [Patsy formula language description](http://patsy.readthedocs.org/)\n", "\n", "## Loading modules and functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import print_function\n", "import numpy as np\n", "import statsmodels.api as sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import convention" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import explicitly from statsmodels.formula.api" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from statsmodels.formula.api import ols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sm.formula.ols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use the following conventioin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sm.OLS.from_formula" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](http://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n", "\n", "``dir(sm.formula)`` will print a list of available models. \n", "\n", "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## OLS regression using formulas\n", "\n", "To begin, we fit the linear model described on the [Getting Started](gettingstarted.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)\n", "res = mod.fit()\n", "print(res.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables\n", "\n", "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n", "\n", "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](contrasts.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators\n", "\n", "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n", "\n", "### Removing variables\n", "\n", "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiplicative interactions\n", "\n", "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()\n", "res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()\n", "print(res1.params, '\\n')\n", "print(res2.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions\n", "\n", "You can apply vectorized functions to the variables in your model: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a custom function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def log_plus_1(x):\n", " return np.log(x) + 1.\n", "res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any function that is in the calling namespace is available to the formula." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using formulas with models that do not (yet) support them\n", "\n", "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n", "can then be fed to the fitting function as `endog` and `exog` arguments. \n", "\n", "To generate ``numpy`` arrays: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import patsy\n", "f = 'Lottery ~ Literacy * Wealth'\n", "y,X = patsy.dmatrices(f, df, return_type='matrix')\n", "print(y[:5])\n", "print(X[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate pandas data frames: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f = 'Lottery ~ Literacy * Wealth'\n", "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n", "print(y[:5])\n", "print(X[:5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(sm.OLS(y, X).fit().summary())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }