{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formulas: Fitting models using R-style formulas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n",
    "\n",
    "* [Patsy formula language description](http://patsy.readthedocs.org/)\n",
    "\n",
    "## Loading modules and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import numpy as np\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import convention"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can import explicitly from statsmodels.formula.api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from statsmodels.formula.api import ols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sm.formula.ols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can use the following conventioin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sm.OLS.from_formula"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](http://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n",
    "\n",
    "``dir(sm.formula)`` will print a list of available models. \n",
    "\n",
    "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## OLS regression using formulas\n",
    "\n",
    "To begin, we fit the linear model described on the [Getting Started](gettingstarted.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)\n",
    "res = mod.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical variables\n",
    "\n",
    "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n",
    "\n",
    "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](contrasts.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operators\n",
    "\n",
    "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n",
    "\n",
    "### Removing variables\n",
    "\n",
    "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiplicative interactions\n",
    "\n",
    "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()\n",
    "res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()\n",
    "print(res1.params, '\\n')\n",
    "print(res2.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions\n",
    "\n",
    "You can apply vectorized functions to the variables in your model: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a custom function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def log_plus_1(x):\n",
    "    return np.log(x) + 1.\n",
    "res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any function that is in the calling namespace is available to the formula."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using formulas with models that do not (yet) support them\n",
    "\n",
    "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n",
    "can then be fed to the fitting function as `endog` and `exog` arguments. \n",
    "\n",
    "To generate ``numpy`` arrays: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import patsy\n",
    "f = 'Lottery ~ Literacy * Wealth'\n",
    "y,X = patsy.dmatrices(f, df, return_type='matrix')\n",
    "print(y[:5])\n",
    "print(X[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate pandas data frames: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f = 'Lottery ~ Literacy * Wealth'\n",
    "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n",
    "print(y[:5])\n",
    "print(X[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(sm.OLS(y, X).fit().summary())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}