{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generalized Linear Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from __future__ import print_function\n",
    "import numpy as np\n",
    "import statsmodels.api as sm\n",
    "from scipy import stats\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GLM: Binomial response data\n",
    "\n",
    "### Load data\n",
    "\n",
    " In this example, we use the Star98 dataset which was taken with permission\n",
    " from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook\n",
    " information can be obtained by typing: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(sm.datasets.star98.NOTE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data and add a constant to the exogenous (independent) variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = sm.datasets.star98.load()\n",
    "data.exog = sm.add_constant(data.exog, prepend=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(data.endog[:5,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " The independent variables include all the other variables described above, as\n",
    " well as the interaction terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(data.exog[:2,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit and summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())\n",
    "res = glm_binom.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quantities of interest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Total number of trials:',  data.endog[0].sum())\n",
    "print('Parameters: ', res.params)\n",
    "print('T-values: ', res.tvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "means = data.exog.mean(axis=0)\n",
    "means25 = means.copy()\n",
    "means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)\n",
    "means75 = means.copy()\n",
    "means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)\n",
    "resp_25 = res.predict(means25)\n",
    "resp_75 = res.predict(means75)\n",
    "diff = resp_75 - resp_25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interquartile first difference for the percentage of low income households in a school district is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"%2.4f%%\" % (diff*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plots\n",
    "\n",
    " We extract information that will be used to draw some interesting plots: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nobs = res.nobs\n",
    "y = data.endog[:,0]/data.endog.sum(1)\n",
    "yhat = res.mu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot yhat vs y:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from statsmodels.graphics.api import abline_plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.scatter(yhat, y)\n",
    "line_fit = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n",
    "abline_plot(model_results=line_fit, ax=ax)\n",
    "\n",
    "\n",
    "ax.set_title('Model Fit Plot')\n",
    "ax.set_ylabel('Observed values')\n",
    "ax.set_xlabel('Fitted values');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot yhat vs. Pearson residuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "\n",
    "ax.scatter(yhat, res.resid_pearson)\n",
    "ax.hlines(0, 0, 1)\n",
    "ax.set_xlim(0, 1)\n",
    "ax.set_title('Residual Dependence Plot')\n",
    "ax.set_ylabel('Pearson Residuals')\n",
    "ax.set_xlabel('Fitted values')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Histogram of standardized deviance residuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from scipy import stats\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "resid = res.resid_deviance.copy()\n",
    "resid_std = stats.zscore(resid)\n",
    "ax.hist(resid_std, bins=25)\n",
    "ax.set_title('Histogram of standardized deviance residuals');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "QQ Plot of Deviance Residuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from statsmodels import graphics\n",
    "graphics.gofplots.qqplot(resid, line='r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GLM: Gamma for proportional count response\n",
    "\n",
    "### Load data\n",
    "\n",
    " In the example above, we printed the ``NOTE`` attribute to learn about the\n",
    " Star98 dataset. Statsmodels datasets ships with other useful information. For\n",
    " example: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(sm.datasets.scotland.DESCRLONG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Load the data and add a constant to the exogenous variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data2 = sm.datasets.scotland.load()\n",
    "data2.exog = sm.add_constant(data2.exog, prepend=False)\n",
    "print(data2.exog[:5,:])\n",
    "print(data2.endog[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit and summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())\n",
    "glm_results = glm_gamma.fit()\n",
    "print(glm_results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GLM: Gaussian distribution with a noncanonical link\n",
    "\n",
    "### Artificial data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nobs2 = 100\n",
    "x = np.arange(nobs2)\n",
    "np.random.seed(54321)\n",
    "X = np.column_stack((x,x**2))\n",
    "X = sm.add_constant(X, prepend=False)\n",
    "lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit and summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))\n",
    "gauss_log_results = gauss_log.fit()\n",
    "print(gauss_log_results.summary())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}