{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generalized Linear Models" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "from __future__ import print_function\n", "import numpy as np\n", "import statsmodels.api as sm\n", "from scipy import stats\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GLM: Binomial response data\n", "\n", "### Load data\n", "\n", " In this example, we use the Star98 dataset which was taken with permission\n", " from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook\n", " information can be obtained by typing: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(sm.datasets.star98.NOTE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the data and add a constant to the exogenous (independent) variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = sm.datasets.star98.load()\n", "data.exog = sm.add_constant(data.exog, prepend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW): " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data.endog[:5,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The independent variables include all the other variables described above, as\n", " well as the interaction terms:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data.exog[:2,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit and summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())\n", "res = glm_binom.fit()\n", "print(res.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quantities of interest" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('Total number of trials:', data.endog[0].sum())\n", "print('Parameters: ', res.params)\n", "print('T-values: ', res.tvalues)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "means = data.exog.mean(axis=0)\n", "means25 = means.copy()\n", "means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)\n", "means75 = means.copy()\n", "means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)\n", "resp_25 = res.predict(means25)\n", "resp_75 = res.predict(means75)\n", "diff = resp_75 - resp_25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interquartile first difference for the percentage of low income households in a school district is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"%2.4f%%\" % (diff*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plots\n", "\n", " We extract information that will be used to draw some interesting plots: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nobs = res.nobs\n", "y = data.endog[:,0]/data.endog.sum(1)\n", "yhat = res.mu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot yhat vs y:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from statsmodels.graphics.api import abline_plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(yhat, y)\n", "line_fit = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n", "abline_plot(model_results=line_fit, ax=ax)\n", "\n", "\n", "ax.set_title('Model Fit Plot')\n", "ax.set_ylabel('Observed values')\n", "ax.set_xlabel('Fitted values');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot yhat vs. Pearson residuals:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "\n", "ax.scatter(yhat, res.resid_pearson)\n", "ax.hlines(0, 0, 1)\n", "ax.set_xlim(0, 1)\n", "ax.set_title('Residual Dependence Plot')\n", "ax.set_ylabel('Pearson Residuals')\n", "ax.set_xlabel('Fitted values')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histogram of standardized deviance residuals:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy import stats\n", "\n", "fig, ax = plt.subplots()\n", "\n", "resid = res.resid_deviance.copy()\n", "resid_std = stats.zscore(resid)\n", "ax.hist(resid_std, bins=25)\n", "ax.set_title('Histogram of standardized deviance residuals');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "QQ Plot of Deviance Residuals:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from statsmodels import graphics\n", "graphics.gofplots.qqplot(resid, line='r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GLM: Gamma for proportional count response\n", "\n", "### Load data\n", "\n", " In the example above, we printed the ``NOTE`` attribute to learn about the\n", " Star98 dataset. Statsmodels datasets ships with other useful information. For\n", " example: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(sm.datasets.scotland.DESCRLONG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Load the data and add a constant to the exogenous variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data2 = sm.datasets.scotland.load()\n", "data2.exog = sm.add_constant(data2.exog, prepend=False)\n", "print(data2.exog[:5,:])\n", "print(data2.endog[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit and summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())\n", "glm_results = glm_gamma.fit()\n", "print(glm_results.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GLM: Gaussian distribution with a noncanonical link\n", "\n", "### Artificial data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nobs2 = 100\n", "x = np.arange(nobs2)\n", "np.random.seed(54321)\n", "X = np.column_stack((x,x**2))\n", "X = sm.add_constant(X, prepend=False)\n", "lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit and summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))\n", "gauss_log_results = gauss_log.fit()\n", "print(gauss_log_results.summary())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }