influence_glm_logit.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Influence Measures for GLM Logit\n",
    "\n",
    "\n",
    "Based on draft version for GLMInfluence, which will also apply to discrete Logit, Probit and Poisson, and eventually be extended to cover most models outside of time series analysis.\n",
    "\n",
    "The example for logistic regression was used by Pregibon (1981) \"Logistic Regression diagnostics\" and is based on data by Finney (1947).\n",
    "\n",
    "GLMInfluence includes the basic influence measures but still misses some measures described in Pregibon (1981), for example those related to deviance and effects on confidence intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os.path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from statsmodels.genmod.generalized_linear_model import GLM\n",
    "from statsmodels.genmod import families"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import statsmodels.stats.tests.test_influence\n",
    "test_module = statsmodels.stats.tests.test_influence.__file__\n",
    "cur_dir = cur_dir = os.path.abspath(os.path.dirname(test_module))\n",
    "\n",
    "file_name = 'binary_constrict.csv'\n",
    "file_path = os.path.join(cur_dir, 'results', file_name)\n",
    "df = pd.read_csv(file_path, index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res = GLM(df['constrict'], df[['const', 'log_rate', 'log_volumne']],\n",
    "          family=families.Binomial()).fit(attach_wls=True, atol=1e-10)\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## get the influence measures\n",
    "\n",
    "GLMResults has a `get_influence` method similar to OLSResults, that returns and instance of the GLMInfluence class. This class has methods and (cached) attributes to inspect influence and outlier measures.\n",
    "\n",
    "This measures are based on a one-step approximation to the the results for deleting one observation. One-step approximations are usually accurate for small changes but underestimate the magnitude of large changes. Event though large changes are underestimated, they still show clearly the effect of influential observations\n",
    "\n",
    "In this example observation 4 and 18 have a large standardized residual and large Cook's distance, but not a large leverage. Observation 13 has the largest leverage but only small Cook's distance and not a large studentized residual.\n",
    "\n",
    "Only the two observations 4 and 18 have a large impact on the parameter estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "infl = res.get_influence(observed=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "summ_df = infl.summary_frame()\n",
    "summ_df.sort_values('cooks_d', ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_influence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='cooks', threshold=2 * infl.cooks_distance[0].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='resid', threshold=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=1, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=2, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=0, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}