influence_glm_logit.ipynb 5.0 KB
Notebook
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Influence Measures for GLM Logit\n",
    "\n",
    "\n",
    "Based on draft version for GLMInfluence, which will also apply to discrete Logit, Probit and Poisson, and eventually be extended to cover most models outside of time series analysis.\n",
    "\n",
    "The example for logistic regression was used by Pregibon (1981) \"Logistic Regression diagnostics\" and is based on data by Finney (1947).\n",
    "\n",
    "GLMInfluence includes the basic influence measures but still misses some measures described in Pregibon (1981), for example those related to deviance and effects on confidence intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os.path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from statsmodels.genmod.generalized_linear_model import GLM\n",
    "from statsmodels.genmod import families"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import statsmodels.stats.tests.test_influence\n",
    "test_module = statsmodels.stats.tests.test_influence.__file__\n",
    "cur_dir = cur_dir = os.path.abspath(os.path.dirname(test_module))\n",
    "\n",
    "file_name = 'binary_constrict.csv'\n",
    "file_path = os.path.join(cur_dir, 'results', file_name)\n",
    "df = pd.read_csv(file_path, index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res = GLM(df['constrict'], df[['const', 'log_rate', 'log_volumne']],\n",
    "          family=families.Binomial()).fit(attach_wls=True, atol=1e-10)\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## get the influence measures\n",
    "\n",
    "GLMResults has a `get_influence` method similar to OLSResults, that returns and instance of the GLMInfluence class. This class has methods and (cached) attributes to inspect influence and outlier measures.\n",
    "\n",
    "This measures are based on a one-step approximation to the the results for deleting one observation. One-step approximations are usually accurate for small changes but underestimate the magnitude of large changes. Event though large changes are underestimated, they still show clearly the effect of influential observations\n",
    "\n",
    "In this example observation 4 and 18 have a large standardized residual and large Cook's distance, but not a large leverage. Observation 13 has the largest leverage but only small Cook's distance and not a large studentized residual.\n",
    "\n",
    "Only the two observations 4 and 18 have a large impact on the parameter estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "infl = res.get_influence(observed=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "summ_df = infl.summary_frame()\n",
    "summ_df.sort_values('cooks_d', ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_influence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='cooks', threshold=2 * infl.cooks_distance[0].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='resid', threshold=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=1, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=2, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "infl.plot_index(y_var='dfbeta', idx=0, threshold=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}