{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression diagnostics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example file shows how to use a few of the ``statsmodels`` regression diagnostic tests in a real-life context. You can learn about more tests and find out more information abou the tests here on the [Regression Diagnostics page.](https://www.statsmodels.org/stable/diagnostic.html) \n",
    "\n",
    "Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online ``statsmodels`` documentation. For presentation purposes, we use the ``zip(name,test)`` construct to pretty-print short descriptions in the examples below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimate a regression model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from __future__ import print_function\n",
    "from statsmodels.compat import lzip\n",
    "import statsmodels\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.formula.api as smf\n",
    "import statsmodels.stats.api as sms\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Load data\n",
    "url = 'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/HistData/Guerry.csv'\n",
    "dat = pd.read_csv(url)\n",
    "\n",
    "# Fit regression model (using the natural log of one of the regressors)\n",
    "results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()\n",
    "\n",
    "# Inspect the results\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normality of the residuals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jarque-Bera test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']\n",
    "test = sms.jarque_bera(results.resid)\n",
    "lzip(name, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Omni test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name = ['Chi^2', 'Two-tail probability']\n",
    "test = sms.omni_normtest(results.resid)\n",
    "lzip(name, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Influence tests\n",
    "\n",
    "Once created, an object of class ``OLSInfluence`` holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.stats.outliers_influence import OLSInfluence\n",
    "test_class = OLSInfluence(results)\n",
    "test_class.dfbetas[:5,:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Explore other options by typing ``dir(influence_test)``\n",
    "\n",
    "Useful information on leverage can also be plotted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.graphics.regressionplots import plot_leverage_resid2\n",
    "fig, ax = plt.subplots(figsize=(8,6))\n",
    "fig = plot_leverage_resid2(results, ax = ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other plotting options can be found on the [Graphics page.](https://www.statsmodels.org/stable/graphics.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multicollinearity\n",
    "\n",
    "Condition number:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.linalg.cond(results.model.exog)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Heteroskedasticity tests\n",
    "\n",
    "Breush-Pagan test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name = ['Lagrange multiplier statistic', 'p-value', \n",
    "        'f-value', 'f p-value']\n",
    "test = sms.het_breushpagan(results.resid, results.model.exog)\n",
    "lzip(name, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Goldfeld-Quandt test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name = ['F statistic', 'p-value']\n",
    "test = sms.het_goldfeldquandt(results.resid, results.model.exog)\n",
    "lzip(name, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linearity\n",
    "\n",
    "Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name = ['t value', 'p value']\n",
    "test = sms.linear_harvey_collier(results)\n",
    "lzip(name, test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}