glm.rst 8.6 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
.. currentmodule:: statsmodels.genmod.generalized_linear_model

.. _glm:

Generalized Linear Models
=========================

Generalized linear models currently supports estimation using the one-parameter
exponential families.

See `Module Reference`_ for commands and arguments.

Examples
--------

.. ipython:: python
    :okwarning:

    # Load modules and data
    import statsmodels.api as sm
    data = sm.datasets.scotland.load(as_pandas=False)
    data.exog = sm.add_constant(data.exog)

    # Instantiate a gamma family model with the default link function.
    gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
    gamma_results = gamma_model.fit()
    print(gamma_results.summary())

Detailed examples can be found here:

* `GLM <examples/notebooks/generated/glm.html>`__
* `Formula <examples/notebooks/generated/glm_formula.html>`__

Technical Documentation
-----------------------

..   ..glm_techn1
..   ..glm_techn2

The statistical model for each observation :math:`i` is assumed to be

 :math:`Y_i \sim F_{EDM}(\cdot|\theta,\phi,w_i)` and
 :math:`\mu_i = E[Y_i|x_i] = g^{-1}(x_i^\prime\beta)`.

where :math:`g` is the link function and :math:`F_{EDM}(\cdot|\theta,\phi,w)`
is a distribution of the family of exponential dispersion models (EDM) with
natural parameter :math:`\theta`, scale parameter :math:`\phi` and weight
:math:`w`.
Its density is given by

 :math:`f_{EDM}(y|\theta,\phi,w) = c(y,\phi,w)
 \exp\left(\frac{y\theta-b(\theta)}{\phi}w\right)\,.`

It follows that :math:`\mu = b'(\theta)` and
:math:`Var[Y|x]=\frac{\phi}{w}b''(\theta)`. The inverse of the first equation
gives the natural parameter as a function of the expected value
:math:`\theta(\mu)` such that

 :math:`Var[Y_i|x_i] = \frac{\phi}{w_i} v(\mu_i)`

with :math:`v(\mu) = b''(\theta(\mu))`. Therefore it is said that a GLM is
determined by link function :math:`g` and variance function :math:`v(\mu)`
alone (and :math:`x` of course).

Note that while :math:`\phi` is the same for every observation :math:`y_i`
and therefore does not influence the estimation of :math:`\beta`,
the weights :math:`w_i` might be different for every :math:`y_i` such that the
estimation of :math:`\beta` depends on them.

================================================= ============================== ============================== ======================================== =========================================== ============================================================================ =====================
Distribution                                      Domain                         :math:`\mu=E[Y|x]`             :math:`v(\mu)`                           :math:`\theta(\mu)`                         :math:`b(\theta)`                                                            :math:`\phi`
================================================= ============================== ============================== ======================================== =========================================== ============================================================================ =====================
Binomial :math:`B(n,p)`                           :math:`0,1,\ldots,n`           :math:`np`                     :math:`\mu-\frac{\mu^2}{n}`              :math:`\log\frac{p}{1-p}`                   :math:`n\log(1+e^\theta)`                                                    1
Poisson :math:`P(\mu)`                            :math:`0,1,\ldots,\infty`      :math:`\mu`                    :math:`\mu`                              :math:`\log(\mu)`                           :math:`e^\theta`                                                             1
Neg. Binom. :math:`NB(\mu,\alpha)`                :math:`0,1,\ldots,\infty`      :math:`\mu`                    :math:`\mu+\alpha\mu^2`                  :math:`\log(\frac{\alpha\mu}{1+\alpha\mu})` :math:`-\frac{1}{\alpha}\log(1-\alpha e^\theta)`                             1
Gaussian/Normal :math:`N(\mu,\sigma^2)`           :math:`(-\infty,\infty)`       :math:`\mu`                    :math:`1`                                :math:`\mu`                                 :math:`\frac{1}{2}\theta^2`                                                  :math:`\sigma^2`
Gamma :math:`N(\mu,\nu)`                          :math:`(0,\infty)`             :math:`\mu`                    :math:`\mu^2`                            :math:`-\frac{1}{\mu}`                      :math:`-\log(-\theta)`                                                       :math:`\frac{1}{\nu}`
Inv. Gauss. :math:`IG(\mu,\sigma^2)`              :math:`(0,\infty)`             :math:`\mu`                    :math:`\mu^3`                            :math:`-\frac{1}{2\mu^2}`                   :math:`-\sqrt{-2\theta}`                                                     :math:`\sigma^2`
Tweedie :math:`p\geq 1`                           depends on :math:`p`           :math:`\mu`                    :math:`\mu^p`                            :math:`\frac{\mu^{1-p}}{1-p}`               :math:`\frac{\alpha-1}{\alpha}\left(\frac{\theta}{\alpha-1}\right)^{\alpha}` :math:`\phi`
================================================= ============================== ============================== ======================================== =========================================== ============================================================================ =====================

The Tweedie distribution has special cases for :math:`p=0,1,2` not listed in the
table and uses :math:`\alpha=\frac{p-2}{p-1}`.

Correspondence of mathematical variables to code:

* :math:`Y` and :math:`y` are coded as ``endog``, the variable one wants to
  model
* :math:`x` is coded as ``exog``, the covariates alias explanatory variables
* :math:`\beta` is coded as ``params``, the parameters one wants to estimate
* :math:`\mu` is coded as ``mu``, the expectation (conditional on :math:`x`)
  of :math:`Y`
* :math:`g` is coded as ``link`` argument to the ``class Family``
* :math:`\phi` is coded as ``scale``, the dispersion parameter of the EDM
* :math:`w` is not yet supported (i.e. :math:`w=1`), in the future it might be
  ``var_weights``
* :math:`p` is coded as ``var_power`` for the power of the variance function
  :math:`v(\mu)` of the Tweedie distribution, see table
* :math:`\alpha` is either

  * Negative Binomial: the ancillary parameter ``alpha``, see table
  * Tweedie: an abbreviation for :math:`\frac{p-2}{p-1}` of the power :math:`p`
    of the variance function, see table


References
^^^^^^^^^^

* Gill, Jeff. 2000. Generalized Linear Models: A Unified Approach. SAGE QASS Series.
* Green, PJ. 1984. “Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives.” Journal of the Royal Statistical Society, Series B, 46, 149-192.
* Hardin, J.W. and Hilbe, J.M. 2007. “Generalized Linear Models and Extensions.” 2nd ed. Stata Press, College Station, TX.
* McCullagh, P. and Nelder, J.A. 1989. “Generalized Linear Models.” 2nd ed. Chapman & Hall, Boca Rotan.

Module Reference
----------------

.. module:: statsmodels.genmod.generalized_linear_model
   :synopsis: Generalized Linear Models (GLM)

Model Class
^^^^^^^^^^^

.. autosummary::
   :toctree: generated/

   GLM

Results Class
^^^^^^^^^^^^^

.. autosummary::
   :toctree: generated/

   GLMResults
   PredictionResults

.. _families:

Families
^^^^^^^^

The distribution families currently implemented are

.. module:: statsmodels.genmod.families.family
.. currentmodule:: statsmodels.genmod.families.family

.. autosummary::
   :toctree: generated/
   :template: autosummary/glmfamilies.rst

   Family
   Binomial
   Gamma
   Gaussian
   InverseGaussian
   NegativeBinomial
   Poisson
   Tweedie


.. _links:

Link Functions
^^^^^^^^^^^^^^

The link functions currently implemented are the following. Not all link
functions are available for each distribution family. The list of
available link functions can be obtained by

::

    >>> sm.families.family.<familyname>.links

.. module:: statsmodels.genmod.families.links
.. currentmodule:: statsmodels.genmod.families.links

.. autosummary::
   :toctree: generated/

   Link
   CDFLink
   CLogLog
   Log
   Logit
   NegativeBinomial
   Power
   cauchy
   cloglog
   identity
   inverse_power
   inverse_squared
   log
   logit
   nbinom
   probit

.. _varfuncs:

Variance Functions
^^^^^^^^^^^^^^^^^^

Each of the families has an associated variance function. You can access
the variance functions here:

::

    >>> sm.families.<familyname>.variance

.. module:: statsmodels.genmod.families.varfuncs
.. currentmodule:: statsmodels.genmod.families.varfuncs

.. autosummary::
   :toctree: generated/

  VarianceFunction
  constant
  Power
  mu
  mu_squared
  mu_cubed
  Binomial
  binary
  NegativeBinomial
  nbinom