version0.9.rst 14.4 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408
:orphan:

===========
0.9 Release
===========

Release 0.9.0
=============

Release summary
---------------

Statsmodels is using github to store the updated documentation which
is available under
https://www.statsmodels.org/stable for the last release, and
https://www.statsmodels.org/devel/ for the development version.


**Warning**

API stability is not guaranteed for new features, although even in
this case changes will be made in a backwards compatible way if
possible. The stability of a new feature depends on how much time it
was already in statsmodels master and how much usage it has already
seen.  If there are specific known problems or limitations, then they
are mentioned in the docstrings.


The list of pull requests for this release can be found on github
https://github.com/statsmodels/statsmodels/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+milestone%3A0.9
(The list does not include some pull request that were merged before
the 0.8 release but not included in 0.8.)


The Highlights
--------------

- statespace refactoring, Markov Switching Kim smoother
- 3 Google summer of code (GSOC) projects merged
  - distributed estimation
  - VECM and enhancements to VAR (including cointegration test)
  - new count models: GeneralizedPoisson, zero inflated models
- Bayesian mixed GLM
- Gaussian Imputation
- new multivariate methods: factor analysis, MANOVA, repeated measures
  within ANOVA
- GLM var_weights in addition to freq_weights
- Holt-Winters and Exponential Smoothing


What's new - an overview
------------------------

The following lists the main new features of statsmodels 0.9. In addition,
release 0.9 includes bug fixes, refactorings and improvements in many areas.

**base**
 - distributed estimation #3396  (Leland Bybee GSOC, Kerby Shedden)
 - optimization option scipy minimize #3193 (Roman Ring)
 - Box-Cox #3477 (Niels Wouda)
 - t_test_pairwise #4365 (Josef Perktold)

**discrete**
 - new count models (Evgeny Zhurko GSOC, Josef Perktold)
    - NegativeBinomialP #3832 merged in #3874
    - GeneralizedPoisson #3727 merged in  #3795
    - zero-inflated count models #3755 merged in #3908

 - discrete optimization improvements #3921, #3928 (Josef Perktold)
 - extend discrete margin when extra params, NegativeBinomial #3811
   (Josef Perktold)

**duration**
 - dependent censoring in survival/duration #3090 (Kerby Shedden)
 - entry times for Kaplan-Meier #3126 (Kerby Shedden)

**genmod**
 - Bayesian GLMM #4189, #4540 (Kerby Shedden)
 - GLM add var_weights #3692 (Peter Quackenbush)
 - GLM: EIM in optimization #3646 (Peter Quackenbush)
 - GLM correction to scale handling, loglike #3856 (Peter Quackenbush)

**graphics**
 - graphics HDR functional boxplot #3876 merged in #4049 (Pamphile ROY)
 - graphics Bland-Altman or Tukey mean difference plot
   #4112 merged in #4200 (Joses W. Ho)
 - bandwidth options in violinplots #4510 (Jim Correia)

**imputation**
 - multiple imputation via Gaussian model #4394, #4520 (Kerby Shedden)
 - regularized fitting in MICE #4319 (Kerby Shedden)

**iolib**
 - improvements of summary_coll #3702 merged #4064 (Natasha Watkins,
   Kevin Sheppard)

**multivariate**
 - multivariate: MANOVA, CanCorr #3327 (Yichuan Liu)
 - Factor Analysis #4161, #4156, #4167, #4214 (Yichuan Liu, Kerby Shedden,
   Josef Perktold)
 - statsmodels now includes the rotation code by ....

**regression**
 - fit_regularized for WLS #3581 (Kerby Shedden)

**stats**
 - Knockoff FDR # 3204 (Kerby Shedden)
 - Repeated measures ANOVA #3303 merged in #3663, #3838 (Yichuan Liu, Richard
   Höchenberger)
 - lilliefors test for exponential distribution #3837 merged in #3936 (Jacob
   Kimmel, Josef Perktold)

**tools**
 - quasi-random, Halton sequences #4104 (Pamphile ROY)

**tsa**
 - VECM #3246 (Aleksandar Karakas GSOC, Josef Perktold)
 - exog support in VAR, incomplete for extra results, part of VECM
   #3246, #4538 (Aleksandar Karakas GSOC, Josef Perktold)
 - Markov switching, Kim smoother #3141 (Chad Fulton)
 - Holt-Winters #3817 merged in #4176 (tvanzyl)
 - seasonal_decompose: trend extrapolation and vectorized 2-D #3031
   (kernc, Josef Perktold)
 - add frequency domain seasonal components to UnobservedComponents #4250
   (Jordan Yoder)
 - refactoring of date handling in tsa #3276, #4457 (Chad Fulton)
 - SARIMAX without AR, MA #3383  (Chad Fulton)

**maintenance**
 - switch to pytest #3804 plus several other PRs (Kevin Sheppard)
 - general compatibility fixes for recent versions of numpy, scipy and pandas


`bug-wrong`
~~~~~~~~~~~

A new issue label `type-bug-wrong` indicates bugs that cause that incorrect
numbers are returned without warnings.
(Regular bugs are mostly usability bugs or bugs that raise an exception for
unsupported use cases.)
see https://github.com/statsmodels/statsmodels/issues?q=is%3Aissue+label%3Atype-bug-wrong+is%3Aclosed+milestone%3A0.9

- scale in GLM fit_constrained, #4193 fixed in #4195
  cov_params and bse were incorrect if scale is estimated as in Gaussian.
  (This did not affect families with scale=1 such as Poisson)
- incorrect `pearson_chi2` with binomial counts, #3612 fixed as part of #3692
- null_deviance and llnull in GLMResults were wrong if exposure was used and
  when offset was used with Binomial counts.
- GLM Binomial in the non-binary count case used incorrect endog in recreating
  models which is
  used by fit_regularized and fit_constrained #4599.
- GLM observed hessian was incorrectly computed if non-canonical link is used,
  fixed in #4620
  This fix improves convergence with gradient optimization and removes a usually
  numerically small error in cov_params.
- discrete predict with offset or exposure, #3569 fixed in #3696
  If either offset or exposure are not None but exog is None, then offset and
  exposure arguments in predict were ignored.
- discrete margins had wrong dummy and count effect if constant is prepended,
  #3695 fixed in #3696
- OLS outlier test, wrong index if order is True, #3971 fixed in #4385
- tsa coint ignored the autolag keyword, #3966 fixed in #4492
  This is a backwards incompatible change in default, instead of fixed maxlag
  it defaults now to 'aic' lag selection. The default autolag is now the same
  as the adfuller default.
- wrong confidence interval in contingency table summary, #3822 fixed in #3830
  This only affected the summary and not the corresponding attribute.
- incorrect results in summary_col if regressor_order is used,
  #3767 fixed in #4271


Description of selected new feature
-----------------------------------

The following provides more information about a selected set of new features.

Vector Error Correction Model (VECM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The VECM framework developed during GSOC 2016 by Aleksandar Karakas adds support
for non-stationary cointegrated VAR processes to statsmodels.
Currently, the following topics are implemented

* Parameter estimation for cointegrated VAR
* forecasting
* testing for Granger-causality and instantaneous causality
* testing for cointegrating rank
* lag order selection.

New methods have been added also to the existing VAR model, and VAR has now
limited support for user provided explanatory variables.


New Count Models
----------------

New count models have been added as part of GSOC 2017 by Evgeny Zhurko.
Additional models that are not yet finished will be added for the next release.

The new models are:

* NegativeBinomialP (NBP): This is a generalization of NegativeBinomial that
  allows the variance power parameter to be specified in the range between 1
  and 2. The current NegativeBinomial support NB1 and NB2 which are two special
  cases of NBP.
* GeneralizedPoisson (GPP): Similar to NBP this allows a large range of
  dispersion specification. GPP also allow some amount of under dispersion
* ZeroInflated Models: Based on a generic base class, zeroinflated models
  are now available for Poisson, GeneralizedPoisson and NegativeBinomialP.

Generalized linear mixed models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Limited support for GLIMMIX models is now included in the genmod
module.  Binomial and Poisson models with independent random effects
can be fit using Bayesian methods (Laplace and mean field
approximations to the posterior).

Multiple imputation
~~~~~~~~~~~~~~~~~~~

Multiple imputation using a multivariate Gaussian model is now
included in the imputation module.  The model is fit via Gibbs
sampling from the joint posterior of the mean vector, covariance
matrix, and missing data values.  A convenience function for fitting a
model to the multiply imputed data sets and combining the results is
provided.  This is an alternative to the existing MICE (Multiple
Imputation via Chained Equations) procedures.

Exponential smoothing models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Exponential smoothing models are now available (introduced in #4176 by
Terence L van Zyl). These models are conceptually simple, decomposing a time
series into level, trend, and seasonal components that are constructed from
weighted averages of past observations. Nonetheless, they produce forecasts
that are competitive with more advanced models and which may be easier to
interpret.

Available models include:

- Simple exponential smoothing
- Holt's method
- Holt-Winters exponential smoothing

Improved time series index support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Handling of indexes for time series models has been overhauled (#3272) to
take advantage of recent improvements in Pandas and to shift to Pandas much of
the special case handling (espcially for date indexes) that had previously been
done in Statsmodels. Benefits include more consistent behavior, a reduced
number of bugs from corner cases, and a reduction in the maintenance burden.

Although an effort was made to maintain backwards compatibility with this
change, it is possible that some undocumented corner cases that previously
worked will now raise warnings or exceptions.

State space models
~~~~~~~~~~~~~~~~~~

The state space model infrastructure has been rewritten and improved (#2845).
New features include:

- Kalman smoother rewritten in Cython for substantial performance improvements
- Simulation smoother (Durbin and Koopman, 2002)
- Fast simulation of time series for any state space model
- Univariate Kalman filtering and smoothing (Koopman and Durbin, 2000)
- Collapsed Kalman filtering and smoothing (Jungbacker and Koopman, 2014)
- Optional computation of the lag-one state autocovariance
- Use of the Scipy BLAS functions for Cython interface if available
  (`scipy.linalg.cython_blas` for Scipy >= 0.16)

These features yield new features and improve performance for the existing
state space models (`SARIMAX`, `UnobservedComopnents`, `DynamicFactor`, and
`VARMAX`), and they also make Bayesian estimation by Gibbs-sampling possible.

**Warning**: this will be the last version that includes the original state
space code and supports Scipy < 0.16. The next release will only include the
new state space code.

Unobserved components models: frequency-domain seasonals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unobserved components models now support modeling seasonal factors from a
frequency-domain perspective with user-specified period and harmonics
(introduced in #4250 by Jordan Yoder). This not only allows for multiple
seasonal effects, but also allows the representation of seasonal components
with fewer unobserved states. This can improve computational performance and,
since it allows for a more parsimonious model, may also improve the
out-of-sample performance of the model.


Major Bugs fixed
----------------

* see github issues for a list of bug fixes included in this release
  https://github.com/statsmodels/statsmodels/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+milestone%3A0.9+label%3Atype-bug
  https://github.com/statsmodels/statsmodels/pulls?q=is%3Apr+is%3Amerged+milestone%3A0.9+label%3Atype-bug-wrong

* Refitting elastic net regularized models using the `refit=True`
  option now returns the unregularized parameters for the coefficients
  selected by the regularized fitter, as documented. #4213

* In MixedLM, a bug that produced exceptions when calling
  `random_effects_cov` on models with variance components has been
  fixed.


Backwards incompatible changes and deprecations
-----------------------------------------------

* DynamicVAR and DynamicPanelVAR is deprecated and will be removed in
  a future version. It used rolling OLS from pandas which has been
  removed in pandas.

* In MixedLM, names for the random effects variance and covariance
  parameters have changed from, e.g. G RE to G Var or G x F Cov.  This
  impacts summary output, and also may require modifications to user
  code that extracted these parameters from the fitted results object
  by name.

* In MixedLM, the names for the random effects realizations for
  variance components have been changed.  When using formulas, the
  random effect realizations are named using the column names produced
  by Patsy when parsing the formula.


Development summary and credits
-------------------------------

Besides receiving contributions for new and improved features and for bugfixes,
important contributions to general maintenance for this release came from

* Kevin Sheppard
* Peter Quackenbush
* Brock Mendel

and the general maintainer and code reviewer

* Josef Perktold

Additionally, many users contributed by participation in github issues and
providing feedback.

Thanks to all of the contributors for the 0.9 release (based on git log):

.. note::

    * Aleksandar Karakas
    * Alex Fortin
    * Alexander Belopolsky
    * Brock Mendel
    * Chad Fulton
    * ChadFulton
    * Christian Lorentzen
    * Dave Willmer
    * Dror Atariah
    * Evgeny Zhurko
    * Gerard Brunick
    * Greg Mosby
    * Jacob Kimmel
    * Jamie Morton
    * Jarvis Miller
    * Jasmine Mou
    * Jeroen Van Goey
    * Jim Correia
    * Joon Ro
    * Jordan Yoder
    * Jorge C. Leitao
    * Josef Perktold
    * Joses W. Ho
    * José Lopez
    * Joshua Engelman
    * Juan Escamilla
    * Justin Bois
    * Kerby Shedden
    * Kernc
    * Kevin Sheppard
    * Leland Bybee
    * Maxim Uvarov
    * Michael Kaminsky
    * Mosky Liu
    * Natasha Watkins
    * Nick DeRobertis
    * Niels Wouda
    * Pamphile ROY
    * Peter Quackenbush
    * Quentin Andre
    * Richard Höchenberger
    * Rob Klooster
    * Roman Ring
    * Scott Tsai
    * Soren Fuglede Jorgensen
    * Tom Augspurger
    * Tommy Odland
    * Tony Jiang
    * Yichuan Liu
    * ftemme
    * hugovk
    * kiwirob
    * malickf
    * tvanzyl
    * weizhongg
    * zveryansky

These lists of names are automatically generated based on git log, and may not
be complete.