Manuel Gebetsberger, Jakob W. Messner, Georg J. Mayr, Achim Zeileis (2018). “Estimation Methods for Nonhomogeneous Regression Models: Minimum Continuous Ranked Probability Score versus Maximum Likelihood.” Monthly Weather Review. 146(12), 4323-4338. doi:10.1175/MWR-D-17-0364.1
Nonhomogeneous regression models are widely used to statistically postprocess numerical ensemble weather prediction models. Such regression models are capable of forecasting full probability distributions and correcting for ensemble errors in the mean and variance. To estimate the corresponding regression coefficients, minimization of the continuous ranked probability score (CRPS) has widely been used in meteorological postprocessing studies and has often been found to yield more calibrated forecasts compared to maximum likelihood estimation. From a theoretical perspective, both estimators are consistent and should lead to similar results, provided the correct distribution assumption about empirical data. Differences between the estimated values indicate a wrong specification of the regression model. This study compares the two estimators for probabilistic temperature forecasting with nonhomogeneous regression, where results show discrepancies for the classical Gaussian assumption. The heavy-tailed logistic and Student?s t distributions can improve forecast performance in terms of sharpness and calibration, and lead to only minor differences between the estimators employed. Finally, a simulation study confirms the importance of appropriate distribution assumptions and shows that for a correctly specified model the maximum likelihood estimator is slightly more efficient than the CRPS estimator.
https://CRAN.R-project.org/package=crch
The function crch()
provides heteroscedastic (or nonhomogenous) regression models of "gaussian"
(i.e., normally distributed), "logistic"
, or "student"
(i.e., t-distributed) response variables. Additionally, responses may be censored or truncated. Estimation methods include maximum likelihood (type = "ml"
, default) and minimum CRPS (type = "crps"
). Boosting can also be employed for model fitting (instead of full optimization). CRPS computations leverage the excellent scoringRules package.
The plots below show histograms of the PIT (probability integral transform) for various nonhomogenous regression models yielding probabilistic 1-day-ahead temperature forecasts at an Alpine site (Innsbruck). When the probabilistic forecasts are perfectly calibrated to the actual observations the PIT histograms should form a straight line at density 1. The gray area illustrates the 95% consistency interval around perfect calibration - and binning is based on 5% intervals.
When a normally distributed or Gaussian response is assumed (left panel), it is shown that the maximum-likelihood model (solid line) is not well calibrated as the tails are not heavy enough. (The legend denotes this “LS” because maximizing the likelihood is equivalent to minimizing the so-called log-score.) In contrast, the minimum-CRPS model is reasonably well calibrated.
When assuming a Student-t response (right panel) there is little deviation between both estimation techniques and both are well-calibrated.
Thus, the source of the differences between CRPS- and ML-based estimation with a Gaussian response comes from assuming a distribution whose tails are not heavy enough. In this situation, minimum-CRPS yields the somewhat more robust model fit while both estimation techniques lead to very similar results if a more suitable response distribution is adopted. In the latter case ML is slightly more efficient than minimum-CRPS.
]]>Heidi Seibold, Torsten Hothorn, Achim Zeileis (2018). “Generalised Linear Model Trees with Global Additive Effects.” Advances in Data Analysis and Classification. Forthcoming. doi:10.1007/s11634-018-0342-1 arXiv
Model-based trees are used to find subgroups in data which differ with respect to model parameters. In some applications it is natural to keep some parameters fixed globally for all observations while asking if and how other parameters vary across subgroups. Existing implementations of model-based trees can only deal with the scenario where all parameters depend on the subgroups. We propose partially additive linear model trees (PALM trees) as an extension of (generalised) linear model trees (LM and GLM trees, respectively), in which the model parameters are specified a priori to be estimated either globally from all observations or locally from the observations within the subgroups determined by the tree. Simulations show that the method has high power for detecting subgroups in the presence of global effects and reliably recovers the true parameters. Furthermore, treatment-subgroup differences are detected in an empirical application of the method to data from a mathematics exam: the PALM tree is able to detect a small subgroup of students that had a disadvantage in an exam with two versions while adjusting for overall ability effects.
https://CRAN.R-project.org/package=palmtree
PALM trees are employed to investigate treatment differences in a mathematics 101 exam (for first-year business and economics students) at Universität Innsbruck. Due to limited availability of seats in the exam room, students could self-select into one of two exam tracks that were conducted back to back with slightly different questions on the same topics. The question is whether this “treatment” of splitting the students into two tracks was fair in the sense that it is on average equally difficult for the two groups. To investigate the question the data are loaded from the psychotools package, points are scaled to achieved percent in [0, 100], and the subset of variables for the analysis are selected:
data("MathExam14W", package = "psychotools")
MathExam14W$tests <- 100 * MathExam14W$tests/26
MathExam14W$pcorrect <- 100 * MathExam14W$nsolved/13
MathExam <- MathExam14W[ , c("pcorrect", "group", "tests", "study",
"attempt", "semester", "gender")]
A naive check could be whether the percentage of correct points (pcorrect
) differs between the two group
s:
ci <- function(object) cbind("Coefficient" = coef(object), confint(object))
ci(lm(pcorrect ~ group, data = MathExam))
## Coefficient 2.5 % 97.5 %
## (Intercept) 57.60 55.1 60.08
## group2 -2.33 -5.7 1.03
This shows that the second group achieved on average 2.33 percentage points less than the first group. But the corresponding confidence interval conveys that this difference is not significant.
However, it is conceivable that stronger (or weaker) students selected themselves more into one of the two groups. And if the assignment had been random, then the “treatment effect” might have been larger or even smaller. Luckily, an independent measure of the students’ ability is available, namely the percentage of points achieved in the online tests
conducted during the semester prior to the exam. Adjusting for that increases the treatment effect to a decrease of 4.37 percentage points which is still non-significant, though. This is due to weaker students self-selecting into the second group. Moreover, the tests
coefficient signals that 1 more percentage point from the online tests lead on average to 0.855 more percentage points in the written exam.
ci(lm(pcorrect ~ group + tests, data = MathExam))
## Coefficient 2.5 % 97.5 %
## (Intercept) -5.846 -13.521 1.828
## group2 -4.366 -7.231 -1.502
## tests 0.855 0.756 0.955
Finally, PALM trees are used to assess whether there are subgroups of differential group
treatment effects when adjusting for a global additive tests
effect. Potential subgroups can be formed from the covariates tests
, type of study
(three-year bachelor vs. four-year diploma), the number of times the students attempt
ed the exam, number of semester
s, and gender
. Using palmtree this can be easily carried out:
library("palmtree")
palmtree_math <- palmtree(pcorrect ~ group | tests | tests +
study + attempt + semester + gender, data = MathExam)
print(palmtree_math)
## Partially additive linear model tree
##
## Model formula:
## pcorrect ~ group | tests + study + attempt + semester + gender
##
## Fitted party:
## [1] root
## | [2] attempt <= 1
## | | [3] tests <= 92.3: n = 352
## | | (Intercept) group2
## | | -7.09 -3.00
## | | [4] tests > 92.3: n = 79
## | | (Intercept) group2
## | | 14.0 -14.5
## | [5] attempt > 1: n = 298
## | (Intercept) group2
## | 2.33 -1.70
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
## Number of parameters per node: 2
## Objective function (residual sum of squares): 253218
##
## Linear fixed effects (from palm model):
## tests
## 0.787
A somewhat enhanced version of plot(palmtree_math)
is shown below:
This indicates that for most students the group
treatment effect is indeed negligible. However, for the subgroup of “good” students (with high percentage correct in the online tests) in the first attempt, the exam in the second group was indeed more difficult. On average the students in the second group obtained -14.5 percentage points less than in the first group.
ci(palmtree_math$palm)
## Coefficient 2.5 % 97.5 %
## (Intercept) -7.088 -16.148 1.971
## .tree4 21.069 13.348 28.791
## .tree5 9.421 5.168 13.673
## tests 0.787 0.671 0.903
## .tree3:group2 -2.997 -6.971 0.976
## .tree4:group2 -14.494 -22.921 -6.068
## .tree5:group2 -1.704 -5.965 2.557
The absolute size of this group difference is still moderate, though, corresponding to about half an exercise out of 13.
In addition to the empirical case study the manuscript also provides an extensive simulation study comparing the performance of PALM trees in treatment-subgroup scenarios to standard linear model (LM) trees, optimal treatment regime (OTR) trees (following Zhang et al. 2012), and the STIMA algorithm (simultaneous threshold interaction modeling algorithm). The study evaluates the methods with respect to (1) finding the correct subgroups, (2) not splitting when there are no subgroups, (3) finding the optimal treatment regime, and (4) correctly estimating the treatment effect.
Here we just briefly highlight the results for question (1): Are the correct subgroups found? The figure below shows the mean number of subgroups (over 150 simulated data sets and mean adjusted rand index (ARI) for increasing treatment effect differences Δ_{β} and number of observations n.
This shows that PALM trees perform increasingly well and somewhat better with respect to these metrics than the competitors. More details on the different scenarios and corresponding evaluations can be found in the manuscript. More replication materials are provided along with the manuscript on the publisher’s web page.
]]>Thorsten Simon, Peter Fabsic, Georg J. Mayr, Nikolaus Umlauf, Achim Zeileis (2018). “Probabilistic Forecasting of Thunderstorms in the Eastern Alps.” Monthly Weather Review. 146(9), 2999-3009. doi:10.1175/MWR-D-17-0366.1
A probabilistic forecasting method to predict thunderstorms in the European eastern Alps is developed. A statistical model links lightning occurrence from the ground-based Austrian Lightning Detection and Information System (ALDIS) detection network to a large set of direct and derived variables from a numerical weather prediction (NWP) system. The NWP system is the high-resolution run (HRES) of the European Centre for Medium-Range Weather Forecasts (ECMWF) with a grid spacing of 16 km. The statistical model is a generalized additive model (GAM) framework, which is estimated by Markov chain Monte Carlo (MCMC) simulation. Gradient boosting with stability selection serves as a tool for selecting a stable set of potentially nonlinear terms. Three grids from 64 x 64 to 16 x 16 km^{2} and five forecast horizons from 5 days to 1 day ahead are investigated to predict thunderstorms during afternoons (1200–1800 UTC). Frequently selected covariates for the nonlinear terms are variants of convective precipitation, convective potential available energy, relative humidity, and temperature in the midlayers of the troposphere, among others. All models, even for a lead time of 5 days, outperform a forecast based on climatology in an out-of-sample comparison. An example case illustrates that coarse spatial patterns are already successfully forecast 5 days ahead.
https://CRAN.R-project.org/package=bamlss
Predicting thunderstorms in complex terrain (like the Austrian Alps) is a challenging task since one of the main forecasting tools, NWP systems, cannot fully resolve convective processes or circulations and exchange processes over complex topography. However, using a boosted binary GAM based on a broad range of NWP outputs useful forecasts can be obtained up to 5 days ahead. As an illustration, lightning activity for the afternoon of 2015-07-22 is shown in the top-left panel below, indicating thunderstorms in many areas in the west but not the east. While the corresponding baseline climatology (top middle) has a low probability of thunderstorms for the entire region, the NWP-based probabilistic forecasts (bottom row) highlight increased probabilities already 5 days ahead, becoming much more clear cut when moving to 3 days and 1 day ahead.
More precisely, the probability of thunderstorms is predicted based on a binary logit GAM that allows for potentially nonlinear smooth effects in all NWP variables considered. It selects the relevant variables by gradient boosting coupled with stability selection. Effects and 95% credible intervals of the model for day 1 are estimated via MCMC sampling and shown below (on the logit scale). The number in the bottom-right corner of each panel indicates the absolute range of the effect. The x-axes are cropped at the 1% and 99% quantiles of the respective covariate to enhance graphical representation.
(Note: As the data cannot be shared freely, the customary replication materials unfortunately cannot be provided.)
]]>Florian Wickelmaier, Achim Zeileis (2018). “Using Recursive Partitioning to Account for Parameter Heterogeneity in Multinomial Processing Tree Models.” Behavior Research Methods, 50(3), 1217-1233. doi:10.3758/s13428-017-0937-z
In multinomial processing tree (MPT) models, individual differences between the participants in a study can lead to heterogeneity of the model parameters. While subject covariates may explain these differences, it is often unknown in advance how the parameters depend on the available covariates, that is, which variables play a role at all, interact, or have a nonlinear influence, etc. Therefore, a new approach for capturing parameter heterogeneity in MPT models is proposed based on the machine learning method MOB for model-based recursive partitioning. This procedure recursively partitions the covariate space, leading to an MPT tree with subgroups that are directly interpretable in terms of effects and interactions of the covariates. The pros and cons of MPT trees as a means of analyzing the effects of covariates in MPT model parameters are discussed based on simulation experiments as well as on two empirical applications from memory research. Software that implements MPT trees is provided via the mpttree
function in the psychotree package in R.
https://CRAN.R-project.org/package=psychotree
To highlight how MPT trees can capture the influence of covariates on the parameters in MPT models, data from a source monitoring experiment are analyzed, that was conducted at the Department of Psychology, University of Tübingen.
Study: Participants were presented with items from two different sources (labeled A vs. B) and afterwards, in a memory test, were shown old and new items intermixed and asked to classify them as either A, B, or new (N). In the experiment the two sources were controlled such that half of the respondents had to read the presented items either quietly (A = think) or aloud (B = say). The other half wrote them down (A = write) or read them aloud (B = say). Items were presented on a computer screen at a self-paced rate. In the final memory test, the studied items and distractor items had to be classified as either A, B, or new (N) by pressing a button on the screen.
Model: To infer the cognitive processes a well-known MPT model is employed that was established by the late Bill Batchelder (who passed away earlier this month) and David Riefer for the source monitoring paradigm:
Explanation: Consider the paths from the root to an A response for a Source A item (left). With probability D1, a respondent detects an item as old. If, in a second step, he/she is able to discriminate the item from a Source B item (d1), then the response will correctly be A; else, if discrimination fails (1 - d1), a correct A response can only be guessed with probability a. If the item was not detected as old in the first place (1 - D1), the response will be A only if there are both a response bias for “old” (b) and a guess for the item being Source A (g). The remaining paths in the left tree lead to classification errors (B, N). The trees for Source B and new items work analogously. Moreover, a = g is assumed for identifiability and discriminability is assumed to be equal for both sources (d1 = d2) as in a similar example in Batchelder and Riefer (1990).
Question: Do these probabilities in the source monitoring (D1, D2, d, b, g) depend on the source condition (think-say vs. write-say), or gender or age of the participants?
Answer: The MPT-based model tree (MOB) finds a highly significant difference between the think-say and write-say source condition. Furthermore, there is an age difference in the think-say condition that is significant at a Bonferroni-corrected 5% level. Gender is not found to play a significant role.
Probabilities: For the think-say sources (Nodes 3 and 4), probability D2 exceeds D1 indicating an advantage of say items over think items with respect to detectability. For the write-say sources (Node 5), D2 and D1 are about the same indicating that for these sources no such advantage exists. The think-say subgroup is further split by age with the older participants having lower values on D1 and d, which suggests lower detectability of think items and lower discriminability as compared to the younger participants. This age effect seems to depend on the type of sources as there is no such effect for the write-say sources. In addition, there are only small effects for the bias parameters b and g, which are psychologically less interesting. Some of the differences in the probabilities across groups/nodes can be brought out even more clearly by parameter estimates and corresponding 95% Wald confidence intervals:
]]>Most of the improvements and new features pertain to clustered covariances which had been introduced to the sandwich package last year in version 2.4-0. For this my PhD student Susanne Berger and myself (= Achim Zeileis) teamed up with Nathaniel Graham, the maintainer of the multiwayvcov package. With the new version 2.5-0 almost all features from multiwayvcov have been ported to sandwich, mostly implemented from scratch along with generalizations, extensions, speed-ups, etc.
The full list of changes can be seen in the NEWS file. The most important changes are:
The manuscript vignette("sandwich-CL", package = "sandwich")
has been significantly improved based on very helpful and constructive reviewer feedback. See also below.
The cluster
argument for the vcov*()
functions can now be a formula, simplifying its usage (see below). NA
handling has been added as well.
Clustered bootstrap covariances have been reimplemented and extended in vcovBS()
. A dedicated method for lm
objects is considerably faster now and also includes various wild bootstraps.
Convenient parallelization for bootstrap covariances is now available.
Bugs reported by James Pustejovsky and Brian Tsay, respectively, have been fixed.
Susanne Berger, Nathaniel Graham, Achim Zeileis: Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R
Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, or other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to “the” clustered standard errors, there is a surprisingly wide variety of clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g. for zero-inflated, censored, or limited responses).
In R, functions for covariances in clustered or panel models have been somewhat scattered or available only for certain modeling functions, notably the (generalized) linear regression model. In contrast, an object-oriented approach to “robust” covariance matrix estimation - applicable beyond lm()
and glm()
- is available in the sandwich package but has been limited to the case of cross-section or time series data. Now, this shortcoming has been corrected in sandwich (starting from version 2.4.0): Based on methods for two generic functions (estfun()
and bread()
), clustered and panel covariances are now provided in vcovCL()
, vcovPL()
, and vcovPC()
. Moreover, clustered bootstrap covariances, based on update()
for models on bootstrap samples of the data, are provided in vcovBS()
. These are directly applicable to models from many packages, e.g., including MASS, pscl, countreg, betareg, among others. Some empirical illustrations are provided as well as an assessment of the methods’ performance in a simulation study.
To show how easily the clustered covariances from sandwich
can be applied in practice, two short illustrations from the manuscript/vignette are used. In addition to the sandwich
package the lmtest
package is employed to easily obtain Wald tests of all coefficients:
library("sandwich")
library("lmtest")
options(digits = 4)
First, a Poisson model with clustered standard errors from Aghion et al. (2013, American Economic Review) is replicated. To investigate the effect of institutional ownership on innovation (as captured by citation-weighted patent counts) they employ a (pseudo-)Poisson model with industry/year fixed effects and standard errors clustered by company, see their Table I(3):
data("InstInnovation", package = "sandwich")
ii <- glm(cites ~ institutions + log(capital/employment) + log(sales) + industry + year,
data = InstInnovation, family = poisson)
coeftest(ii, vcov = vcovCL, cluster = ~ company)[2:4, ]
## Estimate Std. Error z value Pr(>|z|)
## institutions 0.009687 0.002406 4.026 5.682e-05
## log(capital/employment) 0.482884 0.135953 3.552 3.826e-04
## log(sales) 0.820318 0.041523 19.756 7.187e-87
Second, a simple linear regression model with double-clustered standard errors is replicated using the well-known Petersen data from Petersen (2009, Review of Financial Studies):
data("PetersenCL", package = "sandwich")
p <- lm(y ~ x, data = PetersenCL)
coeftest(p, vcov = vcovCL, cluster = ~ firm + year)
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0297 0.0651 0.46 0.65
## x 1.0348 0.0536 19.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition to the description of the methods and the software, the manuscript/vignette also contains a simulation study that investigates the properties of clustered covariances. In particular, this assesses how well the methods perfom in models beyond linear regression but also compares different types of bias adjustments (HC0-HC3) and alternative estimation techniques (generalized estimating equations, mixed effects).
The detailed results are presented in the manuscript - here we just show the results from one of the simulation experiments: The empirical coverage of 95% Wald confidence intervals is depicted for a beta regression, zero-inflated Poisson, and zero-truncated Poisson model. With increasing correlation within the clusters the conventional “standard” errors and “basic” robust sandwich standard errors become too small thus leading to a drop in empirical coverage. However, both clustered HC0 standard errors (CL-0) and clustered bootstrap standard errors (BS) perform reasonably well, leading to empirical coverages close to the nominal 0.95.
Details: Data sets were simulated with 100 clusters of 5 observations each. The cluster correlation (on the x-axis) was generated with a Gaussian copula. The only regressor had a correlation of 0.25 with the clustering variable. Empirical coverages were computed from 10,000 replications.
]]>Last week France won the 2018 FIFA World Cup in a match against Croatia in Russia, thus delivering an entertaining final to a sportful tournament. Many perceived the course of the tournament as very unexpected and surprising because many of the “usual” favorites like Brazil, Germany, Spain, or Argentina did not even make it to the semi-finals. And in contrast, teams like host Russia and finalist Croatia proceeded further than expected. However, does this really mean that expectations of experts and fans were so wrong? Or, how surprising was the result given pre-tournament predictions?
Therefore, we want to take a critical look back at our own Probabilistic Forecast for the 2018 FIFA World Cup based on the bookmaker consensus model that aggregated the expert judgments of 26 bookmakers and betting exchanges. A set of presentation slides (in PDF format) with explanations of the model and its evaluation are available to accompany this blog post: slides.pdf
Despite some surprises in the tournament, the probabilistic bookmaker consensus forecast fitted reasonably well. Although it is hard to evaluate probabilistic forecasts with only one realization of the tournament but by and large most outcomes do not deviate systematically from the probabilities assigned to them.
However, there is one notable exception: Expectations about defending champion Germany were clearly wrong. “Die Mannschaft” was predicted to advance from the group stage to the round of 16 with probability 89.1% - and they not only failed to do so but instead came in last in their group.
Other events that were perceived as surprising were not so unlikely to begin, e.g., for Argentina it was more likely to get eliminated before the quarter finals (predicted probability: 51%) than to proceed further. Or they were not unlikely conditional on previous tournament events. Examples for the latter are the pre-tournament prediction for Belgium beating Brazil in a match (40%) or Russia beating Spain (33%). Of course, another outcome of those matches was more likely but compared with these predictions the results were maybe not as surprising as perceived by many. Finally, the pre-tournament prediction of Croatia making it to the final was only 6% but conditional on the events from the round of 16 (especially with Spain being eliminated) this increased to 27% (only surpassed by England with 36%).
The animated GIF below shows the pre-tournament predictions for each team winning the 2018 FIFA world cup. In the animation the teams that “survived” over the course of the tournament are highlighted. This clearly shows that the elimination of Germany (winning probability: 15.8%) was the big surprise in the group stage but otherwise almost all of the teams expected to proceed also did so. Afterwards, two of the other main favorites Brazil (16.6%) and Spain (12.5%) dropped out but eventually the fourth team with double-digit winning probability (France, 12.1%) prevailed.
Compared to other rankings of the teams in the tournament, the bookmaker consensus model did quite well. To illustrate this we compute the Spearman rank correlation of observed partial tournament ranking (1 FRA, 2 CRO, 3 BEL, 4 ENG, 6.5 URU, 6.5 BRA, …) with the bookmaker consensus model as well as Elo and FIFA rating.
Method | Correlation |
---|---|
Bookmaker consensus Elo rating FIFA rating |
0.704 0.592 0.411 |
As there is no good way to assess the predicted winning probabilities for winning the title with only one realization of the tournament, we at least (roughly) assess the quality of the predicted probabilities for the individual matches. To do so, we split the 63 matches into three groups, depending on the winning probability of the stronger team.
This gives us matches that were predicted to be almost even (50-58%), had moderate advantages for the stronger team (58-72%), or clear advantages for the stronger team (72-85%). It turns out that in the latter two groups the average predicted probabilities (dashed red line) match the actual observed proportions quite well. Only in the “almost even” group, the stronger teams won slightly more often than expected.
As already mentioned above, there was only one big surprise in the group stage - with Germany being eliminated. As the tables below show, most other results from the group rankings conformed quite well with the predicted probabilities to “survive” the group stage.
A Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
URU RUS KSA EGY |
68.1 64.2 19.2 39.3 |
B Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
ESP POR IRN MAR |
85.9 66.3 26.5 27.3 |
C Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
FRA DEN PER AUS |
87.0 46.7 31.7 25.2 |
D Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
CRO ARG NGA ISL |
58.7 78.7 41.2 30.9 |
E Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
BRA SUI SRB CRC |
89.9 45.4 39.0 22.6 |
F Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
SWE MEX KOR GER |
44.5 45.2 26.8 89.1 |
G Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
BEL ENG TUN PAN |
81.7 75.6 23.5 23.2 |
H Rank |
Team |
Prob. (in %) |
---|---|---|
1 2 3 4 |
COL JPN SEN POL |
64.6 36.3 37.9 57.9 |
Two weeks ago we published our Probabilistic Forecast for the 2018 FIFA World Cup: By adjusting quoted bookmakers’ odds for the profit margins of the bookmakers (also known as overrounds), transforming and averaging them, a predicted winning probability for each team was obtained. By employing millions of tournament simulations in combination with a model for pairwise comparisons (or matches) we could also obtain forecasted probabilities for each team to progress through the tournament. In our original study, we visualized these by “survival” curves. See the working paper for more details and references.
Here, we present another display that highlights the likely flow of all teams through the tournament simultaneously. Click on the image to obtain an interactive full-width version of this Sankey diagram produced by Plotly.
Compared to the survival curves shown in our original study this visualization brings out more clearly at which stages of the tournament the strong teams are most likely to meet. Its usage was inspired by the nice working paper On Elo based prediction models for the FIFA Worldcup 2018 by Lorenz A. Gilch and Sebastian Müller.
In a few days we will start learning which of these paths will actually come true. Enjoy the 2018 FIFA World Cup!
]]>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, International Journal of Forecasting, https://doi.org/10.1016/j.ijforecast.2009.10.001) and successfully applied in previous football tournaments, e.g., correctly predicting the winner of the 2010 FIFA World Cup and three out of four semifinalists at the 2014 FIFA World Cup. This time the forecast shows that Brazil is the favorite with a forecasted winning probability of 16.6%, closely followed by the defending World Champion and 2017 FIFA Confederations Cup winner Germany with a winning probability of 15.8%. Two other teams also have double-digit winning probabilities: Spain and France with 12.5% and 12.1%, respectively. More details are displayed in the following barchart.
These probabilistic forecasts have been obtained by model-based averaging the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 15.2%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities.
A more detailed description of the model as well as its results for the 2018 FIFA World Cup are available in a new working paper. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in fifa2018.csv.
Although forecasting the winning probabilities for the 2018 FIFA World Cup is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:
Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.
A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):
$\mathrm{Pr}(A\text{beats}B)=\frac{{\mathrm{ability}}_{A}}{{\mathrm{ability}}_{A}+{\mathrm{ability}}_{B}}.$Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. pink signalling advantages for Team A or B, respectively.
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 1,000,000 times) providing “survival” probabilities for each team across the different stages.
This also shows that indeed the most likely final is a match of the top favorites Brazil and Germany (with a probability of 5.5%) where Brazil has the chance to compensate the dramatic semifinal in Belo Horizonte, four years ago. However, given that it comes to this final, the chances are almost even (50.6% for Brazil vs. 49.4% for Germany). For the semifinals it is most likely (with a probability of 9.4%) that Brazil and France meet in the first semifinal (with chances slightly in favor of Brazil in such a match, 53.5%) while Germany and Spain most likely (with 9.2%) play the second semifinal (with chances slightly in favor of Germany with 53.1%).
The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. This showed prominently at the UEFA Euro 2016:
This illustrates that small things can often make the decisive difference in football, which is why predictions with high probabilities cannot be made. Moreover, it is in the very nature of predictions that they can be wrong, otherwise football tournaments would be very boring. The only forecast that can be made with certainty is that the World Cup will be an exciting tournament that football fans worldwide look forward to.
In addition to this forecast, other interesting approaches will surely also be published in the next days, e.g., using the ideas of Groll, Schauberger, Tutz (2016). Also, Claus Ekstrøm will evaluate and compare predictions for the 2018 FIFA World Cup, see his slides, video, code.
As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a sizeable profit margin of about 15.2% which assures that the best chances of making money based on sports betting lie with them!
Zeileis A, Leitner C, Hornik K (2018). “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model”, Working Paper 2018-09, Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universität Innsbruck. http://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09
]]>Lisa Schlosser, Torsten Hothorn, Reto Stauffer, Achim Zeileis (2018). “Distributional Regression Forests for Probabilistic Precipitation Forecasting in Complex Terrain.” arXiv.org E-Print Archive arXiv:1804.02921 [stat.ME]. https://arxiv.org/abs/1804.02921
To obtain a probabilistic model for a dependent variable based on some set of explanatory variables, a distributional approach is often adopted where the parameters of the distribution are linked to regressors. In many classical models this only captures the location of the distribution but over the last decade there has been increasing interest in distributional regression approaches modeling all parameters including location, scale, and shape. Notably, so-called non-homogenous Gaussian regression (NGR) models both mean and variance of a Gaussian response and is particularly popular in weather forecasting. More generally, the GAMLSS framework allows to establish generalized additive models for location, scale, and shape with smooth linear or nonlinear effects. However, when variable selection is required and/or there are non-smooth dependencies or interactions (especially unknown or of high-order), it is challenging to establish a good GAMLSS. A natural alternative in these situations would be the application of regression trees or random forests but, so far, no general distributional framework is available for these. Therefore, a framework for distributional regression trees and forests is proposed that blends regression trees and random forests with classical distributions from the GAMLSS framework as well as their censored or truncated counterparts. To illustrate these novel approaches in practice, they are employed to obtain probabilistic precipitation forecasts at numerous sites in a mountainous region (Tyrol, Austria) based on a large number of numerical weather prediction quantities. It is shown that the novel distributional regression forests automatically select variables and interactions, performing on par or often even better than GAMLSS specified either through prior meteorological knowledge or a computationally more demanding boosting approach.
R package disttree
at
https://R-Forge.R-project.org/R/?group_id=261
Distributional trees as part of the parametric and recursive partitioning modeling toolbox.
Total precipitation predictions by a distributional forest at station Axams for July 24 in 2009, 2010, 2011 and 2012 learned on data from 1985-2008. Observations are left-censored at 0.
Map of Tyrol coding the best-performing model for each station (type of symbol). The color codes whether the distributional forest had higher (green) or lower (red) CRPS compared to the best of the other three models. Station Axams is highlighted in bold.
]]>Nikolaus Umlauf, Nadja Klein, Achim Zeileis (2018). “BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond).” Journal of Computational and Graphical Statistics. Forthcoming. doi:10.1080/10618600.2017.1407325 [ pdf ]
Bayesian analysis provides a convenient setting for the estimation of complex generalized additive regression models (GAMs). Since computational power has tremendously increased in the past decade it is now possible to tackle complicated inferential problems, e.g., with Markov chain Monte Carlo simulation, on virtually any modern computer. This is one of the reasons why Bayesian methods have become increasingly popular, leading to a number of highly specialized and optimized estimation engines and with attention shifting from conditional mean models to probabilistic distributional models capturing location, scale, shape (and other aspects) of the response distribution. In order to embed many different approaches suggested in literature and software, a unified modeling architecture for distributional GAMs is established that exploits distributions, estimation techniques (posterior mode or posterior mean), and model terms (fixed, random, smooth, spatial, …). It is shown that within this framework implementing algorithms for complex regression problems, as well as the integration of already existing software, is relatively straightforward. The usefulness is emphasized with two complex and computationally demanding application case studies: a large daily precipitation climatology, as well as a Cox model for continuous time with space-time interactions.
https://CRAN.R-project.org/package=bamlss
Censored heteroscedastic precepitation climatology, with spatially-varying seasonal effects, spatial main effects, and predicted average precipitation for target date.
]]>