Umlauf N, Klein N, Simon T, Zeileis A (2019). “bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond).” arXiv:1909.11784, arXiv.org EPrint Archive. https://arxiv.org/abs/1909.11784
Over the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highlyefficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as “Lego bricks” encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.
CRAN package: https://CRAN.Rproject.org/package=bamlss
Replication script: bamlss.R
Project web page: http://www.bamlss.org/
To illustrate that the bamlss
follows the same familiar workflow of the other regression packages such as the basic stats
package or the wellestablished mgcv
or gamlss
two quick examples are provided: a Bayesian logit model and a locationscale model where both mean and variance of a normal response depend on a smooth term.
The logit model is a basic labor force participation model, a standard application in microeconometrics. Here, the data are loaded from the AER
package and the same model formula is specified that would also be used for glm()
(as shown on ?SwissLabor
).
data("SwissLabor", package = "AER")
f < participation ~ income + age + education + youngkids + oldkids + foreign + I(age^2)
Then, the model can be estimated with bamlss()
using essentially the same lookandfeel as for glm()
. The default is to use Markov chain Monte Carlo after obtaining initial parameters via backfitting.
library("bamlss")
set.seed(123)
b < bamlss(f, family = "binomial", data = SwissLabor)
summary(b)
## Call:
## bamlss(formula = f, family = "binomial", data = SwissLabor)
## 
## Family: binomial
## Link function: pi = logit
## *
## Formula pi:
## 
## participation ~ income + age + education + youngkids + oldkids +
## foreign + I(age^2)
## 
## Parametric coefficients:
## Mean 2.5% 50% 97.5% parameters
## (Intercept) 6.15503 1.55586 5.99204 11.11051 6.196
## income 1.10565 1.56986 1.10784 0.68652 1.104
## age 3.45703 2.05897 3.44567 4.79139 3.437
## education 0.03354 0.02175 0.03284 0.09223 0.033
## youngkids 1.17906 1.51099 1.17683 0.83047 1.186
## oldkids 0.24122 0.41231 0.24099 0.08054 0.241
## foreignyes 1.16749 0.76276 1.17035 1.55624 1.168
## I(age^2) 0.48990 0.65660 0.49205 0.31968 0.488
## alpha 0.87585 0.32301 0.99408 1.00000 NA
## 
## Sampler summary:
## 
## DIC = 1033.325 logLik = 512.7258 pd = 7.8734
## runtime = 1.417
## 
## Optimizer summary:
## 
## AICc = 1033.737 converged = 1 edf = 8
## logLik = 508.7851 logPost = 571.3986 nobs = 872
## runtime = 0.012
## 
The summary is based on the MCMC samples, which suggest “significant” effects for all covariates, except for variable education
, since the 95% credible interval contains zero. In addition, the acceptance probabilities alpha
are reported and indicate proper behavior of the MCMC algorithm. The column parameters
shows respective posterior mode estimates of the regression coefficients, which are calculated by the upstream backfitting algorithm.
To show a more flexible regression model we fit a distributional scalelocation model to the wellknown simulated motorcycle accident data, provided as mcycle
in the MASS
package.
Here, the relationship between head acceleration and time after impact is captured by smooth relationships in both mean and variance. See also ?gaulss
in the mgcv
package for the same type of model estimated with REML rather than MCMC. Here, we load the data, set up a list of two formula with smooth terms (and increased knots k
for more flexibility), fit the model almost as usual, and then visualize the fitted terms along with 95% credible intervals.
data("mcycle", package = "MASS")
f < list(accel ~ s(times, k = 20), sigma ~ s(times, k = 20))
set.seed(456)
b < bamlss(f, data = mcycle, family = "gaussian")
plot(b, model = c("mu", "sigma"))
Finally, we show a more challenging case study. Here, emphasis is given to the illustration of the workflow. For more details on the background for the data and interpretation of the model, see Section 5 in the full paper linked above. The goal is to establish a probabilistic model linking positive counts of cloudtoground lightning discharges in the European Eastern Alps to atmospheric quantities from a reanalysis dataset.
The lightning measurements form the response variable and regressors are taken from the atmospheric quantities from ECMWF’s ERA5 reanalysis data. Both have a temporal resolution of 1 hour for the years 20102018 and a spatial mesh size of approximately 32 km. The subset of the data analyzed along with the fitted bamlss
model are provided in the FlashAustria
data on RForge which can be installed by
install.packages("FlashAustria", repos = "http://RForge.Rproject.org")
To model only the lightning counts with at least one lightning discharge we employ a negative binomial count distribution, truncated at zero. The data can be loaded as follows and the regression formula set up:
data("FlashAustria", package = "FlashAustria")
f < list(
counts ~ s(d2m, bs = "ps") + s(q_prof_PC1, bs = "ps") +
s(cswc_prof_PC4, bs = "ps") + s(t_prof_PC1, bs = "ps") +
s(v_prof_PC2, bs = "ps") + s(sqrt_cape, bs = "ps"),
theta ~ s(sqrt_lsp, bs = "ps")
)
The expectation mu
of the underlying untruncated negative binomial model is modeled by various smooth terms for the atmospheric variables while the overdispersion parameter theta
only depends on one smooth regressor. To fit this challenging model, gradient boosting is employed in a first step to obtain initial values for the subsequent MCMC sampler. Running the model takes about 30 minutes on a wellequipped standard PC. In order to move quickly through the example we load the precomputed model from the FlashAustria
package:
data("FlashAustriaModel", package = "FlashAustria")
b < FlashAustriaModel
But, of course, the model can also be refitted:
set.seed(111)
b < bamlss(f, family = "ztnbinom", data = FlashAustriaTrain,
optimizer = boost, maxit = 1000, ## Boosting arguments.
thin = 5, burnin = 1000, n.iter = 6000) ## Sampler arguments.
To explore this model in some more detail, we show a couple of visualizations. First, the contribution to the loglikelihood of individual terms during gradient boosting is depicted.
pathplot(b, which = "loglik.contrib", intercept = FALSE)
Subsequently, we show traceplots of the MCMC samples (left) along with autocorrelation for two splines the term s(sqrt_cape)
of the model for mu
.
plot(b, model = "mu", term = "s(sqrt_cape)", which = "samples")
Next, the effects of the terms s(sqrt_cape)
and s(q_prof_PC1)
from the model for mu
and term s(sqrt_lsp)
from the model for theta
are shown along with 95% credible intervals derived from the MCMC samples.
plot(b, term = c("s(sqrt_cape)", "s(q_prof_PC1)", "s(sqrt_lsp)"),
rug = TRUE, col.rug = "#39393919")
Finally, estimated probabilities for observing 10 or more lightning counts (within one grid box) are computed and visualized. The reconstructions for four time points on September 1516, 2001 are shown.
fit < predict(b, newdata = FlashAustriaCase, type = "parameter")
fam < family(b)
FlashAustriaCase$P10 < 1  fam$p(9, fit)
world < rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
library("ggplot2")
ggplot() + geom_sf(aes(fill = P10), data = FlashAustriaCase) +
colorspace::scale_fill_continuous_sequential("Oslo", rev = TRUE) +
geom_sf(data = world, col = "white", fill = NA) +
coord_sf(xlim = c(7.95, 17), ylim = c(45.45, 50), expand = FALSE) +
facet_wrap(~time, nrow = 2) + theme_minimal() +
theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0))
]]>Over the last week a big controversy over Hurricane Dorian emerged after US President Donald Trump tweeted on September 1 that Alabama (and other states) “will most likely be hit (much) harder than anticipated”. And after the Birmingham, Alabama, office of the National Weather Service contradicted Trump on Twitter, the US president defended his tweet claiming that earlier forecasts showed a high probability of Alabama being hit. The various pieces of “evidence” for this included a map, manually modified by a marker, leading to the hashtag #sharpiegate trending on Twitter.
Here, we won’t comment further on the controversy as it is undisputed among scientists that on September 1 the forecast path did not include Alabama. However, we will look into the maps that Trump claimed his tweet was based on and we will investigate wether poor color choice may have contributed to a misinterpretation of the maps. Specifically, on September 5 Trump tweeted:
Just as I said, Alabama was originally projected to be hit. The Fake News denies it! pic.twitter.com/elJ7ROfm2p
— Donald J. Trump (@realDonaldTrump) September 5, 2019
These maps convey the impression that there is an increased risk for Alabama and especially the three maps with the color coding are rather suggestive. A closer look, though, reveals that the maps are from August 30, have a 5day forecasting horizon, and pertain to probabilities for tropicalstormforce winds (i.e., not the cone of the hurricane!), with SouthEast Alabama only having a 520% probability.
Although the information in the maps can be correctly decoded using their titles and legends, it can be argued that this may require some expertise or experience and that there is some potential for misinterpretations. For example, data visualization expert Alberto Cairo writes on Twitter: “I just want to give him the benefit of the doubt, honestly. These maps are difficult to understand. For me the bad thing isn’t misinterpreting. It’s not apologizing […]”
And one aspect that makes the maps prone to misinterpretations is the color choice for coding the probabilities. This is a socalled “rainbow color map” going from dark green over bright yellow to red and dark purple. Such color maps are still widely used although it has been widely recognized that they have a number of disadvantages. In the following, Reto Stauffer and I illustrate in detail what the specific problems of the top right map are and suggest a better alternative color choice.
On the left is the original map that was included in Trump’s tweet and on the right is our version with alternative colors. The main problem with the original colors is that the entire area with more than 5% probability is shaded with highlysaturated colors. Some would argue that the traffic light system (greenyellowred) signals that the green areas are relatively “low risk”. However, we argue that the bright colors and the abrupt transition from “no color” (less than 5%) to “dark green” (for 510%) conveys a substantially increased risk for the entire shaded area.
One way to avoid this misinterpretation is to choose colors that go from light (low risk) to dark and colorful (high risk). This is what we have done in the map on the right  while preserving the hues from green over yellow and red to purple. The probabibilities represented in the map are exactly the same but the alternative color choice conveys much more intuitively which areas are affected by increased probabilities beyond 50% or 60% (which do not include Alabama).
In summary, the information in the map certainly does not represent strong evidence for Alabama being likely “hit hard” by Hurricane Dorian. However, the poor color choice facilitates such misinterpretations and better, more intuitive color alternatives are easily available.
Further problems with the original colors can be brought out by converting both maps to grayscale. This shows that not only the transition from below to above 5% is emphasized too much but also the discontinuous transitions between dark and light are very counterintuitive. In contrast, our alternative colors are much more intuitive because they become darker with increasing risk.
Another related problem can be demonstrated by emulating greendeficient vision (deuteranopia), also showing discontinuities in the original colors.
Finally, we briefly comment on some technical details for contructing the alternative color map. We have used our R software package colorspace that facilitates choosing color palettes using the HCL color model, that captures the perceptual dimensions “hue” (type of color, dominant wavelength), “chroma” (colorfulness), and “luminance” (brightness). In the two plots below we show the HCL spectrum of both sets of colors.
For the original colors on the left we see that luminance (blue line) is nonmonotonic, chroma (green line) is high throughout, and hue (red line) goes from green to purple. For our alternative colors we have used essentially the same hues. However, luminance covers a similar range as in the original colors but in a monotonic fashion. And chroma is low for colors associated with low risk.
The R code snippet below shows how the alternative colors can be computed using our colorspace
package:
colorspace::sequential_hcl(10, palette = "PurpleYellow", rev = TRUE,
c1 = 70, cmax = 100, l2 = 80, h2 = 500)
The starting point is the sequential PurpleYellow
palette that we have used previously for risk maps. However, we modify the lowrisk hue from yellow to green (hue = 140) and go in the opposite direction through the color wheel (hence hue = 500 = 140 + 360 is used). Moreover, we increase chroma for the highrisk colors and decrease luminance somewhat for the lowrisk colors (to be of similar brightness as the gray map in the background). Further similar illustrations of problems with rainbow color maps are available along with more details and explanations on our web site http://colorspace.RForge.Rproject.org/articles/endrainbow.html.
(Authors: Achim Zeileis, Jason C. Fisher, Kurt Hornik, Ross Ihaka, Claire D. McWhite, Paul Murrell, Reto Stauffer, Claus O. Wilke)
The R package “colorspace” (http://colorspace.RForge.Rproject.org/) provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (HueChromaLuminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space.
Namely, general strategies for three types of palettes are provided: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.
To aid selection and application of these palettes the package provides scales for use with ggplot2; shiny (and tcltk) apps for interactive exploration (see also http://hclwizard.org/); visualizations of palette properties; accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.
Links to: PDF slides, YouTube video, R code, arXiv working paper.
Furthermore, replication code for the introductory example (influenza risk map) was already provided in the recent endrainbow blog post.
]]>Jones PJ, Mair P, Simon T, Zeileis A (2019). “Network Model Trees”, OSF ha4cw, OSF Preprints. doi:10.31219/osf.io/ha4cw
In many areas of psychology, correlationbased network approaches (i.e., psychometric networks) have become a popular tool. In this paper we define a statistical model for correlationbased networks and propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the network structure. We adapt modelbased recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. This approach is implemented in the networktree R package. The empirical power of these approaches is studied in several simulation conditions. Examples are given using reallife data from personality and clinical research.
CRAN package: https://CRAN.Rproject.org/package=networktree
OSF project: https://osf.io/ykq2a/
Network model trees are illustrated using data from the Open Source Psychometrics Project:
The TIPI network is partitioned using MOB based on three covariates: engnat (English as native language), gender, and education. Generally, the structure of the network is characterized by strong negative relationships between the normal and reverse measurements of each domain with complex relationships between separate domains. When partitioning the network interesting differences are revealed. For example, native English speakers without a university degree showed a negative relationship between agreeableness and agreeablenessreversed that was significantly weakened in nonnative speakers and in native speakers with a university degree. Among native English speakers with a university degree, males and other genders showed a stronger relationship between conscientiousness and neuroticismreversed compared to females.
In the network plots edge thicknesses are determined by the strength of regularized partial correlations between nodes. Node labels correspond to the first letter of each Big Five personality domain, with the character “r” indicating items that measure the domain in reverse.
The DASS network is partitioned using MOB based on a larger variety of covariates in a highly exploratory scenario: engnat (Engligh as native language), gender, marital status, sexual orientation, and race. Again, the primary split occurred between native and nonnative English speakers. Among native English speakers, two further splits were found with the race variable. Among the nonnative English speakers, a split was found by gender. These results indicate various sources of potential heterogeneity in network structure. For example, among nonnative speakers, the connection between worthlife (I felt that life wasn’t worthwhile) and nohope (I could see nothing in the future to be hopeful about) was stronger compared to females and other genders. In native English speaking Asians, the connection between getgoing (I just couldn’t seem to get going) and lookforward (I felt that I had nothing to look forward to) was stronger compared to all other racial groups.
]]>Schlosser L, Hothorn T, Zeileis A (2019). “The Power of Unbiased Recursive Partitioning: A Unifying View of CTree, MOB, and GUIDE”, arXiv:1906.10179, arXiv.org EPrint Archive. https://arXiv.org/abs/1906.10179
A core step of every algorithm for learning regression trees is the selection of the best splitting variable from the available covariates and the corresponding split point. Early tree algorithms (e.g., AID, CART) employed greedy search strategies, directly comparing all possible split points in all available covariates. However, subsequent research showed that this is biased towards selecting covariates with more potential split points. Therefore, unbiased recursive partitioning algorithms have been suggested (e.g., QUEST, GUIDE, CTree, MOB) that first select the covariate based on statistical inference using pvalues that are adjusted for the possible split points. In a second step a split point optimizing some objective function is selected in the chosen split variable. However, different unbiased tree algorithms obtain these pvalues from different inference frameworks and their relative advantages or disadvantages are not well understood, yet. Therefore, three different popular approaches are considered here: classical categorical association tests (as in GUIDE), conditional inference (as in CTree), and parameter instability tests (as in MOB). First, these are embedded into a common inference framework encompassing parametric model trees, in particular linear model trees. Second, it is assessed how different building blocks from this common framework affect the power of the algorithms to select the appropriate covariates for splitting: observationwise goodnessoffit measure (residuals vs. model scores), dichotomization of residuals/scores at zero, and binning of possible split variables. This shows that specifically the goodnessoffit measure is crucial for the power of the procedures, with model scores without dichotomization performing much better in many scenarios.
CRAN package: https://CRAN.Rproject.org/package=partykit
Development version with some extensions enabled: partykit_1.24.2.tar.gz
Replication materials: simulation.zip
The manuscript compares three socalled unbiased recursive partitioning algorithms that employ statistical inference to adjust for the number of possible splits in a split variable: GUIDE (Loh 2002), CTree (Hothorn et al. 2006), MOB (Zeileis et al. 2008).
First, it is pointed out what the similarities and the differences in the algorithms are, specifically with respect to the split variable selection through statistical tests. Second, the power of these tests is studied for a “stump”, i.e., a single split only. Third, the capability of the entire algorithm (including a pruning strategy) to recover the correct partition in a “tree” with two splits is investigated.
In all cases, the three algorithms are employed to learn modelbased trees where in each leaf of the tree a linear regression model is fitted with intercept β_{0} and slope β_{1}. The simulations then vary whether only the intercept β_{0} or the slope β_{1} or both differ in the data.
All three algorithms proceed by first fitting the model (here: linear regression by OLS) in a given subgroup (or node) of the tree. Then they extract some kind of goodnessoffit measure (either residuals or full model scores) and test whether this measure is associated with any of the split variables. The variable with the highest association (i.e., lowest pvalue) is employed for splitting and then the procedure is repeated recursively in the resulting subgroups.
For “pruning” the tree to the right size one can either first grow a larger tree and then prune those splits that are not relevant enough (postpruning). Or the algorithm can stop splitting when the association test is not significant anymore (prepruning).
The default combinations of fitted model type, test type, and pruning strategy for the three algorithms are given in the following table.
Algorithm  Fit  Test  Pruning 

CTree  Nonparametric  Conditional inference  Pre 
MOB  Parametric  Scorebased fluctuation  Pre (or post with AIC/BIC) 
GUIDE  Parametric  Residualbased chisquared  Post (costcomplexity pruning) 
Thus, the main difference is the testing strategy but also the pruning is relevant. While at first sight, the tests come from very different motivations, they are actually not that different. When assessing the association with the split variable the following three properties are most relevant:
An overview of the corresponding settings for the three algorithms is given in the following table. Additionally, the tests differ somewhat in how they aggregate across the possible splits considered. Either in a sumofsquares statistic or a maximallyselected statistic.
Algorithm  Scores  Dichotomization  Categorization  Statistic 

CTree  Model scores  –  –  Sum of squares 
MOB  Model scores  –  –  Maximally selected 
GUIDE  Residuals  X  X  Sum of squares 
Subsequently, these algorithms are compared in two simulation studies. More details and more simulation studies can be found in the manuscript. In addition to the three default algorithms, a modified GUIDE algorithm using model scores instead of residuals (GUIDE+scores) is considered.
Clearly, the different choices made in the construction influence the inference properties of the significance tests. Hence, in a first step we investigate the power properties of the tests when there is only one split in one of the split variables (among further noise variables). The split can pertain either to the intercept β_{0} only or the slope β_{1} only or both.
The plot below shows the probability of selection the true split variable (Z_{1}) with the minimal pvalue against the magnitude of the difference in the regression coefficients (δ). For a split in the middle of the data (50%) pertaining only to the intercept β_{0} (top left panel) all tests perform almost equivalently. However, if the split only affects the slope β_{1} (middle column) it is much better to use scorebased tests rather than residualbased tests (as in GUIDE) which cannot pick up changes that do not affect the conditional mean. Moreover, if the split occurs not in the middle (50% quantile, top row) but in the tails (90% quantile, bottom row) it is better to use a maximallyselected statistic (as in MOB) rather than a sumofsquares statistic.
One could argue that the power properties of the tests may be crucial when prepruning (based on statistical significance) is used. However, when combined with costcomplexity postpruning it may not be so important to have particularly high power. As long as the power for the true split variables is higher than for the noise variables, it might be sufficient to select the correct split variable.
This is assessed in a simulation for a tree with two splits, both depending on differences of magnitude δ in the two regression coefficients, respectively. The adjusted Rand index is used to assess how well the partition found by the tree conforms with the true partition. The columns of the display below are for splits that occur in the middle of the data vs. later in the sample (left to right).
And indeed it can be shown that postpruning (bottom row) mitigates many of the power deficits of the testing strategies compared to significancebased prepruning (top row). However, it is still clearly better to use a score based test (as in CTree, MOB, and GUIDE+scores) than a residualsbased test (as in GUIDE). Also, prepruning may even lead to slightly better results than postpruning when based on a powerful test.
Using several simulation setups we have shown that in many circumstances CTree, MOB, and GUIDE perform very similarly for recursive partitioning based on linear regression models. However, in some settings scorebased clearly outperform residualbased tests (the latter may even lack power altogether). To some extent crosscomplexity postpruning can mitigate power deficits of the testing strategy but prepruning typically works as well as long as the significance test works well.
Furthermore, other simulations in the manuscript show that dichotomization of residuals/scores should be avoided as it reduces the power of the tests. Note that this is very easy to do in GUIDE: Instead of chisquared tests one can simply use oneway ANOVA tests. Finally, in the appendix of the manuscript it is shown that maximallyselected statistics (as in MOB) work better for abrupt splits late in the sample while the sumofsquares statistics (from CTree and GUIDE) work better for smooth(er) transitions.
]]>The forecast is based on a hybrid random forest learner that combines three main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 18 bookmakers; further team covariates (e.g., age, team structure) and countryspecific socioeconomic factors (population, GDP). The random forest is learned using the FIFA Women’s World Cups in 2011 and 2015 as training data and then applied to current information to obtain a forecast for the 2019 FIFA Women’s World Cup. The random forest actually provides the predicted number of goals for each team in all possible matches in the tournament so that a bivariate Poisson distribution can be used to compute the probabilities for a win, draw, or loss in such a match. Based on these match probabilities the entire tournament can be simulated 100,000 times yielding winning probabilities for each team. The results show that defending champions United States are the clear favorite with a winning probability of 28.1% followed by host France with a winning probability of 14.3%, England with 13.3%, and Germany with 12.9%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive fullwidth version.
Interactive fullwidth graphic
The full study is available in a recent working paper which has been conducted by an international team of researchers: Andreas Groll, Christophe Ley, Gunther Schauberger, Hans Van Eetvelde, Achim Zeileis. It actually provides a hybrid approach that combines three stateoftheart forecasting methods:
Historic match abilities:
An ability estimate is obtained for every team based on “retrospective” data, namely 3418 historic matches of 167 international women’s teams over the last 8 years. A bivariate Poisson model with teamspecific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Bookmaker consensus abilities:
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 18 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010) the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.
Hybrid random forest:
Finally, machine learning is used to combine the two ability estimates above along with a broad range of further relevant covariates, yielding refined probabilistic forecasts for each match. Specifically, the hybrid random forest approach of Groll, Ley, Schauberger, Van Eetvelde (2019) is used to combine the two highlyinformative ability estimates with further teamspecific information that may or may not be relevant to the team’s performance. The covariates considered comprise teamspecific details (e.g., FIFA rank, average age, confederation, team structure, …) as well as countryspecifc socioeconomic factors (population and GDP per capita). By learning a large ensemble of 5,000 regression trees, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team (averaged over all trees) can then finally be used to simulate the entire tournament 100,000 times.
Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in mean age of the teams, etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss.
The following heatmap shows the win probabilities in each possible match between a pair of teams with green vs. pink signalling probabilities above vs. below 50%, respectively. The corresponding loss probability is displayed when changing the roles of the teams (i.e., switching rows and columns in the matrix below). The tooltips for each match in the interactive version of the graphic also print the three win, draw, and loss probabilities.
Interactive fullwidth graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive fullwidth graphic
All our forecasts are probabilistic, clearly below 100%, and thus by no means certain  even if the favorite United States has clearly the highest winning probability of all participating teams. However, recall that a single poor performance in the playoffs is sufficient to drop out of the tournament. For example, this happened to host and clear favorite Germany in 2011 (with a winning probability of almost 40% according to the bookmakers) when they lost to Japan 01 in extra time in the quarterfinals. Japan then went on to become FIFA Women’s World Champion for the first time.
Another interesting observation is that the bookmakers see both the United States and France almost on par with bookmaker consensus probabilities of 18.1% and 18.7%, respectively. Clearly, the bookmakers (and presumably their customers) expect that France’s home advantage will play an important role. In contrast, our hybrid random forest does not find the home advantage to be an important factor and hence forecasts a much higher winning probability for the United States (28.1%) than for France (14.3%). This is due to the home advantage not having played an important role in our learning data: Germany in 2011 and Canada in 2015 both dropped out in the quarterfinals.
Finally, when considering the bookmaker consensus, it is also worth pointing out that the bookmakers seem to be less confident about their odds for the Women’s World Cup compared to the Men’s World Cup. This is reflected by the increased overround that assures the bookmakers’ profit margins. While for men’s tournaments this overround is typically around 15% (that the bookmakers keep and do not pay out) while for the FIFA Women’s World Cup 2019 it is a sizeable 25% on average and thus ten percentage points higher.
This overround is also the main reason why we recommend against betting based on the results presented here. It assures that the best chances of making money based on sports betting lie with the bookmakers. Instead we recommend to bet only privately among friends and colleagues  or simply enjoy the exciting matches we are surely about to see in France!
Groll A, Ley C, Schauberger G, Van Eetvelde H, Zeileis A (2019). “Hybrid Machine Learning Forecasts for the FIFA Women’s World Cup 2019”, arXiv:1906.01131, arXiv.org EPrint Archive. https://arXiv.org/abs/1906.01131
]]>Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019). “model4you: An R Package for Personalised Treatment Effect Estimation.” Journal of Open Research Software, 7(17), 16. doi:10.5334/jors.219
Typical models estimating treatment effects assume that the treatment effect is the same for all individuals. Modelbased recursive partitioning allows to relax this assumption and to estimate stratified treatment effects (modelbased trees) or even personalised treatment effects (modelbased forests). With modelbased trees one can compute treatment effects for different strata of individuals. The strata are found in a datadriven fashion and depend on characteristics of the individuals. Modelbased random forests allow for a similarity estimation between individuals in terms of model parameters (e.g. intercept and treatment effect). The similarity measure can then be used to estimate personalised models. The R package model4you implements these stratified and personalised models in the setting with two randomly assigned treatments with a focus on ease of use and interpretability so that clinicians and other users can take the model they usually use for the estimation of the average treatment effect and with a few lines of code get a visualisation that is easy to understand and interpret.
https://CRAN.Rproject.org/package=model4you
The correlation between exam group and exam performance in an introductory mathematics exam (for business and economics students) is investigated using treebased stratified and personalized treatment effects. Group 1 took the exam in the morning and group 2 started the exam with slightly different exercises after the first group finished. Potential sources of heterogeneity in the group effect include gender, field of study, whether the exam was taken (and failed) previously, and prior performance in online “tests” earlier in the semester. Performance in both the written exam and the online tests is captured by percentage of correctly solved exercises.
Overall, it seems that the split into two different exam groups was fair: The second group had only a slightly lower performance by around 2 or 3 percentage points, suggesting that the exam in the second group was only very slightly more difficult. However, when investigating the heterogeneity of this group effect with a modelbased tree it turns out that this distinguishes the students by their performance in the online tests. The largest difference between the two exam groups is in the students who did very well in the online tests (more than 92.3 percent correct), where the secondgroup students performed worse by 13.3 percentage points. So the split into the two exam groups seems to have been not fully fair for those very good students.
To refine the assessment further, a modelbased forest can be estimated. This reveals that the dependence of the group effect on the performance in the online tests is even more pronounced. This is shown in the dependence plots and beeswarm plots below with the group treatment effect on the yaxis and the performance in the online tests on the xaxis.
To fit the simple linear base model in R, lm()
can be used. The subsequent tree based on this model can be obtained with pmtree()
from model4you
and the forest with pmforest()
. Example code is shown below, the full replication code for the entire analysis and graphics is included in the manuscript.
bmod_math < lm(pcorrect ~ group, data = MathExam)
tr_math < pmtree(bmod_math, control = ctree_control(maxdepth = 2))
forest_math < pmforest(bmod_math)
]]>The goto palette in many software packages is  or used to be until rather recently  the socalled rainbow: a palette created by changing the hue in highlysaturated RGB colors. This has been widely recognized as having a number of disadvantages including: abrupt shifts in brightness, misleading for viewers with color vision deficiencies, too flashy to look at for a longer time. As part of our R software project colorspace we therefore started collecting typical (ab)uses of the RGB rainbow palette on our web site http://colorspace.RForge.Rproject.org/articles/endrainbow.html and suggest better HCLbased color palettes.
Here, we present the most recent addition to that example collection, a map of influenza severity in Germany, published by the influenza working group of the Robert KochInstitut. Along with the original map and its poor choice of colors we:
The shaded map below was taken from the web site of the Robert KochInstitut (Arbeitsgemeinschaft Influenza) and it shows the severity of influenza in Germany in week 8, 2019. The original color palette (left) is the classic rainbow ranging from “normal” (blue) to “strongly increased” (red). As all colors in the palette are very flashy and highlysaturated it is hard to grasp intuitively which areas are most affected by influenza. Also, the least interesting “normal” areas stand out as blue is the darkest color in the palette.
As an alternative, a proper multihue sequential HCL palette is used on the right. This has smooth gradients and the overall message can be grasped quickly, giving focus to the highrisk regions depicted with dark/colorful colors. However, the extremely sharp transitions between “normal” and “strongly increased” areas (e.g., in the North and the East) might indicate some overfitting in the underlying smoothing for the map.
Converting all colors to grayscale brings out even more clearly why the overall picture is so hard to grasp with the original palette: The gradients are discontinuous switching several times between bright and dark. Thus, it is hard to identify the highrisk regions while this is more natural and straightforward with the HCLbased sequential palette.
Emulating greendeficient vision (deuteranopia) emphasizes the same problems as the desaturated version above but shows even more problems with the original palette: The wrong areas in the map “pop out”, making the map extremely hard to use for viewers with redgreen deficiency. The HCLbased palette on the other hand is equally accessible for colordeficient viewers as for those with full color vision.
The desaturated and deuteranope version of the original image influenzarainbow.png (a screenshot of the RKI web page) are relatively easy to produce using the colorspace
function cvd_emulator("influenzarainbow.png")
. Internally, this reads the RGB colors for all pixels in the PNG, converts them with the colorspace
functions desaturate()
and deutan()
, respectively, and saves the PNG again. Below we also do this “by hand”.
What is more complicated is the replacement of the original rainbow palette with a properly balanced HCL palette (without access to the underlying data). Luckily the image contains a legend from which the original palette can be extracted. Subsequently, it is possibly to index all colors in the image, replace them, and write out the PNG again.
As a first step we read the original PNG image using the R package png, returning a height x width x 4 array containing the three RGB (red/green/blue) channels plus a channel for alpha transparency. Then, this is turned into a height x width matrix containing color hex codes using the base rgb()
function:
img < png::readPNG("influenzarainbow.png")
img < matrix(
rgb(img[,,1], img[,,2], img[,,3]),
nrow = nrow(img), ncol = ncol(img)
)
Using a manual search we find a column of pixels from the palette legend (column 630) and thin it to obtain only 99 colors:
pal_rain < img[96:699, 630]
pal_rain < pal_rain[seq(1, length(pal_rain), length.out = 99)]
For replacement we use a slightly adapted sequential_hcl()
that was suggested by Stauffer et al. (2015) for a precipitation warning map. The "PurpleYellow"
palette is currently only in version 1.41 of the package on RForge but other sequential HCL palettes could also be used here.
library("colorspace")
pal_hcl < sequential_hcl(99, "PurpleYellow", p1 = 1.3, c2 = 20)
Now for replacing the RGB rainbow colors with the sequential colors, the following approach is taken: The original image is indexed by matching the color of each pixel to the closest of the 99 colors from the rainbow palette. Furthermore, to preserve the black borders and the gray shadows, 50 shades of gray are also offered for the indexing. To match pixel colors to palette colors a simple Manhattan distance (sum of absolute distances) is used in the CIELUV color space:
# 50 shades of gray
pal_gray < gray(0:50/50)
## HCL coordinates for image and palette
img_luv < coords(as(hex2RGB(as.vector(img)), "LUV"))
pal_luv < coords(as(hex2RGB(c(pal_rain, pal_gray)), "LUV"))
## Manhattan distance matrix
dm < matrix(NA, nrow = nrow(img_luv), ncol = nrow(pal_luv))
for(i in 1:nrow(pal_luv)) dm[, i] < rowSums(abs(t(t(img_luv)  pal_luv[i,])))
idx < apply(dm, 1, which.min)
Now each element of the img
hex color matrix can be easily replaced by indexing a new palette with 99 colors (plus 50 shades of gray) using the idx
vector. This is what the pal_to_png()
function below does, writing the resulting matrix to a PNG file. The function is somewhat quick and dirty, makes no sanity checks, and assumes img
and idx
are in the calling environment.
pal_to_png < function(pal = pal_hcl, file = "influenza.png", rev = FALSE) {
ret < img
pal < if(rev) c(rev(pal), rev(pal_gray)) else c(pal, pal_gray)
ret[] < pal[idx]
ret < coords(hex2RGB(ret))
dim(ret) < c(dim(img), 3)
png::writePNG(ret, target = file)
}
With this function, we can easily produce the PNG graphic with the desaturated palette and the deuteranope version”
pal_to_png(desaturate(pal_rain), "influenzarainbowgray.png")
pal_to_png( deutan(pal_rain), "influenzarainbowdeutan.png")
The analogous graphics for the HCLbased "PurpleYellow"
palette are generated by:
pal_to_png( pal_hcl, "influenzapurpleyellow.png")
pal_to_png(desaturate(pal_hcl), "influenzapurpleyellowgray.png")
pal_to_png( deutan(pal_hcl), "influenzapurpleyellowdeutan.png")
Given that we have now extracted the pal_rain
palette and set up the pal_hcl
alternative we can also use the colorspace
function specplot()
to understand how the perceptual properties of the colors change across the two palettes. For the HCLbased palette the hue/chroma/luminance changes smoothly from dark/colorful purple to a light yellow. In contrast, in the original RGB rainbow chroma and, more importantly, luminance change nonmonotonically and rather abruptly:
specplot(pal_rain)
specplot(pal_hcl)
Given that the colors in the image are indexed now and the gray shades are in a separate subvector, we can now easily rev
erse the order in both subvectors. This yields a black background with white letters and we can use the "Inferno"
palette that works well on dark backgrounds:
pal_to_png(sequential_hcl(99, "Inferno"), "influenzainferno.png", rev = TRUE)
For more details on the limitations of the rainbow palette and further pointers see “The End of the Rainbow” by Hawkins et al. (2014) or “Somewhere over the Rainbow: How to Make Effective Use of Colors in Meteorological Visualizations” by Stauffer et al. (2015) as well as the #endrainbow hashtag on Twitter.
]]>The web page http://hclwizard.org/ had originally been started to accompany the manuscript: “Somewhere over the Rainbow: How to Make Effective Use of Colors in Meteorological Visualizations” by Stauffer et al. (2015, Bulletin of the American Meteorological Society) to facilitate the adoption of color palettes using the HCL (HueChromaLuminance) color model. It was realized using the R package colorspace in combination with shiny.
After the major update of the colorspace package the http://hclwizard.org/ has also just been relaunched, now hosting all three shiny color apps from the package:
This app allows to design new palettes interactively: qualitative palettes, sequential palettes with single or multiple hues, and diverging palettes (composed from two singlehue sequential palettes). The underlying HCL coordinates can be modified, starting out from a wide range of predefined palettes. The resulting palette can be assessed in various kinds of displays and exported in different formats.
This app allows to assess how well the colors in an uploaded graphics file (png/jpg/jpeg) work for viewers with color vision deficiencies. Different kinds of color blindness can be emulated: deuteranope (red deficient), protanope (green deficient), tritanope (blue deficient), monochrome (grayscale).
In addition to the palette creator app described above, this app provides a more traditional color picker. Sets of individual colors can be selected (and exported) by navigating different views of the HCL color space.
]]>Martin Wagner, Achim Zeileis (2019). “Heterogeneity and Spatial Dependence of Regional Growth in the EU: A Recursive Partitioning Approach.” German Economic Review, 20(1), 6782. doi:10.1111/geer.12146 [ pdf ]
We use modelbased recursive partitioning to assess heterogeneity of growth and convergence processes based on economic growth regressions for 255 European Union NUTS2 regions from 1995 to 2005. Spatial dependencies are taken into account by augmenting the modelbased regression tree with a spatial lag. The starting point of the analysis is a humancapitalaugmented Solowtype growth equation similar in spirit to Mankiw et al. (1992, The Quarterly Journal of Economics, 107, 407437). Initial GDP and the share of highly educated in the working age population are found to be important for explaining economic growth, whereas the investment share in physical capital is only significant for coastal regions in the PIIGS countries. For all considered spatial weight matrices recursive partitioning leads to a regression tree with four terminal nodes with partitioning according to (i) capital regions, (ii) noncapital regions in or outside the socalled PIIGS countries and (iii) inside the respective PIIGS regions furthermore between coastal and noncoastal regions. The choice of the spatial weight matrix clearly influences the spatial lag parameter while the estimated slope parameters are very robust to it. This indicates that accounting for heterogeneity is an important aspect of modeling regional economic growth and convergence.
https://CRAN.Rproject.org/package=lagsarlmtree
The growth model to be assessed for heterogeneity is a linear regression model for the average growth rate of real GDP per capita (ggdpcap) as the dependent variable with the following regressors:
Thus, a humancapitalaugmented version of the Solow model is employed, inspired by the by now classical work of Mankiw et al. (1992). The wellknown data sets from SalaiMartin et al. (2004) and Fernandez et al. (2001) are employed below for estimation.
To assess whether a single growth regression model with stable parameters across all EU regions is sufficient, splitting the data by the following partitioning variables is considered:
To adjust for spatial dependencies a spatial lag term with inverse distance weights is considered here. Other weight specifications lead to very similar estimated tree structures and regression coefficients, though.
library("lagsarlmtree")
data("GrowthNUTS2", package = "lagsarlmtree")
data("WeightsNUTS2", package = "lagsarlmtree")
tr < lagsarlmtree(ggdpcap ~ gdpcap0 + shgfcf + shsh + shsm 
gdpcap0 + accessrail + accessroad + capital + regboarder + regcoast + regobj1 + cee + piigs,
data = GrowthNUTS2, listw = WeightsNUTS2$invw, minsize = 12, alpha = 0.05)
print(tr)
## Spatial lag model tree
##
## Model formula:
## ggdpcap ~ gdpcap0 + shgfcf + shsh + shsm  gdpcap0 + accessrail +
## accessroad + capital + regboarder + regcoast + regobj1 +
## cee + piigs
##
## Fitted party:
## [1] root
##  [2] capital in no
##   [3] piigs in no: n = 176
##   (Intercept) gdpcap0 shgfcf shsh shsm
##   0.11055 0.01171 0.00208 0.02195 0.00179
##   [4] piigs in yes
##    [5] regcoast in no: n = 13
##    (Intercept) gdpcap0 shgfcf shsh shsm
##    0.1606 0.0159 0.0469 0.0789 0.0234
##    [6] regcoast in yes: n = 39
##    (Intercept) gdpcap0 shgfcf shsh shsm
##    0.07348 0.01106 0.09156 0.11668 0.00942
##  [7] capital in yes: n = 27
##  (Intercept) gdpcap0 shgfcf shsh shsm
##  0.2056 0.0223 0.0075 0.0411 0.0528
##
## Number of inner nodes: 3
## Number of terminal nodes: 4
## Number of parameters per node: 5
## Objective function (residual sum of squares): 0.0155
##
## Rho (from lagsarlm model):
## rho
## 0.837
The resulting linear regression tree can be visualized with pvalues from the parameter stability tests displayed in the inner nodes and a scatter plot of GDP per capita growth (ggdpcap) vs. (log) initial real GDP per capita (ggdcap0) in the terminal nodes:
plot(tr, tp_args = list(which = 1))
It is most striking that the speed of βconvergence is much higher for the 27 capital regions. More details about differences in the other regressors are shown in the table below. Finally, it is of interest which variables were not selected for splitting in the tree, i.e., are not associated with significant parameter instabilities: initial income, the border dummy, and Objective 1 regions, among others.
Node  n  Partitioning  variables  Regressor  variables  

capital  piigs  regcoast  (Const.)  gdpcap0  shgfcf  shsh  shsm  
3  176  no  no  –  0.111 (0.016) 
–0.0117 (0.0016) 
–0.0021 (0.0077) 
0.022 (0.011) 
0.0018 (0.0068) 

5  13  no  yes  no  0.161 (0.128) 
–0.0159 (0.0135) 
–0.0469 (0.0815) 
0.079 (0.059) 
–0.0234 (0.0660) 

6  39  no  yes  yes  0.073 (0.056) 
–0.0111 (0.0059) 
0.0916 (0.0420) 
0.117 (0.029) 
0.0094 (0.0218) 

7  27  yes  –  –  0.206 (0.031) 
–0.0223 (0.0029) 
–0.0075 (0.0259) 
0.041 (0.020) 
0.0528 (0.0117) 
For more details see the full manuscript. Replication materials for the entire analysis from the manuscript are available as a demo in the package:
demo("GrowthNUTS2", package = "lagsarlmtree")
]]>