Here is a link to a PowerPoint presentation of very interesting, informative and completely independent research on RELR produced by Thomas Ball. There is one typing mistake where it is mentioned that LASSO/LARS did better on the Brier score than RELR when they actually did worse, but the correct Brier score directional interpretation is at least clear in his Brier Score description (as an example  LARS and LASSO Brier scores were .181 and .182 in the small variable set whereas RELR's Brier score was .17, and a lower score is a better score in this Brier measure because it is simply the average squared error). Thomas Ball is a very experienced analytics and research consultant who works in New York City for large educational institutions and large insurance companies amongst others as a very senior consultant. Thomas responded to a public challenge that we put out on LinkedIn in the summer of 2011 that gave away a long extended free MyRELR license and invited people to test RELR and provide a public report on their research if they were interested in comparing their methods to RELR. This is his report (right click and open in new tab to view or save to your computer  you will not be able to see it by directly clicking on the link in some browsers).
Modeling First Year College Achievement Using RELR.pdf
Below are comments by Dan Rice on this same research that also discuss some results that were not presented in the slides (also please see new research shown in link at far bottom).
Thanks to Tom Ball for providing some very interesting preliminary data and results from his research on RELR. This was independent research, so I do not agree with him on everything, but that is the nature of research. I first have a couple of comments on the methods just to provide a few more details than provided in the excellent high level report. The RELR method that was run was always Parsed RELR; the variable selection method that gives parsimonious "explanatory" models and which can be interpreted as the most probable model that results from the maximum joint probability of all observed outcome events and inferred variable error events. Also, I was told that Proc GLMSelect was used with the default parameters to produce LASSO and LARS. I assume that the binary dependent variable was coded as 1 and 1 to allow logistic regression to be run in Proc GLMSelect for LASSO and LARS in a way that the Hastie et al. 2009 Elements of Statistical Learning book suggests is possible.
Other than that, all of my comments concern the results. They are:
RELR was almost uniformly the most accurate algorithm in these relatively tall datasets compared to LASSO, LARS, Random Forests Logistic Regression, Stepwise Logistic Regression and Bayesian Nets. With the exception of a couple of comparisons involving the smallest Brier score (average squared error) differences, RELR outperformed in all other accuracy comparisons in both measures of average squared error and classification accuracy. Statistical significance tests are not provided for this roughly balanced validation sample of 3300 observations (roughly 13,000 observations used in training). Yet, in this balanced validation sample with 3300 observations, it is likely that the RELR improved classification accuracy that averages 22.5% compared to LASSO, LARS, Stepwise and Random Forest Logistic Regression would be highly statistically significant. That is, the 95% confidence bounds on these classification accuracy percentages are all less than +/0.65%, so the dependent sample McNemar proportions comparison should be quite significant. Also, it is likely that the RELR improved average squared error performance compared to LARS and LASSO of approximately 0.0110.012 units would also be statistically significant, whereas the much smaller differences of 0.0010.003 Brier scores that do not favor RELR are probably either both not statistically significant or possibly the .003 difference would be only marginally significant. What is not shown is an earlier result that also showed that RELR had a classification accuracy benefit relative to Bayesian Networks of roughly 4% in the smaller dataset, as Bayes Networks was dropped from the report to simplify it. Whether a classification accuracy improvement of 24% on average is practically significant in this present educational application where the predictive model may not be directly applied is a legitimate question. Yet, there are many business applications where predictive models can be directly applied where even a 2% classification accuracy improvement could mean a profit improvement of millions of dollars.
The datasets used in the present study with 13,000 training observations are nothing like tall big data datasets that contain millions of training observations, but they do indicate that RELR can outperform other algorithms in classification accuracy in more than just the very wide datasets where RELR has been reported previously to outperform. That was surprising to me, as we even stated in the MyRELR software manual that RELR's principal classification accuracy advantage should be with wide data encompassing many variables in relationship to the sample size, as that is what we had observed in our earliest research which always concerned almost balanced samples. For example in our 2008 JSM paper with balanced sample data, RELR only showed improvements in average squared error and not in classification accuracy in the taller dataset which was not really tall compared to the present datasets, but an intercept was not directly computed in this 2008 model. However, starting in 2009, we implemented a software change where we always now compute intercepts directly in the MyRELR software, as before this one would often employ a more involved procedure for almost balanced binary models like the present study where intercepts were not computed directly but were found through an adjustment process that tried to minimize the error. When intercepts are computed directly, the result is that RELR's classification accuracy can be improved. This improvement in the MyRELR software where intercepts are now always directly computed may be a primary reason that RELR now can show classification accuracy improvements relative to other algorithms even with such tall data where RELR's accuracy advantages still should be least apparent.
All algorithms produced stable models across these overlapping samples with 13,000 training observations. Five overlapping samples as used in this study may not be the best way to assess stability because there is a ceiling effect where all models will be stable with largely overlapping samples like with 80% overlap in the 5 samples of the current study. Additionally, the overlapped samples will tend to oversample certain observations and undersample others, so the different samples will not be representative necessarily of the population. Independent samples would be the preferred way to assess stability, as suggested by the Rice 2008 JSM Proceedings paper. However, overlapping samples as in bootstrapped samples have also been employed in the literature to assess stability (see Austin and Tu, Journal of Clinical Epidemiology, 2004), but this Austin and Tu work had 1000 samples compared to 5 in the current study. But, there were limited resources available, so a more comprehensive study was simply not feasible here. In any case, a coefficient stability score in the range of 0.8 would be found if two models had one different variable selected out of 10 in each case and almost the same relative magnitude in all regression coefficients for the same selected variables. Because the range of stability correlations is roughly between 0.80.99 across these large overlapping samples, all algorithms had mostly similar selected variables and regression coefficients and therefore reasonable stability. Yet, any differences in correlation stability in different algorithms in comparison to RELR are unlikely to be statistically reliable with only 10 independent observations used (variables selected). I am not sure about interpreting the other "stability" measure because it is recommended for large sample clustering quality and has not been applied to small numbers of variables selected in stability studies previously. In addition, it seems to be close to zero in the large data case, where there still must be pretty good stability as indicated by the high correlation measures in the other stability measure.
A large sample size was not necessary for an accurate RELR model and overlapping samples were not necessary for stability in RELR either. That is, with much smaller independent training samples of 3300 observations, RELR had roughly the same Brier score and classification accuracy as it had with training samples of roughly 13,000 and only a slight drop in stability. As mentioned in the report on the background page, an independent sample comparison was conducted for RELR models where each independent training and validation sample had roughly 3300 observations. Those results are deemed academic in the report because they do not change much from the larger overlapping samples, but it was reported to me that the correlation stability measure dropped slightly in RELR down to roughly .77. Another interesting finding was that the RELR classification and Brier Score validation sample accuracy produced from these small 3300 observation training samples were almost identical to those produced with the large training samples of 13,000 that are shown in the report. These results indicate that RELR is still reasonably stable with this highly multicollinear dataset with much smaller independent stability samples and RELR still can have extremely good accuracy with very small training samples. Both of these results have been reported before, as this essentially replicates the notion that RELR works very well with smaller training samples. Note that many of the independent variables of this study like GPA, class standing, income and socioeconomic status were extremely correlated. For this reason, the sample size where the RELR variable selection shows close to perfect stability is likely to be greater in this dataset than in other datasets without such extreme multicollinearity.
This study does not report how the other algorithms would perform in accuracy and stability problems with these same SMALLER INDEPENDENT training samples of 3300 used with RELR. A fairly large number of academic research articles suggest that there may be problems with highly correlated variables in LASSO and LARS. That is, LASSO and LARS seem to arbitrarily select only one variable from each correlated group (see Zou and Hastie, Journal of the Royal Statistical Society, 2005) and this might lead to unstable variable selection across independent samples and/or less accurate models. These LASSO and LARS problems would be expected to be worse when smaller independent samples are employed to assess stability and accuracy. Similar academic studies suggest that instability and accuracy problems are seen in Random Forests methods and Stepwise Regression in smaller independent samples with multicollinear variables. Unfortunately, this present study's analysis only looked at RELR in SMALLER INDEPENDENT training samples of 3300 and did not look at the other algorithms, as all other algorithms were only viewed with LARGER OVERLAPPING training samples where stability and accuracy problems would be expected to happen much less.
A related issue is that the LASSO and LARS method in SAS that was used in the present study does not allow control over the LASSO regularization parameter. This parameter is essentially set automatically through the algorithm deployed in SAS. The effect of too much LASSO regularization can mean less accuracy, but more parsimonious selection, which may be more stable with a large enough sample and with data that are conducive to parsimonious models. Too little LASSO regularization can mean greater accuracy, but also much less parsimony and stability. In RELR, one does not worry about such an unknown arbitrary regularization parameter, as its error modeling replaces regularization and is automatic. Future studies that compare to RELR may examine how these regularization settings in LASSO and LARS would affect these stability vs. accuracy vs. parsimony tradeoffs. These studies could also look at an algorithm like Elastic Net which has been developed to overcome this tendency for LASSO and LARS to arbitrarily select only one variable from each group of highly correlated variables with the obvious cost of less parsimony.
This LASSO/LARS/Elastic Net paradigm can give good models when the regularization assumption is correct, but the problem is that this paradigm will miss fundamental information when that assumption is incorrect as pointed out in the example by Zou and Hastie concerning highly correlated variables. Thus, there should be instances in empirical studies where these LASSO/LARS/Elastic Net models perform comparably to or even better than Parsed RELR because one has guessed correctly on the regularization model and given the laws of probability. Yet, the real advantage of RELR is that one does not have to guess about the type of regularization and the regularization parameter settings because RELR's error model does not have any arbitrary parameters and simply generates the most probable model given the data. The most probable model is not always most accurate, but it should be most accurate more often than not, assuming a minimal sample size and given the very same candidate variables/interactions/nonlinear and missing data effects in each comparison model. Note that one of the advantages of ensemble modeling is that it may incorporate a diverse and overlapping basis of candidate effects (Seni and Elder, Ensemble Methods in Data Mining, 2010), which would allow less susceptibility to unexpected missing data because of the great overlap in the large number of selected variables in these models. Ensemble modeling may thus be expected often to outperform Parsed RELR in real world accuracy, but definitely not in parsimony or interpretation. Similarly, the Best Full RELR model also often seems to outperform Parsed RELR slightly in accuracy in real world scoring samples. This is probably because the many redundant variables in Fuller RELR models also have less sensitivity to unrepresentative missing data effects compared to the parsimonious Parsed RELR models. Overall though, the results of the present study support the idea that Parsed RELR generates the most probable model, but this study missed another chance to falsify or support this contention by not comparing the other algorithms to RELR in smaller independent samples.
The problems reported in previous academic literature that Stepwise Regression has with wrong signed regression coefficients and selecting too many variables in multcollinear datasets are clearly seen in this study. More than 20% of Stepwise selected variables had regression coefficients with opposite signs to the pairwise correlations between the independent variables and the dependent variable. None of the other algorithms had this problem. Stepwise also tended to select between 23 times as many variables as the other algorithms. The fact that these two problems cooccur suggests that Stepwise may be putting in extra variables to offset the selected variables with wrong signs, or vice versa. In that case, any fix related to dropping the wrong signed variables or reversing their signs may be problematic and lead to inaccurate and high variance real world performance of models. While some of the problems related to multicollinearity would be expected to be muted in larger samples, wrong signed coefficients may be one problem that is not easily muted at least in training samples that are in the range of tens of thousands with highly multicollinear variables.
Daniel M. Rice
St. Louis, MO (USA)
August 6, 2011
Update (May 30, 2016): Here are links to new work by Dan Rice that shows good replication of RELR's parsimonious feature selection (now called Explicit RELR) and predictions across randomly split subsamples, which departs from the resampling performed in this Ball research. The Rice design using independent randomly split subsamples is similar to that used in other work that shows good replication of RELR's feature selection and predictions like presented in 2008 JSM Proceedings and the 2014 book Calculus of Thought. Replication only can be reliably measured across such completely independent samples, as even observations generated from white noise can have similar feature selection and predictions in models built across resampled data with sufficient overlap. Note that highly correlated substitute features (r>.95) were selected in this first link below, so any measure of stability like in this Ball work that did not evaluate substitutes will miss them as markers for replication. Also, note that Explicit RELR (called Parsed RELR previously) and Implicit RELR (called Fullbest RELR previously) both show very high correlations between predictions in models built from independent samples. This "stability of predictions "is a much better measure than a stability measure of feature selection that ignores highly correlated substitutes as used in the Ball work reviewed here. Future work on stability will be much more insightful if it reports the degree of correlation of predictions that are built from completely independent samples as in the second link below, as "stability of predictions" are measured in this way and "stable predictions" are the ultimate goal in predictive modeling.
http://www.skyrelr.com/interpretingdeepexplicitrelr/
http://www.skyrelr.com/replicatingpredictionsusingpublicucirvinedataset/
