Rice Analytics

Automated Reduced Error Predictive Analytics


Rice Analytics Issued Fundamental Patent on RELR Method

This Patent Covers RELR Error Modeling and Related Dimension Reduction

St. Louis, MO (USA), October 4, 2011 – Rice Analytics, the pioneers in automated reduced error regression, announced today the issuance to it by the US Patent Office for a patent for fundamental aspects of its Reduced Error Logistic Regression (RELR) technology.  This patent covers important error modeling and dimension reduction aspects of RELR.  Dan Rice, the inventor of RELR and President of Rice Analytics, stated the significance of this RELR patent as follows:

“While large numbers of patents are important in many technology applications, it is also clear that just one fundamental patent can lead to the breakthrough commercialization of an entire industry.  The MRI patent in the early 1970’s had such an effect and by the 1990’s had resulted in billions of dollars in licensing fees and enormous practical applications in medicine.  We believe that this RELR patent could have a similar effect in the field of Big Data analytics because RELR completely avoids the problematic and risky issues related to error and arbitrary model building choices that plague all other Big Data high dimensional regression algorithms.  RELR finally allows Big Data machine learning to be completely automated and interpretable. Just as the MRI allowed the physician to work at a much higher level and avoid arbitrary diagnostic choices where two physicians would come to completely different and inaccurate diagnoses, RELR allows analytic professionals to work at a much higher level and completely avoid arbitrary guesses in model building.  Thus, different modelers will no longer either build completely different models with the very same data or have to rely upon pre-filled parameters that are the arbitrary choices of others. Most modelers would spend significant time testing arbitrary parameters because they are worried about the large risk associated with such parameters, but then it is very hard for them to find the time to be creative. The complete automation that is the basis of RELR frees analytic professionals to work at a much higher and creative level, so they can pose better modeling problems and develop insightful model interpretations. Most importantly, unlike parsimonious variable selection in all other algorithms, RELR’s Parsed variable selection models actually can be interpreted because these models are not built with arbitrary choices and because they are consistent with maximum probability statistical theory.”

This US patent referenced as number 8,032,473 describes a method of modeling and reducing error in logistic regression that can be applied quite generally in machine learning applications. Logistic regression is one of the more general advanced analytics methods because it can be used to model the probability of outcomes in all classic regression problems without regard to the form of the dependent variable. The most common application of logistic regression is in modeling categorical outcomes, such as binary or ordinal outcomes. Yet, any continuous dependent variable can be categorized into intervals and also modeled with logistic regression, such as in forecasting and survival analysis problems.   Logistic regression remains one of the most widely used advanced analytics methods in business, government, medicine, and science applications. The reason for the popularity of logistic regression is that it allows the possibility of insight into the key putative drivers of the predicted regression outcome, but problems related to error and dimensionality are major limiting factors and prevent such insight with non-experimental data.  This patented RELR method overcomes these problems. 

Various regularization and variable selection approaches, such as Ridge, LARS, LASSO or Stepwise methods, have been proposed to handle the regression error or dimensionality problems or both.  These methods 1)require assumptions that are often far from realistic, 2)either require an analyst to test and tune various arbitrary parameters manually or rely upon pre-filled arbitrary parameters, and 3)often require a preliminary stage to reduce the dimensionality that uses another algorithm such as principal components analysis or decision trees that is also fraught with arbitrary choices.  A further problem is that multicollinearity and related overfitting error, due to highly correlated variables, are often a problem with high dimensional data even after these traditional methods to deal with error and dimensionality are applied.  As a result, building a logistic regression model with high dimensional data is an extremely difficult and time consuming challenge that often requires a large labor effort from analysts and almost always will require arbitrary choices that will differ across analysts and data samples.  Because these traditional logistic regression methods require such arbitrary choices, their solutions can also differ widely across analysts. Such variability will have a large effect on the quality of the solution and its accuracy.   Today’s era of “Big Data” is the era of high dimensional data, but traditional methods to deal with error and dimensionality in logistic regression do not overcome these problems.  Logistic regression is not alone in having these problems, as all other potentially interpretable predictive analytics methods have significant problems with high dimensional data.

Because RELR substantially avoids error and dimensionality problems in logistic regression, large sample sizes are not necessary for an accurate and stable RELR model.  However, RELR still clearly benefits from Big Data in terms of high dimensional data, as RELR can get dramatically more accurate with more possible variables even though its Parsed RELR variable selection usually only selects very few variables.  The RELR error modeling and its related dimension reduction method are described in the patent.  This patent also describes what is now called the Fullbest or “best Full RELR” method which produces accurate, but not parsimonious models. RELR’s highly effective parsimonious Parsed variable selection optimization method is not described in this patent, as it was discovered after the patent submission. The difference between Fullbest and Parsed RELR model is somewhat akin to the difference between a Full Regression model (after initial dimension reduction) and a Stepwise Regression model, although the Fullbest RELR model is usually substantially more accurate than an arbitrary Full model.  The primary application for Fullbest RELR is for purely predictive models with extremely small training samples such as fewer than several hundred observations when parsimonious Parsed variable selection would underfit the model. Unlike Parsed RELR, Fullbest RELR does not produce models that can be interpreted as consistent with causal explanations due to the redundancy of its variable selection.  All evidence now suggests that Parsed RELR performs substantially better in parsimony, accuracy, stability, and interpretability than widely used Stepwise Regression methods, along with other widely used methods (see white paper review of previous studies).  The significance of Parsed RELR is that it arises as the solution that is the maximum joint probability of all observed dependent variable events and inferred error model events.  With non-experimental data where all observations are assumed to be independent, this is readily interpretable as the maximum probability solution.

At present, the optimization procedure that is the basis of Parsed RELR’s maximum probability solution has only been partially disclosed publicly (see Rice, JSM Proceedings, 2008). Yet, the methods described in this patent are fundamental to all effective RELR modeling applications including parsimonious Parsed RELR variable selection.  For example, effective implementation of Parsed RELR requires balanced binary dependent variable model builds based upon oversampling rare and/or undersampling frequent events because only then would any possible t statistic  that could be used to compute the error model not be arbitrary (see Section 12 and Appendix 4 of MyRELR Manual for a fuller discussion of the statistical theory of the RELR error modeling and why this is not arbitrary).  The intercepts of these balanced binary RELR models are then corrected in scoring.  Thus, with this non-arbitrary balanced binary model build, the error modeling solution described in the patent would also necessarily apply to effective Parsed RELR binary modeling. It would also apply to all ordinal and interval categorized dependent variables in Parsed RELR models. While the patent is general enough to cover multinomial  dependent variables, implementations of both Parsed and Fullbest RELR have handled multinomial dependent variables by building separate binary models, as this avoids the arbitrary choice of a multinomial reference condition. In reference to Parsed RELR and how the patent influences its maximum probability optimization trade secrets, Dan Rice suggests:

“The effect of this patent not only allows us to license our currently implemented proprietary MyRELR SAS language macro that includes Parsed RELR in open SAS code, but it also would allow us to license the Parsed RELR optimization trade secrets to those who may also wish to develop our RELR technology in popular Big Data frameworks such as for Hadoop Map-Reduce or firm software (firmware) implementations for in-database  processing in massively parallel data warehouse appliances.  With open code and trade secret licensing, we will not have concerns about the lack of more fundamental patent protection. This patent also means that we may eventually publish remaining details to our Parsed RELR optimization method prior to when the patent period ends and still enjoy US patent protection for the error modeling that is necessary for Parsed RELR.”

 

MyRELR is a trademark of Rice Analytics.  SAS® is a registered trademark of SAS Institute. Map Reduce is a patented invention of Google, Inc. (US Patent 7,650,331).

Copyright, 2011 Rice Analytics, All rights reserved.

.



Machine Learning  Segmentation  Consumer Surveys  Predictive Modeling  Risk Management