|
Rice Analytics Issued Fundamental Patent on RELR
Method
This Patent Covers RELR
Error Modeling and Related Dimension Reduction
St. Louis, MO (USA), October 4, 2011 – Rice Analytics, the
pioneers in automated reduced error regression, announced today the issuance to
it by the US Patent Office for a patent for fundamental aspects of its Reduced
Error Logistic Regression (RELR) technology. This patent covers important error modeling
and dimension reduction aspects of RELR. Dan Rice, the
inventor of RELR and President of Rice Analytics, stated the significance of this
RELR patent as follows:
“While large
numbers of patents are important in many technology applications, it
is also clear that just one fundamental patent can lead to the breakthrough commercialization
of an entire industry. The MRI patent in
the early 1970’s had such an effect and by the 1990’s had resulted in billions
of dollars in licensing fees and enormous practical applications in
medicine. We believe that this RELR
patent could have a similar effect in the field of Big Data analytics because
RELR completely avoids the problematic and risky issues related to error and
arbitrary model building choices that plague all other Big Data high
dimensional regression algorithms. RELR
finally allows Big Data machine learning to be completely automated and
interpretable. Just as the MRI allowed the physician to work at a much higher
level and avoid arbitrary diagnostic choices where two physicians would come to
completely different and inaccurate diagnoses, RELR allows analytic professionals to
work at a much higher level and completely avoid arbitrary guesses in model
building. Thus, different modelers will no longer either build completely different
models with the very same data or have to rely upon pre-filled parameters that
are the arbitrary choices of others. Most modelers would spend
significant time testing arbitrary parameters because they are worried about the large risk associated with such parameters, but then it is very hard for
them to find the time to be creative. The complete automation that is the basis
of RELR frees analytic professionals to work at a much higher and creative
level, so they can pose better modeling problems and develop insightful model interpretations.
Most importantly, unlike parsimonious variable selection in all other
algorithms, RELR’s Parsed variable selection models actually can be interpreted
because these models are not built with arbitrary choices and because they are
consistent with maximum probability statistical theory.”
This US patent referenced
as number 8,032,473 describes a method of
modeling and reducing error in logistic regression that can be applied quite
generally in machine learning applications. Logistic regression is one of the
more general advanced analytics methods because it can be used to model the
probability of outcomes in all classic regression problems without regard to
the form of the dependent variable. The most common application of logistic
regression is in modeling categorical outcomes, such as binary or ordinal outcomes.
Yet, any continuous dependent variable can be categorized into intervals and
also modeled with logistic regression, such as in forecasting and survival
analysis problems. Logistic regression
remains one of the most widely used advanced analytics methods in business,
government, medicine, and science applications. The reason for the popularity of logistic regression is that it allows
the possibility of insight into the key putative drivers of the predicted
regression outcome, but problems related to error and dimensionality are major
limiting factors and prevent such insight with non-experimental data. This patented RELR method overcomes these
problems.
Various
regularization and variable selection approaches, such as Ridge, LARS, LASSO or
Stepwise methods, have been proposed to handle the regression error
or dimensionality problems or both. These methods 1)require assumptions that are often far from realistic,
2)either require an analyst to test and tune various arbitrary parameters
manually or rely upon pre-filled arbitrary parameters, and 3)often require a preliminary
stage to reduce the dimensionality that uses another algorithm such as principal
components analysis or decision trees that is also fraught with arbitrary
choices. A further problem is that
multicollinearity and related overfitting error, due to highly correlated variables,
are often a problem with high dimensional data even after these traditional
methods to deal with error and dimensionality are applied. As a result, building a logistic regression
model with high dimensional data is an extremely difficult and time consuming
challenge that often requires a large labor effort from analysts and almost
always will require arbitrary choices that will differ across analysts and data
samples. Because these traditional
logistic regression methods require such arbitrary choices, their solutions can
also differ widely across analysts. Such variability will have a large effect on the quality of the solution
and its accuracy. Today’s era of “Big Data” is the era of high
dimensional data, but traditional methods to deal with error and dimensionality
in logistic regression do not overcome these problems. Logistic regression is not alone in having
these problems, as all other potentially interpretable predictive analytics methods have significant problems with high dimensional data.
Because RELR
substantially avoids error and dimensionality problems in logistic regression, large
sample sizes are not necessary for an accurate and stable RELR model. However, RELR still clearly benefits from Big
Data in terms of high dimensional data, as RELR can get dramatically more
accurate with more possible variables even though its Parsed RELR variable
selection usually only selects very few variables. The RELR error modeling and its related
dimension reduction method are described in the patent. This patent also describes what is now called
the Fullbest or “best Full RELR” method which produces accurate, but not
parsimonious models. RELR’s highly effective parsimonious Parsed variable
selection optimization method is not described in this patent, as it was
discovered after the patent submission. The difference between Fullbest and
Parsed RELR model is somewhat akin to the difference between a Full Regression
model (after initial dimension reduction) and a Stepwise Regression model,
although the Fullbest RELR model is usually substantially more accurate than an
arbitrary Full model. The primary application for Fullbest RELR is for purely predictive models with extremely small training samples such as fewer than several hundred observations when parsimonious Parsed variable selection would underfit the model. Unlike Parsed RELR, Fullbest RELR does not produce models that can be interpreted as consistent with causal explanations due to the redundancy of its variable selection. All evidence now
suggests that Parsed RELR performs substantially better in parsimony, accuracy,
stability, and interpretability than widely used Stepwise Regression methods, along with other widely used methods (see white paper review of previous studies). The
significance of Parsed RELR is that it arises as the solution that is the
maximum joint probability of all observed dependent variable events and
inferred error model events. With non-experimental data where all observations are assumed to be independent, this is readily interpretable as the maximum probability solution.
At present, the optimization
procedure that is the basis of Parsed RELR’s maximum probability solution has only
been partially disclosed publicly (see Rice, JSM Proceedings, 2008). Yet, the
methods described in this patent are fundamental to all effective RELR modeling
applications including parsimonious Parsed RELR variable selection. For example, effective implementation of
Parsed RELR requires balanced binary dependent variable model builds based upon oversampling rare and/or undersampling frequent events because only then would any possible t statistic that could be used to compute the error model not be arbitrary (see Section 12 and Appendix 4 of MyRELR Manual for a fuller discussion
of the statistical theory of the RELR error modeling and why this is not arbitrary). The intercepts of these balanced binary RELR models are then corrected in scoring. Thus, with this non-arbitrary balanced binary model build, the error modeling solution described in the patent
would also necessarily apply to effective Parsed RELR binary modeling. It would also apply to all ordinal and interval categorized dependent variables in Parsed RELR models. While the patent is general enough to cover multinomial dependent variables, implementations of both Parsed and Fullbest RELR have handled multinomial dependent variables by building separate binary models, as this avoids the arbitrary choice of a multinomial reference condition. In reference to Parsed RELR and how the
patent influences its maximum probability optimization trade secrets, Dan Rice
suggests:
“The effect of this patent not only allows us to license our currently
implemented proprietary MyRELR SAS language macro that includes Parsed RELR in
open SAS code, but it also would allow us to license the Parsed RELR optimization trade
secrets to those who may also wish to develop our RELR technology in
popular Big Data frameworks such as for Hadoop Map-Reduce or firm software (firmware) implementations for in-database processing in massively parallel data warehouse appliances. With open code and trade secret licensing, we will not have concerns about the lack of more
fundamental patent protection. This patent also means that we may eventually
publish remaining details to our Parsed RELR optimization method prior to when
the patent period ends and still enjoy US patent protection for the error modeling that is necessary for Parsed RELR.”
MyRELR is a
trademark of Rice Analytics. SAS® is a
registered trademark of SAS Institute. Map Reduce is a patented invention of Google, Inc. (US
Patent 7,650,331).
Copyright, 2011 Rice Analytics, All rights reserved.
.
|