Is the AUC
the Best Measure?
Daniel M. Rice
Rice Analytics, St.
Louis, MO (USA), www.riceanalytics.com
Sept 7, 2010
Copyright © 2010 Rice Analytics. All Rights Reserved.
The area under the curve (AUC) that
relates the hit rate to the false alarm rate has
become a standard measure in
tests of predictive modeling accuracy.
The AUC is an
estimate
of the probability that a classifier will rank a randomly chosen
positive instance
higher than a randomly chosen negative instance. For
this reason, the AUC is widely
thought to be a better measure than a classification
error rate based upon a single prior
probability or KS statistic threshold.
When tested in a real
world application, many models will still use a single threshold
based upon the
KS statistic or a prior probability rather than the range of
thresholds in the
AUC calculation. Hence, the AUC may not reflect the expected
classification
accuracy at this single threshold when these models are put to real world
use. Another
external validity problem is that the AUC will not assess the extent to
which
the model output is well calibrated to the target variable, as the AUC does not
estimate the accuracy of the probabilities in the model output. In contrast, average
squared error will
directly reflect the error of the output probabilities. Indeed, there is a
long and successful track
record of average squared error to assess probability accuracy
in areas of
science such as weather forecasting back to Brier (1950). In many
applications today, even a 1% reduction in
classification error or average squared error
would mean tens of millions of
dollars or more of ROI. Yet, the AUC may miss these
effects because of this lack
of external validity.
What may be most
troubling is that more published simulation data now show AUC
estimates of error are less accurate than straight classification error estimates. The
earliest simulations were the Huang and Ling
(2005) work based upon up to 20
observations. Huang and Ling did report that AUC had better accuracy than classification rate. However, more recent simulations with much more data come to the opposite conclusion. Hanczar et al. (2010) report simulations at various sample sizes up to 1000
observations. They find that straight classification error is a better measure of actual
error because the AUC predicted error can have
much greater dispersion and is therefore
less precise. These AUC inaccuracies were most apparent in
imbalanced samples and
smaller samples. Based
upon these data, Hanczar et al. (2010) urge caution in the use
of AUC measures unless
the sample size is very large.
Unfortunately, they also point
out that while “it would be nice to have a simple rule of thumb to determine if a sample
is sufficiently large ... no simple solution is possible” (p. 829).
What does this mean for the everyday
practitioner? A more comprehensive study
now suggests that the AUC may be noisier than
previously thought (Hanczar et al.
2010). Other studies are needed, but this recent evidence does not support the
superiority of the AUC as a measure of accuracy. Clearly, another problem is that
a valid confidence
interval for the AUC is not so simple to compute, so a valid repeated
measures statistical
test for AUC differences between two models built from the same
data also would not be simple. In any event, given the apparently
greater noise in the
AUC, any practice of simply “eyeballing” AUC results might be equivalent to flipping a coin to determine the reliability of differences between models. In contrast to the AUC,
well established and simple
repeated measures statistical tests can be used
to assess straight
classification error rate differences or average squared error
differences (see Rice 2008, as an example). Thus, instead of picking a model
winner in what could be a
random AUC lottery, apparently more accurate measures
- straight classification error rate and average squared error – with much better
statistical and external validity should probably now be considered.
Postscript:
A very nice critique that makes one of the same arguments as made here regarding external validity can be found in Lobo et al. (2008). Professor David Hand who has been doing research on the AUC for a
very long time (see Hand and Till, 2001) was kind enough to send us his
new articles on this subject on Sept 9, 2010. In the Hand (2009) article that we now reference below, he comes to the same conclusion as we do and as Lobo et al.
do that the AUC is fundamentally flawed, although his remedy is different. In any event, Professor Hand's current position is that the
AUC is only rarely an appropriate measure of classification performance.
REFERENCES
Brier (1950). Verificaton
of forecasts expressed in terms of probability. Monthly weather
review 78: 1–3.
Hanczar, B., Hua, J., Sima, C., Weinstein,
J., Bittner, M. and Dougherty, E.R. (2010).
Small-sample precision of ROC-related estimates. Bioinformatics 26 (6): 822-830.
Hand, D.J., & Till, R.J. (2001). A simple generalization of the area under
the ROC curve to
multiple class classification problems. Machine Learning, 45,
171-186.
Hand, D.J. (2009). Measuring classifier performance: A coherent alternative to the area
under the ROC curve. Machine Learning, 77: 103-123.
Huang, J. and Ling, C.X. (2005): Using AUC and Accuracy in Evaluating Learning
Algorithms. IEEE Trans. Knowl. Data Eng. 17(3): 299-310.
Lobo, J. M., Jiménez-Valverde, A. and Real, R. (2008), AUC: a
misleading measure of the
performance of predictive distribution models. Global
Ecology and Biogeography,
17: 145–151.
doi: 10.1111/j.1466-8238.2007.00358.
Rice,
D.M. (2008), Generalized Reduced Error Logistic Regression Machine, Section
on
Statistical Computing - JSM
Proceedings 2008, pp. 3855-3862.
|
 |