Saturday, December 02, 2006

The Interpretation Of Statistical Tests

Albert Frank headshot by Albert Frank

In this article, we assume the following hypothesis: if the reliability of a dichotomical test is f, then the probability that it gives a wrong result is 1-f.

The following question arises: Below what reliability will a test result have a probability of being correct of less than 0.5?

Let P be the number of elements in the population, a the probability (known) for an element of this population to have a definite feature K, and f the reliability of the test. The number of K-elements detected by the test equals a f P. The number of non-K detected (wrongly) is (1-a) (1-f) P. The probability that an element detected by the test is effectively a K-element is 0.5 if a f P = (1-a) (1-f) P, equivalent to f = 1-a. So, as soon as fa, the test becomes a nonsense.

A test must be more reliable if what it attempts to detect is very rare.

This simple fact is very often neglected.

Let's take an example: the alcohol test. We assume as hypothesis that one driver out of 100 is at "0.8 or more" (European norm for heavy offence is in excess of 0.8 gm/ltr.). In the following table, we examine for several reliabilities of the test the probability that somebody with a positive test is actually positive. We take a population of 100,000 persons, of which 1,000 are supposed to be "at 0.8 or more."

Reliability of the test Valid detections Invalid detections Probability a "detection" is valid
.9999 1,000 10 0.99
.999 999 99 0.91
.99 990 990 0.50
.95 950 4,950 0.16
.9 900 9,900 0.08
.8 800 19,800 0.04

We can imagine the dangers of bad interpretations of tests in, for example, the medical field.


1 comment:

Renaissance said...

It's not just medical tests that can result in incorrect interpretations.

Also tests to identify someone as a terrorist, for example airport passenger screening or data-mining citizen call records or random roving wiretaps. In each case, the value of P(terrorist) is known to be small ("optimistically" 1000 out of 300M or 0.00033% but more likely smaller) and the value of f(test) is either uncertain or not very high (optimistically of the order of 80-90%, but probably worse). The chances of false positives swamp the likelihood of ever detecting true positives. Further, false positives further decrease f(test) because of the "boy crying wolf" effect.

As the cost of imposing such tests is demonstrably high (GDP, liberties, etc.), it begs the question: why do we use them and how deluded are we in believing their efficacy? The cost-benefit is dubious at best.

This obvious result is what leads some to wonder if the target isn't terrorists but rather anyone who doesn't agree with Administration and/or GOP ideology. The P value in that case puts such tests into the range of effective.