Statistical Fallacies

Selecting a truly RANDOM Sample

If you choose every person who passes you on the sidewalk, that would not be a random sample, since you would miss out the people moving in vehicles. If you, likewise, stop every car on the highway, your sample will miss out on those people who do not have cars.

A truly RANDOM sample of the entire population is necessary, though hard to produce. In Statistics, RANDOM does not mean haphazard or carelessly chosen. In fact, Random means quite the opposite – a sample that is carefully chosen to reflect the parent population.

The Size of the Sample

As discussed above, a truly random sample ensures that the information gathered, or test conducted, will be reliable. Of course, if you only have TWO items in your random sample, that will not work. So, in addition to randomness, you also need a good SIZED sample. What’s a good size? Shouldn’t that depend on the size of the population being examined? For e.g. – for a million strong population, shouldn’t the sample size be LARGER than for a 1000 strong population?

Turns out, surprisingly, that the size (for getting a reliable sample) does not depend on the total population size. In other words, a 1000 size sample will serve equally well whether examining a 5000 strong parent population or a million strong parent population. Huh? How is that possible?

Sampling for rarely occurring events

Say – you need to provide an estimate for the number of blades of grass in a desert. Or for the amount of rainwater that is part of the total water in the ocean. Sounds pretty impossible right?

As long as you know a previously established AVERAGE for these measures, you can actually predict, with accuracy, what the expected outcome will be. You can predict whether you will find 1 blade or 2 blades of grass in a given area, provided the average from a previous year was provided to you. This is the poisson distribution.

Medical Test Success Rates, What sounds like 95% success (rates) is actually just 9% success rates

Suppose, a medical test was found for detecting a disease (say Alzheimer’s), which was 95% reliable. If this test was applied to a group of people where approximately 0.5% actually had Alzheimer’s, what do you think is the probability that a patient with a positive test result ACTUALLY has Alzheimer’s? If you said 95%, you would be off. The actual probability of a positive test correlating to an actual Alzheimer patient, is 9% ! i.e. Out of a 100 people who test positive, only 9 will actually have the disease! How is that you ask?

The Gambler’s Fallacy, I am due to win!

In a roulette roll, a gambler notices that 19 of the last 19 rolls have all landed on RED. He reasons that BLACK is due on the next roll – since the probability of 20 consecutive RED rolls is (0.5) ^ 20 (approximately, one in a million). So, he puts all his money on BLACK.

Of course, his reasoning is flawed. The outcome of any single roll is always 50% for RED or BLACK. And he is betting on the outcome of a SINGLE roll. Had he been betting (at the very start), on the chances of 20 consecutive RED rolls, he might have had a point. That chance would be 1 in a million. But now that 19 rolls have already taken place, all he is doing is betting on the chance of the next roll – a SINGLE roll. And that chance is still 50%.

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written 1209 posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.