Misleading Statistics

Why Statistics are Valuable

While there are many ways to be fooled by statistics, we should be aware of how valuable they are. If we want to know whether we can plant our flowers without too much danger of them being killed by a late frost, it is better to have some statistics on when the last frost occurs in our part of the country than to depend on what we can remember of last year's weather. If a researcher wants to know if one method for treating a disease is better than another, it is important that she has some statistics to show what percentage of the time each method worked on comparable groups of patients. If I want to know whether I should wear a seat belt, I'd like to compare the percentage of people who survive crashes when wearing seat belts to the percentage who survive when they don't.

The alternative is to make guesses from the knowledge of individual cases, and this is likely to be very misleading. The trouble is that we are likely to only know of a few cases personally, and they might not be very representative of the overall situation. For example, someone might know one person who was killed in a car crash despite wearing a seatbelt and nobody who was killed without one, and therefore falsely assume the seatbelt was detrimental. However, if we look at a large number of traffic accidents, statistics show that seatbelts make driving substantially safer, and this information is a much better guide to what is likely to happen to us.

A second problem is that individual cases we know about may represent a biased sampling of what actually happens. We are much more likely to hear about people who are released from prison if they go on to murder someone than if they don't. Newspapers are not likely to have headlines that say "Released Prisoner Doesn't Commit Murder" even if that happens for hundreds of released prisoners, but they are likely to report it when a released prisoner does murder someone. Based on what we see reported, our mental image of the percentage of released prisoners that actually murder someone is likely to be much greater than what happens in reality.

Statistics give us a way to look at the big picture and get a much more accurate way of understanding what is going on in the world than what we could get from individual observations.

Problems With Statistics

While statistics are extremely valuable, they are also notorious for being a means that people use to make false and misleading arguments.

Faulty Statistics

86% of statistics are made up on the spot, you know - the remaining 24% are mathematically flawed.- from an internet message board.

An obvious problem with statistics is that they can be simply be fabricated. Of course this could be true with any claim, but because statistics use specific numbers, they have a quality of authority about them, and we may be a little less suspicious that a statistical claim is false than we would be for a more descriptive argument. Saying "83% of high school students admit cheating on tests" just sounds more authoritative than "most high school students admit they cheat on tests."

Other problems arise because of the way statistics are gathered. Some common problems are bad sampling (see Unfair Sampling) and biases introduced into polls and surveys.

Bad sampling

Most often, statistics are obtained by taking a sample from a larger group and assuming the whole group has the same characteristics as the sample. For example, if we ask 100 people who they are going to vote for in the next election, and 55 of them say they will vote for Murphy, we might assume that about 55% of all the voters will vote for Murphy. This is very useful, since we can't possibly ask all the voters, but it has some important limitations.

First, it would not be at all surprising to find that only 45% are really for Murphy, but just by luck we happened to talk to an unusually large percentage of Murphy supporters. This is the problem of sample size. The smaller the sample, the greater the influence of luck on the results we get.

The other problem is that the way the people in the sample were picked might be biased toward a certain result. If Murphy supports spending lots of money for sports teams, and we sample people attending a football game, we might find an abnormally high percentage of Murphy supporters. On the other hand, if we sample people at the mall when most sports fans are at home watching a big game on television, we might find an abnormally low percentage of Murphy supporters. In either case, our results will be misleading.

One kind of biased sample results when the people being sampled get to decide whether to respond. A television show might ask people to call in and vote on some issue. Not only might the people who watch that particular show be atypical of the overall population, but the people who are motivated to make the phone call might also be more or less likely to vote "yes" than people who don't want to call. This is called "response bias" - people who choose to respond might on the average have different opinions than the people who don't.

Unfair poll questions

Statistics based on polls can be faulty if the poll is constructed in such a way as to encourage a particular answer. If a question is worded "Do you feel you should be taxed so some people can get paid for staying home and doing nothing?" it is likely to get a lot of "no" responses. On the other hand, the question "Do you think the government should help people who are unable to find work?" is likely to get a lot more positive responses. Both questions could be about the same policy of providing unemployment assistance. Polls can easily be rigged to get a desired answer by the way the questions are phrased. Another way of rigging a poll is to have a series of questions designed to highlight the arguments for one side of an issue before presenting the question about how the poll responder feels about that issue. If it is important to know whether the results of some poll are reliable, one should try to find out exactly what was asked in the poll.

Statistics that are true but misleading

Even when statistics are technically accurate, particular statistical facts can be very misleading. I once heard a statistic that the rate of teenage pregnancy in a conservative religious group was higher than the national average. This seemed surprising until it became apparent that the reason wasn't a high percentage of unwed mothers - it was a high percentage of women who got married while still in their teens.

I recall hearing apparently conflicting claims about employment during a presidential election campaign a number of years ago. The challenger claimed that unemployment was up during the President's term in office. The President's campaigners said that employment was up! It turns out that both were true. The population had increased, and it turned out the number of people who were employed and the number of people who were unemployed had both increased.

When someone wants to use statistics to make a point, there are many choices of just what numbers to use. Suppose we want to dramatize how much the price of candy bars has gone up. We might have the following data:

January $ .76
February $ .54
March $ .51
April $ .63
May $ .80
June $ .91
July $ .76

We could correctly say that the price jumped from 51 cents to 91 cents in only three months (March to June), an increase of more than 78%! On the other hand, we can see it didn't change at all from January to July, which we might avoid mentioning if we wanted to impress people with the price increase. Choosing the starting and ending points for data used is an easy way to deliberately manipulate statistics.

However we can't always assume that because dates seem arbitrary that they have been deliberately manipulated. If someone reports that the number of people believing in astrology went from 46% in 1985 to 52% in 1999, they might have picked those dates simply because they couldn't find anyplace where a poll asked about belief in astrology except in those years.

Ranking Statistics

Often we hear statistics in the form of rankings: "He is ranked fifth among hitters for most career home runs" or "this is the third leading cause of accidents in the home." Since these are based on comparisons with other quantities rather than specifying specific amounts, there are special problems we need to be aware of.

A problem with ranking is that it is not always clear what the categories are that are being ranked. Is carelessness a cause of accidents in the home? How about tripping or doing home repairs or leaving cleaning products where they can be reached by children? There is not necessarily a standard way to divide up all causes of accidents, so the ranking of a particular cause will depend on how the other causes are divided up. Diabetes might be said to be the third leading cause of death in the United States, but it's rank could change depending on whether cancer is considered one disease or many (lung cancer, breast cancer, colon cancer, etc.). Diabetes might be more common than any particular type of cancer, but less common than cancer in general.

Part of the problem with ranking is that it does not tell us much about the actual amount involved. The most popular restaurant in the city might only do one one-thousandth of the business in the city, while the most popular brand of soup might have 70% of the sales, so simply being ranked number one doesn't tell us much about the actual percentage or amount of business.

Qualifiers on statistics

One way that statistics can be made to sound more impressive is by putting qualifications on them that might not seem important, but really are. One example is the statement that "The brown bear is the largest land predator in the world." I presume this is true, but it's safe to assume that the words "land" and "predator" wouldn't have been included if they didn't rule out other animals. The word "predator" rules out elephants, which are bigger but aren't predators, while "land" rules out various kinds of whales which are bigger predators but don't live on land. The two qualifiers together create a category in which the brown bear comes out first.

Sports announcers always want to inject as much excitement as possible into the games they announce, so they will find any way they can to make what happens into some kind of a record. We often hear things like "That give him the team record for most yards gained from scrimmage by a running back in the first quarter." Players on other teams may have gained more, players who weren't running backs might have gained more, players may have gained more in other quarters, and players who weren't starting from scrimmage (as when returning kicks) may have gained more. Other players presumably have the records for all of these. With so many qualifiers available, sportscasters can concoct some impressive facts for almost any game we watch.

Percentages

Sometimes statistics are given in absolute terms and other times they are given in percentages. We might hear that Blanko Corp. laid off 32 people or we might hear that they laid off 25% of their workforce. Typically a news source will try to make the number sound as dramatic as it can, so if Blanko is a huge company - say it has 200,000 employees - the source might find it more impressive to say it laid off 20,000 people rather than 10% of the workforce. If Blanko is small, say 100 employees, it sounds more impressive to say they laid off 10% rather than just 10 people. Which figure we should prefer as responsible thinkers depends on why we care about the information. If we are worried about the effect on the community or the country, then perhaps we should figure out the percentage of the population affected, rather than the absolute number or the percentage of company employees. If Blanko cuts 500 people from a town of 10,000, that is a huge effect, while if they are in a city of two million it may not be too important. If I had stock in Blanko, I'd be more interested in how the cut compared to their overall workforce.

When dealing with facts on a large scale, we should prefer to look at numbers as percentages or in proportion to other quantities since absolute values don't have direct meaning to us (see Large and Small Quantities). When we see there were 43,220 highway fatalities in the U.S. in 2003, it is hard to know what that implies. We might be better off knowing that it is more than in 2002, but about the same in fatalities per mile driven (1.5 deaths per 100 million vehicle miles traveled). If we are trying to decide whether to drive or fly somewhere, we'd like to know that driving has far more fatalities per mile than air travel.

Sometimes, however, we may want to use absolute numbers. If we know that a gas station 4 miles away has a price 5 cents a gallon lower than the one we are near, do we want to drive to it? If we plan to buy about ten gallons, it will save us 50 cents, minus the expense of the extra gas we use to get there. The percentage price difference from the nearby gas is of no particular interest to us.

Making sense of statistical claims

There are too many different statistical situations to consider them all, so we will often have to analyze the situation for ourselves in order to make good judgments.

One simple strategy is to temporarily ignore the statistic that someone has presented and ask ourselves what statistic we would actually want in order to make a judgment about the issue involved. For example when a political ad says that more people are employed, ask yourself what would be the normal way of measuring employment. Most often we hear about unemployment rate, which is the percentage of people who want to be employed who don't have jobs. Then see how this differs from the statistic presented to us to see if it could be misleading. For this issue you might ask yourself whether it is possible for jobs to be increasing and yet have the unemployment rate also be going up. Yes, it is, if the population is going up faster than the number of jobs.

Unless we have good access to the data and know how is was obtained, we should always recognize that statistics can misrepresent what is going on. If we get our information from a group that has a strong political or philosophical agenda, we can almost take for granted that their statistics (as well as the rest of their arguments) have been carefully chosen to promote their point of view.

Nevertheless, recognizing that statistics people present to us are frequently flawed doesn't imply that we can depend on anecdotes about individual cases or a few our own experiences. These are likely to be atypical of what happens in the world at large. Instead we should withhold judgment until we can get more reliable information about what is really going on.