Statistics Primer. Part 1: Samples

Don't run away. This is going to be gentle and soft and not as hard as you expect. Honest. I'm going to teach you some statistics and you don't have to pay the sorts of fees I would usually get for this. Something for nothing! Well, not quite. You still have to be willing to work just a little. But by the end of this post you will be so smart and bullet-proofed against a lot of lying with statistics.

So let us begin, you and I, even though we are not T.S. Eliot or in a poem. Rather, imagine that we are in a kitchen, a kitchen with a gigantic pot of really wonderful-smelling soup in it, and imagine that we are responsible for deciding if the soup needs its seasonings corrected.

How would you go about doing that? Yep, you would take a spoon or a ladle and taste the soup. That is pretty much what statisticians do when they take a sample. A sample is a ladleful of information from the population which is the whole soup. The reason for studying a sample is also fairly close to the same reason we only taste a ladleful of the soup to check the seasoning. If we ate all the soup there would be none left and we'd have to make more which would be time-consuming and expensive. Likewise, studying the whole population would be time-consuming and expensive, and in some cases also destructive (imagine testing how long light bulbs work, say).

The soup-and-ladle analogy works pretty well for explaining how sampling works. Think about a soup that has not been well stirred, which has lumps of carrots in one area and all the onions in another area. If you dip a ladle into that soup and then taste the contents of the ladle you may get a very different idea of the overall taste depending on where the ladle happened to enter the soup.

The solution to correct that problem is to stir the soup first. That way we make it random. But we can't really stir populations, and so the solution in sampling is a little different. For example, we might skim the ladle across the surface of the soup or dip it a little into several different places in the soup to get an idea of the totality of the soup. These solutions and others similar to those are ways of trying to guarantee that we get what statisticians call a random sample. In the simplest case a random sample gives each unit in the population the same chances of being selected for the sample. More generally, random sampling tries to avoid bias. A biased sampling process is one where different elements in the population have different likelihoods of being entered into the sample. A biased sample might overrepresent the carrots, say, and have too few onions, because the carrots are volunteering for the sample and the onions are refusing to participate.

More generally, samples where people choose to participate, such as those you often see on the internet, are biased samples. They omit the opinions of all those people who don't go on the net or who don't feel strongly enough or have enough time to click the vote-button. The results, then, tell us little about what people in general might think about the question the poll posed. Another example of a biased sample would be to carry out a general health assessment study by using only hospital records to pick subjects. People who have not been in a hospital in the recent past will not have any chance of being included in the sample, and the results are probably going to be biased towards greater apparent ill-health. In short, we don't want to let people decide themselves if they want to be in the sample and we don't want to exclude some people altogether by picking a sampling frame (here hospital records and more generally the source we use to find the sample) that doesn't include them.

A good sample is not based on convenience sampling, either. An example of the latter would be when a reporter goes out to the local mall to ask people about their opinions on some hot-button issue. This is convenient for the reporter, but unless we are interested in the population of people in malls it is not a way of getting a representative range of opinion. It excludes all those who don't visit malls (the bedridden, for one group) and, depending on the time of the day, it might also exclude all people at work. And these excluded groups might have quite different average opinions.

So polls usually employ random sampling to get the group that then is questioned. What sampling frame should they use in this? The most common one today consists of telephone numbers for landlines. But you can see how this might become a poor sampling frame as just owning cell phones becomes more common, especially among the younger individuals.

Most polls don't use simple random sampling of the kind I described, the kind that would be close to putting all names in a large hat and then stirring the names and randomly picking some. The reasons for not using this are three: First, simple random sampling could be incredibly expensive. Imagine that you are doing a study and that you need to interview 2,000 people in person. If you pick the names for these people randomly all across the United States, you might end up having to travel to two thousand different localities. To avoid this, many studies first draw randomly a smaller numer of geographical localities and then randomly pick a certain number of respondents within each of these localities. In my example this could be picking twenty random places and then picking hundred respondents randomly in each place.

Second, a simple random sample of all Americans would need to be enormous to include a meaningful number of people who belong to the less common minority groups, American Indians, for example. This is because it is likely that the sample would consist of all groups in their proportions in the target population, and so even a large simple random sample might include just one American Indian. If pollsters wish to understand the opinions of these smaller groups they would base all the evidence on one person's opinions. Not very sensible. The solution is to oversample the rarer groups, so that the study is including enough variety within the subgroup, and then to shrink back the share of this group in the overall results by weighing it down to the relative population share of the group.

Third (though in some ways this isn't completely separate from the second reason explained above), sometimes the question that is studied suggests obvious subgroups which are very similar inside the subgroup but very different from other subgroups. It might make more sense in a setting like this to randomly sample some respondents within each subgroup, especially if what we are interested in are the very differences between the subgroups. An example would be to poll a certain number of individuals with each possible religious affiliation on the question of how these individuals view government sanctioned torture, or to poll anti-choice and pro-choice voters on their views on other political topics than reproductive choice

This might be a good time to leave the kitchen and to remind all of us about the basic problem we have: There is this population (the soup) and we don't know its characteristics (what it tastes like). It's too expensive and time-consuming to study the whole population (drink all the soup) to find out, so we take a sample (a ladleful) and we try to make sure that our sample is representative of the population, like a microcosm of the population macrocosm. So we use a method of random sampling. This lets us exclude bias.

Our sample might still not reflect the population, just because we might have bad luck in the sampling (such as happening to get all the bay leaf in your ladle when tasting a soup), but statisticians have a way of figuring out what the risk of this happening might be. (This will be the topic of my next post on statistics.)

Note also that if we sampled a very large soup with a very tiny spoon we'd be unlikely to get a very good idea of the taste of the soup. I recently read about a study used to justify single-sex schooling where the population studied consisted of fewer than twenty teenagers and where the whole result touted in the media was based on two teenage boys' responses. Now this is a very tiny spoon, especially to use in an attempt to overturn the whole education system.

More generally, samples in statistics must be of a certain size to be meaningful representations of populations. How large, depends partly on the population we are looking at. If it's very diverse we need a larger sample to capture that diversity. The size of the sample also depends on the precision we are seeking and on the kinds of questions we are asking.

But it's clear that asking one person in a telephone poll isn't enough to get an idea about the general views in the United States. What isn't quite as clear is the question of how many people we should pick for the sample to get a representative sample. Remember that the bigger the sample the more it will cost to interview or to study. This means that statisticians must weigh the needs for a larger sample against the costs of acquiring one, always remembering that one of the costs of a too-small sample is that it will have a greater chance of being unrepresentative. More about that later on, too.

Posted by: echidne / 10/04/2006 03:17:00 PM




Statistics Primer. Part 2: Probability

Statistics is not the same as probability theory, but the latter is used in statistics and a small detour into the wonderful world of probabilities is necessary here. Let's start by grabbing the concept of probability by its horns: What is this thing?

It's a way of quantifying the likelihood of some event. Call the event that Echidne will tomorrow wake up all cheerful the event A (just to call it something short and sweet). Denote the probability of this event A with the shorthand p(A) (this is said "p of A"). How could we quantify this wonderful likelihood?

We can do it by defining an impossible event and a sure thing. Let's fix these two extreme values as follows:

p(A)=0 if A is an impossibility


p(A)=1 if A is certain to happen.

Given these two fixed values all other probability values would fall in the range from zero to one

This is fun. If I tell you that p(A) =0.14 for the event of me waking up chirpy as a bird you can now tell that I don't think it's very likely to happen. But there is an even funner aspect of this, for we can always define a second event, notA, being the event in which Echidne will not wake up cheerful. If p(A)=0.14, then the probability of notA, or the complement of A, will be...what?

It makes sense that it would be 1-p(A) = 0.86. Because something is going to happen and if it's not a cheerful Echidne, then it must be a grumpy or neutral Echidne that rises from that divine bed tomorrow morning, assuming that she does rise. As long as we are not counting some of my mental states in both events A and notA, this will work.

So what was funner about this? The fact that uncertainty actually increases as we move from probabilities close to zero towards the middle values and that it also increases as we move from probabilities close to one towards the middle values. So we have most uncertainty when the probability hovers around 0.50. When we are closer to the endpoints of zero or one, either the event or its complement is almost sure to happen, so we have less uncertainty. Plus the fact that probability theory can be used to make all sorts of formulas which will let us find the probabilities of combined events happening and so on. I'm not going that way but you may wish to do if you find this entertaining.

You might say that all this is well and good, but what is the anchor that settles this whole probability thing? Where do we get those values you made up here? There are three possible answers to this question. The first is the classical definition of probability and is best explained by thinking about games of chance in which the rules of events happening are simple to follow and where we can find very small events which clearly are equally likely to occur. For instance, think of the following game: You toss two fair dice at the same time. What is the probability that the dots on the top sides of the two dice add up to seven?

The solution consists of counting the number of events (here an event is the way the two dice fall) in which the dots add up to seven and then counting the total number of events, whatever the number of dots might add up to. The probability of the event we are interested in (the dot sum is 7) is the ratio between the two counts. The way to find these count values is by....counting!

First, I can count the total number of events by noting that the first die can take any value from one to six and so can the second die. This mean that for any value of the first die the second one could take any one of six possible values. Given that the first die can also take six values, the total number of events is 6 times 6 or 36. Second, I can look at all these 36 events and add the dots on the two dice for each of them. When I do that, I find that in exactly six cases (1,6), (2,5), (3,4), (4,3), (5,2) and (6,1) the sum comes to 7. (Note that in the pairs I've given here the first die value always comes first and the second die value second and that (1,6) and (6,1) are two separate events). Third, we make up the probability ratio. Here it is 6/36 or 1/6 or 0.167.

The second way of anchoring the probability concept is more important in actual statistical studies, and that is to link probabilities of future events to what happened in the past. This makes sense as long as whatever affects these events hasn't changed in the meantime. This definition is called the objective definition of probability (to distinguish it from the third definition still to come) and also the long-run relative frequency definition of probability. The latter name hints at the way the probabilities are derived: By using long enough strings of information about actual events and by assuming that the events will replicate at the same frequencies in the future. The word "relative" is added because we standardize the probability measure to the scale from zero to one.

An example of this approach would be taking a coin that is known to be unfair (so that heads and tails are not going to be equally likely in tossing it) and finding out what the probability of head is by tossing the coin again and again and by writing down whether heads or tails turned up on each toss. Suppose you toss this coin a million times and find out that heads came out 400,000 times. Then the probability of heads using this coin would be 400,000/1,000,000 = 0.4.

Sadly, the easiest teaching examples on probability tend to be stuff like that. But the same principle applies to studies about voting activity or opinions in general.

The third definition of probability is the subjective one, also called the Bayesian definition. This differs from the other definitions in that a Bayesian statistician could ask a question such as this: What is the likelihood that Echidne is grinding her teeth right now? A strict objectivist would not ask such a question, because either I am grinding my teeth now or I am not; it's just that others can't observe which it might be. The subjective definition of probability has to do with our beliefs about events and is not strictly limited to predicting future probabilities. It can handle the way learning more facts changes our beliefs and other interesting questions like that.

The probability concepts most used in statistical studies are the long-run relative frequency view (which is used in the studies themselves) and the subjective view (which is used in the way we interpret the margins of error and similar concepts). My next post will talk a little more on the concept of probability distributions, needed for understanding sampling distributions.

What I hope you got from this post was the rough feeling that is conveyed by something like "candidate X has the probability of 0.7 of winning next month's elections", and that you'd also want to ask what the basis for this prediction is. It might be totally subjective or perhaps there was a poll in which 70% of those surveyed stated that they were going to vote for X. This relative frequency (70% is 0.7 in relative frequency terms) is then used as the probability of X getting elected, which naturally assumes that people will act according to their stated intentions in the survey and that the survey was representative of actual voters.

Posted by: echidne / 10/05/2006 01:33:00 AM




Statistics Primer. Part 3: Sample Statistics

Here is where writing this primer gets difficult. An introductory course to statistics is at least one semester long for a good reason. To do something much shorter requires a certain amount of omissions and a couple of rough approximations or almost-lies. So if you know a lot of statistics, go gently on me in the comments section. Ok?

Suppose, then, that we have some information we have gathered by a proper random sampling process. It could be the yearly incomes of one hundred people, ranging from zero dollars to, let's say, two hundred thousand, and we want to do something interesting with these sample data. For one thing, nobody is going to think we are great if we just print out the hundred numbers on a piece of paper and distribute it. Human beings are not good at seeing general patterns in numbers like that. So what can we do to summarize the information?

Two things come to mind right away. We could try to condense the information into just a few numbers or we could try to make a mental picture of it (a topic I might or might not cover in this primer, depending on whether it seems needed). Let's begin by trying to condense the information into just a few numbers, called sample statistics. These statistics are numbers, to be distinguished from the science of statistics in general. If you could only give one sample statistic to represent all the information in the sample of one hundred incomes, what would it be?

Probably some measure of central tendency, meaning that the number we pick should somehow represent the average, or the common or the representative in the sample. There are three candidates for this measure: the mean (or the arithmetic average), the mode (or the most common value) and the median (or the middle value). Most of us are familiar with the mean, and it turns out to be the overall winner for reasons that have more to do with its statistical usefulness than its ability to otherwise beat the competition. But the mode and the median are also handy to know about.

In our income example, the mode would be the income value which appears most often in our sample. The median income of the sample would be found as follows: Arrange the hundred income numbers in an increasing order. Give each individual an ordinal number corresponding to his or her place in the line-up. Then the median income is the income of the individual who is standing smack in the middle of the line-up. Oops, you say now. There is nobody standing there! True, because my sample has an even number of observations. The trick in this case is to use the arithmetic average of the two incomes belonging to the two people on both sides of the missing central person.

An example might come in useful here, and one with fewer than a hundred numbers. Suppose that we have some data on yearly incomes for five people only, and the incomes are as follows:

0, 45,000, 45,000, 70,000, 100,000

I have put them into an increasing order for your convenience. The arithmetic average for this sample is 52,000. The mode is 45,000 (it occurs twice and no other figure occurs more than once) and the median is the middle income in this ordered array or 45,000. Note that the three measures of central tendency may or may not be the same and that each of them might be useful for different purposes, including different political manipulations. For example, see what happens when I add one more observation to the sample:

0, 45,000, 45,000, 70,000, 100,000, 700,000

The arithmetic mean is now 160,000! The mode is still 45,000, but the median is now the income half-way between the second 45,000 and the 70,000 figure following it or 57,500.

You might want to play with this a little more. For example, it's possible to have more modes than one. Take out the 700,000 I added and replace it with a second 100,000 figure. But there is always only one mean and one median.

As I mentioned above, the mean is the workhorse among these measures of central tendency, even though it may not always be the most representative single number in a sample. What we use it for, ultimately, is in estimating the same single number in the population. For example, if we found that the average income of the one hundred people in my original sample is 67,000, then we could use that as a point estimate of the average income of all people in the population I drew the sample from. But this sounds a little dangerous, doesn't it? Because I might have gotten the same sample mean from a sample of only ten people and because clearly the mean itself isn't a very good guess if the sample incomes varied widely all over the place.

What about that varying wildly all over the place? Let's take a different imaginary set of three samples:

Sample A:

7, 8, 10, 12, 13

Sample B:

7, 9, 10, 11, 13

Sample C:

10, 10, 10, 10, 10

All these samples have the same mean, 10. (Note that the mean doesn't have to be one of the numbers in the initial sample, it just happened to be in this case.) But the samples are clearly showing very different stuff otherwise, and if we only reported the mean to someone we'd be omitting important information. Sample C is just the same number five times. Sample B has the three numbers in the center closer to each other than is the case in sample A. So A has the most variation of the three. How could we express that in one single number?

Statisticians came up with a way of doing it. To understand the thinking behind the favorite selected for the job it might be useful to discard a few other candidates first.

The starting point would be to note that we need to fix the measure of scatter to something and the mean is already there as a good candidate for that. What if we measured the general variation in the sample by looking at the distance of the various sample values from the mean? The further these values fall from the mean, on average, the more scattered is the sample, after all. Suppose that we calculated all these distances. To get just one number to reflect the dispersion we could use the average of the distances.

Let's try it for sample A. The first distance is 7-10 = -3, the second 8-10 = -2 the third is 10-10 = 0, the fourth 12-10 = 2 and the fifth 13-10 = 3. To make these into one overall measure of the scatter or dispersion we could add them up and then divide by the sample size, five. Except that what we get as the sum of the distances from the mean is zero.

That's why this one was rejected. The problem has to do with the negative and positive values canceling each other out. So a slightly different approach would be to use the absolute values of the distances in these calculations. This would work, but it turns out to be cumbersome later on in various statistical uses the measure has. Still, the idea of getting rid of the negative signs in the distances or deviations around the mean is a good one. Is there any other way we could get this trick to work?

Yes, and that is by squaring the deviations around the mean before we add them up and then average them. To get back to the original units we used we then take a square root of the result. This number is called the standard deviation. The number we have before we take the square root is the variance.

The values for the variance for the three samples A, B and C are 5.20, 4.00 and 0 respectively, and the standard deviations 2.28, 2.00 and 0. For those who know some statistics and want to know more about how to average the sum of deviations around the mean correctly, see the footnote preceded by the asterisk.

To recap the conversation so far: We have two formulas, one of which is the average value in the sample, the mean, and the other one of which is the average squared deviation around the mean (for the variance) or the square root of that (for the standard deviation). I'm not sure if you can see how these could start a pattern for more formulas. For example, suppose that we calculated the average cubed deviation around the mean and so on. What might we get? It turns out that we'd get measures for finding out how lopsided our sample might be and other interesting things like that. Lopsided distributions aren't going to be central in what I'm covering here, but they can be quite fascinating.

You may be silently complaining that none of this seems to have much to do with opinion polls where the data we get tends to consist of verbal answers to questions. Data like that are qualitative, not quantitative, and we can't just storm ahead to calculate means and variances for them. But there is a way around that problem, and that is by counting.

Suppose that we have asked five people whether they prefer Smith or Jones to be their state senator in the next elections, and suppose that four people say they prefer Jones and one person says that she or he prefers Smith. It turns out that all the work we have done can serve here, too, if we make one additional change: We are going to count each expressed preference for Jones as 1 and each expressed preference for Smith as 0 . The data will then look like this:

1, 1, 1, 1, 0

and the mean of these numbers will be 4/5 or 0.80. Using the earlier form for variance gives us the figure 0.16 , which produces 0.40 for the standard deviation.

The only snag here is that we might have as well counted the votes for Smith as 1, and then the mean would have been 0.20. But the variance and standard deviation are unchanged (though usually we have a shortcut formula for counting these rather than doing what I made you do here for learning purposes).

The mean for binary data like this is called the sample proportion rather than the sample mean, and we need to decide which of the two alternatives we are going to focus on. But nothing is lost as you can clearly see if we multiply the proportions by 100 to get percentages. If Jones gets 80% in the poll then Smith must get 20% (assuming everybody expressed preference for one or the other or that we took out the undecideds before the calculations started).

That's probably enough for one post. Note that I gave you no actual formulas and neither did I give you the Greek letters usually employed to denote the population mean, proportion and variance or the letters used to denote the sample equivalents. We'll see how far I get before I have to do something like that. But you should now know what people mean by a sample mean, a sample proportion or a sample standard deviation, and that the mysterious population lends itself to calculating corresponding measures should we ever have enough time and money to do that.


*You may be aware that the formulas for sample variances and standard deviations usually don't employ the sample size, n, as the denominator, but n-1. I omitted the necessary explanation for that here, because my goals are more modest for this series. But someone asked why the formulas usually employ n-1 here rather than n.

The answer has to do with the final uses of these formulas, which is to estimate population equivalents to the sample concepts we have talked about here. There is no really easy and juicy way of explaining this (at least I haven't found one), but perhaps the best explanation has to do with the concept called degrees of freedom (d.o.f.). Roughly, the degrees of freedom is the number of independent sample observations we have for calculating a formula such as the sample mean or the sample standard deviation. Note that when we calculated the mean we could use all the sample data freely in that work. But when we calculated the variance we were using the previously calculated mean, with the added restriction that the sum of the sample observations divided by the sample size must equal that value. So we lost some independence there and this is reflected in the use of n-1 when we figure out the average dispersion in the sample. Sigh. I'm afraid that this wasn't very helpful unless you already knew all about it.

More generally, the degrees of freedom is the sample size minus the number of population parameters already estimated from the sample. Here we have only one such already estimated parameter which would be the population mean, estimated here by the sample mean.

Posted by: echidne / 10/09/2006 02:12:00 AM

****** Http://


Statistics Primer: Part 4: Sampling Distributions

The time has come to make a clearer link between the messages of my previous posts. One of them talked about the concept of probability and one about the sample mean, sample proportion and sample standard deviation. We are now going to build on those two posts to bring them together into a fruitful marriage of sorts.

To begin with, cast your mind back on the two disciplines: statistics and probability theory. How do they relate to each other? Think of this example: I have a deck of 52, cards, half of which are red and half black. If I randomly draw five cards from this deck, how likely is it that all of them are red? Now that is a question in probability theory. We know what the population (the deck of cards) looks like, and we wish to learn what the sample (the hand of five cards) might look like. Statistics reverses this way of thinking. For example, suppose that you have a deck of 52 cards, but you have manipulated the deck so that it no longer has exactly half red cards. You give me a randomly drawn hand of five cards and my job is to try to figure out from the hand I have what the deck proportions of red and black might be.

In short, statistics uses probability theory "in reverse".

We need one additional pair of terms to get going, and that is the pair of "a variable" and "a constant". A constant is exactly what it sounds like: a number which has a specified constant value. For example, if we calculate the sample proportion of people who love chocolate to be 0.95, then that is a constant. A variable is then not a constant. If we have a sample of annual incomes for Americans, and these incomes vary from zero to, say, 200,000 dollars, then "annual income" is a variable with many different values depending on which sample observation we select.

Variables can be quantitative, as in the previous example, or they can be qualitative. An example of the latter would be the religion of a person in a poll about voting. It can take many different values (Catholic, Evangelical Christian, Jewish, Muslim, Buddhist, Other, Atheist). These are qualities, not quantities. But remember, we can count proportions to get quantitative data on them.

Quantitative variables are of two main types: continuous and discrete. A continuous variable is one where the following question will be answered in the affirmative: If you take any two values of the variable, is there always a third possible value between the two? Heights and weights are continuous variables. Check this: If one piece of chocolate weighs 2 oz. and another 3 oz. it's clearly possible to have a third piece which weighs something between 2 and 3 oz. The number of visitors to a store selling chocolate on one weekday on the other hand, is a discrete variable. Check this: If the store had 20 visitors one Monday and 21 on another Monday, is it possible that it had, say, 20.75 visitors on some other Monday? The answer is negative (unless we are averaging over the visitor numbers but then we are talking about something different).

The next stage in our adventure is a magic trick. We are going to take the sample statistics such as the mean, the variance and the standard deviation, all constants, and we are going to transform them into variables! Well, not really, but something a little like that. How is this trick carried out?

Let's go back to the example of a deck of 52 playing cards. This deck has been manipulated by someone so that it doesn't necessarily have 26 red cards and 26 black cards as the usual decks do, and we don't know the true proportion of, say, red cards. Earlier I suggested that one hand of five playing cards has been randomly drawn from this deck. This sample could be used to make a point estimate of the proportion of red cards in the deck. Suppose the hand contains three red cards and two black ones. Then the proportion of red cards in the sample is 0.6. This is a constant, right?

Suppose now the evil person who messed up the deck takes these cards pack into the deck, reshuffles it, and deals out another five cards. The new sample has two red cards, so the sample proportion is 0.4. The cards are returned to the deck again, and yet another hand of five cards are dealt out, with three red cards (a sample proportion of 0.6 again). And so on. What is going on here?

Note that we now have a variable, the sample proportion of red cards. It can take more than one value in the experiment. Imagine the dealing-out of five cards continued for, say, a hundred times. We'd then have a large number of sample proportions, and we could use those numbers to figure out the average of all these proportions! We could even calculate a variance and a standard deviation for the sample proprotion! Layers upon layers, you might say. First we had one single sample proportion and one single sample standard deviation. Now we have a whole distribution of sample proportions and we have a standard deviation for the sample proportion itself!

The title of this post has to do with sampling distributions. This is a fancy name for showing us all the possible values and their probabilities that, say, a sample mean could take in this sort of repeated sampling. Or the sample proportion. Or the sample standard deviation. So a sampling distribution is a probability distribution for a sample statistic. If we bring in the heavy artillery of mathematics, it turns out that we can derive explicit formulas for the measures of central tendency and dispersion of these probability distributions. Quite wonderfully, it turns out that the average (or the expected value, as it is properly called) of all sample proportions is....voila! the population proportion! The very thing we wish to estimate! The same is true for the sampling distribution of the sample mean. It averages (in the expected value sense) to the population mean in this sort of repeated sampling (dealing out the five cards over and over from the deck).

The formulas for the variance and the standard deviation of the sample proportion and the sample mean are not quite as intuitive, but they make sense after some thought. Take the sample variance and divide it by the sample size. You get the variance of the sample mean/proportion. If you need the standard deviation, take a square root of the whole thing. So these measures of dispersion get smaller the larger the sample we use. Makes sense. Think about the deck example, but with ten cards dealt out every time. The proportion of red cards in these bigger samples is not going to vary as much from the true unknown proportion in the deck as it will in the smaller samples.

We are almost ready to start playing with real examples of polls. There's only one missing link, and that has to do with the question of probabilities in the sampling distributions. How do we know how likely each possible sample mean or proportion is in the repeated sampling? That is the topic of the next post in this series.

The crucial point of this post in the series is the following: When we take a sample and use it to make estimates about the unknown population characteristics of, say, voter preferences between candidates, we think of this sample as coming from a distribution of many possible samples. This means that the sample proportion we get is a variable. It has a mean and a variance of its own, and the variance, in particular, affects the trust we can place in our estimates.


This is a wonderful toy to play with. It lets you see the sampling distribution appear.

You can read the earlier parts of this series here: Part 1, Part 2 and Part 3.

Posted by: echidne / 10/16/2006 12:09:00 AM




Statistics Primer. Part 5: Constructing a Confidence Interval for the Sample Proportion

At this point it's probably not possible just to jump in and read the posts separately unless you've had statistics already. If you are a statistical virgin, go and read the four earlier posts in this series first. The links to them are at the bottom of this post.

As a reminder, our goal is to learn to understand confidence intervals.* For example, political polls might say that Bush's approval is 36%, give or take 3%. Or a poll might tell us that 52% of people prefer candidate A over B, give or take 6%. Where do these numbers come from?

The previous post started the necessary explanations. The 52% or 36% figure is simply the proportion of people in the poll (the sample) who expressed a certain opinion. The other stuff, about giving or taking three or six percent, has to do with the sampling distribution of this sample proportion, and its size depends on the sample size (how many people were asked) and the inherent variability in the population (and in the sample).

Put another way, the 36% Bush approval rate, give or take 3%, is really an interval ranging from a low value of 33% to a high value of 39%. It's centered on the sample proportion and extends a certain distance in each direction from it. Like this, visually:

I------sample proportion------I

Or like this, for a smaller interval:

I---sample proportion---I

We call this interval a confidence interval when its length is derived from the sampling distribution for the sample proportion. Think of it this way: You are a blind-folded archer shooting funny kinds of arrows at a dartboard. The tip of each arrow looks like a staple or like the confidence interval I have drawn above, and you are to shoot lots of these staples to the dartboard. After you finish, you can take your blindfold off and go and check how many of your staples actually cover the bull's eye on the board. If you used really, really wide staples (say, a mile wide), you will cover the bull's eye every time, but you haven't really shown any precision at all. On the other hand, if you use very narrow staples you are going to miss the bull's eye a lot. The very wide staples allow us to be very confident in the knowledge that we have hit the bull's eye. The narrow staples would give us a lot more precision but less confidence. Some sort of a compromise is needed to get both.

The compromise statisticians use is like tying the staple length to the size of the dartboard, very roughly. In our case the dartboard is the sampling distribution for the sample proportion, and our knowledge of that sampling distribution allows us to specify a level of confidence first and then to make a staple of the necessary length to get that level of confidence.

The task is made much simpler by one remarkable result in statistics, called the Central Limit Theorem (CLT). This theorem shows that the sampling distribution of the sample mean, and the sample proportion, too, has one particular form if the sample we use is large enough**. Thus, if we learn the probabilities of this one distribution we can apply them to a whole lot of problems. Not all problems, mind you. But a large chunk of the more common problems.

This probability distribution is called the normal distribution or the Gaussian distribution (by its inventor) or the Bell Curve. The last name comes from its shape, as the distribution looks like a jingle bell viewed from the side. It has all sorts of neat characteristics.*** For example, the mean, the mode and the median are all identical and fall right below the peak of the bell. The two sides of the distribution are mirror images of each other. Here is a picture of the normal distribution****:


Ignore the blackened area for the time being. It's not part of the picture but a way of looking at probability calculations.

Note that there is an infinite number of these distributions, one for each possible value of the mean and the standard deviation, so some will be further to the right or to the left, and some will be fatter than others. To make things easier, statisticians usually show one specific normal distribution, called the standard normal distribution (as I have done above), and calculate the probabilities from that. The standard normal distribution has zero as its mean and one as the length of its standard deviation. So the horizontal axis here can be viewed as measuring the variable (say the sample proportion) in units equal to the length of the standard deviation of the sampling distribution. We can apply the standard normal distribution probabilities to any normally distributed variable by using linear transformations, but it's not necessary to go there right now.

The normal distribution is a probability distribution. For every possible value of the variable on the horizontal axis (say, the sample mean or the sample proportion) the area under the curve shows the corresponding probabilities. That's what the blackened area in the above picture demonstrates: the probability that this variable has a value between the mean and half a standard deviation above it. Note that the total area under the curve but above the horizontal axis equals one or 100% because it's certain that something happens and the values here cover all feasible ones.

Now, the normal distribution has certain very nice characteristics, shown in the picture below:


(This picture uses the general Greek letters mu and sigma for the mean and standard deviation of whatever distribution we are looking at. Mu is that thing which looks like the letter u with a long tail in the front. Sigma is the letter which looks like number six. If this bothers you, just pretend that the mu is zero and the sigma one)

The percentage areas marked in the picture are probabilities. For any normally distributed variable the following is true: If we move two standard deviations up from the mean and two standard deviations down from the mean the interval we have created corresponds to the probability of 0.954 (add up all the areas under the curve between the two chosen values on the horizontal axis) . Likewise, if we move three standard deviations up from the mean and three standard deviations down from the mean the interval we have created corresponds to the probability of 0.998. (What about trying to find an interval corresponding to probability 1 or certainty? Here we meet a slight snag as the normal distribution has tails which go on to infinity, so such an interval would range from minus infinity to plus infinity.)

See what I'm getting to here? Suppose that I wanted to find an interval length in the sampling distribution of a sample proportion which corresponds to the probability of 0.95. How many standard deviations would I need to go out from the mean towards each tail to get an area of that size under the curve? We already know that two standard deviations gives us a fairly good rough approximation of it, but if I want to find the exact value I can get it from precalculated tables in statistics books and even on the net. The value we need is 1.96 standard deviations. In a similar manner the number of standard deviations each side of the mean that would give us the probability of 0.99 is 2.576.

Ok. I'm skipping quite a lot of mathematics here and some statistics, too, but I hope that the basic idea is clear, and that idea is that we can create the confidence interval (the staple in the arrow example) using this way of thinking:

First, calculate the sample proportion.

Second, calculate the standard deviation of the sample proportion. It is found as follows: Multiply the sample proportion by its complement and divide this product by the sample size. Then take the square root of the result.*****

Third, multiply this standard deviation of the sample proportion by the numerical value corresponding to a given significance level in the standard normal distribution. For example, if we use 0.95 as the significance level, then we multiply the standard deviation of the sample proportion by 1.96

Fourth add the product from the third step to the sample proportion to get the upper limit of the interval. Subtract it from the sample proportion to get the lower limit of the interval.

The traditional choices for confidence levels are 0.90, 0.95 or 0.99, or the same in percentages(90%, 95% or 99%). What do these mean? Return to the arrow-shooting example. Suppose that you are blind-folded and shoot a staple-tipped arrow a hundred times to the dartboard. You then take the blindfold off and check the results. If you used a 0.95 staple, you should find at most five staples totally missing the bull's eye. If you used a 0.99 staple, at most one out of the hundred arrows should have missed the bull's eye.

Translated into statistics, what this means in the case of a 0.95 confidence level is that if the same study design was used a hundred times to draw samples of identical size, and if a confidence interval was created from each sample proportion, at most five of these confidence intervals would not include the true unknown population proportion.

Note that the interval is wider if we want to be more confident. It is also wider if the standard deviation of the sample itself is large or if the sample size is small. All this makes sense, as we should have a wider interval when there is more variation in the sample and/or when the sample is smaller.

We are going to finish this post with an example. But before that I need to talk about a concept which you often see when polls are discussed: The margin of error or MOE. For example, a poll might tell you that the margin of error is plus/minus three percent. If the poll consisted of only one single question with yes-no type answers, the MOE would equal the confidence interval as we have calculated it here, i.e., it would give us the amount to add to and subtract from the sample proportion.

But most polls have several questions. Clearly, the confidence intervals can't all be the exact same length. So what is the MOE in these cases?

It tends to be the longest possible confidence interval we could get in a particular study. Usually the studies use the 0.95 level of confidence for getting this. Think about the calculations I outlined above. If we have fixed the confidence interval at 0.95 and if we have a sample of a given size, what could we change to make the interval as long as possible? Suppose that we ignore the actual value of the sample proportion we get in any particular question and just replace it by 0.5. It turns out that this value makes the standard deviation of the sample proportion as large as it ever could be. So using this trick allows pollsters to give us just one margin of error for the whole study, when in reality each question has a different confidence interval. But this MOE is most likely an overstatement. Viewed from a different angle, the only real information it gives us is the sample size (as both the confidence level and the proportion used in the calculations are set to equal certain constants). The sample size is usually provided separately in any case.

Time for an example. It's from a recent poll, to be found here. The MOE for the whole study is given as plus or minus 3.1%. Let's pick one specific question from this poll, the one where the 1006 respondents are asked if they approve or disapprove of the Congress. The percentage numbers are as follows:

Approve 16%

Disapprove 75%

Not sure 9%

I started by taking out the people who were not sure. What polls do about this lot varies, but it's usually best to omit them from the analyses. That left me with 915 answers and the approval rate within this sample of 915 of 17.6%. The disapproval percentage is now 82.4%. I want to make a confidence interval for the disapproval rate among those respondents who express strong opinions. So the first step is to calculate the sample proportion, which is 0.824. The second stage is to calculate the standard deviation of this sample proportion. We find it by first multiplying 0.824 by 0.176, then dividing the product by 915 and finally by taking the square root of the whole thing. This gives me 0.0126. Next, I find how much to add to and subtract from the sample proportion to get an interval by multiplying 0.0126 with 1.96, the number of standard deviations that is associated with the 0.95 level of confidence. The result is 0.0247, or 2.47%. The 0.95 level confidence interval is then found by adding this to 0.824 and then by subtracting it from 0.824. We get, after rounding, an interval from 0.85 to 0.80.

A couple of comments: First, I did a lot of the calculating in my head so double-check. Second, note the difference between the 2.47% here and the MOE for the whole study which is given as 3.1%. The difference comes from the fact that the sample proportion here is quite far from 0.5, and so the standard deviation of the sample proportion is quite a bit smaller than the maximum possible value for this. Third, I claim no special significance for the choice I made about those who are not sure. You could experiment by putting them into the "approve" group or the "disapprove" group and see what happens to the results.


* I talk about finding confidence intervals for the sample proportion in this post, but everything applies to the sample mean just handily, except for the exact formula of the standard deviation of the sample mean which for the sample mean is just the sample standard deviation divided by the sample size.

**The CLT says something more than this. Google it if you are interested. Usually the "large enough" a sample is quite small, though prudent people want the sample size to be at least thirty. For binary data of the type we get with polls the requirement is that there are at least five answers counted as 1 and five answers counted as 0.

***It is a continuous distribution, though, and that means that the sample distribution of the sample proportion is only approximately normal, because we are using a continous distribution to approximate a discrete one. It's not very hard to find the exact sampling distribution for the sample proportion and to use that to find the probabilities needed. This is what is done with small sample studies anyway. But the extra work is unnecessary as the probabilities start equaling those we get from the normal distribution when the sample gets bigger.

****The vertical axis is marked probability density in the picture, because continuous probability functions are called probability density functions in geeky company.

*****This is the shorthand formula for the standard deviation of the sample proportion. In an earlier post I used a pedagogical aid to get the value, but this formula gives it much faster.


You can find the earlier posts in this series here: Part1, Part2, Part3 and Part4.

Posted by: echidne / 10/20/2006 04:31:00 PM



Statistics Primer. Part 6: Wrapping it All Up In A Nice Package

My series on a primer for statistics is pretty much done. I could go on a little bit longer about how confidence intervals can be improved when a poll uses cluster sampling, for example, but you can google "bootstrap confidence intervals" yourself. And it's worth reminding all you erudite and nice readers that no poll is better than the basic plan for collecting the data. If the researchers did not use random sampling and/or if the sampling frame was selected poorly, the results mean nothing outside the actual group that is being questioned. The crucial question to answer is whether the findings can be generalized to the population that we are interested in, for example, to all Americans, and a poor polling plan makes the answer always negative.

Even a random sample can always be erroneous because of the effect of pure luck, but the sampling error can be quantified as discussed in the last installment to this series. But when a sample is non-randomly drawn, it can also be biased, and we cannot quantify the bias. For instance, a study which tries to find out the health habits of Americans by only asking people at the emergency clinics of hospitals is going to be biased. That is a bad way of trying to get a random sample, because it will oversample people who are sick right now and it is also likely to oversample poor people who have no regular general practitioner to go to. A similar problem may be created in polls which use land-line telephones, if the individuals who only have a cell phone are different from those who have a land-line phone. It's possible that the group these polls reach is older, on average, and more likely to be at home. If being older and spending more time at home answering the phone makes ones opinions different from those of people who flit about with a cell phone in their belt, then the polls could be biased.

A further selection bias in polls may come from non-response bias. Lots of people refuse to answer polls altogether. If the people who refuse have different opinions, on average, than those who eagerly chat with pollsters, then the results are likely to be biased, i.e., not generalizable to the population we want to learn about, say, likely voters.

Then there is the way questions are framed in polls. It is well known that a certain answer can be elicited by just changing the question. Introducing clear judgemental components is one nifty trick to achieve this. Or one could use versions of the old "Have you started brushing your teeth yet?" One reason why polls on what to do about abortions get such different results is in the way the questions are framed.

Statistics is a very large field, and what I've touched in this short series of posts is just a very small square inch of the field. I linked to some internet courses on statistics earlier on and I encourage you to pursue more study on your own. Or you can ask me questions either by e-mail or in the comments threads, and if I know the answer I will let you know what it is. Statistics can be fun! Honest!

To finish off these meanderings, I want to talk about something that pisses me off: the way the term "statistically significant" is misused all over the place. First, this term is a technical one, and the "significant" part does NOT have its everyday meaning. If a finding is statistically significant it could be totally unimportant in everyday utility, not earth-shaking at all, even trivial and frivolous. And a statistically nonsignificant finding does NOT mean that the study found nothing of importance. Let me clarify all this a little.

We have talked about confidence intervals in this series. If you have taken a statistics intro course you may remember that the topic of confidence intervals is usually followed by the topic of hypothesis testing, and that is the application where statistical significance is commonly introduced. "Hypothesis" is just a guess or a theory you have about the likely magnitude of some number or relationship between numbers, and "testing" means that we use data we have collected to see how your guess fares.

The way this testing goes is by setting the theory you DON'T support as the one you try to disprove. The theory you don't support is usually the conventional knowledge you try to prove wrong or the idea that some new policy or treatment has no effect and so on, and it's called "the null hypothesis". The theory you secretly want to prove is then called the "alternative hypothesis". And yes, statisticians are terrible wordsmiths.

So your hypothesis testing tries to prove the null hypothesis wrong. If you can prove it wrong then the alternative hypothesis must be right. Of course for this to work you must frame the two hypotheses so that nothing falls outside them. An example might be useful here:

Suppose that you have a new treatment for the symptoms of some chronic disease. You run a study where you give the new treatment to some patients randomly and the old treatment to a similar group of patients, also randomly selected. You then measure the reduction in unpleasant symptoms in the two groups, and you use these data to determine if the new treatment is worth while. Now, the null hypothesis here could be that the new treatment is the same as the old treatment. The alternative hypothesis would then be that the new treatment is different; it could be either better or worse than the old treatment. If you decide to test this pair of hypotheses, you are said to do a two-sided test of hypotheses, because both large and small values in your experimental group might be evidence that the new treatment is different from the old one. It could be better or it could be worse.

It is more likely that you are interested in finding if the new treatment is better than the old one. This would be a one-sided test of hypotheses, and you would write the null hypothesis differently. It would be that the new treatment is either the same as the old treatment or worse. Then the alternative hypothesis would be that the new treatment is better than the old treatment. If fewer bad symptoms is what the test measures then only low values in the experimental group would support the idea that the new treatment works better.

Given that you are using sample data to do all this, you add something like the staple of the confidence intervals to your testing procedure, and you report the results in a form which tells the informed reader how likely the disproving of the null hypothesis is to work in the population rather than in the sample. The staple we use in the two-sided test of hypotheses is exactly the same as we used for confidence interval construction, except that we construct the interval for the way the world looks if the null hypothesis is true! Remember the 95% confidence interval? In a two-sided test of hypothesis, using this level of confidence translates into a 5% significance level of the results, meaning that the interval we have created around the possible mean under the null hypothesis is so long that it only omits the utmost 2.5% of the possible distribution of sample means at each end of the distribution.

Now suppose that the made-up study I have described here finds that the sample mean of bad symptoms in the experimental group is so low that the probability of such a value drawn from a population actually centering on the average symptoms from the old treatment is at most 2.5%. Then statisticians using the 5% level of significance would argue that the study disproves the null hypothesis that the old treatment is no different from the old one.

If the study had used a one-sided test of hypothesis, the story changes slightly. It's as if we are only going to look at the staple arrows missing the bull's eye when they do it on one side of the dartboard (read the earlier posts for this metaphor), and we are going to hone down the staple width appropriately to do that. Thus, a 90% confidence interval based on the null hypothesis would leave 5% in each end of the distribution uncovered, and in my example we'd only look at the lower end of the distribution to see if the experimental results fall into that area or not. If they do, we reject the null hypothesis at the same 5% level of significance (or at 0.05 level if you go all decimal on me). If the results fall elsewhere on the distribution, we keep the null hypothesis and find the results not statistically significant.

But of course finding that the new treatment is no better than the old one IS significant! And finding that something is "statistically significant" just means that the null hypothesis was rejected at the 0.05 or 5% level, that the researchers used either a 95% or a 90% confidence interval for the null hypothesis data, depending on whether the test was two-sided or one-sided.

Likewise, I could make up a silly study about something quite silly and find the results statistically significant without saying anything at all about their real-world relevance. Note also that we could find something to be statistically significant and that something could be such a minor effect in reality that it would hardly matter at all.

The convention is to call the results "statistically significant" if the null hypothesis is rejected at the 0.05 or 5% level. If the researchers used 0.01 or 1% level in their calculations, then any rejecting of the null hypothesis that takes place is "statistically very significant". That's all these terms mean.

Many studies now dispense with the terms altogether and instead report something called the p-values. These are the actual probabilities of getting the experimental sample result or one more extreme if the null hypothesis in fact was the correct one. The smaller the p-value is the less likely the null hypothesis looks. You can always compare the p-values in your head to the 0.05 and 0.01 conventional values if you so wish.

The end of this particular road. I hope that you have enjoyed the series and that you will now go out and learn the zillions of additional things in statistics.



Comments: Statistics Primer. Part 1: Samples

Gravatar Hmm, does random sampling give us a better result than requested sampling?

I avoid phone calls (answering machine screens everything). I avoid store-front polls and petitioners.

...maybe I'm just a cranky old man (at 46).

Darryl Pearce | Homepage | 10.04.06 - 4:29 pm | #

Gravatar Darryl Pearce, you are talking about a serious problem in sampling, and that is respondent refusal. Such refusal is getting more and more common, these days, and it has at least two negative consequences for the study. First, and less importantly, it will make studies more expensive because they have to contact more people to get a sample of a certain size. Second, and more worrying, is the possibility that those who refuse are different from those who don't refuse. They might be younger or busier or more educated or less educated and so on. Which means that they might have different opinions on the asked questions. If so, the sample will not be representative.

Echidne of the snakes | Homepage | 10.04.06 - 4:39 pm | #

Gravatar No stats for you!

Echidne, will you be talking about the discrepancy between poll numbers and election numbers? Not exit polls so much, but all the pre-election polls, which always seem to either overestimate the Democrats' strength, or else send ambiguous signals that analysts gleefully overinterpret (if an incumbent is under 50% they're in trouble because undecideds always break for the challenger!).

Basically, I'm just wondering if there's something structural about polling methodology that makes them inadequate election predictors - are Democrats more likely to respond to phone polls, or are polls simply not able to account for differences in voter turnout?

I'm always disappointed on election night, and I'd like to know why the polls never seem to prepare us for the reality.

Eli | Homepage | 10.04.06 - 4:42 pm | #

Gravatar Echidne, will you be talking about the discrepancy between poll numbers and election numbers?

I hope to get there. I tried in a couple of posts right after the elections but it was clear one needed more statistics than I offered in those. There are explanations for what you described but none of them are satisfactory to me.

But first I want to talk about sampling distributions. It's really important to understand them in order to understand what goes on in margins of error and statistical significance and so on.

Echidne of the snakes | Homepage | 10.04.06 - 4:46 pm | #

Gravatar Now I'm all hungry.

Very nice exposition. If you had been my Stats instructor in undergrad, I would not have spent the entire class making color pattterns in my graph paper.

I still managed to swing a C in the class, though. And I can figure a mean average.

left rev. | 10.04.06 - 5:51 pm | #

Gravatar B-bu-but, I thought the mean was different from the avergage.



...gee, I hope that's a

Darryl Pearce | Homepage | 10.04.06 - 6:21 pm | #

Gravatar D'oh! ...dang spelling errors!

Darryl Pearce | Homepage | 10.04.06 - 6:22 pm | #

Nice intro. And I'm pleased, I already know about this so far (so I'm not as bad as I thought). Keep posting...

Anonymous | 10.04.06 - 6:23 pm | #

Gravatar Ah, the mean, the mode and the median. And the arithmetic average which is the same as the mean. These are to do with descriptive statistics and I'm not sure how much of that to talk about. Probably some, as the standard deviation is a useful thing to talk about.

Echidne of the snakes | Homepage | 10.04.06 - 6:36 pm | #

Gravatar Darryl, bad boy, no cookie for screwing / skewing the sample.

Nice post, Echidne. Disclosure: I provide data to pollsters every day. It is almost impossible to exclude bias entirely. If you're doing a phone poll (and who doesn't), your sample is skewed towards those who actually have phones, who may be older, more stable, etc.

Bruce | 10.04.06 - 6:38 pm | #

Gravatar An excellent example of a biased poll is the Truman/Dewey election. Telephone polls showed Dewey with a big lead, so newspapers gave the election to Dewey before the votes were counted.

However, in 1948 telephones were relatively expensive, so telephone polls skewed Republican. Hence the famous photo of a smiling Truman the next morning holding up a newspaper with the headline "Dewey wins".

maidhc | 10.04.06 - 6:53 pm | #

Gravatar Is it possible to do a pilot study to check respondent refusal - for example, to call people in different areas, using locality as a proxy for demographic characteristics?

Alon Levy | Homepage | 10.04.06 - 8:47 pm | #

Gravatar Is it possible to do a pilot study to check respondent refusal - for example, to call people in different areas, using locality as a proxy for demographic characteristics?

I think so, Alon Levy, though I'm not sure if people do these as a pilot study. In most studies there is an attempt to find out what the refusers are like, compared to those who don't refuse. For example, sometimes it's possible to use other data people have to figure out if the refusers are, say, younger or healthier or less healthy or poorer and so on. But this gets harder the more refusers there are.

Some studies with important goals actually keep on trying to contact the refusers and those who dropped out for some unknown reasons, just to make sure that their disappearance isn't causing bias. An example is the long follow-up health studies on heart disease (such as the Framingham study). If the people who drop out are the ones who got sick the results could be totally ruined.

But something like a political poll doesn't have the money and time to really follow this topic.

Echidne of the snakes | Homepage | 10.04.06 - 9:01 pm | #

Gravatar Very good explanation - should I ever be forced to teach statistics, I'll steal the soup analogy. One little quibble - I was under the impression that for sufficiently large and homogenous populations at least, what matters is the absolute size of the spoon , so to speak, not its relative size compared to the total population (ie., the errors you get by sampling 2000 respondents in the US are going to be very close to those you get by sampling 2000 people in, say, China).

christian h. | 10.05.06 - 7:35 pm | #

Gravatar Thank you. I really enjoy your soup analogy and am looking forward to more!

Kelly | Homepage | 10.06.06 - 4:59 pm | #

Commenting by HaloScan: Statistics Primer. Part 1: Samples


Comments: Statistics Primer. Part 2: Probability

Gravatar Bless your heart, that's exactly what I needed to be reminded of (or perhaps told in the first place).

I was breaking up some green peppers for chili yesterday, and I'd noticed that the 2nd pepper I'd picked up had a little label on it (a little paper pepper label) that needed to be removed. I hadn't noticed a label on the 1st pepper, but I did see labels on the 3rd and 4th peppers. So I wondered, what are the chances there was a label on the 1st pepper?

And I realized that, although I had two semesters of statistics, a sociology course which was mostly statistics and a year-long experimental psych course, I didn't know how to answer that question!

It could be creeping decrepitude, but I knew enough to know that considered objectively, once I'd gone back and looked at the pieces (assuming I did a perfect search for the label) the probability of there having been a label was either one or zero.

But I also remember that experience teaches us something, and that while (from Bayes' Rule) you can predict the probabilities of certain types of error from knowledge of antecedent probabilities (in general, it's going to be really damn hard to construct an instrument that diagnoses a very rare event (e.g., a terrorist on a plane) better than going eenie meenie miney mo).

But I hadn't figured out what you can tell about antecedent conditions from experimental results. Well, I mean, you've got your sample of 4 peppers, only 3 of which turned out to actually have a label, btw) and the population of peppers that left the pepper picking and packing plant with labels, and you can use the T test or Chi square or F test or... this is where my head starts to spin, but the idea is to figure out the probability of the next 100 sample peppers having 0, 50, 75, or 100 labels.

So that's the realm of statistics and you've been talking about the realm of probability, and I'm pretty sure the only link I ever saw between them was "OK, we talked about probability for our last exam, now we'll go on to statistics."

So I hereby demand you make that link!

In the meantime, many thanks for waking up a path or two in the old grey matter.

Rev. Bob | Homepage | 10.05.06 - 4:45 am | #

Gravatar Whoops! Sorry. The thing I really was wondering about was whether, knowing that 3 of the 4 peppers had labels on them, making a prediction about the chance of finding a label on the 1st pepper (the one I hadn't looked at) was in any sense a meaningful question.

Sorry, got sidetracked there.

Rev. Bob | Homepage | 10.05.06 - 4:53 am | #

Gravatar Heh, Rev. Bob. There is a connection between probability theory and statistics, and it is exactly the way you speculate.

One example I've used is this: In probability theory we know that the deck of 52 playing cards has four aces, twelve picture cards and so on, and we ask what the probability is, for example, of drawing at least two queens in a randomly dealt hand of five cards. So we know the deck but not the cards we are going to draw yet.

In statistics, we have a deck of cards but we don't know what it is like. Instead, we know what is in the hand of cards we have been randomly dealt from that deck and we try to deduce some characteristic of the deck from that hand. In this case we know the cards in the hand but not the deck contents.

Echidne of the snakes | Homepage | 10.05.06 - 4:54 am | #

Gravatar OW! OW! OW!


sorry for screaming...but ANY KINDA MATH STUFF HURTS SOOOO MUCH.

20 years ago I took an adult remedial algebra course at Towson State and was sitting there, totally befuddled in the second or third class, as the teacher patiently and laboriously worked to explain a quadratic equation.

Then it hit me, like a two by four between a donkey's eyes, I blurted out, "holy shit that's just like a racing form from the horse track!"

And given my poor record at the track, I gave up trying to understand anything beyond balancing my checkbook.

Turns out my talent is in reading people, you learn that as a semi-pro beggar, and gave up the track for the poker table.

After years of research and expensive study I have learned this, "Don't worry about your bad cards, count on MY bad luck!"

Now, whatta ya know about powerball and chaos theory?

Your devoted supplicant, fartsinsleep

fartsinsleep | 10.05.06 - 6:17 am | #

Gravatar Now, whatta ya know about powerball and chaos theory?

Not enough to pontificate on them.

Echidne of the snakes | Homepage | 10.05.06 - 7:06 am | #

Gravatar Or, in the dice example, one may observe that for each possible number on the first die there is one and only one number on the second die that allows for a total of seven, and therefore it is only necessary to consider the probability of a particular number coming up on the second die. In other words, one out of six, but much simpler.

If you're interested in the probability of any other number coming up from a roll of two dice it's more complicated since there's numbers on the first die that preclude making the target total at all, but seven is simple.

Ledasmom | 10.05.06 - 7:39 am | #

Gravatar Good of you, Ledasmom. Yes, there are sometimes ways of doing the counting quicker than in the process I described, though the latter will always work.

Echidne of the snakes | Homepage | 10.05.06 - 8:01 am | #

Gravatar Dear Echidne,

I love reading you so much!

You remind me that there are wise and kind teachers in the world (someday I'd like to join their ranks, too). You are insightful and always captivating, and without a trace of pedantry.

I've been meaning to drop you a fan note, but let this suffice. I admire you - and I thank you for having made my fun reading so very fun.

PS The stats refresher is delightful, as well.

sk | 10.05.06 - 9:35 am | #

Gravatar OK, wait a minute.

There is a connection between probability theory and statistics, and it is exactly the way you speculate.

What way I speculate? I seriously don't know the answer to that question. I think I'm the guy when they got to that part in class, everybody nodded, mumbled "obviously" and went on while I was looking out the window.

Is it scientifically(?) mathematically(?) sadistically(?) respectable to guess about whether the 4th pepper will have a label on it?

Rev. Bob | Homepage | 10.05.06 - 11:33 am | #

Gravatar any one done any research on 911

info wars | 10.05.06 - 11:51 am | #

I work with phylogenetics, so I deal with likelihood and Bayes a bit. This is how I understand it. Normally, you look at probability of an outcome based on the data. With likelihood you look at the probability of seeing your data based on the outcome. With Bayes, you take into account prior assumptions about your data. Likelihood and Bayes are very similar, but Bayes involves priors.

Nymphalidae | 10.05.06 - 12:12 pm | #

Gravatar Oooh, that brings up some interesting experiments that the psychology department could do on the mathematics and business departments.

..but me, I'm an English major!

Nevertheless, the mistaken notions one might make in assuming the nature of card decks (or dice) are what give sleight of hand artists their uncommong advantage.

Darryl Pearce | Homepage | 10.05.06 - 1:22 pm | #

Gravatar Echinde--

A small point of clarification.

I may be pedantically(?) overcomplicating what you're trying to do, but I think your last paragraph about 70% chance of X winning based on 70% of surveyed individuals said they would vote for X is confusing. For a reasonable survey size, if 70% of the survey so far say they will vote for X, the probability that the next person asked will vote for X is .7, but the probability of X getting more than 50% of the total vote is >>70%.

The difference is, of course, related to Rev. Bob's problem above: confidence (or credibility) intervals about the estimated probability. If 3 out of 3 chilis polled vote for sticker, the point estimate of the probability of chilis having stickers is 1.0, but the confidence interval of that estimate is pretty broad. If half of all green chilis have stickers, there was a 1 in 8 chance that all 3 of the inspected chilis would have stickers due to chance.

Thanks for putting up what looks to become a great resource for understanding sampling, surveys, polling, etc.!


ps: As an empirical Bayesian, I'd give 3:1 odds that one of your next 2 installments will be about margins of error and cover the tail probability that 70% polling can come from a population with true p < 50%.

tp | 10.05.06 - 1:33 pm | #

Gravatar's peculiar the connections our brains make but I think the comparison of "magic" assumption and statistics bear a little exploration.

An anecdote: I was wondering how an internet card trick was done, the end result being "my card" was somehow missing from the hand.

I was wondering out loud about how that was done and my spouse said, "Don't choose any card!"

And then the "trick" was easy to see: all new cards had been dealt.

Yet..., we know there are lies, damned lies, and statistics. Given the right "nudge" people could look at a given number from any angle.

Darryl Pearce | Homepage | 10.05.06 - 1:34 pm | #

Gravatar Thanks for this Echidne!! This is an awesome refresher course. I'm hoping that maybe you can shed some clarity as you go forward as to why, oh why? n-1? why not just n?

sailaway | 10.05.06 - 2:02 pm | #

Gravatar tp, I'm not talking about confidence intervals yet. You have to wait for that a little.

sailaway, you mean the formulas for sample standard deviation and sample variance, most likely. The reason why we divide by n-1 and not by n there is that if we didn't the sample statistic would not be an unbiased estimator of the population parameter we try to estimate. I will try to think of a better way of explaining this.

Echidne of the snakes | Homepage | 10.05.06 - 3:52 pm | #

Gravatar I may be pedantically(?) overcomplicating what you're trying to do, but I think your last paragraph about 70% chance of X winning based on 70% of surveyed individuals said they would vote for X is confusing. For a reasonable survey size, if 70% of the survey so far say they will vote for X, the probability that the next person asked will vote for X is .7, but the probability of X getting more than 50% of the total vote is >>70%.

You are quite right that 0.7 is the probability of a randomly drawn individual (from the surveyed population) of voting for X. The problem in writing a primer on statistics is that the introductory courses are a semester-long for a good reason. I'm omitting and simplifying a lot.

Echidne of the snakes | Homepage | 10.05.06 - 7:36 pm | #

Gravatar can i make a suggestion? (yeah, that's a rhetorical question. i'm gonna make a suggestion no matter what you say)

can you put the permilink for each of these stats posts on the front page sidebar of your blog? it's really a great series of posts. i'm sure i'm gonna wanna cite it some argument at some point. but in a couple of weeks, i'll have a hard time finding it. and i'm too lazy to dig around your archives

upyernoz | Homepage | 10.05.06 - 9:35 pm | #

Gravatar upyernoz, I'm planning to put them on my other website, the one mentioned at the top of the blog. It already has my gender gap series. But I could put them in the sidebar, too.

Echidne of the snakes | Homepage | 10.05.06 - 10:17 pm | #

Gravatar The probability of having a label is for all intents and purposes unknown. You know that there are two green peppers with labels and one without. Strictly speaking, the probability should be 2/3. But almost any probability from 0 to 1 could produce the 2-to-1 result within the 95% confidence interval.

If the probability of having a label is p, and p is small, then the probability of having at least two labels is p^3 + 3(1-p)p^2 = 3p^2 - 2p^3. The lower bound for the 95% confidence interval is the solution to 3p^2 - 2p^3 = 0.05, which is approximately 0.13535035. If p is large, then the probability of having at most two labels is 1 - p^3, and the 95% confidence interval's upper bound is the solution to p^3 = 0.95 --> p = 0.98304757.

So all you can say is that with 95% certainty, the probability is between 13.5% and 98.3%.

Alon Levy | Homepage | 10.06.06 - 2:53 am | #

Gravatar How to find out if there was a sticker on the first pepper chopped:

If there's a funny texture while you're eating, it had a sticker.

Samantha Vimes | 10.06.06 - 8:17 am | #

Gravatar Heh, Samantha Vimes. Alon Levy, I think you are correct, except that n=4 in Rev. Bob's example and 3 peppers were known to have stickers. The question was whether the fourth one, already chopped, had a label or not. This would be a fairly straightforward binomial prob. question if we new pi, the population proportion of labels. As we don't know that one, all we can do is define possible values for different pis the way you did.

Echidne of the snakes | Homepage | 10.06.06 - 4:03 pm | #

Gravatar Andrew, thank you so much for that answer. Don't worry about the number of peppers. I think I'm blogging this because of some nagging uncertainty, and in my examples I threw a pepper out so your answer would make sense. Oh. Can I use your answer on my blog too, please? It isn't nearly as respectable a blog as Echidne's.

Thank you too, Samantha. Certainly the pragmatic test collapses the wave function. And Echidne, I'm meditating on your answer and reading your subsequent posts, and hopefully englightnment will strike. "Binomial problem" is certainly familiar. I'm pretty sure I handled all this arithmetic before, but I suspect I used it without understanding it.

Rev. Bob | Homepage | 10.14.06 - 8:47 am | #

End Commenting by HaloScan: Statistics Primer. Part 2: Probability


Comments: Statistics Primer. Part 3: Sample Statistics

Gravatar Thanks for these posts on statistics. I'm finding them useful and interesting. (I enjoy your other posts, as well.) As for what a one-semester introductory course can do: well, I was less then impressed with my introductory course (many years ago), and I've subsequently come to think of the variance and so forth simply as values that an Excel spreadsheet can calculate for you after you have put in some data. Anyway, your three posts so far compare very favorably to my one-semester introduction, and I hope you keep the series going. I would be particularly interested in why the absolute-value approach to variance calculations becomes cumbersome and in more about the degrees-of-freedom concept. At the least you have motivated me to try to look it up myself.

I'm a bit surprised that a goddess dabbles in probability and statistics.

Rick | Homepage | 10.09.06 - 1:24 pm | #

Gravatar OK, this almost made my brain hurt.

But in a good way.

"Your average statistics student would just say you are mean. And that most types of statistics are just the prefered mode of torture. But you keep going until I'm caught in the middle and finally scream me dyin."-an actual study aid presented to my lab group by a TA from Ireland. I didn't get it.

left rev. | 10.09.06 - 6:04 pm | #

Gravatar Rick, one reason why we don't use the absolutes is shown here: Mea...nDeviation.html

Echidne of the snakes | Homepage | 10.09.06 - 6:09 pm | #

Gravatar Another way of explaining why we don't divide by n in the formulas for sample variance is that we will end up with a biased estimator for the population variance if we don't correct the formula by replacing n with n-1. It would take too long to show this here, but think of the formula for the population variance. It would be the average of the squared deviations of all the observations in the population around the population mean. The sample equivalent doesn't use the population mean, it uses the sample mean instead, and this is what can introduce bias into the variance formula if it is used as a point estimator for the population variance.

Note that if we have a large sample size, n is going to be quite close to n-1 and the bias from the simplified formula in the post would be small.

Echidne of the snakes | Homepage | 10.09.06 - 6:17 pm | #

Gravatar Goddess E--

OK, let me be your harshest (least gentle) critic.

This post & series are _quite_ good; your last comment explaining the unbiasedness of n-1 was outstanding. I wouldn't have done it without equations.

I just hope you have enough energy & time to carry this through to unequal-probability sampling (stratified & such).

tp | 10.09.06 - 11:28 pm | #

Gravatar I braced myself there, tp.

We'll see how far we get, but I've decided to jump straight into sampling distributions in the next post so perhaps we get to stratified sampling, too.

Echidne of the snakes | Homepage | 10.10.06 - 12:09 am | #

Gravatar TP, I think Echidne's right to do it without equations. She's trying to explain statistics in a social scientific context, so teaching things like what sampling is, how to avoid biased estimators, and what 95% confidence means is more important than writing an intro to probability and statistics textbook.

Alon Levy | Homepage | 10.11.06 - 8:16 pm | #

Gravatar Yes, I'm trying to keep things light and concrete, because the series is also intended for those who fear statistics. We'll see if it works.

Echidne of the snakes | Homepage | 10.12.06 - 3:44 am | #

End Commenting by HaloScan: Statistics Primer. Part 3: Sample Statistics






divider divider

Mean, Median, and Mode Discussion

Shodor > Interactivate > Discussions > Mean, Median, and Mode Discussion

Student: When do we use mean and when do we use median?

Mentor: It is up to the researcher to decide. The important thing is to make sure you tell which method you use. Unfortunately, too often people call mean, median and mode by the same name: average.

Student: What is mode?

Mentor: The easiest way to look at modes is on histograms. Let us imagine a histogram with the smallest possible class intervals (see also Increase or Decrease? Discussion ).

Student: Then every different piece of data contributes to only one bin in the histogram.

Mentor: Now let us consider the value that repeats most often. It will look like the highest peak on our histogram. This value is called the mode. If there are several modes, data is called multimodal. Can you make an example of trimodal data?

Student: Data with three modes? Sure. Say, if somebody counted numbers of eggs in 20 tree creepers' nests, they could get these numbers: 4, 3, 1, 2, 6, 3, 4, 5, 2, 6, 4, 3, 3, 3, 6, 4, 6, 4, 2, 6. I can make a histogram:

yellow graph

Mentor: There are three values that appear most often: 3, 4, and 6, so all these values are modes. Modes are often used for so-called qualitative data, that is, data that describes qualities rather than quantities.

Student: What about median?

Mentor: Median is simply the middle piece of data, after you have sorted data from the smallest to the largest. In your nest example, you sort the numbers first: 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6 eggs. There is an even number of values, so the middle (or median) is between the first and second 4. Because they are the same, we can easily say that the median is four, but if they were different, say if the median was between a 3 and a 4, we would do (3+4)/2=3.5.

Student: So, if there is an even number of values, the median is equal to the sum of the two middle values divided by two.

Mentor: If no birds had nests with only one egg, we would have values of 2, 3, 4, 5, and 6. In this case, the middle number or the median would be the second 4, and we would not need to add or divide because there were an odd number of values.

Student: The last type of averages I would like to know about is mean.

Mentor: Sometimes it is called arithmetic mean, because there are other things in math that are called mean. For example, there is a geometric mean and a harmonic mean. The arithmetic mean of a set of values is a sum of all values, divided by their number. In your nest example,

mean = (4+3+1+2+6+3+4+5+2+6+4+3+3+3+6+4+6+4+2+6)/20 = 3.65

Student: Which one is better: mean, median or mode?

Mentor: It depends on your goals. I can give you some examples to show you why. Consider a company that has nine employees with salaries of 35,000 a year, and their supervisor makes 150,000 a year. If you want to describe the typical salary in the company, which statistics will you use?

Student: I will use mode (35,000), because it tells what salary most people get.

Mentor: What if you are a recruiting officer for the company that wants to make a good impression on a prospective employee?

Student: The mean is (35,000*9 + 150,000)/10 = 46,500 I would probably say: "The average salary in our company is 46,500" using mean.

Mentor: In each case, you have to decide for yourself which statistics to use.

Student: It also helps to know which ones other people are using!

a resource from CSERD, a pathway portal of NSDL

© 1994-2006 Shodor | Website Feedback





HyperStat Online Contents

Degrees of Freedom

Estimates of parameters can be based upon different amounts of information. The number of independent pieces of information that go into the estimate of a parameter is called the degrees of freedom (df). In general, the degrees of freedom of an estimate is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself. For example, if the variance, ?², is to be estimated from a random sample of N independent scores, then the degrees of freedom is equal to the number of independent scores (N) minus the number of parameters estimated as intermediate steps (one, ? estimated by M) and is therefore equal to N-1.

Statistics and Methods Consulting For Graduate Students (877-437-8622)

College Textbooks

Math and Statistics Textbooks



Comments: Statistics Primer: Part 4: Sampling Distributions

Gravatar Beautiful. I feel like I'm getting a refresher course in basic stats! I am now motivated to dig out my old textbook to review. You must be a good teacher, Echidne, because my professor at uni took much longer to pound this stuff into my thick skull!

Gentlewoman | 10.16.06 - 4:54 pm | #

Gravatar Are you going to deal with confidence intervals anytime soon? I'm getting noise from anti-Lancet conservatives who say very wrong things about what the Lancet's large confidence interval implies.

Alon Levy | Homepage | 10.16.06 - 6:57 pm | #

Gravatar Alon Levy, I'm almost ready to do confidence intervals. First I'm going to do a post on the normal distribution and how to use it for probabilities. Then I'm ready to do the confidence intervals and that's the last post. So one more to go first.

Echidne of the snakes | Homepage | 10.16.06 - 7:02 pm | #

Gravatar Oh, great. Thanks.

Alon Levy | Homepage | 10.16.06 - 8:50 pm | #

Gravatar I have a request for you. Please gather these excellent explanations together when you have them all done and make them accessible through a permanent link.

It would be a valuable resource for students and other people who would like to learn more about statistics.

maidhc | 10.20.06 - 3:24 am | #

Commenting by HaloScan: Statistics Primer: Part 4: Sampling Distributions


Comments: Statistics Primer. Part 5: Constructing a Confidence Interval for the Sample Proportion

Gravatar Thanks for posting this, Echidne. I didn't know how polls got their standard deviations - I'd never have guessed they just used the binomial standard deviation.

Alon Levy | Homepage | 10.20.06 - 10:18 pm | #

Gravatar It depends on the exact form of the question, of course. If the answers are quantitative then the normal distribution probably works exactly when making up confidence intervals for the sample mean. And note that if we are interested in the confidence interval for the difference in two sample proportions (say, the percentage liking candidate A minus the percentage liking candidate B) the standard deviation for this difference in proportions is a different-looking formula. I might talk a little about that in the last post. But the basic trick is to find the sampling distribution for whatever statistic you want to analyze and to use that sampling distribution to find the interval length which corresponds to a given confidence level, usually 0.95.

Echidne of the snakes | Homepage | 10.20.06 - 10:29 pm | #

Gravatar Correct me if I'm wrong, but isn't the standard deviation for the difference just twice the regular standard deviation, assuming only two options? If s is the proportion of voters voting for candidate A, then the proportion voting for A minus this voting for B is just 2s-1, which has standard deviation twice this of s.

Alon Levy | Homepage | 10.20.06 - 11:22 pm | #

Gravatar You are correct, Alon. But I had in mind a slightly different situation; one where the survey asks in one question of the respondent likes candidate A and then in another question if the respondent likes candidate B. Suppose we then use this information to make up a new variable, the difference in the proportions liking each candidate. Then the formula becomes more complicated because the two sample proportions (viewed as variables in repeated sampling now) can be correlated with each other. Intuitively, for example, someone liking a conservative would not like a liberal and vice versa. This correlation needs to be taken into account in the calculation of the standard deviation for the difference in the sample proportions, if it exists. There's more about this in the link to Wikipedia I give in the post for the margin of error topic. Scroll down to "derivation".

Echidne of the snakes | Homepage | 10.21.06 - 12:32 am | #

Echidne, any advice on what to study as a text? Your posts convinced me to drag out my old book and start from the beginning again. But that book is 30 years out of print.

My problems with opinion polling go a lot deeper than this, by the way. It begins in the question of what those being polled base their 'opinion' on, continues with how the poll is conducted and what that might tell us about the motives of the pollsters and finally, opinion isn't news and should never be used as a substitute to fill in for producing the facts. If the people being polled had more facts their opinion might be entirely different. Opinion polls have no place in news. None.

olvlzl | Homepage | 10.21.06 - 7:33 am | #

Gravatar olvlzl, in the post above this one I link to one course that is available on the net so that you can study it at your own convenience. Some other links in the earlier posts are also parts of net courses on basic statistics.

You could use your old book as the spine of the course. If you Google the keywords in a chapter you will get animation programs and simple lessons on the net. The basic parametric statistics hasn't changed that much. What's perhaps new is the large number of nonparametric methods that have been added to the toolkit. The word "parametric" refers to the fact that the kind of stuff I talk about here links the sample statistics to the probability distribution of the same measures in the population, called parameters. Like the population mean, for example. The nonparametric methods are not based on those, so they need fewer assumptions and less information about the population. On the other hand, they tend to be a bit cruder for that reason, too.

Echidne of the snakes | Homepage | 10.21.06 - 4:03 pm | #

you will get animation programs and simple lessons on the net

I don't know. That might be too racy for someone my age.

Thank you for the advice. It's such an important subject for understanding published material.

olvlzl | Homepage | 10.21.06 - 6:25 pm | #

Commenting by HaloScan: Statistics Primer. Part 5: Constructing a Confidence Interval for the Sample Proportion


Comments (6) | Trackback (0)

Posted by: echidne / 11/15/2006 07:30:00 PM

Gravatar I'm mainly interested in probability theory, but thanks anyway.

Slothrop | 11.15.06 - 7:41 pm | #

Gravatar Slothrop, probability theory is not statistics, though they go hand in hand. I haven't Googled for internet courses in probability theory, but I bet there are some.

Though most statistics courses do give an overview of probability theory, too.

Echidne of the snakes | Homepage | 11.15.06 - 7:52 pm | #

You just exactly covered the last chapter I read in my stats textbook. Thanks for the review.

Hee | 11.15.06 - 10:49 pm | #

Gravatar You're welcome, Hee. And I did it without consulting a single textbook, too... Drat. How long do you think before my brain is cleansed out?

Echidne of the snakes | Homepage | 11.15.06 - 11:02 pm | #

You should combine these into a permanent link, like Orcinus does with his tomes on fascism. The part about the limits of polling is so beautiful, it brought tears to my eyes.

olvlzl The Heretic | Homepage | 11.16.06 - 8:46 am | #

Gravatar Slothrop, what about probability theory interests you?

Alon Levy | Homepage | 11.16.06 - 11:01 pm | #

Commenting by HaloScan: Statistics Primer. Part 6: Wrapping it All Up In A Nice Package