It is one of the most common warning when someone deals with statistics for the first time: Correlation does not imply causation. This means, just because some variables seem to show dependencies, it doesn't mean that they are really dependend from each other. There are a lot of examples for so called "spurious relationship" between variables but one of the most famous is the claim, that babys are delivered by storks and the statistical "proof" for this statement.

Here is a list of several countries, the birth rate and the number of storks:

 Country Storks Birthrate Albania 100 83 Belgium 1 87 Bulgaria 5000 117 Denmark 9 59 Germany 3300 901 France 140 774 Greece 2500 106 Netherlands 4 188 Italy 5 551 Austria 300 87 Poland 30000 610 Portugal 1500 120 Rumania 5000 23 Spain 8000 439 Switzerland 150 82 Turkey 25000 1576 Hungary 5000 124

If we now want to know how birthrate and number of storks are related, we may first calculate the correlation coefficient between the birth rate and the number of storks. The correlation coefficient is a value between -1 and +1 and tells us, if there is a linear correlation between variables. A value of -1 means, there is a perfect negative linear correlation, a value of +1 means, there is a perfect positive linear correlation. A value of 0 means, there is no linear correlation at all. The following picture gives an overview about two variables X and Y and their correlation coefficient:

If we plot the number of storks on the X axis and the birth rate of a country on the Y axis and add a regression line, we get the following picture:

The interesting thing here is that the correlation between these values is 0.6 which is not pretty high but also not pretty low. But maybe, this correlation is just coincidence and it would also appear if we would use random values. To check this, we can use the p-value. The p-value tells us: If our first assumption that there is no correlation between birthrate and the number of storks is true, then there is just a possibility of 0,8% that we may see the result that we've just received (a correlation of 0.6). In other words: The smaller the p-value, the more likely is it that our assumption (there is no connection between birth rate and number of storks) is false. Or in other words again: The coherence of birth rate and number of storks is pretty significant which could lead to the conclusion that babies are delivered by storks.

But, since we know that this is not true, what is the explanation behind this phenomenon? There are multiple attempts to explain it, in the first place, there could be a "hidden" variable that influences both, the birth rate and the number of storks and causes the apparent connection between these variables. In our case, the explanation is quite simple: The more untouched the nature of a country is, the more storks tend to live there. Countries with a sound nature and a lot of living space for animals tend to be poorer than industrialized countries. And poorer countries tend to have a higher birth rate.

With the same setup, we can also find a high correlation between the shoe size and the read capability. If the shoe size is big, the read capability is better as well. Of course, also in this case we can observe a spurious relationship between shoe size and read capability: Older children have better read capabilities than younger children. Adults have better read capability than children. Since the shoe size of adults is bigger than those if children, we see a positive correlation between shoe size and read capability even though the assumption, people with bigger feet can read better is pretty false.

Reference: http://www3.math.uni-paderborn.de/~agbiehler/sis/sisonline/struktur/jahrgang21-2001/heft2/Langfassungen/2001-2_Matth.pdf