Sampling Distribution of the Mean

Experiment 1

Objective : To estimate the mean height of a population using a sample.

Assume we have a scenario where we want to know the height of all males from age 20 to 30 in a particular city. The country has more than 800 males in that age group. We only have resources to measure the height of 40 people.

Can we use the height of these 40 people to somehow estimate what the average height of the 800 people could be? This is the question that inferential statistics tries to answer.

A “sample” is a subset of data from a larger population. A population in statistics is a large , defined (but sometimes theoretical or imaginary) set of data.

When we sample data and try to infer the properties of a population we are faced with 2 key considerations?

Who do we sample? Can we ask the first 40 people we meet? Should we use some sort of criteria? How do we find the “right” 40 people to ask? - To answer this we need to delve into the idea of “Sampling methods”. We need to use the correct sampling method so that we have a sample that is representative of the population.
If we do measure the height of 40 people , how does the mean of this sample compare with the true mean of the whole population? How can we use the height of 40 people to estimate the height of 800? To answer this we need to delve into the idea of Inferential Statistics.

Random Sampling is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw. The sample that results is called a “simple random sample”. The population can also be divided into “strata” based on common characteristics and then sampling from each strata. This is called “Stratified Random Sampling”.

Data Quality can matter more than Data Quanity while making an estimate based on a sample. Data Quality refers to the completeness, consistency of format, cleanliness , accuracy and representativeness of the data. A sample has to be representative of the population it comes from. Only then can any inferences drawn from a sample be useful.

ID_Number	Height_cm
P-001	183.0
P-002	180.2
P-003	194.9
P-004	182.3
P-005	181.8
P-006	175.3
P-007	174.9
P-008	185.6
P-009	171.8
P-010	171.1
P-011	169.3
P-012	173.9
P-013	169.1
P-014	177.2
P-015	175.1
P-016	177.2
P-017	181.3
P-018	169.7
P-019	180.6
P-020	182.5
P-021	185.2
P-022	184.2
P-023	181.8
P-024	169.8
P-025	165.0
P-026	167.3
P-027	171.0
P-028	179.9
P-029	168.9
P-030	174.8
P-031	176.0
P-032	175.8
P-033	182.2
P-034	176.6
P-035	191.7
P-036	180.0
P-037	167.1
P-038	171.2
P-039	178.1
P-040	156.0

Let us try to calculate the average height in the group.

Sample Mean height: 176.49 cm

Sample SD: 7.46 cm

Now that we can see that the average height of people in this sample is 176.49

Are we sure that the true mean in the population (the value we really want to know) will be exactly this number? Unlikely? Can it be close to this number? How do we resolve this?

Experiment 2

repeat the same experiment.
Take another 40 people.
Calculate the mean of the second sample

These are the results you find

ID_Number	Height_cm
P-041	183.0
P-042	180.2
P-043	194.9
P-044	182.3
P-045	181.8
P-046	175.3
P-047	174.9
P-048	185.6
P-049	171.8
P-050	171.1
P-051	169.3
P-052	173.9
P-053	169.1
P-054	177.2
P-055	175.1
P-056	177.2
P-057	181.3
P-058	169.7
P-059	180.6
P-060	182.5
P-061	185.2
P-062	184.2
P-063	181.8
P-064	169.8
P-065	165.0
P-066	167.3
P-067	171.0
P-068	179.9
P-069	168.9
P-070	174.8
P-071	176.0
P-072	175.8
P-073	182.2
P-074	176.6
P-075	191.7
P-076	180.0
P-077	167.1
P-078	171.2
P-079	178.1
P-080	156.0

Sample Mean height of second set of values: 173.71 cm

Sample SD: 7.46 cm

Now we see when we repeat the experiment we got a value of 173.71 which is different from the 176.49 in the first example. Now how do we interpret this? Is the mean of the population likely to be higher? or lower? Let us repeat one more time.

Experiment 3

Repeat the same experiment a third time.
Measure the height of another 40 people.
Calculate the mean of the third sample

ID_Number	Height_cm
P-081	180.4
P-082	185.7
P-083	178.2
P-084	188.8
P-085	170.0
P-086	171.6
P-087	181.0
P-088	166.5
P-089	169.4
P-090	189.0
P-091	173.6
P-092	176.3
P-093	170.8
P-094	177.8
P-095	176.7
P-096	171.4
P-097	184.6
P-098	171.8
P-099	178.0
P-100	183.9
P-101	162.0
P-102	183.5
P-103	174.5
P-104	164.9
P-105	180.1
P-106	155.7
P-107	177.6
P-108	162.8
P-109	186.9
P-110	183.4
P-111	168.4
P-112	175.7
P-113	176.9
P-114	163.2
P-115	170.5
P-116	184.0
P-117	183.5
P-118	171.2
P-119	177.1
P-120	165.9

Sample Mean height of third set of values: 175.33 cm

Sample SD: 8.04 cm

Now we have repeated the experiment a third time and got a third mean. This is also a similar number but different. What can we do to solve this.

The solution we often apply in statistics is to create a new dataset that contains the means of each of these experiments. This would look something like this.

Table containing the mean from each repetition of the experiment.

Experiment	Mean Height
Experiment 1	176.71
Experiment 2	173.71
Experiment 3	175.33
………..	………….
Experiment “n”	173.55

A genuine drawback of this method would be a very practical one. One would usually not have the money and resources to repeat the experiment any number of times.

However, if one were to do it, and repeat the experiment “n” number of times then plot the means of each experiment as a histogram , you would likely get a figure like this.

This brings us to an important concept of the SAMPLING DISTRIBUTION.

: Note that the Sampling Distribution is very different from the “Sample Distribution”. It does not refer to the data in the sample. It refers to the mean of a collection of means if that same experiment was repeated n number of times. A “Mean of Means”. This is an important distinction to make.

What is the Sampling Distribution of the Mean.

Sampling Distribution of the mean
Distribution from when the same experiment is (hypothetically) repeated n number of times and the mean is collected from each experiment and then plotted as a histogram.

We have to now make a very important distinction, with another similar sounding entity which is the “Sample Distribution” .

Sample Distribution

Distribution of the individual data points present in a single sample taken from the population .: Please make the clear distinction that the “Sampling Distribution of the Mean” is a very different entity from a “Sample Distribution”.

In Data Science we generally consider that the mean of the sampling distribution will be the true mean of the distribution from which the sample was taken. Hence it is called the Population Mean.

Some Observations from the above figure

Notice how the value of our first experiment is near to the population mean but not exactly the population mean.
Notice that the distribution of the means (sampling distribution) is also normally distributed. This property of the sampling distribution (provided the right conditions) to take a normal distribution as the number of sample means increase, even if the underlying data distribution of the sample is not normal is called the “Central Limit Theorem”. It is often described when the number of sample means is greater than 30.
Notice the lines analogous to the standard deviation in this sampling distribution. While the dispersion in a sample distribution of data is measured in standard deviation, a similar entity for the Sampling Distribution is called the Standard Error.
The distance between -2 and +2 Standard Errors is sometimes referred to as a 95% Confindence Interval.

Some Caveats: It is often assumed that data points in large samples are “normally distributed”. This is infact not true. Most data points from samples are not necessarily normally distributed. Other distributions like the t-distribution are far more common. It is infact the Sampling Distribution that is often (but not always) normally distributed. The Central Limit Theorem is built on many assumptions that are often violated in the real world. Hence the applicability of the CLT is not universal. We can discuss this further in future tutorials.

The above scenario is obviously hypothetical. In the real world we cannot afford to do our experiment >30 times to derive the population mean, standard error and confidence intervals. In the next tutorial we will see how we can estimate the values using just one sample and the power of statistics.