Statistical inference

In statistics, a population refers to the entire group of individuals or items that we are interested in studying. However, collecting data from the whole population is often impractical due to size, accessibility, cost, or ethical concerns. Instead, we gather data from a sample, which is a smaller, manageable subset of the population.

The process of using sample data to draw conclusions about the entire population is called statistical inference. For this process to be reliable, the sample must accurately represent the population it comes from.

Random samples

A random sample is one in which every member of the population has an equal chance of being selected. Random sampling reduces bias and promotes the likelihood that the sample will reflect the characteristics of the population. When a sample is random, each possible sample is equally likely, and the results tend to have similar proportions to those in the population, allowing for valid and generalisable conclusions.

To describe and compare populations and samples, we use specific measures:

Population proportion: \(p=\dfrac{\text{number in population with attribute}}{\text{population size }(P)}\)
Sample proportion: \(\hat{p}=\dfrac{\text{number in sample with attribute}}{\text{sample size }(n)}\)

Here, \(p\) is a parameter, a fixed (but often unknown) value that describes the population, while \(\hat{p}\) is a statistic, a value calculated from sample data used to estimate the parameter.

To understand how we make inferences from a sample, it's important to recognise that sample statistics—like the sample proportion or sample mean—are themselves random variables that vary from sample to sample.

Just as we define a population proportion \(p\) and a sample proportion \(\hat{p}\), we can also compare the population and sample means:

Population mean: \(\mu=\dfrac{\text{sum of the data values in the population}}{\text{population size}(P)}\)
Sample mean: \(\bar{x}=\dfrac{\text{sum of the data values in the sample}}{\text{sample size}(n)}\)

Since samples differ depending on which individuals are selected, both \(\hat{p}\)  and \(\bar{x}\) can vary. This variability is the foundation of statistical inference and leads to the idea of a sampling distribution.

Worked Example

A bag contains \(10\) marbles, \(6\) green and \(4\) blue. If \(4\) marbles are randomly selected from the bag without replacement, generate a sampling distribution for the number of green marbles in the sample

The number of green marbles in the sample is a random variable that can take values from \(0\) to \(4\), depending on the outcome.

We can also express this using the sample proportion \(\hat{p}\), which represents the proportion of green marbles in the sample. Since the sample size is \(4\), the possible values of \(\hat{p}\) are:

\(\hat{p}=0, \frac{1}{4}, \frac{1}{2}, \frac{3}{4}, 1\)

To build the sampling distribution of \(\hat{p}\), we calculate the probability of each possible outcome using combinations. For example, for \(2\) green marbles, we calculate the probability of choosing \(2\) green and \(2\) blue marbles, and divide total number of possible samples of \(4\) marbles from \(10\):

\(\Pr\left(\hat{p}=\frac{1}{2}\right)= \dfrac{\binom{2}{4}\times\binom{4}{2}}{\binom{10}{4}}=\dfrac{90}{210}\)

The table below summarises the sampling distribution of \(\hat{p}\):

Number of green marbles in the sample \(0\) \(1\) \(2\) \(3\) \(4\)
Proportion of green marble in the sample \( 0\) \(\dfrac{1}{4} \) \(\dfrac{1}{2} \) \(\dfrac{3}{4} \) \(1\)
\(\Pr\left(\hat{P}=\hat{p}\right)\) \(\dfrac{1}{210} \) \(\dfrac{24}{210} \) \( \dfrac{90}{210}\) \( \dfrac{80}{210}\) \(\dfrac{15}{210} \)

The mean and standard deviation of the sample proportion

To describe the behaviour of the sample proportion, we can calculate its mean and standard deviation. These values give insight into the long-run average of \(\hat{p}\) and how much it is expected to vary from sample to sample.

When sampling from a small population, the sampling distribution of \(\hat{p}\) can be calculated exactly. In this case, we use a expected value for the mean, and a two-step process for the standard deviation:

\[\begin{align}\mathrm{E}\left(\hat{P}\right)&=\sum p\cdot \Pr{\left(\hat{P}=\hat{p}\right)}\\
\mathrm{sd}\left(\hat{P}\right)&=\sqrt{\mathrm{E}\left({\hat{P}}^2\right)-\left(\mathrm{E}\left(\hat{P}\right)\right)^2}\end{align}\]

For example, the mean and standard deviation of the following small population sample distribution is:

\( \hat{p}\) \( 0\) \(\frac{1}{4} \) \(\frac{1}{2} \) \( \frac{3}{4}\) \( 1\)
\( \Pr(\hat{P}=\hat{p}\) \( 0.031\) \( 0.172\) \( 0.356\) \(0.328 \) \( 0.113\)

\[\begin{align} E\left(\hat{P}\right)&=0\times0.031+\frac{1}{4}\times0.172+\frac{1}{2}\times0.356+\frac{3}{4}\times0.328+1\times0.113 \\
&=0.58\\
sd\left(\hat{P}\right)&=\sqrt{\mathrm{E}\left({\hat{P}}^2\right)-\left(\mathrm{E}\left(\hat{P}\right)\right)^2} \\ &=\sqrt{0\times0.031+\frac{1}{16}\times0.172+\frac{1}{4}\times0.356+\frac{9}{16}\times0.328+1\times0.113-{0.58}^2} \\ &=0.247
\end{align} \]

When sampling from a large population, we can approximate the sampling distribution using a formula based on the binomial model:

\[\begin{align}\mathrm{E}\left(\hat{P}\right)&=p\\
\mathrm{sd}\left(\hat{P}\right)&=\sqrt{\frac{p\left(1-p\right)}{n}}\end{align}\]

These formulas assume random sampling and a large enough population that the selections are approximately independent. For example.

\(x\) \(0\) \(1\) \(2\) \(3\) \(4\) \(5\)
\( \hat{p}\) \( 0\) \(\frac{1}{5} \) \(\frac{2}{5} \) \( \frac{3}{5}\) \( \frac{4}{5}\) \( 1\)
\( \Pr(\hat{P}=\hat{p})\) \( 0.002\) \( 0.028\) \( 0.132\) \(0.309 \) \( 0.360\) \( 0.168\)

\[\begin{align} \mathrm{E}\left(\hat{P}\right)&=p=0.7 \\ \mathrm{sd}\left(\hat{P}\right)&=\sqrt{\frac{p\left(1-p\right)}{n}}=\sqrt{\frac{0.7\times0.3}{5}}\\ &=0.205 \end{align} \]

Approximating the distribution of the sample proportion

When the sample size n is large, the sample proportion \(\hat{P}\) has an approximately normal distribution with mean \(\mu=p\) and standard deviation \( \sigma=\sqrt{\frac{p\left(1-p\right)}{n}}\). This allows us to apply techniques from normal probability to analyse sample proportions, even though they originate from a binomial setting.
To use the normal approximation reliably, the sample size must be large enough to ensure the binomial distribution is not too skewed. This is checked using the conditions:

\(n\times p>5 \text{ and } n\left(1-p\right)>5\)
If both of these conditions are satisfied, the distribution of \( \hat{p}\) is approximately normal, and we can use the normal model to estimate probabilities or construct confidence intervals.

Confidence intervals

A confidence interval provides a range of values within which the true population proportion \(p\) is likely to lie. It is based on a sample proportion \(\hat{p}\), and is used when we want to estimate an unknown population parameter with a stated level of confidence. For example, it can indicate the likelihood of having an accurate sample mean.

The general form of a confidence interval for a population proportion is:

\[\left(\hat{p}-z\sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}},\ \hat{p}+z\sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}}\right)\]
Where

  • \(\hat{p}\) is the sample proportion
  • \(n\) is the number in the sample size
  • \(z\) is the \(z\)-score corresponding to the desired confidence level (the area under the normal curve).

The confidence interval can also be made smaller by increasing the sample size \(n\).

Three side-by-side shaded normal distribution curves, each with a vertical dashed line indicating the mean. From left to right, they show z=2.58 and 99% confidence interval, z=1.96 and 95% confidence interval and z=1.65 and 90% confidence interval.>

Margin of error

The margin of error (\(M\)) is a measure of the precision of a sample estimate. It represents the maximum expected difference between the sample proportion \(\hat{p}\) and the true population proportion \(p\), within a given level of confidence.

In the context of confidence intervals, the margin of error is the distance from the sample proportion to each endpoint of the interval. It determines how wide the confidence interval is and reflects the uncertainty due to sampling.

The margin of error is calculated using the formula:

\[M=z\sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}}\]

Where:

  • \(\hat{p}\) is the sample proportion,
  • \(n\) is the sample size,
  • \(z\) is the \(z\)-score corresponding to the chosen confidence level.