Use adaptive quiz-based learning to study this topic faster and more effectively.

# Data analysis

## Mean

Information in large data sets can be summarised by statistical indicators, such as the mean, the median and the standard deviation.

The mean is the most common indicator. The mean can only be calculated for numerical values.

You can compute the mean of the height of your friends but not of their favourite colour.

The mean is the sum of the values divided by their number.

$$\Torange{\text{Mean}} = \frac{\Tblue{\text{Sum of values}}}{\Tred{\text{Number of values}}}$$

The mean for the data $\Tblue{1}$, $\Tblue{2}$, $\Tblue{2}$, $\Tblue{2}$, $\Tblue{3}$, $\Tblue{4}$, $\Tblue{5}$, $\Tblue{5}$ is $$\Torange{\text{Mean}} = \frac{\Tblue{1} + \Tblue{2} +\Tblue{2} + \Tblue{2} + \Tblue{3} + \Tblue{4} + \Tblue{5} + \Tblue{5}}{\Tred{8}} = \frac{\Tblue{24}}{\Tred{8}} = \Torange{3}$$

The symbol for the mean of data $x$ is $\Torange{\bar{x}}$.

Using the frequency $\Tred{f}$ of the data, the formula becomes $$\Torange{\bar{x}} = \frac{\sum \Tred{f}\Tblue{x}}{\sum \Tred{f}}.$$

The table of frequency for the data above is as follows:

 Data Frequency $\Tblue{1}$ $\Tblue{2}$ $\Tblue{3}$ $\Tblue{4}$ $\Tblue{5}$ $\Tred{1}$ $\Tred{3}$ $\Tred{1}$ $\Tred{1}$ $\Tred{2}$

The computation of the mean becomes $$\Torange{\text{Mean}} = \frac{\Tred{1}\times\Tblue{1} + \Tred{3}\times \Tblue{2} + \Tred{1}\times\Tblue{3}+ \Tred{1}\times\Tblue{4}+ \Tred{2}\times\Tblue{5}}{\Tred{1}+\Tred{3}+\Tred{1}+\Tred{1}+\Tred{2}} = \frac{\Tblue{24}}{\Tred{8}} = \Torange{3}$$

The mean is the average value of a data set.

## Median and mode

The mean is the most common indicator of central tendency for a dataset. But other indicators are useful too.

The median is the most representative item. Half of the values in a data set are higher than the median and half of the values are lower.

• The median is the middle value of an ordered data set when the number of values is odd.

The median for the data set $\Tblue{1}$, $\Tblue{2}$, $\Tblue{5}$ is $\Tbrown{2}$.

• The median is the mean of the two values in the middle of an ordered data set when the number of values in the data set is even.

The median for the data set $\Tblue{1}$, $\Tblue{2}$, $\Tblue{3}$, $\Tblue{5}$ is $\Tbrown{2.5} = (\Tblue{2} + \Tblue{3})/2$.

Like the mean, the median is defined only for numerical data.

In 2011, the median US household income was $\Tbrown{50,054\}$ per annum. Half of the households had an income lower than $\Tbrown{50,054\}$ and half had an income that was higher.

The mode is the most frequent item. It can apply to non-numerical data. A data set can have several modes. The mode is sometimes called the modal class .

• The mode for the data set $\Tblue{1}$, $\Tblue{1}$, $\Tblue{2}$, $\Tblue{5}$ is $\Tpink{1}$.
• The mode for the data set $\Tblue{1}$, $\Tblue{2}$, $\Tblue{2}$, $\Tblue{5}$, $\Tblue{5}$ are $\Tpink{2}$ and $\Tpink{5}$.
• The mode for the non-numerical data set cat, cat, dog, dog, dog, hamster, parrot is dog.

The mean, the median and the mode

## Standard deviation

The standard deviation measures the dispersion of the data around the mean. It is given by the formula \begin{align*} \Tviolet{\sigma} &= \sqrt{\frac{\vphantom{|}\text{Sum of squared differences to mean}}{\Tred{\text{Number of data}}}}\\ &= \sqrt{\frac{\sum (\Tblue{x} - \Torange{\bar{x}})^2}{\Tred{n}}} = \sqrt{\frac{\sum\Tblue{x}^2}{\Tred{n}} - \Torange{\bar{x}}^2} \end{align*}

The symbol $\Tviolet{\sigma}$ for the standard deviation is the Greek letter sigma.

You can use either of the last two formulas for $\Tviolet{\sigma}$.

Consider the data $\Tblue{1}$, $\Tblue{2}$, $\Tblue{2}$, $\Tblue{5}$. The number $\Tred{n}$ of data points is $\Tred{4}$ and the average is $\Torange{\bar{x}} = \Tblue{10}/\Tred{4} = \Torange{2.5}$. The standard deviation is \begin{align*} \Tviolet{\sigma} &= \sqrt{\frac{ (\Tblue{1}-\Torange{2.5})^2 + (\Tblue{2}-\Torange{2.5})^2 + (\Tblue{2}-\Torange{2.5})^2 + (\Tblue{5}-\Torange{2.5})^2}{\Tred{4}}}\\ &= \sqrt{\frac{ (-1.5)^2 + (-0.5)^2 + (-0.5)^2 + (2.5)^2}{\Tred{4}}}\\ &= \sqrt{\frac{9}{\Tred{4}}} =\frac{3}{2} = \Tviolet{1.5} \end{align*} We used the first formula. The second formula reads \begin{align*} \Tviolet{\sigma} &= \sqrt{\frac{ 1^2 + 2^2 + 2^2 + 5^2}{\Tred{4}} - 2.5^2}\\ &= \sqrt{\frac{1+4+4+25 - 25}{\Tred{4}}}= \sqrt{\frac{9}{\Tred{4}}} =\frac{3}{2} = \Tviolet{1.5} \end{align*}

Using the frequency $\Tred{f}$ of the data, the formula becomes $$\Tviolet{\sigma} = \sqrt{\frac{\sum \Tred{f}(\Tblue{x} - \Torange{\bar{x}})^2}{\sum \Tred{f}}} = \sqrt{\frac{\sum \Tred{f}\Tblue{x}^2}{\sum \Tred{f}} - \Torange{\bar{x}}^2}$$

Another measure of the dispersion of the data is the range. It is the difference between the highest and the lowest values in the data.

The range for the data $\Tblue{3}$, $\Tblue{1}$, $\Tblue{5}$, $\Tblue{2}$ is $$\Tlightgreen{\text{Range}} = \Tblue{5} - \Tblue{1} =\Tlightgreen{4}$$

The standard deviation and the range measures dispersion around the mean

## Quartiles and percentiles

Quartiles are an extension of the median. We split an ordered data set into four groups with an equal number of values.

• The lower (or first ) quartile is the value that is greater than $25\%$ of the values. It is the median of the values below the median. (When the full data set has an odd number of points, we exclude the middle point).
• The second quartile is the median.
• The upper (or third ) quartile is the value that is greater than $75\%$ of the data.

The interquartile range is the difference between the upper and the lower quartile. Like the standard deviation, it is a measure of the dispersion of the values.

Take the data set $\Tblue{1}$, $\Tblue{1}$, $\Tblue{3}$, $\Tblue{5}$, $\Tblue{5}$, $\Tblue{6}$, $\Tblue{8}$. The median is $\Tbrown{5}$. The lower quartile is $\Tbrown{1}$ because it is the median of $\Tblue{1}$, $\Tblue{1}$, $\Tblue{3}$. The upper quartile is $\Tbrown{6}$. The interquartile range is $\Tbrown{6} - \Tbrown{1} = \Tgreen{5}$.

Percentiles are similar to quartiles, but values are grouped into buckets of $1/100^\text{th}$ of values.

The $90^\text{th}$ percentile is the value that is greater than $90\%$ of the data and smaller than $10\%$.

Quartiles and percentiles are quantiles.

The three quartiles for differenty frequency diagrams

## Box-and-whisker plots

A box-and-whisker plot is a summary representation of a data set. It gives the median, the lower and upper quartiles and the lower and upper extremes (the lowest and the highest values).

From a box-and-whisker plot, one can compute the interquartile range and the range. However, one cannot get the mean and the standard deviation of data.

The elements of a box-and-whisker plot

Take the data set $\Tblue{0}$, $\Tblue{1}$, $\Tblue{3}$, $\Tblue{5}$, $\Tblue{5}$, $\Tblue{6}$, $\Tblue{8}$. The lower quartile is $\Tbrown{1}$; the median is $\Tbrown{5}$; the upper quartile is $\Tbrown{6}$. The lower extreme is $\Tbrown{0}$ and the upper extreme is $\Tbrown{8}$. This gives the following box-and-whisker plot

Box-and-whisker plot (example)

## Cumulative frequency

The cumulative frequency of a value in a data set is the sum of all the frequencies of the data that have the same or a lower value. It can be represented in a table or in a graph.

Take the following data set.

 $\Tblue{0}$ $\Tblue{2}$ $\Tblue{1}$ $\Tblue{3}$ $\Tblue{1}$ $\Tblue{4}$ $\Tblue{4}$ $\Tblue{1}$ $\Tblue{1}$ $\Tblue{1}$ $\Tblue{2}$ $\Tblue{2}$ $\Tblue{0}$ $\Tblue{2}$ $\Tblue{4}$ $\Tblue{1}$ $\Tblue{0}$ $\Tblue{1}$ $\Tblue{0}$ $\Tblue{0}$

The cumulative frequency of a value is by adding the frequencies of all the values from the lowest to the current value.

 Number of Siblings Frequency Cumulative frequency $\Tblue{0}$ $\Tblue{1}$ $\Tblue{2}$ $\Tblue{3}$ $\Tblue{4}$ $\Tred{5}$ $\Tred{7}$ $\Tred{4}$ $\Tred{1}$ $\Tred{3}$ $\Torange{5}$ $\Torange{12}$ $\Torange{16}$ $\Torange{17}$ $\Torange{20}$

The cumulative frequency curve is the curve that connects all the cumulative frequencies. The cumulative frequency diagram is the same, with bars, instead of lines.

Cumulative frequency curve (red) and diagram (orange)

It is easy to read quartiles and percentiles from the cumulative frequency curve. The $b^\text{th}$ percentile is the $x$-coordinate $a$ of the point with the $y$-coordinate equal to $b\times n$ (where $n$ is the number of data). Indeed, there are exactly $b\%$ of the data that are smaller than $a$. That is exactly the definition of cumulative frequency.

Quartiles and cumulative frequency curve