# Introduction to Statistics

Statistics is the science of analyzing data.

When we have created a model for prediction, we must assess the prediction’s reliability.

After all, what is a prediction worth, if we cannot rely on it?

Descriptive Statistics
We will first cover some basic descriptive statistics.

Descriptive statistics summarizes important features of a data set such as:

Count
Sum
Standard Deviation
Percentile
Average
Etc..
It is a good starting point to become familiar with the data.

We can use the describe() function in Python to summarize the data:

import pandas as pd

pd.set_option(‘display.max_columns’,None)
pd.set_option(‘display.max_rows’,None)

print (full_health_data.describe())

```         Duration  Average_Pulse   Max_Pulse  Calorie_Burnage  Hours_Work  \
count  163.000000     163.000000  163.000000       163.000000  163.000000
mean    64.263804     107.723926  134.226994       382.368098    4.386503
std     42.994520      14.625062   16.403967       274.227106    3.923772
min     15.000000      80.000000  100.000000        50.000000    0.000000
25%     45.000000     100.000000  124.000000       256.500000    0.000000
50%     60.000000     105.000000  131.000000       320.000000    5.000000
75%     60.000000     111.000000  141.000000       388.500000    8.000000
max    300.000000     159.000000  184.000000      1860.000000   11.000000

Hours_Sleep
count   163.000000
mean      7.680982
std       0.663934
min       5.000000
25%       7.500000
50%       8.000000
75%       8.000000
max      12.000000  ```

# 25%, 50% and 75% – Percentiles

Percentiles are used in statistics to give you a number
that describes the value that a given percent of the values are lower than.

Percentiles
Let us try to explain it by some examples, using Average_Pulse.

The 25% percentile of Average_Pulse means that 25% of all of the training sessions
have an average pulse of 100 beats per minute or lower. If we flip the statement,
it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session
have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions
have an average pulse of 111 beats per minute or higher

Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:

import pandas as pd
import numpy as np

Max_Pulse= full_health_data[“Max_Pulse”]
percentile10 = np.percentile(Max_Pulse, 10)

print(percentile10)

`120.0`

Max_Pulse = full_health_data[“Max_Pulse”] – Isolate the variable Max_Pulse from the full health data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower.

# Standard Deviation

Standard deviation is a number that describes how spread out the observations are.

Standard Deviation
A mathematical function will have difficulties in predicting precise values,
if the observations are “spread”. Standard deviation is a measure of uncertainty.

A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.

Standard Deviation is often represented by the symbol Sigma: σ

We can use the std() function from Numpy to find the standard deviation of a variable:

import pandas as pd
import numpy as np

std = np.std(full_health_data)

print(std)

```Duration            42.862432
Average_Pulse       14.580131
Max_Pulse           16.353571
Calorie_Burnage    273.384624
Hours_Work           3.911718
Hours_Sleep          0.661895
dtype: float64```

What does these numbers mean?

# Coefficient of Variation

The coefficient of variation is used to get an idea of how large the standard deviation is.

Mathematically, the coefficient of variation is defined as:

Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:

import pandas as pd
import numpy as np

cv = np.std(full_health_data) / np.mean(full_health_data)

print(cv)

```Duration           0.666976
Average_Pulse      0.135347
Max_Pulse          0.121835
Calorie_Burnage    0.714978
Hours_Work         0.891762
Hours_Sleep        0.086173
dtype: float64```

We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation
compared to Max_Pulse, Average_Pulse and Hours_Sleep.

# Variance

Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance
(remember that we now use the first data set with 10 observations):

import pandas as pd
import numpy as np

var = np.var(health_data)

print(var)

```Duration           236.25
Average_Pulse      206.25
Max_Pulse          116.00
Calorie_Burnage    825.00
Hours_Work          11.84
Hours_Sleep          0.25
dtype: float64```

Use Python to Find the Variance of Full Data Set
Here we calculate the variance for each column for the full data set:

import pandas as pd
import numpy as np

```Duration            1837.188076