# Introduction to Statistics

Statistics is the science of analyzing data.

When we have created a model for prediction, we must assess the prediction’s reliability.

After all, what is a prediction worth, if we cannot rely on it?

Descriptive Statistics

We will first cover some basic descriptive statistics.

Descriptive statistics summarizes important features of a data set such as:

Count

Sum

Standard Deviation

Percentile

Average

Etc..

It is a good starting point to become familiar with the data.

We can use the describe() function in Python to summarize the data:

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

pd.set_option(‘display.max_columns’,None)

pd.set_option(‘display.max_rows’,None)

print (full_health_data.describe())

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work \ count 163.000000 163.000000 163.000000 163.000000 163.000000 mean 64.263804 107.723926 134.226994 382.368098 4.386503 std 42.994520 14.625062 16.403967 274.227106 3.923772 min 15.000000 80.000000 100.000000 50.000000 0.000000 25% 45.000000 100.000000 124.000000 256.500000 0.000000 50% 60.000000 105.000000 131.000000 320.000000 5.000000 75% 60.000000 111.000000 141.000000 388.500000 8.000000 max 300.000000 159.000000 184.000000 1860.000000 11.000000 Hours_Sleep count 163.000000 mean 7.680982 std 0.663934 min 5.000000 25% 7.500000 50% 8.000000 75% 8.000000 max 12.000000

# 25%, 50% and 75% – Percentiles

Percentiles are used in statistics to give you a number

that describes the value that a given percent of the values are lower than.

Percentiles

Let us try to explain it by some examples, using Average_Pulse.

The 25% percentile of Average_Pulse means that 25% of all of the training sessions

have an average pulse of 100 beats per minute or lower. If we flip the statement,

it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher

The 75% percentile of Average_Pulse means that 75% of all the training session

have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions

have an average pulse of 111 beats per minute or higher

Find the 10% percentile for Max_Pulse

The following example shows how to do it in Python:

import pandas as pd

import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

Max_Pulse= full_health_data[“Max_Pulse”]

percentile10 = np.percentile(Max_Pulse, 10)

print(percentile10)

120.0

Max_Pulse = full_health_data[“Max_Pulse”] – Isolate the variable Max_Pulse from the full health data set.

np.percentile() is used to define that we want the 10% percentile from Max_Pulse.

The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower.

# Standard Deviation

Standard deviation is a number that describes how spread out the observations are.

Standard Deviation

A mathematical function will have difficulties in predicting precise values,

if the observations are “spread”. Standard deviation is a measure of uncertainty.

A low standard deviation means that most of the numbers are close to the mean (average) value.

A high standard deviation means that the values are spread out over a wider range.

Standard Deviation is often represented by the symbol Sigma: σ

We can use the std() function from Numpy to find the standard deviation of a variable:

import pandas as pd

import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

std = np.std(full_health_data)

print(std)

Duration 42.862432 Average_Pulse 14.580131 Max_Pulse 16.353571 Calorie_Burnage 273.384624 Hours_Work 3.911718 Hours_Sleep 0.661895 dtype: float64

What does these numbers mean?

# Coefficient of Variation

The coefficient of variation is used to get an idea of how large the standard deviation is.

Mathematically, the coefficient of variation is defined as:

Coefficient of Variation = Standard Deviation / Mean

We can do this in Python if we proceed with the following code:

import pandas as pd

import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

cv = np.std(full_health_data) / np.mean(full_health_data)

print(cv)

Duration 0.666976 Average_Pulse 0.135347 Max_Pulse 0.121835 Calorie_Burnage 0.714978 Hours_Work 0.891762 Hours_Sleep 0.086173 dtype: float64

We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation

compared to Max_Pulse, Average_Pulse and Hours_Sleep.

# Variance

Use Python to Find the Variance of health_data

We can use the var() function from Numpy to find the variance

(remember that we now use the first data set with 10 observations):

import pandas as pd

import numpy as np

health_data = pd.read_csv(“Documents/Data Science/data2.csv”, header=0, sep=”,”)

var = np.var(health_data)

print(var)

Duration 236.25 Average_Pulse 206.25 Max_Pulse 116.00 Calorie_Burnage 825.00 Hours_Work 11.84 Hours_Sleep 0.25 dtype: float64

Use Python to Find the Variance of Full Data Set

Here we calculate the variance for each column for the full data set:

import pandas as pd

import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

var = np.var(full_health_data)

print(var)

Duration 1837.188076 Average_Pulse 212.580225 Max_Pulse 267.439271 Calorie_Burnage 74739.152847 Hours_Work 15.301536 Hours_Sleep 0.438105 dtype: float64