DS – Introduction to Statistics

Introduction to Statistics

Statistics is the science of analyzing data.

When we have created a model for prediction, we must assess the prediction’s reliability.

After all, what is a prediction worth, if we cannot rely on it?

Descriptive Statistics
We will first cover some basic descriptive statistics.

Descriptive statistics summarizes important features of a data set such as:

Count
Sum
Standard Deviation
Percentile
Average
Etc..
It is a good starting point to become familiar with the data.

We can use the describe() function in Python to summarize the data:

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

pd.set_option(‘display.max_columns’,None)
pd.set_option(‘display.max_rows’,None)

print (full_health_data.describe())

         Duration  Average_Pulse   Max_Pulse  Calorie_Burnage  Hours_Work  \
count  163.000000     163.000000  163.000000       163.000000  163.000000   
mean    64.263804     107.723926  134.226994       382.368098    4.386503   
std     42.994520      14.625062   16.403967       274.227106    3.923772   
min     15.000000      80.000000  100.000000        50.000000    0.000000   
25%     45.000000     100.000000  124.000000       256.500000    0.000000   
50%     60.000000     105.000000  131.000000       320.000000    5.000000   
75%     60.000000     111.000000  141.000000       388.500000    8.000000   
max    300.000000     159.000000  184.000000      1860.000000   11.000000   

       Hours_Sleep  
count   163.000000  
mean      7.680982  
std       0.663934  
min       5.000000  
25%       7.500000  
50%       8.000000  
75%       8.000000  
max      12.000000  

25%, 50% and 75% – Percentiles

Percentiles are used in statistics to give you a number
that describes the value that a given percent of the values are lower than.

Percentiles
Let us try to explain it by some examples, using Average_Pulse.

The 25% percentile of Average_Pulse means that 25% of all of the training sessions
have an average pulse of 100 beats per minute or lower. If we flip the statement,
it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session
have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions
have an average pulse of 111 beats per minute or higher

Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:

import pandas as pd
import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

Max_Pulse= full_health_data[“Max_Pulse”]
percentile10 = np.percentile(Max_Pulse, 10)

print(percentile10)

120.0

Max_Pulse = full_health_data[“Max_Pulse”] – Isolate the variable Max_Pulse from the full health data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower.

Standard Deviation

Standard deviation is a number that describes how spread out the observations are.

Standard Deviation
A mathematical function will have difficulties in predicting precise values,
if the observations are “spread”. Standard deviation is a measure of uncertainty.

A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.

Standard Deviation is often represented by the symbol Sigma: σ

We can use the std() function from Numpy to find the standard deviation of a variable:

import pandas as pd
import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

std = np.std(full_health_data)

print(std)

Duration            42.862432
Average_Pulse       14.580131
Max_Pulse           16.353571
Calorie_Burnage    273.384624
Hours_Work           3.911718
Hours_Sleep          0.661895
dtype: float64

What does these numbers mean?

Coefficient of Variation

The coefficient of variation is used to get an idea of how large the standard deviation is.

Mathematically, the coefficient of variation is defined as:

Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:

import pandas as pd
import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

cv = np.std(full_health_data) / np.mean(full_health_data)

print(cv)

Duration           0.666976
Average_Pulse      0.135347
Max_Pulse          0.121835
Calorie_Burnage    0.714978
Hours_Work         0.891762
Hours_Sleep        0.086173
dtype: float64

We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation
compared to Max_Pulse, Average_Pulse and Hours_Sleep.

Variance

Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance
(remember that we now use the first data set with 10 observations):

import pandas as pd
import numpy as np

health_data = pd.read_csv(“Documents/Data Science/data2.csv”, header=0, sep=”,”)

var = np.var(health_data)

print(var)

Duration           236.25
Average_Pulse      206.25
Max_Pulse          116.00
Calorie_Burnage    825.00
Hours_Work          11.84
Hours_Sleep          0.25
dtype: float64

Use Python to Find the Variance of Full Data Set
Here we calculate the variance for each column for the full data set:

import pandas as pd
import numpy as np

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

var = np.var(full_health_data)

print(var)

Duration            1837.188076
Average_Pulse        212.580225
Max_Pulse            267.439271
Calorie_Burnage    74739.152847
Hours_Work            15.301536
Hours_Sleep            0.438105
dtype: float64