Introduction to Statistics
Statistics is the science of analyzing data.
When we have created a model for prediction, we must assess the prediction’s reliability.
After all, what is a prediction worth, if we cannot rely on it?
Descriptive Statistics
We will first cover some basic descriptive statistics.
Descriptive statistics summarizes important features of a data set such as:
Count
Sum
Standard Deviation
Percentile
Average
Etc..
It is a good starting point to become familiar with the data.
We can use the describe() function in Python to summarize the data:
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
pd.set_option(‘display.max_columns’,None)
pd.set_option(‘display.max_rows’,None)
print (full_health_data.describe())
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work \ count 163.000000 163.000000 163.000000 163.000000 163.000000 mean 64.263804 107.723926 134.226994 382.368098 4.386503 std 42.994520 14.625062 16.403967 274.227106 3.923772 min 15.000000 80.000000 100.000000 50.000000 0.000000 25% 45.000000 100.000000 124.000000 256.500000 0.000000 50% 60.000000 105.000000 131.000000 320.000000 5.000000 75% 60.000000 111.000000 141.000000 388.500000 8.000000 max 300.000000 159.000000 184.000000 1860.000000 11.000000 Hours_Sleep count 163.000000 mean 7.680982 std 0.663934 min 5.000000 25% 7.500000 50% 8.000000 75% 8.000000 max 12.000000
25%, 50% and 75% – Percentiles
Percentiles are used in statistics to give you a number
that describes the value that a given percent of the values are lower than.
Percentiles
Let us try to explain it by some examples, using Average_Pulse.
The 25% percentile of Average_Pulse means that 25% of all of the training sessions
have an average pulse of 100 beats per minute or lower. If we flip the statement,
it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session
have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions
have an average pulse of 111 beats per minute or higher
Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:
import pandas as pd
import numpy as np
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
Max_Pulse= full_health_data[“Max_Pulse”]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
120.0
Max_Pulse = full_health_data[“Max_Pulse”] – Isolate the variable Max_Pulse from the full health data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower.
Standard Deviation
Standard deviation is a number that describes how spread out the observations are.
Standard Deviation
A mathematical function will have difficulties in predicting precise values,
if the observations are “spread”. Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
Standard Deviation is often represented by the symbol Sigma: σ
We can use the std() function from Numpy to find the standard deviation of a variable:
import pandas as pd
import numpy as np
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
std = np.std(full_health_data)
print(std)
Duration 42.862432 Average_Pulse 14.580131 Max_Pulse 16.353571 Calorie_Burnage 273.384624 Hours_Work 3.911718 Hours_Sleep 0.661895 dtype: float64
What does these numbers mean?
Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Mathematically, the coefficient of variation is defined as:
Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:
import pandas as pd
import numpy as np
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
Duration 0.666976 Average_Pulse 0.135347 Max_Pulse 0.121835 Calorie_Burnage 0.714978 Hours_Work 0.891762 Hours_Sleep 0.086173 dtype: float64
We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation
compared to Max_Pulse, Average_Pulse and Hours_Sleep.
Variance
Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance
(remember that we now use the first data set with 10 observations):
import pandas as pd
import numpy as np
health_data = pd.read_csv(“Documents/Data Science/data2.csv”, header=0, sep=”,”)
var = np.var(health_data)
print(var)
Duration 236.25 Average_Pulse 206.25 Max_Pulse 116.00 Calorie_Burnage 825.00 Hours_Work 11.84 Hours_Sleep 0.25 dtype: float64
Use Python to Find the Variance of Full Data Set
Here we calculate the variance for each column for the full data set:
import pandas as pd
import numpy as np
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
var = np.var(full_health_data)
print(var)
Duration 1837.188076 Average_Pulse 212.580225 Max_Pulse 267.439271 Calorie_Burnage 74739.152847 Hours_Work 15.301536 Hours_Sleep 0.438105 dtype: float64