DS – Correlation

Correlation

Correlation measures the relationship between two variables.

We mentioned that a function has a purpose to predict a value,
by converting input (x) to output (f(x)). We can say also say that a function uses the relationship between two variables for prediction.

Correlation Coefficient
The correlation coefficient measures the relationship between two variables.

The correlation coefficient can never be less than -1 or higher than 1.

1 = there is a perfect linear relationship between the variables (like Average_Pulse against Calorie_Burnage)
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between
the variables (e.g. Less hours worked, leads to higher calorie burnage during a training session)
Example of a Perfect Linear Relationship (Correlation Coefficient = 1)
We will use scatterplot to visualize the relationship between
Average_Pulse and Calorie_Burnage (we have used the small data set of the sports watch with 10 observations).

This time we want scatter plots, so we change kind to “scatter”:

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

health_data = pd.read_csv(“Documents/Data Science/data2.csv”, header=0, sep=”,”)

health_data.plot(x =’Average_Pulse’, y=’Calorie_Burnage’, kind=’scatter’),

plt.show()

As we saw earlier, it exists a perfect linear relationship between Average_Pulse and Calorie_Burnage.

Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
We have plotted fictional data here. The x-axis represents the amount of hours worked at our job before a training session.
The y-axis is Calorie_Burnage.

If we work longer hours, we tend to have lower calorie burnage because we are exhausted before the training session.

The correlation coefficient here is -1.

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

negative_corr = {‘Hours_Work_Before_Training’: [10,9,8,7,6,5,4,3,2,1],
‘Calorie_Burnage’: [220,240,260,280,300,320,340,360,380,400]}
negative_corr = pd.DataFrame(data=negative_corr)

negative_corr.plot(x =’Hours_Work_Before_Training’, y=’Calorie_Burnage’, kind=’scatter’)
plt.show()

Example of No Linear Relationship (Correlation coefficient = 0)
Here, we have plotted Max_Pulse against Duration from the full_health_data set.

As you can see, there is no linear relationship between the two variables.
It means that longer training session does not lead to higher Max_Pulse.

The correlation coefficient here is 0.

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

full_health_data.plot(x =’Duration’, y=’Max_Pulse’, kind=’scatter’)

plt.show()

Correlation Matrix

A matrix is an array of numbers arranged in rows and columns.

A correlation matrix is simply a table showing the correlation coefficients between variables.

Here, the variables are represented in the first row, and in the first column:

Observations:

We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of 0.89. This makes sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage (correlation coefficient of 0.02)
Can we conclude that Average_Pulse does not affect Calorie_Burnage? No. We will come back to answer this question later!

Correlation Matrix in Python

We can use the corr() function in Python to create a correlation matrix. We also use the round() function
to round the output to two decimals:

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
Corr_Matrix = round(full_health_data.corr(),2)

print(Corr_Matrix)

                 Duration  Average_Pulse  Max_Pulse  Calorie_Burnage  \
Duration             1.00          -0.17       0.00             0.89   
Average_Pulse       -0.17           1.00       0.79             0.02   
Max_Pulse            0.00           0.79       1.00             0.20   
Calorie_Burnage      0.89           0.02       0.20             1.00   
Hours_Work          -0.12          -0.28      -0.27            -0.14   
Hours_Sleep          0.07           0.03       0.09             0.08   

                 Hours_Work  Hours_Sleep  
Duration              -0.12         0.07  
Average_Pulse         -0.28         0.03  
Max_Pulse             -0.27         0.09  
Calorie_Burnage       -0.14         0.08  
Hours_Work             1.00        -0.14  
Hours_Sleep           -0.14         1.00  

Using a Heatmap

We can use a Heatmap to Visualize the Correlation Between Variables:

The closer the correlation coefficient is to 1, the greener the squares get.

The closer the correlation coefficient is to -1, the browner the squares get.

Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a visualization library based on matplotlib):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
correlation_full_health = full_health_data.corr()

axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)

plt.show()

Example Explained:
Import the library seaborn as sns.
Use the full_health_data set.
Use sns.heatmap() to tell Python that we want a heatmap to visualize the correlation matrix.
Use the correlation matrix. Define the maximal and minimal values of the heatmap. Define that 0 is the center.
Define the colors with sns.diverging_palette. n=500 means that we want 500 types of color in the same color palette.
square = True means that we want to see squares.

Correlation vs. Causality

Correlation Does Not Imply Causality
Correlation measures the numerical relationship between two variables.

A high correlation coefficient (close to 1), does not mean that we can for sure conclude
an actual relationship between two variables.

A classic example:

During the summer, the sale of ice cream at a beach increases
Simultaneously, drowning accidents also increase as well
Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents?

The Beach Example in Python
Here, we constructed a fictional data set for you to try:

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {“Drowning_Accident”: [20,40,60,80,100,120,140,160,180,200],
“Ice_Cream_Sale”: [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)

Drowning.plot(x=”Ice_Cream_Sale”, y=”Drowning_Accident”, kind=”scatter”)
plt.show()

correlation_beach = Drowning.corr()
print(correlation_beach)

                   Drowning_Accident  Ice_Cream_Sale
Drowning_Accident                1.0             1.0
Ice_Cream_Sale                   1.0             1.0

Correlation vs Causality – The Beach Example
In other words: can we use ice cream sale to predict drowning accidents?

The answer is – Probably not.

It is likely that these two variables are accidentally correlating with each other.

What causes drowning then?

Unskilled swimmers
Waves
Cramp
Seizure disorders
Lack of supervision
Alcohol (mis)use
etc.
Let us reverse the argument:

Does a low correlation coefficient (close to zero) mean that change in x does not affect y?

Back to the question:

Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient?
The answer is no.

There is an important difference between correlation and causality:

Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
It is therefore important to critically reflect over the concept of causality when we do predictions!