DS – Linear Regression Case Study

Linear Regression Case Study

Case: Use Duration + Average_Pulse to Predict Calorie_Burnage
Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables:

import pandas as pd
import statsmodels.formula.api as smf

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

model = smf.ols(‘Calorie_Burnage ~ Average_Pulse + Duration’, data = full_health_data)
results = model.fit()

                            OLS Regression Results                            
Dep. Variable:        Calorie_Burnage   R-squared:                       0.816
Model:                            OLS   Adj. R-squared:                  0.814
Method:                 Least Squares   F-statistic:                     355.8
Date:                Tue, 20 Jul 2021   Prob (F-statistic):           1.27e-59
Time:                        17:29:36   Log-Likelihood:                -1007.7
No. Observations:                 163   AIC:                             2021.
Df Residuals:                     160   BIC:                             2031.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
Intercept      -334.5194     73.616     -4.544      0.000    -479.904    -189.135
Average_Pulse     3.1695      0.644      4.922      0.000       1.898       4.441
Duration          5.8424      0.219     26.671      0.000       5.410       6.275
Omnibus:                      160.167   Durbin-Watson:                   2.339
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5096.292
Skew:                           3.383   Prob(JB):                         0.00
Kurtosis:                      29.544   Cond. No.                     1.02e+03

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Example Explained:
Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in Python.
Use the full_health_data set.
Create a model based on Ordinary Least Squares with smf.ols().
Notice that the explanatory variable must be written first in the parenthesis. Use the full_health_data data set.
By calling .fit(), you obtain the variable results. This holds a lot of information about the regression model.
Call summary() to get the table with the results of linear regression.

The linear regression function can be rewritten mathematically as:

Calorie_Burnage = Average_Pulse * 3.1695 + Duration * 5.8424 – 334.5194
Rounded to two decimals:

Calorie_Burnage = Average_Pulse * 3.17 + Duration * 5.84 – 334.52

Define the Linear Regression Function in Python
Define the linear regression function in Python to perform predictions.

What is Calorie_Burnage if:

Average pulse is 110 and duration of the training session is 60 minutes?
Average pulse is 140 and duration of the training session is 45 minutes?
Average pulse is 175 and duration of the training session is 20 minutes?

def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695 * Average_Pulse + 5.8434 * Duration – 334.5194)



The Answers:

Average pulse is 110 and duration of the training session is 60 minutes = 365 Calories
Average pulse is 140 and duration of the training session is 45 minutes = 372 Calories
Average pulse is 175 and duration of the training session is 20 minutes = 337 Calories

Access the Coefficients

Look at the coefficients:

Calorie_Burnage increases with 3.17 if Average_Pulse increases by one.
Calorie_Burnage increases with 5.84 if Duration increases by one.

Access the P-Value

Look at the P-value for each coefficient.

P-value is 0.00 for Average_Pulse, Duration and the Intercept.
The P-value is statistically significant for all of the variables, as it is less than 0.05.
So here we can conclude that Average_Pulse and Duration has a relationship with Calorie_Burnage.

Adjusted R-Squared

There is a problem with R-squared if we have more than one explanatory variable.

R-squared will almost always increase if we add more variables, and will never decrease.

This is because we are adding more data points around the linear regression function.

If we add random variables that does not affect Calorie_Burnage,
we risk to falsely conclude that the linear regression function is a good fit. Adjusted R-squared adjusts for this problem.

It is therefore better to look at the adjusted R-squared value if we have more than one explanatory variable.

The Adjusted R-squared is 0.814.

The value of R-Squared is always between 0 to 1 (0% to 100%).

A high R-Squared value means that many data points are close to the linear regression function line.
A low R-Squared value means that the linear regression function line does not fit the data well.
Conclusion: The model fits the data point well!