DS – Regression Table and Properties

Regression Table

The output from linear regression can be summarized in a regression table.

The content of the table includes:

Information about the model
Coefficients of the linear regression function
Regression statistics
Statistics of the coefficients from the linear regression function
Other information that we will not cover in this module
Regression Table with Average_Pulse as Explanatory Variable

Create a Linear Regression Table in Python

Here is how to create a linear regression table in Python:

import pandas as pd
import statsmodels.formula.api as smf

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

model = smf.ols(‘Calorie_Burnage ~ Average_Pulse’, data = full_health_data)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Calorie_Burnage   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.006
Method:                 Least Squares   F-statistic:                   0.04975
Date:                Tue, 20 Jul 2021   Prob (F-statistic):              0.824
Time:                        16:38:23   Log-Likelihood:                -1145.8
No. Observations:                 163   AIC:                             2296.
Df Residuals:                     161   BIC:                             2302.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       346.8662    160.615      2.160      0.032      29.682     664.050
Average_Pulse     0.3296      1.478      0.223      0.824      -2.588       3.247
==============================================================================
Omnibus:                      124.542   Durbin-Watson:                   1.620
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              938.541
Skew:                           2.927   Prob(JB):                    1.58e-204
Kurtosis:                      13.195   Cond. No.                         811.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example Explained:
Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in Python.
Use the full_health_data set.
Create a model based on Ordinary Least Squares with smf.ols(). Notice that the explanatory variable must be written first in the parenthesis. Use the full_health_data data set.
By calling .fit(), you obtain the variable results. This holds a lot of information about the regression model.
Call summary() to get the table with the results of linear regression.

Regression Table – Info

The “Information Part” in Regression Table

Dep. Variable: is short for “Dependent Variable”. Calorie_Burnage is here the dependent variable. The Dependent variable is here assumed to be explained by Average_Pulse.
Model: OLS is short for Ordinary Least Squares. This is a type of model that uses the Least Square method.
Date: and Time: shows the date and time the output was calculated in Python.

import pandas as pd
import statsmodels.formula.api as smf

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

model = smf.ols(‘Calorie_Burnage ~ Average_Pulse’, data = full_health_data)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Calorie_Burnage   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.006
Method:                 Least Squares   F-statistic:                   0.04975
Date:                Tue, 20 Jul 2021   Prob (F-statistic):              0.824
Time:                        16:43:57   Log-Likelihood:                -1145.8
No. Observations:                 163   AIC:                             2296.
Df Residuals:                     161   BIC:                             2302.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       346.8662    160.615      2.160      0.032      29.682     664.050
Average_Pulse     0.3296      1.478      0.223      0.824      -2.588       3.247
==============================================================================
Omnibus:                      124.542   Durbin-Watson:                   1.620
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              938.541
Skew:                           2.927   Prob(JB):                    1.58e-204
Kurtosis:                      13.195   Cond. No.                         811.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The Information Part in Regression Table

Dep. Variable: Calorie_Burnage
Model: OLS
Method: Least Squares
Date: Tue, 20 Jul 2021
Time: 16:43:57
No. Observations: 163

The “Coefficients Part” in Regression Table

coef

Intercept 346.8662
Average_Pulse 0.3296

Coef is short for coefficient. It is the output of the linear regression function.
The linear regression function can be rewritten mathematically as:

Calorie_Burnage = 0.3296 * Average_Pulse + 346.8662
These numbers means:

If Average_Pulse increases by 1, Calorie_Burnage increases by 0.3296 (or 0,3 rounded)
If Average_Pulse = 0, the Calorie_Burnage is equal to 346.8662 (or 346.9 rounded).
Remember that the intercept is used to adjust the model’s precision of predicting!
Do you think that this is a good model?

Define the Linear Regression Function in Python

Define the linear regression function in Python to perform predictions.

What is Calorie_Burnage if Average_Pulse is: 120, 130, 150, 180?

def Predict_Calorie_Burnage(Average_Pulse):
return(0.3296 * Average_Pulse + 346.8662)

Try some different values:

print(Predict_Calorie_Burnage(120))
print(Predict_Calorie_Burnage(130))
print(Predict_Calorie_Burnage(150))
print(Predict_Calorie_Burnage(180))

386.4182
389.7142
396.3062
406.1942

Regression Table: P – Value

The “Statistics of the Coefficients Part” in Regression Table

=================================================================================

std err t P>|t| [0.025 0.975]

                   160.615      2.160      0.032      29.682     664.050
                     1.478      0.223      0.824      -2.588       3.247

Now, we want to test if the coefficients from the linear regression function has a
significant impact on the dependent variable (Calorie_Burnage).

This means that we want to prove that it exists a relationship
between Average_Pulse and Calorie_Burnage, using statistical tests.

There are four components that explains the statistics of the coefficients:

std err stands for Standard Error
t is the “t-value” of the coefficients
P>|t| is called the “P-value”
[0.025 0.975] represents the confidence interval of the coefficients
We will focus on understanding the “P-value” in this module.

The P-value
The P-value is a statistical number to conclude if there is a relationship between Average_Pulse and Calorie_Burnage.

We test if the true value of the coefficient is equal to zero (no relationship).
The statistical test for this is called Hypothesis testing.

A low P-value (< 0.05) means that the coefficient is likely not to equal zero. A high P-value (> 0.05) means that we cannot conclude that the explanatory variable
affects the dependent variable (here: if Average_Pulse affects Calorie_Burnage).
A high P-value is also called an insignificant P-value.
Hypothesis Testing
Hypothesis testing is a statistical procedure to test if your results are valid.

In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal to zero.

Hypothesis test has two statements. The null hypothesis and the alternative hypothesis.

The null hypothesis can be shortly written as H0
The alternative hypothesis can be shortly written as HA
Mathematically written:

H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0
The sign ≠ means “not equal to”

Hypothesis Testing and P-value
The null hypothesis can either be rejected or not.

If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse and Calorie_Burnage.
The P-value is used for this conclusion.

A common threshold of the P-value is 0.05.

Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis.
It means that we accept that 5% of the times, we might falsely have concluded a relationship.

If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that
it exist a relationship between the variables.

However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship
between Average_Pulse and Calorie_Burnage.

It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero.

The intercept is used to adjust the regression function’s ability to predict more precisely.
It is therefore uncommon to interpret the P-value of the intercept.

R – Squared

R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data points:

R-squared: 0.000
Adj. R-squared: -0.006

The value of R-Squared is always between 0 to 1 (0% to 100%).

A high R-Squared value means that many data points are close to the linear regression function line.
A low R-Squared value means that the linear regression function line does not fit the data well.
Visual Example of a Low R – Squared Value (0.00)
Our regression model shows a R-Squared value of zero,
which means that the linear regression function line does not fit the data well.

This can be visualized when we plot the linear regression
function through the data points of Average_Pulse and Calorie_Burnage.

Linear Regression Using One Explanatory Variable

In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression:

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

x = full_health_data[“Average_Pulse”]
y = full_health_data[“Calorie_Burnage”]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel(“Average_Pulse”)
plt.ylabel (“Calorie_Burnage”)
plt.show()

Visual Example of a High R – Squared Value (0.79)
However, if we plot Duration and Calorie_Burnage, the R-Squared increases. Here, we see that the data points
are close to the linear regression function line:

Linear Regression Using One Explanatory Variable

In this example, we will try to predict Calorie_Burnage with Duration using Linear Regression:

Three lines to make our compiler able to draw:

import sys
import matplotlib
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

x = full_health_data[“Duration”]
y = full_health_data[“Calorie_Burnage”]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel(“Duration”)
plt.ylabel (“Calorie_Burnage”)
plt.show()

Summary – Predicting Calorie_Burnage with Average_Pulse
How can we summarize the linear regression function with Average_Pulse as explanatory variable?

Coefficient of 0.3296, which means that Average_Pulse has a very small effect on Calorie_Burnage.
High P-value (0.824), which means that we cannot conclude a relationship between Average_Pulse and Calorie_Burnage.
R-Squared value of 0, which means that the linear regression function line does not fit the data well.