ML – Scale

Scale Features

When your data has different values, and even different measurement units, it can be difficult to compare them.
What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter,
but this time the volume column contains values in liters
instead of cm3 (1.0 instead of 1000).

Loading the cars2.CSV into a DataFrame:

import pandas as pd

df = pd.read_csv(‘Documents/Machine Learning/cars2.csv’)

print(df.to_string())

           Car       Model  Volume  Weight  CO2
0       Toyoty        Aygo     1.0     790   99
1   Mitsubishi  Space Star     1.2    1160   95
2        Skoda      Citigo     1.0     929   95
3         Fiat         500     0.9     865   90
4         Mini      Cooper     1.5    1140  105
5           VW         Up!     1.0     929  105
6        Skoda       Fabia     1.4    1109   90
7     Mercedes     A-Class     1.5    1365   92
8         Ford      Fiesta     1.5    1112   98
9         Audi          A1     1.6    1150   99
10     Hyundai         I20     1.1     980   99
11      Suzuki       Swift     1.3     990  101
12        Ford      Fiesta     1.0    1112   99
13       Honda       Civic     1.6    1252   94
14      Hundai         I30     1.6    1326   97
15        Opel       Astra     1.6    1330   97
16         BMW           1     1.6    1365   99
17       Mazda           3     2.2    1280  104
18       Skoda       Rapid     1.6    1119  104
19        Ford       Focus     2.0    1328  105
20        Ford      Mondeo     1.6    1584   94
21        Opel    Insignia     2.0    1428   99
22    Mercedes     C-Class     2.1    1365   99
23       Skoda     Octavia     1.6    1415   99
24       Volvo         S60     2.0    1415   99
25    Mercedes         CLA     1.5    1465  102
26        Audi          A4     2.0    1490  104
27        Audi          A6     2.0    1725  114
28       Volvo         V70     1.6    1523  109
29         BMW           5     2.0    1705  114
30    Mercedes     E-Class     2.1    1605  115
31       Volvo        XC70     2.0    1746  117
32        Ford       B-Max     1.6    1235  104
33         BMW         216     1.6    1390  108
34        Opel      Zafira     1.6    1405  109
35    Mercedes         SLK     2.5    1395  120

It can be difficult to compare the volume 1.0 with the weight 790,
but if we scale them both into comparable values, we can easily see how much one value is compared to the other.

There are different methods for scaling data, in this tutorial we will use a method called standardization.
The standardization method uses this formula:

z = (x – u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

If you take the weight column from the data set above, the first value is 790, and the scaled value will be:

(790 – 1292.23) / 238.74 = -2.1
If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be:

(1.0 – 1.61) / 0.38 = -1.59

Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.

You do not have to do this manually, the Python sklearn module has a method called StandardScaler()
which returns a Scaler object with methods for transforming data sets.

Scale all values in the Weight and Volume columns:

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv(“Documents/Machine Learning/cars2.csv”)

X = df[[‘Weight’, ‘Volume’]]

scaledX = scale.fit_transform(X)

print(scaledX)

[[-2.10389253 -1.59336644]
 [-0.55407235 -1.07190106]
 [-1.52166278 -1.59336644]
 [-1.78973979 -1.85409913]
 [-0.63784641 -0.28970299]
 [-1.52166278 -1.59336644]
 [-0.76769621 -0.55043568]
 [ 0.3046118  -0.28970299]
 [-0.7551301  -0.28970299]
 [-0.59595938 -0.0289703 ]
 [-1.30803892 -1.33263375]
 [-1.26615189 -0.81116837]
 [-0.7551301  -1.59336644]
 [-0.16871166 -0.0289703 ]
 [ 0.14125238 -0.0289703 ]
 [ 0.15800719 -0.0289703 ]
 [ 0.3046118  -0.0289703 ]
 [-0.05142797  1.53542584]
 [-0.72580918 -0.0289703 ]
 [ 0.14962979  1.01396046]
 [ 1.2219378  -0.0289703 ]
 [ 0.5685001   1.01396046]
 [ 0.3046118   1.27469315]
 [ 0.51404696 -0.0289703 ]
 [ 0.51404696  1.01396046]
 [ 0.72348212 -0.28970299]
 [ 0.8281997   1.01396046]
 [ 1.81254495  1.01396046]
 [ 0.96642691 -0.0289703 ]
 [ 1.72877089  1.01396046]
 [ 1.30990057  1.27469315]
 [ 1.90050772  1.01396046]
 [-0.23991961 -0.0289703 ]
 [ 0.40932938 -0.0289703 ]
 [ 0.47215993 -0.0289703 ]
 [ 0.4302729   2.31762392]]

Note that the first two values are -2.1 and -1.59, which corresponds to our calculations:

Predict CO2 Values

The task in the Multiple Regression chapter was to predict the CO2 emission from a car when you only knew its weight and volume.

When the data set is scaled, you will have to use the scale when you predict values:

Example
Predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms:

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv(“Documents/Machine Learning/cars2.csv”)

X = df[[‘Weight’, ‘Volume’]]
y = df[‘CO2’]

scaledX = scale.fit_transform(X)

regr = linear_model.LinearRegression()
regr.fit(scaledX, y)

scaled = scale.transform([[2300, 1.3]])

predictedCO2 = regr.predict([scaled[0]])
print(predictedCO2)

[107.2087328]