Scale Features
When your data has different values, and even different measurement units, it can be difficult to compare them.
What is kilograms compared to meters? Or altitude compared to time?
The answer to this problem is scaling. We can scale data into new values that are easier to compare.
Take a look at the table below, it is the same data set that we used in the multiple regression chapter,
but this time the volume column contains values in liters
instead of cm3 (1.0 instead of 1000).
Loading the cars2.CSV into a DataFrame:
import pandas as pd
df = pd.read_csv(‘Documents/Machine Learning/cars2.csv’)
print(df.to_string())
Car Model Volume Weight CO2 0 Toyoty Aygo 1.0 790 99 1 Mitsubishi Space Star 1.2 1160 95 2 Skoda Citigo 1.0 929 95 3 Fiat 500 0.9 865 90 4 Mini Cooper 1.5 1140 105 5 VW Up! 1.0 929 105 6 Skoda Fabia 1.4 1109 90 7 Mercedes A-Class 1.5 1365 92 8 Ford Fiesta 1.5 1112 98 9 Audi A1 1.6 1150 99 10 Hyundai I20 1.1 980 99 11 Suzuki Swift 1.3 990 101 12 Ford Fiesta 1.0 1112 99 13 Honda Civic 1.6 1252 94 14 Hundai I30 1.6 1326 97 15 Opel Astra 1.6 1330 97 16 BMW 1 1.6 1365 99 17 Mazda 3 2.2 1280 104 18 Skoda Rapid 1.6 1119 104 19 Ford Focus 2.0 1328 105 20 Ford Mondeo 1.6 1584 94 21 Opel Insignia 2.0 1428 99 22 Mercedes C-Class 2.1 1365 99 23 Skoda Octavia 1.6 1415 99 24 Volvo S60 2.0 1415 99 25 Mercedes CLA 1.5 1465 102 26 Audi A4 2.0 1490 104 27 Audi A6 2.0 1725 114 28 Volvo V70 1.6 1523 109 29 BMW 5 2.0 1705 114 30 Mercedes E-Class 2.1 1605 115 31 Volvo XC70 2.0 1746 117 32 Ford B-Max 1.6 1235 104 33 BMW 216 1.6 1390 108 34 Opel Zafira 1.6 1405 109 35 Mercedes SLK 2.5 1395 120
It can be difficult to compare the volume 1.0 with the weight 790,
but if we scale them both into comparable values, we can easily see how much one value is compared to the other.
There are different methods for scaling data, in this tutorial we will use a method called standardization.
The standardization method uses this formula:
z = (x – u) / s
Where z is the new value, x is the original value, u is the mean and s is the standard deviation.
If you take the weight column from the data set above, the first value is 790, and the scaled value will be:
(790 – 1292.23) / 238.74 = -2.1
If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be:
(1.0 – 1.61) / 0.38 = -1.59
Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.
You do not have to do this manually, the Python sklearn module has a method called StandardScaler()
which returns a Scaler object with methods for transforming data sets.
Scale all values in the Weight and Volume columns:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv(“Documents/Machine Learning/cars2.csv”)
X = df[[‘Weight’, ‘Volume’]]
scaledX = scale.fit_transform(X)
print(scaledX)
[[-2.10389253 -1.59336644] [-0.55407235 -1.07190106] [-1.52166278 -1.59336644] [-1.78973979 -1.85409913] [-0.63784641 -0.28970299] [-1.52166278 -1.59336644] [-0.76769621 -0.55043568] [ 0.3046118 -0.28970299] [-0.7551301 -0.28970299] [-0.59595938 -0.0289703 ] [-1.30803892 -1.33263375] [-1.26615189 -0.81116837] [-0.7551301 -1.59336644] [-0.16871166 -0.0289703 ] [ 0.14125238 -0.0289703 ] [ 0.15800719 -0.0289703 ] [ 0.3046118 -0.0289703 ] [-0.05142797 1.53542584] [-0.72580918 -0.0289703 ] [ 0.14962979 1.01396046] [ 1.2219378 -0.0289703 ] [ 0.5685001 1.01396046] [ 0.3046118 1.27469315] [ 0.51404696 -0.0289703 ] [ 0.51404696 1.01396046] [ 0.72348212 -0.28970299] [ 0.8281997 1.01396046] [ 1.81254495 1.01396046] [ 0.96642691 -0.0289703 ] [ 1.72877089 1.01396046] [ 1.30990057 1.27469315] [ 1.90050772 1.01396046] [-0.23991961 -0.0289703 ] [ 0.40932938 -0.0289703 ] [ 0.47215993 -0.0289703 ] [ 0.4302729 2.31762392]]
Note that the first two values are -2.1 and -1.59, which corresponds to our calculations:
Predict CO2 Values
The task in the Multiple Regression chapter was to predict the CO2 emission from a car when you only knew its weight and volume.
When the data set is scaled, you will have to use the scale when you predict values:
Example
Predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv(“Documents/Machine Learning/cars2.csv”)
X = df[[‘Weight’, ‘Volume’]]
y = df[‘CO2’]
scaledX = scale.fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(scaledX, y)
scaled = scale.transform([[2300, 1.3]])
predictedCO2 = regr.predict([scaled[0]])
print(predictedCO2)
[107.2087328]