DS – Data Preparation and Cleansing

Data Preparation

Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable.

Extract and Read Data With Pandas Before data can be analyzed, it must be imported/extracted.

In the example below, we show you how to import data using Pandas in Python.

We use the read_csv() function to import a CSV file for the health data:

Dataset of Full Health Data

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

print(full_health_data.to_string())

     Duration  Average_Pulse  Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0          60            110        130              409         0.0          8.0
1          60            117        145              479         0.0          8.0
2          60            103        135              340         8.0          7.5
3          45            109        175              282         8.0          8.0
4          45            117        148              406         0.0          6.5
5          60            102        127              300         0.0          7.5
6          60            110        136              374         0.0          7.5
7          45            104        134              253         0.0          9.0
8          30            109        133              195         8.0          8.0
9          60             98        124              269         7.5          8.0
10         60            103        147              329         8.0          7.0
11         60            100        120              250         0.0          5.0
12         60            106        128              345         8.5          7.5
13         60            104        132              379         8.5          7.5
14         60             98        123              275         9.0          7.0
15         60             98        120              215         6.5          7.0
16         60            100        120              300         9.0          8.0
17         45             90        112              180         8.0          8.0
18         60            103        123              323         8.0          8.0
19         45             97        125              243         8.0          7.0
20         60            108        131              364         8.0          7.0
21         45            100        119              282         0.0          7.0
22         60            130        101              300         8.0          7.0
23         45            105        132              246         0.0          8.0
24         60            102        126              334         8.0          8.0
25         60            100        120              250         8.0          7.0
26         60             92        118              241         7.0          6.0
27         60            103        132              353         9.0          8.5
28         60            100        132              280         3.0          8.0
29         60            102        129              380         3.0          8.0
30         60             92        115              243         7.5          8.0
31         60            101        124              299         8.5          7.5
32         60             93        113              223         8.0          7.0
33         60            107        136              361         2.0          8.0
34         60            114        140              415         9.0          8.5
35         60            100        120              300         8.0          7.0
36         60            100        120              300         7.0          6.0
37         45            104        129              266         8.0          7.0
38         60             98        126              286         7.5          7.0
39         60            100        122              329         8.0          7.0
40         60            111        138              400         8.0          8.0
41         60            111        131              397         0.0          8.0
42         60             99        119              273         8.0          7.0
43         60            109        153              387         8.0          7.0
44         45            111        136              300        11.0          7.0
45         45            108        129              298         5.0          7.0
46         60            111        139              397         0.0          8.0
47         60            107        136              380         8.5          8.0
48         80            123        146              643         8.5          8.0
49         60            106        130              263         8.0          8.0
50         60            118        151              486         7.0          7.0
51         30            136        175              238         0.0          7.5
52         60            121        146              450         5.0          8.0
53         60            118        121              413         8.0          7.0
54         45            115        144              305         5.0          7.0
55         20            153        172              226         0.0          8.0
56         45            123        152              321         5.0          8.0
57        210            108        160             1376         8.0          8.0
58        160            110        137             1034         8.0          8.0
59        160            109        135              853         8.0          8.0
60         45            118        141              341         8.0          8.0
61         20            110        130              131         0.0          8.0
62        180             90        130              800         0.0          8.0
63        150            105        135              873         8.0          8.0
64        150            107        130              816         3.0          8.0
65         20            106        136              110         3.0          8.0
66        300            108        143             1500         0.0          8.0
67        150             97        129             1115         0.0          8.0
68         90            100        127              700         0.0          8.0
69        150             97        127              953         0.0          8.0
70         45            114        146              304         0.0          8.0
71         90             98        125              563         0.0          8.0
72         45            105        134              251         0.0          8.0
73         45            110        141              300         7.0          7.5
74        120            100        130              500         7.0          7.5
75        270            100        131             1729         0.0          7.0
76         30            159        182              319         5.0          7.5
77         45            149        169              344         0.0          8.0
78         30            103        139              151         4.0          8.0
79        120            100        130              500         0.0          8.0
80         45            100        120              225         8.5          8.0
81         30            151        170              300         0.0          7.0
82         45            102        136              234         6.5          8.0
83        120            100        157             1000         0.0          8.0
84         45            129        103              242         5.0          8.0
85         20             83        107               50         9.5          8.0
86        180            101        127              600         0.0          7.5
87         45            107        137             1200         8.0          8.0
88         30             90        107              105         7.0          8.0
89         15             80        100               50         9.0          8.0
90         20            150        171              127         0.0          7.5
91         20            151        168              229         0.0          7.0
92         30             95        128              128         8.0          7.5
93         25            152        168              244         0.0          8.0
94         30            109        131              188         8.0          7.5
95         90             93        124              604         8.0          8.0
96         20             95        112               77         8.0          8.0
97         90             90        110              500         8.0          8.0
98         90             90        100              500         0.0          8.0
99         90             90        100              500         5.0          8.0
100        30             92        108               92         8.0          8.0
101        30             93        128              124         8.0          8.0
102       180             90        120              800         8.0          8.0
103        30             90        120               86         7.0          8.0
104        90             90        120              500         8.0          8.0
105       210            137        184             1860         0.0          7.5
106        60            102        124              325         0.0          8.0
107        45            107        124              275         0.0          8.0
108        15            124        139              124         5.0          6.0
109        60            108        131              367         0.0          8.0
110        60            108        151              351         0.0          8.0
111        60            116        141              443         0.0          8.0
112        60             97        122              277         0.0          8.0
113        60            105        125              353         0.0          8.0
114        60            103        124              332         0.0          8.0
115        30            112        137              193         0.0          8.0
116        45            100        120              100         0.0          7.5
117        60            119        169              336         0.0          8.0
118        60            107        127              344         0.0          8.0
119        60            111        151              368         0.0          8.0
120        60             98        122              271         0.0          8.0
121        60             97        124              275         0.0          8.0
122        60            109        127              382         0.0          6.0
123        90             99        125              466         0.0          7.5
124        60            114        151              384         0.0         12.0
125        60            104        134              342         0.0          7.5
126        60            107        138              357         0.0          7.5
127        60            103        133              335         0.0          7.5
128        60            106        132              327         0.0          7.5
129        60            103        136              339         0.0          7.5
130        20            136        156              189         8.5          7.5
131        45            117        143              317         0.0          8.0
132        45            115        137              318         0.0          7.5
133        45            113        138              308         0.0          7.0
134        20            141        162              222         0.0          7.5
135        60            108        135              390         0.0          8.0
136        60             97        127              288         0.0          8.0
137        45            100        120              250         0.0          8.0
138        45            122        149              335         0.0          7.5
139        60            136        170              470         8.0          8.0
140        45            106        126              270         8.0          8.0
141        60            107        136              400         9.5          7.0
142        60            112        146              361         9.0          7.5
143        30            103        127              185         9.0          7.5
144        60            110        150              409         8.0          7.0
145        60            106        134              343         8.0          8.5
146        60            109        129              353         8.0          8.0
147        60            109        138              374         8.0          6.0
148        30            150        167              275         0.0          9.0
149        60            105        128              328         9.0          8.0
150        60             97        131              270         9.0          8.0
151        60            100        120              270         8.0          8.0
152        60            114        150              382         0.0          8.0
153        30             80        120              240        10.0          7.0
154        30             85        120              250        10.0          7.0
155        45             90        130              260         8.0          7.0
156        45             95        130              270         8.0          7.0
157        45            100        140              280         0.0          7.0
158        60            105        140              290         7.0          8.0
159        60            110        145              300         7.0          8.0
160        60            115        145              310         8.0          8.0
161        75            120        150              320         0.0          8.0
162        75            125        150              330         8.0          8.0

Example Explained:
Import the Pandas library
Name the data frame as full_health_data.
header=0 means that the headers for the variable names are to be
found in the first row (note that 0 means the first row in Python)
sep=”,” means that “,” is used as the separator between the values.
This is because we are using the file type .csv (comma separated values)
Tip: If you have a large CSV file, you can use the head() function to only show the top 5 rows:

Data Cleaning

Look at the imported data to find “dirty” with wrongly or unregistered values:

In Case of –
Blank Rows or non-numeric values

Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into “NaN” values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value:

use dropna() to drop NaN values

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
full_health_data.dropna(axis=0,inplace=True)

print(full_health_data)

     Duration  Average_Pulse  Max_Pulse  Calorie_Burnage  Hours_Work  \
0          60            110        130              409         0.0   
1          60            117        145              479         0.0   
2          60            103        135              340         8.0   
3          45            109        175              282         8.0   
4          45            117        148              406         0.0   
..        ...            ...        ...              ...         ...   
158        60            105        140              290         7.0   
159        60            110        145              300         7.0   
160        60            115        145              310         8.0   
161        75            120        150              320         0.0   
162        75            125        150              330         8.0   

     Hours_Sleep  
0            8.0  
1            8.0  
2            7.5  
3            8.0  
4            6.5  
..           ...  
158          8.0  
159          8.0  
160          8.0  
161          8.0  
162          8.0  

[163 rows x 6 columns]

Data Categories

To analyze data, we also need to know the types of data we are dealing with.

Data can be split into three main categories:

Numerical – Contains numerical values. Can be divided into two categories:
Discrete: Numbers are counted as “whole”. Example: You cannot have trained 2.5 sessions, it is either 2 or 3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours
Categorical – Contains values that cannot be measured up against each other. Example: A color or a type of training
Ordinal – Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on
By knowing the type of your data, you will be able to know what technique to use when analyzing them.

Data Types

We can use the info() function to list the data types within our data set:

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)

print(full_health_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Duration         163 non-null    int64  
 1   Average_Pulse    163 non-null    int64  
 2   Max_Pulse        163 non-null    int64  
 3   Calorie_Burnage  163 non-null    int64  
 4   Hours_Work       163 non-null    float64
 5   Hours_Sleep      163 non-null    float64
dtypes: float64(2), int64(4)
memory usage: 7.8 KB
None

In case we see that this data set has an object type of data:

Float64
Object
We cannot use objects to calculate and perform analysis here. We must convert the type object to float64
(float64 is a number with a decimal in Python).

We can use the astype() function to convert the data into float64.

Example
import pandas as pd

health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
health_data.dropna(axis=0,inplace=True)

health_data[“Average_Pulse”] = health_data[‘Average_Pulse’].astype(float)
health_data[“Max_Pulse”] = health_data[“Max_Pulse”].astype(float)

print(health_data.info())

In our case, we don’t have any object type

Analyze the Data

When we have cleaned the data set, we can start analyzing the data.

We can use the describe() function in Python to summarize data:

import pandas as pd

full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
pd.set_option(‘display.max_columns’,None)

print(full_health_data.describe())

         Duration  Average_Pulse   Max_Pulse  Calorie_Burnage  Hours_Work  \
count  163.000000     163.000000  163.000000       163.000000  163.000000   
mean    64.263804     107.723926  134.226994       382.368098    4.386503   
std     42.994520      14.625062   16.403967       274.227106    3.923772   
min     15.000000      80.000000  100.000000        50.000000    0.000000   
25%     45.000000     100.000000  124.000000       256.500000    0.000000   
50%     60.000000     105.000000  131.000000       320.000000    5.000000   
75%     60.000000     111.000000  141.000000       388.500000    8.000000   
max    300.000000     159.000000  184.000000      1860.000000   11.000000   

       Hours_Sleep  
count   163.000000  
mean      7.680982  
std       0.663934  
min       5.000000  
25%       7.500000  
50%       8.000000  
75%       8.000000  
max      12.000000  

Conclusion

The above results gives us the stats of the data

Count – Counts the number of observations
Mean – The average value
Std – Standard deviation
Min – The lowest value
25%, 50% and 75% are percentiles
Max – The highest value

Advertisement