Data Preparation
Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable.
Extract and Read Data With Pandas Before data can be analyzed, it must be imported/extracted.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file for the health data:
Dataset of Full Health Data
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
print(full_health_data.to_string())
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 0 60 110 130 409 0.0 8.0 1 60 117 145 479 0.0 8.0 2 60 103 135 340 8.0 7.5 3 45 109 175 282 8.0 8.0 4 45 117 148 406 0.0 6.5 5 60 102 127 300 0.0 7.5 6 60 110 136 374 0.0 7.5 7 45 104 134 253 0.0 9.0 8 30 109 133 195 8.0 8.0 9 60 98 124 269 7.5 8.0 10 60 103 147 329 8.0 7.0 11 60 100 120 250 0.0 5.0 12 60 106 128 345 8.5 7.5 13 60 104 132 379 8.5 7.5 14 60 98 123 275 9.0 7.0 15 60 98 120 215 6.5 7.0 16 60 100 120 300 9.0 8.0 17 45 90 112 180 8.0 8.0 18 60 103 123 323 8.0 8.0 19 45 97 125 243 8.0 7.0 20 60 108 131 364 8.0 7.0 21 45 100 119 282 0.0 7.0 22 60 130 101 300 8.0 7.0 23 45 105 132 246 0.0 8.0 24 60 102 126 334 8.0 8.0 25 60 100 120 250 8.0 7.0 26 60 92 118 241 7.0 6.0 27 60 103 132 353 9.0 8.5 28 60 100 132 280 3.0 8.0 29 60 102 129 380 3.0 8.0 30 60 92 115 243 7.5 8.0 31 60 101 124 299 8.5 7.5 32 60 93 113 223 8.0 7.0 33 60 107 136 361 2.0 8.0 34 60 114 140 415 9.0 8.5 35 60 100 120 300 8.0 7.0 36 60 100 120 300 7.0 6.0 37 45 104 129 266 8.0 7.0 38 60 98 126 286 7.5 7.0 39 60 100 122 329 8.0 7.0 40 60 111 138 400 8.0 8.0 41 60 111 131 397 0.0 8.0 42 60 99 119 273 8.0 7.0 43 60 109 153 387 8.0 7.0 44 45 111 136 300 11.0 7.0 45 45 108 129 298 5.0 7.0 46 60 111 139 397 0.0 8.0 47 60 107 136 380 8.5 8.0 48 80 123 146 643 8.5 8.0 49 60 106 130 263 8.0 8.0 50 60 118 151 486 7.0 7.0 51 30 136 175 238 0.0 7.5 52 60 121 146 450 5.0 8.0 53 60 118 121 413 8.0 7.0 54 45 115 144 305 5.0 7.0 55 20 153 172 226 0.0 8.0 56 45 123 152 321 5.0 8.0 57 210 108 160 1376 8.0 8.0 58 160 110 137 1034 8.0 8.0 59 160 109 135 853 8.0 8.0 60 45 118 141 341 8.0 8.0 61 20 110 130 131 0.0 8.0 62 180 90 130 800 0.0 8.0 63 150 105 135 873 8.0 8.0 64 150 107 130 816 3.0 8.0 65 20 106 136 110 3.0 8.0 66 300 108 143 1500 0.0 8.0 67 150 97 129 1115 0.0 8.0 68 90 100 127 700 0.0 8.0 69 150 97 127 953 0.0 8.0 70 45 114 146 304 0.0 8.0 71 90 98 125 563 0.0 8.0 72 45 105 134 251 0.0 8.0 73 45 110 141 300 7.0 7.5 74 120 100 130 500 7.0 7.5 75 270 100 131 1729 0.0 7.0 76 30 159 182 319 5.0 7.5 77 45 149 169 344 0.0 8.0 78 30 103 139 151 4.0 8.0 79 120 100 130 500 0.0 8.0 80 45 100 120 225 8.5 8.0 81 30 151 170 300 0.0 7.0 82 45 102 136 234 6.5 8.0 83 120 100 157 1000 0.0 8.0 84 45 129 103 242 5.0 8.0 85 20 83 107 50 9.5 8.0 86 180 101 127 600 0.0 7.5 87 45 107 137 1200 8.0 8.0 88 30 90 107 105 7.0 8.0 89 15 80 100 50 9.0 8.0 90 20 150 171 127 0.0 7.5 91 20 151 168 229 0.0 7.0 92 30 95 128 128 8.0 7.5 93 25 152 168 244 0.0 8.0 94 30 109 131 188 8.0 7.5 95 90 93 124 604 8.0 8.0 96 20 95 112 77 8.0 8.0 97 90 90 110 500 8.0 8.0 98 90 90 100 500 0.0 8.0 99 90 90 100 500 5.0 8.0 100 30 92 108 92 8.0 8.0 101 30 93 128 124 8.0 8.0 102 180 90 120 800 8.0 8.0 103 30 90 120 86 7.0 8.0 104 90 90 120 500 8.0 8.0 105 210 137 184 1860 0.0 7.5 106 60 102 124 325 0.0 8.0 107 45 107 124 275 0.0 8.0 108 15 124 139 124 5.0 6.0 109 60 108 131 367 0.0 8.0 110 60 108 151 351 0.0 8.0 111 60 116 141 443 0.0 8.0 112 60 97 122 277 0.0 8.0 113 60 105 125 353 0.0 8.0 114 60 103 124 332 0.0 8.0 115 30 112 137 193 0.0 8.0 116 45 100 120 100 0.0 7.5 117 60 119 169 336 0.0 8.0 118 60 107 127 344 0.0 8.0 119 60 111 151 368 0.0 8.0 120 60 98 122 271 0.0 8.0 121 60 97 124 275 0.0 8.0 122 60 109 127 382 0.0 6.0 123 90 99 125 466 0.0 7.5 124 60 114 151 384 0.0 12.0 125 60 104 134 342 0.0 7.5 126 60 107 138 357 0.0 7.5 127 60 103 133 335 0.0 7.5 128 60 106 132 327 0.0 7.5 129 60 103 136 339 0.0 7.5 130 20 136 156 189 8.5 7.5 131 45 117 143 317 0.0 8.0 132 45 115 137 318 0.0 7.5 133 45 113 138 308 0.0 7.0 134 20 141 162 222 0.0 7.5 135 60 108 135 390 0.0 8.0 136 60 97 127 288 0.0 8.0 137 45 100 120 250 0.0 8.0 138 45 122 149 335 0.0 7.5 139 60 136 170 470 8.0 8.0 140 45 106 126 270 8.0 8.0 141 60 107 136 400 9.5 7.0 142 60 112 146 361 9.0 7.5 143 30 103 127 185 9.0 7.5 144 60 110 150 409 8.0 7.0 145 60 106 134 343 8.0 8.5 146 60 109 129 353 8.0 8.0 147 60 109 138 374 8.0 6.0 148 30 150 167 275 0.0 9.0 149 60 105 128 328 9.0 8.0 150 60 97 131 270 9.0 8.0 151 60 100 120 270 8.0 8.0 152 60 114 150 382 0.0 8.0 153 30 80 120 240 10.0 7.0 154 30 85 120 250 10.0 7.0 155 45 90 130 260 8.0 7.0 156 45 95 130 270 8.0 7.0 157 45 100 140 280 0.0 7.0 158 60 105 140 290 7.0 8.0 159 60 110 145 300 7.0 8.0 160 60 115 145 310 8.0 8.0 161 75 120 150 320 0.0 8.0 162 75 125 150 330 8.0 8.0
Example Explained:
Import the Pandas library
Name the data frame as full_health_data.
header=0 means that the headers for the variable names are to be
found in the first row (note that 0 means the first row in Python)
sep=”,” means that “,” is used as the separator between the values.
This is because we are using the file type .csv (comma separated values)
Tip: If you have a large CSV file, you can use the head() function to only show the top 5 rows:
Data Cleaning
Look at the imported data to find “dirty” with wrongly or unregistered values:
In Case of –
Blank Rows or non-numeric values
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into “NaN” values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value:
use dropna() to drop NaN values
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
full_health_data.dropna(axis=0,inplace=True)
print(full_health_data)
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work \ 0 60 110 130 409 0.0 1 60 117 145 479 0.0 2 60 103 135 340 8.0 3 45 109 175 282 8.0 4 45 117 148 406 0.0 .. ... ... ... ... ... 158 60 105 140 290 7.0 159 60 110 145 300 7.0 160 60 115 145 310 8.0 161 75 120 150 320 0.0 162 75 125 150 330 8.0 Hours_Sleep 0 8.0 1 8.0 2 7.5 3 8.0 4 6.5 .. ... 158 8.0 159 8.0 160 8.0 161 8.0 162 8.0 [163 rows x 6 columns]
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data can be split into three main categories:
Numerical – Contains numerical values. Can be divided into two categories:
Discrete: Numbers are counted as “whole”. Example: You cannot have trained 2.5 sessions, it is either 2 or 3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours
Categorical – Contains values that cannot be measured up against each other. Example: A color or a type of training
Ordinal – Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on
By knowing the type of your data, you will be able to know what technique to use when analyzing them.
Data Types
We can use the info() function to list the data types within our data set:
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
print(full_health_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 163 entries, 0 to 162 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Duration 163 non-null int64 1 Average_Pulse 163 non-null int64 2 Max_Pulse 163 non-null int64 3 Calorie_Burnage 163 non-null int64 4 Hours_Work 163 non-null float64 5 Hours_Sleep 163 non-null float64 dtypes: float64(2), int64(4) memory usage: 7.8 KB None
In case we see that this data set has an object type of data:
Float64
Object
We cannot use objects to calculate and perform analysis here. We must convert the type object to float64
(float64 is a number with a decimal in Python).
We can use the astype() function to convert the data into float64.
Example
import pandas as pd
health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
health_data.dropna(axis=0,inplace=True)
health_data[“Average_Pulse”] = health_data[‘Average_Pulse’].astype(float)
health_data[“Max_Pulse”] = health_data[“Max_Pulse”].astype(float)
print(health_data.info())
In our case, we don’t have any object type
Analyze the Data
When we have cleaned the data set, we can start analyzing the data.
We can use the describe() function in Python to summarize data:
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
pd.set_option(‘display.max_columns’,None)
print(full_health_data.describe())
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work \ count 163.000000 163.000000 163.000000 163.000000 163.000000 mean 64.263804 107.723926 134.226994 382.368098 4.386503 std 42.994520 14.625062 16.403967 274.227106 3.923772 min 15.000000 80.000000 100.000000 50.000000 0.000000 25% 45.000000 100.000000 124.000000 256.500000 0.000000 50% 60.000000 105.000000 131.000000 320.000000 5.000000 75% 60.000000 111.000000 141.000000 388.500000 8.000000 max 300.000000 159.000000 184.000000 1860.000000 11.000000 Hours_Sleep count 163.000000 mean 7.680982 std 0.663934 min 5.000000 25% 7.500000 50% 8.000000 75% 8.000000 max 12.000000
Conclusion
The above results gives us the stats of the data
Count – Counts the number of observations
Mean – The average value
Std – Standard deviation
Min – The lowest value
25%, 50% and 75% are percentiles
Max – The highest value