What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Data can be categorized into two groups:
Structured data
Unstructured data
How to Structure Data?
We can use an array or a database table to structure or present data.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
We will use the following libraries:
Pandas – This library is used for structured data operations, like import CSV files, create dataframes, and data preparation
Numpy – This is a mathematical library. Has a powerful N-dimensional array object, linear algebra, Fourier transform, etc.
Matplotlib – This library is used for visualization of data.
SciPy – This library has linear algebra modules
Create a DataFrame with Pandas
A data frame is a structured representation of data.
Let’s define a data frame with 3 columns and 5 rows with fictional numbers:
import pandas as pd
d = {‘col1’: [1, 2, 3, 4, 7], ‘col2’: [4, 5, 6, 9, 5], ‘col3’: [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 col3 0 1 4 7 1 2 5 8 2 3 6 12 3 4 9 1 4 7 5 11
Count the number of columns: use df.shape[1] for column
import pandas as pd
d = {‘col1’: [1, 2, 3, 4, 7], ‘col2’: [4, 5, 6, 9, 5], ‘col3’: [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
count_column = df.shape[1]
print(“Number of columns:”)
print(count_column)
Number of columns: 3
Count the number of rows: use df.shape[0] for row
import pandas as pd
d = {‘col1’: [1, 2, 3, 4, 7], ‘col2’: [4, 5, 6, 9, 5], ‘col3’: [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
count_row = df.shape[0]
print(“Number of rows:”)
print(count_row)
Number of rows: 5
Dataset of Full Health Data
import pandas as pd
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
print(full_health_data.to_string())
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 0 60 110 130 409 0.0 8.0 1 60 117 145 479 0.0 8.0 2 60 103 135 340 8.0 7.5 3 45 109 175 282 8.0 8.0 4 45 117 148 406 0.0 6.5 5 60 102 127 300 0.0 7.5 6 60 110 136 374 0.0 7.5 7 45 104 134 253 0.0 9.0 8 30 109 133 195 8.0 8.0 9 60 98 124 269 7.5 8.0 10 60 103 147 329 8.0 7.0 11 60 100 120 250 0.0 5.0 12 60 106 128 345 8.5 7.5 13 60 104 132 379 8.5 7.5 14 60 98 123 275 9.0 7.0 15 60 98 120 215 6.5 7.0 16 60 100 120 300 9.0 8.0 17 45 90 112 180 8.0 8.0 18 60 103 123 323 8.0 8.0 19 45 97 125 243 8.0 7.0 20 60 108 131 364 8.0 7.0 21 45 100 119 282 0.0 7.0 22 60 130 101 300 8.0 7.0 23 45 105 132 246 0.0 8.0 24 60 102 126 334 8.0 8.0 25 60 100 120 250 8.0 7.0 26 60 92 118 241 7.0 6.0 27 60 103 132 353 9.0 8.5 28 60 100 132 280 3.0 8.0 29 60 102 129 380 3.0 8.0 30 60 92 115 243 7.5 8.0 31 60 101 124 299 8.5 7.5 32 60 93 113 223 8.0 7.0 33 60 107 136 361 2.0 8.0 34 60 114 140 415 9.0 8.5 35 60 100 120 300 8.0 7.0 36 60 100 120 300 7.0 6.0 37 45 104 129 266 8.0 7.0 38 60 98 126 286 7.5 7.0 39 60 100 122 329 8.0 7.0 40 60 111 138 400 8.0 8.0 41 60 111 131 397 0.0 8.0 42 60 99 119 273 8.0 7.0 43 60 109 153 387 8.0 7.0 44 45 111 136 300 11.0 7.0 45 45 108 129 298 5.0 7.0 46 60 111 139 397 0.0 8.0 47 60 107 136 380 8.5 8.0 48 80 123 146 643 8.5 8.0 49 60 106 130 263 8.0 8.0 50 60 118 151 486 7.0 7.0 51 30 136 175 238 0.0 7.5 52 60 121 146 450 5.0 8.0 53 60 118 121 413 8.0 7.0 54 45 115 144 305 5.0 7.0 55 20 153 172 226 0.0 8.0 56 45 123 152 321 5.0 8.0 57 210 108 160 1376 8.0 8.0 58 160 110 137 1034 8.0 8.0 59 160 109 135 853 8.0 8.0 60 45 118 141 341 8.0 8.0 61 20 110 130 131 0.0 8.0 62 180 90 130 800 0.0 8.0 63 150 105 135 873 8.0 8.0 64 150 107 130 816 3.0 8.0 65 20 106 136 110 3.0 8.0 66 300 108 143 1500 0.0 8.0 67 150 97 129 1115 0.0 8.0 68 90 100 127 700 0.0 8.0 69 150 97 127 953 0.0 8.0 70 45 114 146 304 0.0 8.0 71 90 98 125 563 0.0 8.0 72 45 105 134 251 0.0 8.0 73 45 110 141 300 7.0 7.5 74 120 100 130 500 7.0 7.5 75 270 100 131 1729 0.0 7.0 76 30 159 182 319 5.0 7.5 77 45 149 169 344 0.0 8.0 78 30 103 139 151 4.0 8.0 79 120 100 130 500 0.0 8.0 80 45 100 120 225 8.5 8.0 81 30 151 170 300 0.0 7.0 82 45 102 136 234 6.5 8.0 83 120 100 157 1000 0.0 8.0 84 45 129 103 242 5.0 8.0 85 20 83 107 50 9.5 8.0 86 180 101 127 600 0.0 7.5 87 45 107 137 1200 8.0 8.0 88 30 90 107 105 7.0 8.0 89 15 80 100 50 9.0 8.0 90 20 150 171 127 0.0 7.5 91 20 151 168 229 0.0 7.0 92 30 95 128 128 8.0 7.5 93 25 152 168 244 0.0 8.0 94 30 109 131 188 8.0 7.5 95 90 93 124 604 8.0 8.0 96 20 95 112 77 8.0 8.0 97 90 90 110 500 8.0 8.0 98 90 90 100 500 0.0 8.0 99 90 90 100 500 5.0 8.0 100 30 92 108 92 8.0 8.0 101 30 93 128 124 8.0 8.0 102 180 90 120 800 8.0 8.0 103 30 90 120 86 7.0 8.0 104 90 90 120 500 8.0 8.0 105 210 137 184 1860 0.0 7.5 106 60 102 124 325 0.0 8.0 107 45 107 124 275 0.0 8.0 108 15 124 139 124 5.0 6.0 109 60 108 131 367 0.0 8.0 110 60 108 151 351 0.0 8.0 111 60 116 141 443 0.0 8.0 112 60 97 122 277 0.0 8.0 113 60 105 125 353 0.0 8.0 114 60 103 124 332 0.0 8.0 115 30 112 137 193 0.0 8.0 116 45 100 120 100 0.0 7.5 117 60 119 169 336 0.0 8.0 118 60 107 127 344 0.0 8.0 119 60 111 151 368 0.0 8.0 120 60 98 122 271 0.0 8.0 121 60 97 124 275 0.0 8.0 122 60 109 127 382 0.0 6.0 123 90 99 125 466 0.0 7.5 124 60 114 151 384 0.0 12.0 125 60 104 134 342 0.0 7.5 126 60 107 138 357 0.0 7.5 127 60 103 133 335 0.0 7.5 128 60 106 132 327 0.0 7.5 129 60 103 136 339 0.0 7.5 130 20 136 156 189 8.5 7.5 131 45 117 143 317 0.0 8.0 132 45 115 137 318 0.0 7.5 133 45 113 138 308 0.0 7.0 134 20 141 162 222 0.0 7.5 135 60 108 135 390 0.0 8.0 136 60 97 127 288 0.0 8.0 137 45 100 120 250 0.0 8.0 138 45 122 149 335 0.0 7.5 139 60 136 170 470 8.0 8.0 140 45 106 126 270 8.0 8.0 141 60 107 136 400 9.5 7.0 142 60 112 146 361 9.0 7.5 143 30 103 127 185 9.0 7.5 144 60 110 150 409 8.0 7.0 145 60 106 134 343 8.0 8.5 146 60 109 129 353 8.0 8.0 147 60 109 138 374 8.0 6.0 148 30 150 167 275 0.0 9.0 149 60 105 128 328 9.0 8.0 150 60 97 131 270 9.0 8.0 151 60 100 120 270 8.0 8.0 152 60 114 150 382 0.0 8.0 153 30 80 120 240 10.0 7.0 154 30 85 120 250 10.0 7.0 155 45 90 130 260 8.0 7.0 156 45 95 130 270 8.0 7.0 157 45 100 140 280 0.0 7.0 158 60 105 140 290 7.0 8.0 159 60 110 145 300 7.0 8.0 160 60 115 145 310 8.0 8.0 161 75 120 150 320 0.0 8.0 162 75 125 150 330 8.0 8.0
Performing Functions
How to calculate Stats from an input CSV file using Pandas
mean value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
mean1 = df[‘Calorie_Burnage’].mean()
print (‘Mean Calorie_Burnage: ‘ + str(mean1))
Mean Calorie_Burnage: 382.3680981595092
max value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
max1 = df[‘Calorie_Burnage’].max()
print (‘Max Calorie_Burnage: ‘ + str(max1))
Max Calorie_Burnage: 1860
min value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
min1 = df[‘Calorie_Burnage’].min()
print (‘Min Calorie_Burnage: ‘ + str(min1))
Min Calorie_Burnage: 50
sum value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
sum1 = df[‘Calorie_Burnage’].sum()
print (‘Sum Calorie_Burnage: ‘ + str(sum1))
Sum Calorie_Burnage: 62326
count value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
count1 = df[‘Calorie_Burnage’].count()
print (‘Count Calorie_Burnage: ‘ + str(count1))
Count Calorie_Burnage: 163
median value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
median1 = df[‘Calorie_Burnage’].median()
print (‘Median Calorie_Burnage: ‘ + str(median1))
Median Calorie_Burnage: 320.0
mode value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
mode1 = df[‘Calorie_Burnage’].mode()
print (‘Mode Calorie_Burnage: ‘ + str(mode1))
Mode Calorie_Burnage: 0 300 dtype: int64
standard deviation (std) value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
std1 = df[‘Calorie_Burnage’].std()
print (‘Std Calorie_Burnage: ‘ + str(std1))
Std Calorie_Burnage: 274.2271059590154
variance (var) value for a specific column within a csv file
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
var1 = df[‘Calorie_Burnage’].var()
print (‘Var Calorie_Burnage: ‘ + str(var1))
Var Calorie_Burnage: 75200.50564265704
grouping sum by Hours_Sleep
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
groupby_sum1 = df.groupby([‘Hours_Sleep’]).sum()
print (‘Sum’ + str(groupby_sum1))
Sum Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 5.0 60 100 120 250 0.0 6.0 255 534 642 1421 27.0 6.5 45 117 148 406 0.0 7.0 1790 3289 4082 10799 200.5 7.5 1710 3174 3946 10173 95.0 8.0 6300 9654 12083 37254 366.5 8.5 180 323 406 1111 26.0 9.0 75 254 301 528 0.0 12.0 60 114 151 384 0.0
grouping count by Hours_Sleep
import pandas as pd
df = pd.read_csv(“Documents/Data Science/data.csv”)
groupby_count1 = df.groupby([‘Hours_Sleep’]).count()
print (‘Count’ + str(groupby_count1))
Count Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 5.0 1 1 1 1 1 6.0 5 5 5 5 5 6.5 1 1 1 1 1 7.0 31 31 31 31 31 7.5 28 28 28 28 28 8.0 91 91 91 91 91 8.5 3 3 3 3 3 9.0 2 2 2 2 2 12.0 1 1 1 1 1
Three lines to make our compiler able to draw:
import sys
import matplotlib
%matplotlib inline #use this instead of natplotlib.use(‘Agg’), this will render plt.show() to work
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
full_health_data = pd.read_csv(“Documents/Data Science/data.csv”, header=0, sep=”,”)
x = full_health_data[“Average_Pulse”]
y = full_health_data[“Calorie_Burnage”]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel(“Average_Pulse”)
plt.ylabel (“Calorie_Burnage”)
plt.show()

You must be logged in to post a comment.