ML – Creating Machine Learning Pipelines

Creating Machine Learning Pipeline

from IPython.display import Image
Image(filename=’C:\Users\Valerie MASODA FOTSO\Documents\Machine Learning\ML-Pipeline01.jpg’)

Filtering warnings

import warnings
warnings.filterwarnings(‘ignore’)

Importing Modules (Packages)

Importing Modules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Importing Data

Importing Diabetes Data

DiabetesData = pd.read_csv(‘Documents\Machine Learning\DiabetesData.csv’)

Viewing Data

DiabetesData.head()

Dividing Data into Train and Test

X_train, X_test, y_train, y_test = train_test_split(DiabetesData.iloc[:,[0,1,2,3,4,5,6,7]],
DiabetesData.iloc[:,[8]], test_size=0.2, random_state=1)

Creating Pipelines

Creating pipelines for Logistic Regression Decision Tree and Random Forest Models

Pipeline steps will include

1. Data Preprocessing using MinMaxScaler

2. Reducing Dimensionality using PCA

3. Training respective models

Logistic Regression Pipeline

LogisticRegressionPipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘logistic_classifier’, LogisticRegression())])

Decision Tree Pipeline

DecisionTreePipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘decisiontree_classifier’, DecisionTreeClassifier())])

Random Forest Pipeline

RandomForestPipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘randomforest_classifier’, RandomForestClassifier())])

Model Training and Validation

Defining the pipelines in a list

mypipeline=[LogisticRegressionPipeline, DecisionTreePipeline, RandomForestPipeline]

Defining Variables for choosing best model

accuracy = 0.0
classifier = 0
pipeline = “”

Creating dictionary of pipelines and training models

PipelineDict = {0: ‘Logistic Regression’, 1: ‘Decision Tree’, 2: ‘Random Forest’}

Fit the Pipelines

for mypipe in mypipeline: mypipe.fit(X_train, y_train)

Getting test accuracy for all classifiers

for i, model in enumerate(mypipeline):
print (“{} Test Accuracy: {}”.format(PipelineDict[i], model.score(X_test, y_test)))

Logistic Regression Test Accuracy: 0.0
Decision Tree Test Accuracy: 0.5
Random Forest Test Accuracy: 1.0

Choosing best model for the given data

for i, model in enumerate(mypipeline):
if model.score(X_test, y_test) > accuracy:
accuracy=model.score(X_test, y_test)
pipeline = model
classifier = 1
print(‘Classifier with best accuracy: {}’.format(PipelineDict[classifier]))

Classifier with best accuracy: Decision Tree