Creating Machine Learning Pipeline
from IPython.display import Image
Image(filename=’C:\Users\Valerie MASODA FOTSO\Documents\Machine Learning\ML-Pipeline01.jpg’)

Filtering warnings
import warnings
warnings.filterwarnings(‘ignore’)
Importing Modules (Packages)
Importing Modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Importing Data
Importing Diabetes Data
DiabetesData = pd.read_csv(‘Documents\Machine Learning\DiabetesData.csv’)
Viewing Data
DiabetesData.head()
Dividing Data into Train and Test
X_train, X_test, y_train, y_test = train_test_split(DiabetesData.iloc[:,[0,1,2,3,4,5,6,7]],
DiabetesData.iloc[:,[8]], test_size=0.2, random_state=1)
Creating Pipelines
Creating pipelines for Logistic Regression Decision Tree and Random Forest Models
Pipeline steps will include
1. Data Preprocessing using MinMaxScaler
2. Reducing Dimensionality using PCA
3. Training respective models
Logistic Regression Pipeline
LogisticRegressionPipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘logistic_classifier’, LogisticRegression())])
Decision Tree Pipeline
DecisionTreePipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘decisiontree_classifier’, DecisionTreeClassifier())])
Random Forest Pipeline
RandomForestPipeline = Pipeline([(‘myscaler’, MinMaxScaler()),
(‘mypca’, PCA(n_components=3)),
(‘randomforest_classifier’, RandomForestClassifier())])
Model Training and Validation
Defining the pipelines in a list
mypipeline=[LogisticRegressionPipeline, DecisionTreePipeline, RandomForestPipeline]
Defining Variables for choosing best model
accuracy = 0.0
classifier = 0
pipeline = “”
Creating dictionary of pipelines and training models
PipelineDict = {0: ‘Logistic Regression’, 1: ‘Decision Tree’, 2: ‘Random Forest’}
Fit the Pipelines
for mypipe in mypipeline: mypipe.fit(X_train, y_train)
Getting test accuracy for all classifiers
for i, model in enumerate(mypipeline):
print (“{} Test Accuracy: {}”.format(PipelineDict[i], model.score(X_test, y_test)))
Logistic Regression Test Accuracy: 0.0 Decision Tree Test Accuracy: 0.5 Random Forest Test Accuracy: 1.0
Choosing best model for the given data
for i, model in enumerate(mypipeline):
if model.score(X_test, y_test) > accuracy:
accuracy=model.score(X_test, y_test)
pipeline = model
classifier = 1
print(‘Classifier with best accuracy: {}’.format(PipelineDict[classifier]))
Classifier with best accuracy: Decision Tree