DL – Loading in your own Dataset

import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
from tqdm import tqdm
DATADIR = "C:/Users/Valerie MASODA FOTSO/Documents/Deep Learning/kagglecatsanddogs_3367a/PetImages" 
#Dataset download from Microsoft Kaggle Cats and Dogs Dataset
CATEGORIES = ["Dog", "Cat"]
for category in CATEGORIES:  # do dogs and cats
    path = os.path.join(DATADIR,category)  # create path to dogs and cats
    for img in os.listdir(path):  # iterate over each image per dogs and cats
        img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE)  # convert to array
        plt.imshow(img_array, cmap='gray')  # graph it
        plt.show()  # display!

        break  # we just want one for now so break
    break  #...and one more!
print(img_array)
[[117 117 119 ... 133 132 132]
 [118 117 119 ... 135 134 134]
 [119 118 120 ... 137 136 136]
 ...
 [ 79  74  73 ...  80  76  73]
 [ 78  72  69 ...  72  73  74]
 [ 74  71  70 ...  75  73  71]]
print(img_array.shape)
(375, 500)

So that’s a 375 tall, 500 wide, and 3-channel image. 3-channel is because it’s RGB (color). We definitely don’t want the images that big, but also various images are different shapes, and this is also a problem.

IMG_SIZE = 50

new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap='gray')
plt.show()

IMG_SIZE = 100

new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap='gray')
plt.show()

Better. Let’s try that. Next, we’re going to want to create training data and all that, but, first, we should set aside some images for final testing. I am going to just manually create a folder called Testing and then create 2 folders inside of there, one for Dog and one for Cat. From here, I am just going to move the first 15 (from 0 to 14) images from both Dog and Cat into the training versions. Make sure you move them, not copy. We will use this for our final tests.

Now, we want to begin building our training data!

training_data = []

def create_training_data():
    for category in CATEGORIES:  # do dogs and cats

        path = os.path.join(DATADIR,category)  # create path to dogs and cats
        class_num = CATEGORIES.index(category)  # get the classification  (0 or a 1). 0=dog 1=cat

        for img in tqdm(os.listdir(path)):  # iterate over each image per dogs and cats
            try:
                img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE)  # convert to array
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))  # resize to normalize data size
                training_data.append([new_array, class_num])  # add this to our training_data
            except Exception as e:  # in the interest in keeping the output clean...
                pass
            #except OSError as e:
            #    print("OSErrroBad img most likely", e, os.path.join(path,img))
            #except Exception as e:
            #    print("general exception", e, os.path.join(path,img))

create_training_data()

print(len(training_data))            
100%|██████████| 12486/12486 [01:04<00:00, 193.69it/s]
100%|██████████| 12486/12486 [00:49<00:00, 251.48it/s]
24916

Great, we have almost 25K samples! That’s awesome.

One thing we want to do is make sure our data is balanced. In the case of this dataset, I can see that the dataset started off as being balanced. By balanced, I mean there are the same number of examples for each class (same number of dogs and cats). If not balanced, you either want to pass the class weights to the model, so that it can measure error appropriately, or balance your samples by trimming the larger set to be the same size as the smaller set.

If you do not balance, the model will initially learn that the best thing to do is predict only one class, whichever is the most common. Then, it will often get stuck here. In our case though, this data is already balanced, so that’s easy enough. Maybe later we’ll have a dataset that isn’t balanced so nicely.

Also, if you have a dataset that is too large to fit into your ram, you can batch-load in your data. There are many ways to do this, some outside of TensorFlow and some built in. We may discuss this further, but, for now, we’re mainly trying to cover how your data should look, be shaped, and fed into the models.

Next, we want to shuffle the data. Right now our data is just all dogs, then all cats. This will usually wind up causing trouble too, as, initially, the classifier will learn to just predict dogs always. Then it will shift to oh, just predict all cats! Going back and forth like this is no good either.

import random

random.shuffle(training_data)  #shuffle data

Our training_data is a list, meaning it’s mutable, so it’s now nicely shuffled. We can confirm this by iterating over a few of the initial samples and printing out the class.

for sample in training_data[:10]:
    print(sample[1])
1
0
1
0
0
1
0
0
1
1
#Let's create the model
X = []
y = []

for features,label in training_data:
    X.append(features)
    y.append(label)

print(X[0].reshape(-1, IMG_SIZE, IMG_SIZE, 1))

X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
[[[[ 28]
   [ 27]
   [ 28]
   ...
   [ 85]
   [ 72]
   [ 79]]

  [[ 17]
   [ 19]
   [ 22]
   ...
   [ 80]
   [ 68]
   [ 70]]

  [[  4]
   [  3]
   [  5]
   ...
   [ 71]
   [ 63]
   [ 74]]

  ...

  [[151]
   [149]
   [151]
   ...
   [ 83]
   [103]
   [ 87]]

  [[179]
   [156]
   [186]
   ...
   [103]
   [100]
   [ 76]]

  [[216]
   [209]
   [200]
   ...
   [116]
   [ 94]
   [ 69]]]]

Let’s save this data, so that we don’t need to keep calculating it every time we want to play with the neural network model:

#Let's save our data to come back to anytime
import pickle

pickle_out = open("X.pickle","wb")
pickle.dump(X, pickle_out)
pickle_out.close()

pickle_out = open("y.pickle","wb")
pickle.dump(y, pickle_out)
pickle_out.close()
#We can always load it in to our current script, or a totally new one by doing

pickle_in = open("X.pickle","rb")
X = pickle.load(pickle_in)

pickle_in = open("y.pickle","rb")
y = pickle.load(pickle_in)
Advertisement