Learning Machine Learning and Artificial Intelligence with Blast
Article Three - Building a Binary Classification Model with Neural Networks in Keras
Hello World! Welcome and thank you for being a part of the journey so far…
We started with prerequisites for becoming a Machine Learning Engineer or an Artificial Intelligence Researcher (article zero). Then we moved to the history of Artificial Intelligence and introduction to the concept of Deep Learning (article one). Then in the previous article, we saw how to set up a Deep Learning workspace and also built a Linear Classifier in pure TensorFlow (article two).
We’ve come quite a way and I’m proud of the steady progress but we still have much more to cover and to learn before we can settle and say our journey is done. In this article, we’ll be building a Binary Classification Model with Neural Networks in Keras. Yay!
We’ll be using Google Colab which I introduced in the last article. The dataset we’ll be using comes prepackaged with Keras, and by the end of this article just like all the others, you’ll be a couple steps closer to becoming a Machine Learning Engineer or an Artificial Intelligence Researcher.
In this article, we’ll go through the Machine Learning workflow end to end, you’ll be introduced to data preprocessing, basic model architecture principles, and model evaluation. We’ll successfully implement a model that correctly classifies movie reviews as positive or negative.
Are you ready?? Let’s gooooo!
A Binary Classification task in Machine Learning is one where each input sample should be categorized into two exclusive categories. By the end of this article, you’ll be able to use Neural Networks to handle simple Binary Classification tasks.
Two-class classification, or Binary Classification, is one of the most common kinds of Machine Learning problems. In this article, you’ll learn to build a model that can classify movie reviews as positive or negative, based on the text content of the reviews.
We’ll be working with the IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.
The IMDB dataset comes prepackaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.
To load the dataset (about 80 MB of data will be downloaded the first time you run it), enter the snippet below on Colab.
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The argument num_words=10000 means we’re only using 10,000 most frequently occurring words in the training data. Rare words will be discarded. This allows us to work with vector data of manageable size. If we didn’t set this limit, we’d be working with 88,585 unique words on the training data, which is unnecessarily large. Many of these words occur in a single sample and thus can’t be meaningfully used for classification.
The variables train_data and test_data contains lists of reviews; each review is a list of word indices (encoding a sequence of words). train_labels and test_labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
You can check the data out by simply running:
print(train_data[0])
print(train_labels[0])
Also, you can decode one of the reviews back to english words by running:
word_index = imdb.get_word_index() # word_index is a dictionary mapping words to an integer index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
print(decoded_review)
Alright, so we have sufficient data to use to train our model which is an important first step. The reviews have been decoded from sequences of words to sequences of integers which is great because Machine Learning Models require numerical input to handle data effectively.
We’re on a good track but there’s a little problem, we can’t directly feed our lists of integers into a Neural Network. The lists all have different lengths but a Neural Network expects to process contiguous batches of data. So we have to turn the lists of integers to lists of tensors.
To turn our review lists into tensors, there are multiple ways but for this article we’ll go with multi-hot encode which turns the contents of our lists to vectors of 0s and 1s. This would mean, for instance, turning a list [8, 5] into a 10,000-dimensional vector that would be all 0s except for indices 8 and 5, which would be 1s.
To multi-hot encode the review integer lists, run:
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
for j in sequence:
results[i, j] = 1.
return results
x_train = vectorize_sequences(train_data) # Vectorized training data
x_test = vectorize_sequences(test_data) # Vectorized test data
To see what the samples look like now, you can run:
print(x_train[0])
We should also vectorize our labels, which is straightforward, we can do that by running:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
Now the data is ready to be fed into a neural network. Great job so far!
Short break? Okay let’s continue.
Our input data is vectors, and the labels are scalar (1s and 0s). A type of model that performs well on such a problem is a plain stack of densely connected (Dense) layers with relu activations.
For our model architecture, we’ll be answering two questions:
How many layers to use?
How many units to choose for each layer?
The answer is: Two intermediate layers with 16 units each and a third layer that will output the scalar prediction regarding the sentiment of the current review.
Visually, it looks something like:
Input (vectorized text) → Dense (units=16) → Dense (units=16) → Dense (units=1) → Output (probability)
In code, it looks like:
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(16, activation="relu"),
layers.Dense(16, activation="relu"),
layers.Dense(1, activation="sigmoid")
])
The first argument being passed to each Dense layer is the number of units in the layer i.e the dimensionality of representation space in the layer. You can intuitively understand the dimensionality of your representation space as “how much freedom you’re allowing the model to have when learning internal representations.” Having more units (a higher-dimensional representational space) allows your model to learn more-complex representations, but it makes the model more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).
The intermediate layers use relu as their activation function, and the final layer uses a sigmoid activation so as to output a probability (a score between 0 and 1 indicating how likely the review is to be positive i.e have the target of 1).
A relu (rectified linear unit) is a function meant to zero out negative values whereas a sigmoid squashes arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.
We need to make further research into choosing the right architecture for our models but asides that we have our model architecture for this task and everything’s moving along nicely.
Finally, to complete the model we need to choose a loss function and an optimizer. The loss function provides a way to measure the model's performance, while optimizers are used to improve the model by minimizing the loss.
Because we’re facing a binary classification problem and the output of our model is a probability (ending the model with a single-unit layer with a sigmoid activation), it’s best to use the binary_crossentropy loss function. As for the choice of optimizer we’ll go with rmsprop, which is usually a good default choice for virtually any problem.
Next, we compile the model with the code below:
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
# note that we're monitoring accuracy during the training
A Deep Learning model should never be evaluated on its training data—it’s standard practice to use a validation set to monitor the accuracy of the model during training. So we’ll create a validation set by setting apart 10,000 samples from the original training data.
We can do that by running:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
Having done all the other previous steps down to separating the validation dataset, we’re finally ready to train our model.
We’ll be training the model for 20 epochs (20 iterations over all samples in the training data) in mini-batches of 512 samples. At the same time, we will monitor loss and accuracy on the 10,000 samples that we set apart. We do so by passing the validation data as the validation_data argument.
Here’s the code to train the model:
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
Running all the code so far on Colab will train our model on the train dataset and validate the model on the validation dataset. Now let’s use matplotlib to plot the training and validation loss side by side, as well as the training and validation accuracy.
After your model completes training for all 20 epochs, you can use the code below to see the loss and accuracy data in graph.
import matplotlib.pyplot as plt
# Loss
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Accuracy
plt.clf() # clears the previous graph
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
From the graphs generated you should be able to see the training loss decreases with every epoch, and the training accuracy increases with every epoch. That’s what we would expect but it isn’t the case for the validation loss and accuracy as seen from the graphs: they seem to peak at the fourth epoch.
This is an example of a problem: the model performs better on the training data but it won’t perform better on data it hasn’t seen before which is why the validation data is important. This problem is called overfitting: after the fourth epoch we’re over optimizing on the training data, and the model is ending up learning representations that are specific to the training data and don’t generalize to data outside of the training set.
To prevent overfitting for our model, a simple solution is to stop training after four epochs.
We’ll write the code for our model again from scratch but this time we’ll train for only 4 epochs and then we’ll evaluate the model on the test data.
The code for our model is thus:
model = keras.Sequential([
layers.Dense(16, activation="relu"),
layers.Dense(16, activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)
print(results)
# [0.2801417410373688, 0.8884400129318237]
# printed results should look like above. The first number, 0.28, is the test loss, and the second number, 0.88, is the test accuracy.
And there you have it, you’ve just trained a Binary Classification model that achieves an accuracy of 88%. To use it in a practical setting, you can generate the likelihood of reviews being positive by using the predict method as seen below:
model.predict(x_test)
From the result, you’ll see the model is confident for some samples (0.99 or more, or 0.01 or less) but less confident for others (0.6, 0.4).
And that’s it. I’ll be wrapping up here. I hope you’ve enjoyed this article. Importantly you should make more research, and also consult our reference book Deep Learning with Python by Francois Chollet for deeper understanding.
Try different architectures to see if it improves or decreases or has no effect on the model. You can try using three representation layers instead of two, try using layers with more or fewer units e.g 32 units, 64 units.
A couple key points before I say bye:
Raw data needs to be preprocessed before it can be fed into a Neural Network as tensors. From words to integers to vectors.
Stacks of dense layers with relu activations can solve a wide range of problems.
In a Binary Classification problem (two output classes), your model should end with a Dense layer with one unit and a sigmoid activation.
With a scalar sigmoid output on a binary classification problem, the loss function you should use is binary_crossentropy.
The rmsprop optimizer is generally a good enough enough choice, whatever the problem.
As they get better on their training data, Neural Networks eventually start to over fit and end up obtaining increasingly worse results on data they’ve never seen before.
And that’s a wrap, this article is quite comprehensive but as always I’ll implore you to make further research. If you need me just reach out. In the next article, we’ll be tackling a Multiclass Classification problem.
Till then, bye!