Mastodon

Building a Matrix-based Neural Network From the Ground Up

Earlier in the summer I began my quest to learn more about machine learning and had originally planned to just dive into the deep end with TensorFlow. I very quickly realized the error in this approach and updated my decision. I instead sought to build a neural network from the ground up. Not only did I wish to better understand the fundamental math, but also to familiarize myself with the mathematics libraries needed to effectively work with the mountain of data used to train a network.

Machine learning by xkcd

Looking for guidance in my endeavor, I naturally turned to Google. Glancing over the search results, I found that almost all of them could be described by a Venn diagram with two categories: “Mathematically rigorous” and “Computationally well-designed.” Sadly, this Venn diagram features very little overlap. Wanting to have my cake and eat it too, I found the best of each category and spent what little free time I found myself with this past summer melding them together (Because c’mon, who doesn’t like Cake? They’re probably my favorite '90s alt-rock band).

In the “Mathematically rigorous” corner, Michael Nielsen’s neuralnetworksanddeeplearning.com is one of the best resources for anyone looking to get their feet wet in artificial neural networks, and for the can’t-be-beat price of free (But seriously, if you do read this book, please kick the author a few bucks. If a broke college student can do it, you can to)! Obviously, having some calculus and linear algebra under your belt is nice, but even if you don’t, Nielsen’s explanations of the underlying mathematical principles is very readable. The code in the book on the other hand is exactly what you would expect research code to look like. Although this isn’t inherently bad, PEP 8 isn’t too complicated and it greatly improves readability, a central tenant of Python. Indentations and whitespaces aside, the amount that Python’s memory management leaves to be desired grows proportionally to the scale of the amount of data it’s storing. The time complexity of using vanilla Python is a similar situation. Thankfully Nielsen addresses this early on, challenging the reader to implement the code using a more efficient library, such as NumPy.

Great! What’s NumPy? This led me to our top contender in the “Computationally well-designed” corner, a series of articles by machine learning researcher Brian Dolhansky. His code is beautiful, fully implementing NumPy and no God objects to be found! …nor an explanation of why a softmax activation function was applied in the output layer while the rest are sigmoid. It’s either assumed that you already know the basics or that you can learn them later. This is perfectly fine, as Nielsen’s book gives an excellent explanation of softmax neurons, as well as other more advanced techniques like weight regularization. Reverse-engineering Dolhansky’s code gave me an understanding of the framework in which to implement neural-network-based algorithms in a computationally efficient way.

So without further ado, here is my attempt at a ground-up implementation of a neural network.

import numpy as np
import activations as sigma
import costs as cost
class Layer:
def __init__(self, size, minibatch_size, input_layer=False,
output_layer=False, activation=sigma.prelu):
self.size = size
self.input_layer = input_layer
self.output_layer = output_layer
self.activation = activation
self.z = np.zeros((minibatch_size, size[0]))
if not input_layer:
self.s = np.zeros((minibatch_size, size[1]))
self.delta = np.zeros((minibatch_size, size[1]))
# Initilize weights with gaussian distribution proportional to the
# number of inputs of the input neuron
self.weights = np.random.normal(size=size,
scale=1/np.sqrt(size[0]))
self.del_w = np.zeros_like(self.weights)
if not input_layer and not output_layer:
self.f_prime = np.zeros((minibatch_size, size[1]))
self.biases = np.zeros((1, size[1]))
# Initilize neurons with no bias, allowing biases to be learned
# through backpropagation
self.del_b = np.zeros_like(self.biases)
def forward_propagate(self):
if self.input_layer: # No activation function applied to input layer
return self.z
if not self.input_layer and not self.output_layer:
self.s = np.dot(self.s, self.weights)
self.s = np.add(self.s, self.biases)
self.f_prime = self.activation(self.s, deriv=True)
else:
self.s = np.dot(self.s, self.weights)
self.z = self.activation(self.s)
return self.z
class Network:
def __init__(self, sizes, minibatch_size, cost=cost.cross_entropy):
self.sizes = sizes
self.num_layers = len(sizes)
self.layers = np.empty(self.num_layers, dtype=object)
self.minibatch_size = minibatch_size
self.cost = cost
print "Initilizing network..."
for i in range(self.num_layers-1):
if i == 0:
print "\tInitilizing input layer of size {0}.".format(
sizes[i])
self.layers[i] = Layer([sizes[i]], minibatch_size,
input_layer=True)
else:
print "\tInitilizing hidden layer of size {0}.".format(
sizes[i])
self.layers[i] = Layer([sizes[i-1], sizes[i]], minibatch_size)
print "\tInitilizing output layer of size {0}.".format(sizes[-1])
self.layers[-1] = Layer([sizes[-2], sizes[-1]], minibatch_size,
output_layer=True, activation=sigma.softmax)
print "Done!"
def forward_propagate(self, input_data):
self.layers[0].z = input_data
for i in range(self.num_layers-1):
self.layers[i+1].s = self.layers[i].forward_propagate()
return self.layers[-1].forward_propagate()
def backpropagate(self, y_hat, label):
# Calculate derivative of cost function
self.layers[-1].delta = (self.cost).error(y_hat, label)
for i in range(self.num_layers-2, 0, -1):
self.layers[i].delta = np.dot(self.layers[i+1].delta,
self.layers[i+1].weights.T) * \
self.layers[i].f_prime
def update_weights(self, n_examples, eta, lmbd):
for i in range(1, self.num_layers):
if i < self.num_layers-1:
self.layers[i].del_b = \
np.dot(np.ones((1, self.minibatch_size)),
self.layers[i].delta)
self.layers[i].biases = self.layers[i].biases - \
(eta / self.minibatch_size) * self.layers[i].del_b
self.layers[i].del_w = np.dot(self.layers[i-1].z.T,
self.layers[i].delta)
# Apply L2 regularization to weight updates with regularization
# rate lambda
self.layers[i].weights = (1 - eta * (lmbd / n_examples)) * \
self.layers[i].weights - (eta / self.minibatch_size) * \
self.layers[i].del_w
def train(self, data, n_examples, eta, lmbd):
inputs, labels = data
output = self.forward_propagate(inputs)
self.backpropagate(output, labels)
self.update_weights(n_examples, eta=eta, lmbd=lmbd)

The activations module contains a few different activation functions to play around with.

# Module of activation functions for use in matrix-based neural networks
import numpy as np
def sigmoid(s, deriv=False):
if not deriv:
return 1.0 / (1.0 + np.exp(-s))
else:
return sigmoid(s) * (1.0 - sigmoid(s))
def softmax(s):
z = np.sum(np.exp(s), axis=1)
z = z.reshape(z.shape[0], 1)
return np.exp(s) / z
def relu(s, deriv=False):
if not deriv:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = 0
else:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = 0
else:
s[i, j] = 1
return s
def prelu(s, alpha=0.1, deriv=False):
if not deriv:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] *= alpha
else:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = alpha
else:
s[i, j] = 1
return s

Likewise, the cost module includes a couple of different cost functions for use with different activation functions.

# Module of cost functions for use in matrix-based neural networks
import numpy as np
class quadratic(object):
@staticmethod
def func(y_hat, label):
# Return cost of actual output "y_hat" with respect to expected
# output "label"
return 0.5*np.linalg.norm(y_hat - label)**2
@staticmethod
def error(y_hat, label):
# Return the error signal "delta" from the output layer
return (y_hat - label)
class cross_entropy(object):
@staticmethod
def func(y_hat, label):
# Return cost of actual output "y_hat" with respect to expected
# output "label"
return np.sum(np.nan_to_num(-label * np.log(y_hat) - (1 - label) *
np.log(1 - y_hat)))
@staticmethod
def error(y_hat, label):
# Return the error signal "delta" from the output layer
return (y_hat - label)

Although I’m not working with the raw MNIST dataset, it still needs to be loaded, and requires some additional pre-processing for training.

import cPickle
import gzip
import numpy as np
def load_data():
data = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(data)
data.close()
training_inputs, training_labels = training_data
vectorized_labels = np.zeros((len(training_labels), 10))
for i in range(len(training_labels)):
vectorized_labels[i] = vectorize_label(training_labels[i])
training_data = zip(training_inputs, vectorized_labels)
validation_data = zip(*validation_data)
test_data = zip(*test_data)
return training_data, validation_data, test_data
def vectorize_label(i):
vectorized_label = np.zeros((10))
vectorized_label[i] = 1
return vectorized_label

The oh-so-cleverly named evaluator module handles the training and evaluation of the network, returning the results of each epoch.

# Evaluates neural networks with given hyper-parameters and returns results
import numpy as np
def evaluate(network, epochs, eta, lmbd, training_data,
validation_data=None, test_data=None):
n_training = len(training_data)
print "Training for {0} epochs...".format(epochs)
for t in range(epochs):
np.random.shuffle(training_data)
out_str = "\tEpoch {0:2d}:".format(t+1)
for i in range(n_training/network.minibatch_size):
data = create_minibatch(training_data, i, network.minibatch_size,
training_data=True)
network.train(data, n_training, eta, lmbd)
if validation_data:
n_validation = len(validation_data)
n_correct = 0
for i in range(n_validation/network.minibatch_size):
inputs, labels = create_minibatch(validation_data, i,
network.minibatch_size)
output = network.forward_propagate(inputs)
y_hat = np.argmax(output, axis=1)
n_correct += np.sum(y_hat == labels)
out_str = "{0} Training accuracy: {1:.2f}%".format(
out_str, float(n_correct)/n_validation * 100)
if test_data:
n_test = len(test_data)
n_correct = 0
for i in range(n_test/network.minibatch_size):
inputs, labels = create_minibatch(test_data, i,
network.minibatch_size)
output = network.forward_propagate(inputs)
y_hat = np.argmax(output, axis=1)
n_correct += np.sum(y_hat == labels)
out_str = "{0} Test accuracy: {1:.2f}%".format(
out_str, float(n_correct)/n_test * 100)
print out_str
def create_minibatch(data, i, minibatch_size, training_data=False):
inputs, labels = zip(*data)
n = np.size(inputs[0], axis=0)
minibatch_inputs = np.zeros((minibatch_size, n))
if training_data:
minibatch_labels = np.empty((minibatch_size, 10))
for j in range(minibatch_size):
minibatch_inputs[j, :] = inputs[i+j]
minibatch_labels[j, :] = labels[i+j]
else:
minibatch_labels = np.empty(minibatch_size)
for j in range(minibatch_size):
minibatch_inputs[j, :] = inputs[i+j]
minibatch_labels[j] = int(labels[i+j])
return minibatch_inputs, minibatch_labels

Finally, I wrote a quick script to initialize networks and evaluate them on the MNIST dataset with given hyper-parameters, greatly reducing the amount of repetitive typing, though sadly not eliminating it entirely.

import mnist_loader
import matrix_network
import evaluator
training_data, validation_data, test_data = mnist_loader.load_data()
network = matrix_network.Network([784, 100, 10], 100)
evaluator.evaluate(network, 30, 1.0, 0.1, training_data,
validation_data, test_data)

There’s quite a bit going on in the code overall, despite the relatively simple algorithms underpinning it. The power of these networks lies in the scale of data that can be leveraged by these algorithms when implemented using modern matrix libraries. In the next few weeks, I’ll attempt to shed some light on what’s happening under the hood.