Building a Matrix-based Neural Network From the Ground Up
24 Aug 2017
Earlier in the summer I began my quest to learn more about machine learning and had originally planned to just dive into the deep end with TensorFlow. I very quickly realized the error in this approach and updated my decision. I instead sought to build a neural network from the ground up. Not only did I wish to better understand the fundamental math, but also to familiarize myself with the mathematics libraries needed to effectively work with the mountain of data used to train a network.
Looking for guidance in my endeavor, I naturally turned to Google. Glancing over the search results, I found that almost all of them could be described by a Venn diagram with two categories: “Mathematically rigorous” and “Computationally well-designed.” Sadly, this Venn diagram features very little overlap. Wanting to have my cake and eat it too, I found the best of each category and spent what little free time I found myself with this past summer melding them together (Because c’mon, who doesn’t like Cake? They’re probably my favorite '90s alt-rock band).
In the “Mathematically rigorous” corner, Michael Nielsen’s neuralnetworksanddeeplearning.com is one of the best resources for anyone looking to get their feet wet in artificial neural networks, and for the can’t-be-beat price of free (But seriously, if you do read this book, please kick the author a few bucks. If a broke college student can do it, you can to)! Obviously, having some calculus and linear algebra under your belt is nice, but even if you don’t, Nielsen’s explanations of the underlying mathematical principles is very readable. The code in the book on the other hand is exactly what you would expect research code to look like. Although this isn’t inherently bad, PEP 8 isn’t too complicated and it greatly improves readability, a central tenant of Python. Indentations and whitespaces aside, the amount that Python’s memory management leaves to be desired grows proportionally to the scale of the amount of data it’s storing. The time complexity of using vanilla Python is a similar situation. Thankfully Nielsen addresses this early on, challenging the reader to implement the code using a more efficient library, such as NumPy.
Great! What’s NumPy? This led me to our top contender in the “Computationally well-designed” corner, a series of articles by machine learning researcher Brian Dolhansky. His code is beautiful, fully implementing NumPy and no God objects to be found! …nor an explanation of why a softmax activation function was applied in the output layer while the rest are sigmoid. It’s either assumed that you already know the basics or that you can learn them later. This is perfectly fine, as Nielsen’s book gives an excellent explanation of softmax neurons, as well as other more advanced techniques like weight regularization. Reverse-engineering Dolhansky’s code gave me an understanding of the framework in which to implement neural-network-based algorithms in a computationally efficient way.
So without further ado, here is my attempt at a ground-up implementation of a neural network.
import numpy as np import activations as sigma import costs as cost class Layer: def __init__(self, size, minibatch_size, input_layer=False, output_layer=False, activation=sigma.prelu): self.size = size self.input_layer = input_layer self.output_layer = output_layer self.activation = activation self.z = np.zeros((minibatch_size, size[0])) if not input_layer: self.s = np.zeros((minibatch_size, size[1])) self.delta = np.zeros((minibatch_size, size[1])) # Initilize weights with gaussian distribution proportional to the # number of inputs of the input neuron self.weights = np.random.normal(size=size, scale=1/np.sqrt(size[0])) self.del_w = np.zeros_like(self.weights) if not input_layer and not output_layer: self.f_prime = np.zeros((minibatch_size, size[1])) self.biases = np.zeros((1, size[1])) # Initilize neurons with no bias, allowing biases to be learned # through backpropagation self.del_b = np.zeros_like(self.biases) def forward_propagate(self): if self.input_layer: # No activation function applied to input layer return self.z if not self.input_layer and not self.output_layer: self.s = np.dot(self.s, self.weights) self.s = np.add(self.s, self.biases) self.f_prime = self.activation(self.s, deriv=True) else: self.s = np.dot(self.s, self.weights) self.z = self.activation(self.s) return self.z class Network: def __init__(self, sizes, minibatch_size, cost=cost.cross_entropy): self.sizes = sizes self.num_layers = len(sizes) self.layers = np.empty(self.num_layers, dtype=object) self.minibatch_size = minibatch_size self.cost = cost print "Initilizing network..." for i in range(self.num_layers-1): if i == 0: print "\tInitilizing input layer of size {0}.".format( sizes[i]) self.layers[i] = Layer([sizes[i]], minibatch_size, input_layer=True) else: print "\tInitilizing hidden layer of size {0}.".format( sizes[i]) self.layers[i] = Layer([sizes[i-1], sizes[i]], minibatch_size) print "\tInitilizing output layer of size {0}.".format(sizes[-1]) self.layers[-1] = Layer([sizes[-2], sizes[-1]], minibatch_size, output_layer=True, activation=sigma.softmax) print "Done!" def forward_propagate(self, input_data): self.layers[0].z = input_data for i in range(self.num_layers-1): self.layers[i+1].s = self.layers[i].forward_propagate() return self.layers[-1].forward_propagate() def backpropagate(self, y_hat, label): # Calculate derivative of cost function self.layers[-1].delta = (self.cost).error(y_hat, label) for i in range(self.num_layers-2, 0, -1): self.layers[i].delta = np.dot(self.layers[i+1].delta, self.layers[i+1].weights.T) * \ self.layers[i].f_prime def update_weights(self, n_examples, eta, lmbd): for i in range(1, self.num_layers): if i < self.num_layers-1: self.layers[i].del_b = \ np.dot(np.ones((1, self.minibatch_size)), self.layers[i].delta) self.layers[i].biases = self.layers[i].biases - \ (eta / self.minibatch_size) * self.layers[i].del_b self.layers[i].del_w = np.dot(self.layers[i-1].z.T, self.layers[i].delta) # Apply L2 regularization to weight updates with regularization # rate lambda self.layers[i].weights = (1 - eta * (lmbd / n_examples)) * \ self.layers[i].weights - (eta / self.minibatch_size) * \ self.layers[i].del_w def train(self, data, n_examples, eta, lmbd): inputs, labels = data output = self.forward_propagate(inputs) self.backpropagate(output, labels) self.update_weights(n_examples, eta=eta, lmbd=lmbd)
The activations module contains a few different activation functions to play around with.
# Module of activation functions for use in matrix-based neural networks import numpy as np def sigmoid(s, deriv=False): if not deriv: return 1.0 / (1.0 + np.exp(-s)) else: return sigmoid(s) * (1.0 - sigmoid(s)) def softmax(s): z = np.sum(np.exp(s), axis=1) z = z.reshape(z.shape[0], 1) return np.exp(s) / z def relu(s, deriv=False): if not deriv: for i in range(np.ma.size(s, axis=0)): for j in range(np.ma.size(s, axis=1)): if s[i, j] <= 0: s[i, j] = 0 else: for i in range(np.ma.size(s, axis=0)): for j in range(np.ma.size(s, axis=1)): if s[i, j] <= 0: s[i, j] = 0 else: s[i, j] = 1 return s def prelu(s, alpha=0.1, deriv=False): if not deriv: for i in range(np.ma.size(s, axis=0)): for j in range(np.ma.size(s, axis=1)): if s[i, j] <= 0: s[i, j] *= alpha else: for i in range(np.ma.size(s, axis=0)): for j in range(np.ma.size(s, axis=1)): if s[i, j] <= 0: s[i, j] = alpha else: s[i, j] = 1 return s
Likewise, the cost module includes a couple of different cost functions for use with different activation functions.
# Module of cost functions for use in matrix-based neural networks import numpy as np class quadratic(object): @staticmethod def func(y_hat, label): # Return cost of actual output "y_hat" with respect to expected # output "label" return 0.5*np.linalg.norm(y_hat - label)**2 @staticmethod def error(y_hat, label): # Return the error signal "delta" from the output layer return (y_hat - label) class cross_entropy(object): @staticmethod def func(y_hat, label): # Return cost of actual output "y_hat" with respect to expected # output "label" return np.sum(np.nan_to_num(-label * np.log(y_hat) - (1 - label) * np.log(1 - y_hat))) @staticmethod def error(y_hat, label): # Return the error signal "delta" from the output layer return (y_hat - label)
Although I’m not working with the raw MNIST dataset, it still needs to be loaded, and requires some additional pre-processing for training.
import cPickle import gzip import numpy as np def load_data(): data = gzip.open('../data/mnist.pkl.gz', 'rb') training_data, validation_data, test_data = cPickle.load(data) data.close() training_inputs, training_labels = training_data vectorized_labels = np.zeros((len(training_labels), 10)) for i in range(len(training_labels)): vectorized_labels[i] = vectorize_label(training_labels[i]) training_data = zip(training_inputs, vectorized_labels) validation_data = zip(*validation_data) test_data = zip(*test_data) return training_data, validation_data, test_data def vectorize_label(i): vectorized_label = np.zeros((10)) vectorized_label[i] = 1 return vectorized_label
The oh-so-cleverly named evaluator module handles the training and evaluation of the network, returning the results of each epoch.
# Evaluates neural networks with given hyper-parameters and returns results import numpy as np def evaluate(network, epochs, eta, lmbd, training_data, validation_data=None, test_data=None): n_training = len(training_data) print "Training for {0} epochs...".format(epochs) for t in range(epochs): np.random.shuffle(training_data) out_str = "\tEpoch {0:2d}:".format(t+1) for i in range(n_training/network.minibatch_size): data = create_minibatch(training_data, i, network.minibatch_size, training_data=True) network.train(data, n_training, eta, lmbd) if validation_data: n_validation = len(validation_data) n_correct = 0 for i in range(n_validation/network.minibatch_size): inputs, labels = create_minibatch(validation_data, i, network.minibatch_size) output = network.forward_propagate(inputs) y_hat = np.argmax(output, axis=1) n_correct += np.sum(y_hat == labels) out_str = "{0} Training accuracy: {1:.2f}%".format( out_str, float(n_correct)/n_validation * 100) if test_data: n_test = len(test_data) n_correct = 0 for i in range(n_test/network.minibatch_size): inputs, labels = create_minibatch(test_data, i, network.minibatch_size) output = network.forward_propagate(inputs) y_hat = np.argmax(output, axis=1) n_correct += np.sum(y_hat == labels) out_str = "{0} Test accuracy: {1:.2f}%".format( out_str, float(n_correct)/n_test * 100) print out_str def create_minibatch(data, i, minibatch_size, training_data=False): inputs, labels = zip(*data) n = np.size(inputs[0], axis=0) minibatch_inputs = np.zeros((minibatch_size, n)) if training_data: minibatch_labels = np.empty((minibatch_size, 10)) for j in range(minibatch_size): minibatch_inputs[j, :] = inputs[i+j] minibatch_labels[j, :] = labels[i+j] else: minibatch_labels = np.empty(minibatch_size) for j in range(minibatch_size): minibatch_inputs[j, :] = inputs[i+j] minibatch_labels[j] = int(labels[i+j]) return minibatch_inputs, minibatch_labels
Finally, I wrote a quick script to initialize networks and evaluate them on the MNIST dataset with given hyper-parameters, greatly reducing the amount of repetitive typing, though sadly not eliminating it entirely.
import mnist_loader import matrix_network import evaluator training_data, validation_data, test_data = mnist_loader.load_data() network = matrix_network.Network([784, 100, 10], 100) evaluator.evaluate(network, 30, 1.0, 0.1, training_data, validation_data, test_data)
There’s quite a bit going on in the code overall, despite the relatively simple algorithms underpinning it. The power of these networks lies in the scale of data that can be leveraged by these algorithms when implemented using modern matrix libraries. In the next few weeks, I’ll attempt to shed some light on what’s happening under the hood.