Building a Matrix-based Neural Network From the Ground Up

24 Aug 2017

Earlier in the summer I began my quest to learn more about machine learning and had originally planned to just dive into the deep end with TensorFlow. I very quickly realized the error in this approach and updated my decision. I instead sought to build a neural network from the ground up. Not only did I wish to better understand the fundamental math, but also to familiarize myself with the mathematics libraries needed to effectively work with the mountain of data used to train a network.

Relevant xkcd.

Looking for guidance in my endeavor, I naturally turned to Google. Glancing over the search results, I found that almost all of them could be described by a Venn diagram with two categories: “Mathematically rigorous” and “Computationally well-designed.” Sadly, this Venn diagram features very little overlap. Wanting to have my cake and eat it too, I found the best of each category and spent what little free time I found myself with this past summer melding them together (Because c’mon, who doesn’t like Cake? They’re probably my favorite '90s alt-rock band).

In the “Mathematically rigorous” corner, Michael Nielsen’s neuralnetworksanddeeplearning.com is one of the best resources for anyone looking to get their feet wet in artificial neural networks, and for the can’t-be-beat price of free (But seriously, if you do read this book, please kick the author a few bucks. If a broke college student can do it, you can to)! Obviously, having some calculus and linear algebra under your belt is nice, but even if you don’t, Nielsen’s explanations of the underlying mathematical principles is very readable. The code in the book on the other hand is exactly what you would expect research code to look like. Although this isn’t inherently bad, PEP 8 isn’t too complicated and it greatly improves readability, a central tenant of Python. Indentations and whitespaces aside, the amount that Python’s memory management leaves to be desired grows proportionally to the scale of the amount of data it’s storing. The time complexity of using vanilla Python is a similar situation. Thankfully Nielsen addresses this early on, challenging the reader to implement the code using a more efficient library, such as NumPy.

Great! What’s NumPy? This led me to our top contender in the “Computationally well-designed” corner, a series of articles by machine learning researcher Brian Dolhansky. His code is beautiful, fully implementing NumPy and no God objects to be found! …nor an explanation of why a softmax activation function was applied in the output layer while the rest are sigmoid. It’s either assumed that you already know the basics or that you can learn them later. This is perfectly fine, as Nielsen’s book gives an excellent explanation of softmax neurons, as well as other more advanced techniques like weight regularization. Reverse-engineering Dolhansky’s code gave me an understanding of the framework in which to implement neural-network-based algorithms in a computationally efficient way.

So without further ado, here is my attempt at a ground-up implementation of a neural network.

 import numpy as np
import activations as sigma
import costs as cost
 
 
class Layer:
    def __init__(self, size, minibatch_size, input_layer=False,
                 output_layer=False, activation=sigma.prelu):
        self.size = size
        self.input_layer = input_layer
        self.output_layer = output_layer
        self.activation = activation
        self.z = np.zeros((minibatch_size, size[0]))
 
        if not input_layer:
            self.s = np.zeros((minibatch_size, size[1]))
            self.delta = np.zeros((minibatch_size, size[1]))
            # Initilize weights with gaussian distribution proportional to the
            # number of inputs of the input neuron
            self.weights = np.random.normal(size=size,
                                            scale=1/np.sqrt(size[0]))
            self.del_w = np.zeros_like(self.weights)
 
        if not input_layer and not output_layer:
            self.f_prime = np.zeros((minibatch_size, size[1]))
            self.biases = np.zeros((1, size[1]))
            # Initilize neurons with no bias, allowing biases to be learned
            # through backpropagation
            self.del_b = np.zeros_like(self.biases)
 
    def forward_propagate(self):
        if self.input_layer:  # No activation function applied to input layer
            return self.z
 
        if not self.input_layer and not self.output_layer:
            self.s = np.dot(self.s, self.weights)
            self.s = np.add(self.s, self.biases)
            self.f_prime = self.activation(self.s, deriv=True)
 
        else:
            self.s = np.dot(self.s, self.weights)
 
        self.z = self.activation(self.s)
 
        return self.z
 
 
class Network:
    def __init__(self, sizes, minibatch_size, cost=cost.cross_entropy):
        self.sizes = sizes
        self.num_layers = len(sizes)
        self.layers = np.empty(self.num_layers, dtype=object)
        self.minibatch_size = minibatch_size
        self.cost = cost
 
        print "Initilizing network..."
        for i in range(self.num_layers-1):
            if i == 0:
                print "\tInitilizing input layer of size {0}.".format(
                    sizes[i])
                self.layers[i] = Layer([sizes[i]], minibatch_size,
                                       input_layer=True)
            else:
                print "\tInitilizing hidden layer of size {0}.".format(
                    sizes[i])
                self.layers[i] = Layer([sizes[i-1], sizes[i]], minibatch_size)
 
        print "\tInitilizing output layer of size {0}.".format(sizes[-1])
        self.layers[-1] = Layer([sizes[-2], sizes[-1]], minibatch_size,
                                output_layer=True, activation=sigma.softmax)
 
        print "Done!"
 
    def forward_propagate(self, input_data):
        self.layers[0].z = input_data
        for i in range(self.num_layers-1):
            self.layers[i+1].s = self.layers[i].forward_propagate()
        return self.layers[-1].forward_propagate()
 
    def backpropagate(self, y_hat, label):
        # Calculate derivative of cost function
        self.layers[-1].delta = (self.cost).error(y_hat, label)
 
        for i in range(self.num_layers-2, 0, -1):
            self.layers[i].delta = np.dot(self.layers[i+1].delta,
                                          self.layers[i+1].weights.T) * \
                self.layers[i].f_prime
 
    def update_weights(self, n_examples, eta, lmbd):
        for i in range(1, self.num_layers):
            if i < self.num_layers-1:
                self.layers[i].del_b = \
                    np.dot(np.ones((1, self.minibatch_size)),
                           self.layers[i].delta)
 
                self.layers[i].biases = self.layers[i].biases - \
                    (eta / self.minibatch_size) * self.layers[i].del_b
 
            self.layers[i].del_w = np.dot(self.layers[i-1].z.T,
                                          self.layers[i].delta)
 
            # Apply L2 regularization to weight updates with regularization
            # rate lambda
            self.layers[i].weights = (1 - eta * (lmbd / n_examples)) * \
                self.layers[i].weights - (eta / self.minibatch_size) * \
                self.layers[i].del_w
 
    def train(self, data, n_examples, eta, lmbd):
        inputs, labels = data
        output = self.forward_propagate(inputs)
        self.backpropagate(output, labels)
        self.update_weights(n_examples, eta=eta, lmbd=lmbd)

The activations module contains a few different activation functions to play around with.

 # Module of activation functions for use in matrix-based neural networks
 
import numpy as np
 
 
def sigmoid(s, deriv=False):
    if not deriv:
        return 1.0 / (1.0 + np.exp(-s))
    else:
        return sigmoid(s) * (1.0 - sigmoid(s))
 
 
def softmax(s):
    z = np.sum(np.exp(s), axis=1)
    z = z.reshape(z.shape[0], 1)
    return np.exp(s) / z
 
 
def relu(s, deriv=False):
    if not deriv:
        for i in range(np.ma.size(s, axis=0)):
            for j in range(np.ma.size(s, axis=1)):
                if s[i, j] <= 0:
                    s[i, j] = 0
 
    else:
        for i in range(np.ma.size(s, axis=0)):
            for j in range(np.ma.size(s, axis=1)):
                if s[i, j] <= 0:
                    s[i, j] = 0
                else:
                    s[i, j] = 1
 
    return s
 
 
def prelu(s, alpha=0.1, deriv=False):
    if not deriv:
        for i in range(np.ma.size(s, axis=0)):
            for j in range(np.ma.size(s, axis=1)):
                if s[i, j] <= 0:
                    s[i, j] *= alpha
 
    else:
        for i in range(np.ma.size(s, axis=0)):
            for j in range(np.ma.size(s, axis=1)):
                if s[i, j] <= 0:
                    s[i, j] = alpha
                else:
                    s[i, j] = 1
 
    return s

Likewise, the cost module includes a couple of different cost functions for use with different activation functions.

 # Module of cost functions for use in matrix-based neural networks
 
import numpy as np
 
 
class quadratic(object):
    @staticmethod
    def func(y_hat, label):
        # Return cost of actual output "y_hat" with respect to expected
        # output "label"
        return 0.5*np.linalg.norm(y_hat - label)**2
 
    @staticmethod
    def error(y_hat, label):
        # Return the error signal "delta" from the output layer
        return (y_hat - label)
 
 
class cross_entropy(object):
    @staticmethod
    def func(y_hat, label):
        # Return cost of actual output "y_hat" with respect to expected
        # output "label"
        return np.sum(np.nan_to_num(-label * np.log(y_hat) - (1 - label) *
                                    np.log(1 - y_hat)))
 
    @staticmethod
    def error(y_hat, label):
        # Return the error signal "delta" from the output layer
        return (y_hat - label)

Although I’m not working with the raw MNIST dataset, it still needs to be loaded, and requires some additional pre-processing for training.

 import cPickle
import gzip
import numpy as np
 
 
def load_data():
    data = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(data)
    data.close()
 
    training_inputs, training_labels = training_data
 
    vectorized_labels = np.zeros((len(training_labels), 10))
 
    for i in range(len(training_labels)):
        vectorized_labels[i] = vectorize_label(training_labels[i])
 
    training_data = zip(training_inputs, vectorized_labels)
    validation_data = zip(*validation_data)
    test_data = zip(*test_data)
 
    return training_data, validation_data, test_data
 
 
def vectorize_label(i):
    vectorized_label = np.zeros((10))
    vectorized_label[i] = 1
    return vectorized_label

The oh-so-cleverly named evaluator module handles the training and evaluation of the network, returning the results of each epoch.

 # Evaluates neural networks with given hyper-parameters and returns results
 
import numpy as np
 
 
def evaluate(network, epochs, eta, lmbd, training_data,
             validation_data=None, test_data=None):
 
    n_training = len(training_data)
 
    print "Training for {0} epochs...".format(epochs)
    for t in range(epochs):
        np.random.shuffle(training_data)
 
        out_str = "\tEpoch {0:2d}:".format(t+1)
 
        for i in range(n_training/network.minibatch_size):
            data = create_minibatch(training_data, i, network.minibatch_size,
                                    training_data=True)
            network.train(data, n_training, eta, lmbd)
 
        if validation_data:
            n_validation = len(validation_data)
            n_correct = 0
 
            for i in range(n_validation/network.minibatch_size):
                inputs, labels = create_minibatch(validation_data, i,
                                                  network.minibatch_size)
                output = network.forward_propagate(inputs)
                y_hat = np.argmax(output, axis=1)
                n_correct += np.sum(y_hat == labels)
 
            out_str = "{0} Training accuracy: {1:.2f}%".format(
                out_str, float(n_correct)/n_validation * 100)
 
        if test_data:
            n_test = len(test_data)
            n_correct = 0
            for i in range(n_test/network.minibatch_size):
                inputs, labels = create_minibatch(test_data, i,
                                                  network.minibatch_size)
                output = network.forward_propagate(inputs)
                y_hat = np.argmax(output, axis=1)
                n_correct += np.sum(y_hat == labels)
 
            out_str = "{0} Test accuracy: {1:.2f}%".format(
                out_str, float(n_correct)/n_test * 100)
 
        print out_str
 
 
def create_minibatch(data, i, minibatch_size, training_data=False):
    inputs, labels = zip(*data)
 
    n = np.size(inputs[0], axis=0)
 
    minibatch_inputs = np.zeros((minibatch_size, n))
    if training_data:
        minibatch_labels = np.empty((minibatch_size, 10))
 
        for j in range(minibatch_size):
            minibatch_inputs[j, :] = inputs[i+j]
            minibatch_labels[j, :] = labels[i+j]
 
    else:
        minibatch_labels = np.empty(minibatch_size)
 
        for j in range(minibatch_size):
            minibatch_inputs[j, :] = inputs[i+j]
            minibatch_labels[j] = int(labels[i+j])
 
    return minibatch_inputs, minibatch_labels

Finally, I wrote a quick script to initialize networks and evaluate them on the MNIST dataset with given hyper-parameters, greatly reducing the amount of repetitive typing, though sadly not eliminating it entirely.

 import mnist_loader
import matrix_network
import evaluator
 
training_data, validation_data, test_data = mnist_loader.load_data()
 
network = matrix_network.Network([784, 100, 10], 100)
 
evaluator.evaluate(network, 30, 1.0, 0.1, training_data,
                  validation_data, test_data)

There’s quite a bit going on in the code overall, despite the relatively simple algorithms underpinning it. The power of these networks lies in the scale of data that can be leveraged by these algorithms when implemented using modern matrix libraries. In the next few weeks, I’ll attempt to shed some light on what’s happening under the hood.

Tags: Machine Learning, Neural Networks, Python

	import numpy as np
	import activations as sigma
	import costs as cost


	class Layer:
	def __init__(self, size, minibatch_size, input_layer=False,
	output_layer=False, activation=sigma.prelu):
	self.size = size
	self.input_layer = input_layer
	self.output_layer = output_layer
	self.activation = activation
	self.z = np.zeros((minibatch_size, size[0]))

	if not input_layer:
	self.s = np.zeros((minibatch_size, size[1]))
	self.delta = np.zeros((minibatch_size, size[1]))
	# Initilize weights with gaussian distribution proportional to the
	# number of inputs of the input neuron
	self.weights = np.random.normal(size=size,
	scale=1/np.sqrt(size[0]))
	self.del_w = np.zeros_like(self.weights)

	if not input_layer and not output_layer:
	self.f_prime = np.zeros((minibatch_size, size[1]))
	self.biases = np.zeros((1, size[1]))
	# Initilize neurons with no bias, allowing biases to be learned
	# through backpropagation
	self.del_b = np.zeros_like(self.biases)

	def forward_propagate(self):
	if self.input_layer: # No activation function applied to input layer
	return self.z

	if not self.input_layer and not self.output_layer:
	self.s = np.dot(self.s, self.weights)
	self.s = np.add(self.s, self.biases)
	self.f_prime = self.activation(self.s, deriv=True)

	else:
	self.s = np.dot(self.s, self.weights)

	self.z = self.activation(self.s)

	return self.z


	class Network:
	def __init__(self, sizes, minibatch_size, cost=cost.cross_entropy):
	self.sizes = sizes
	self.num_layers = len(sizes)
	self.layers = np.empty(self.num_layers, dtype=object)
	self.minibatch_size = minibatch_size
	self.cost = cost

	print "Initilizing network..."
	for i in range(self.num_layers-1):
	if i == 0:
	print "\tInitilizing input layer of size {0}.".format(
	sizes[i])
	self.layers[i] = Layer([sizes[i]], minibatch_size,
	input_layer=True)
	else:
	print "\tInitilizing hidden layer of size {0}.".format(
	sizes[i])
	self.layers[i] = Layer([sizes[i-1], sizes[i]], minibatch_size)

	print "\tInitilizing output layer of size {0}.".format(sizes[-1])
	self.layers[-1] = Layer([sizes[-2], sizes[-1]], minibatch_size,
	output_layer=True, activation=sigma.softmax)

	print "Done!"

	def forward_propagate(self, input_data):
	self.layers[0].z = input_data
	for i in range(self.num_layers-1):
	self.layers[i+1].s = self.layers[i].forward_propagate()
	return self.layers[-1].forward_propagate()

	def backpropagate(self, y_hat, label):
	# Calculate derivative of cost function
	self.layers[-1].delta = (self.cost).error(y_hat, label)

	for i in range(self.num_layers-2, 0, -1):
	self.layers[i].delta = np.dot(self.layers[i+1].delta,
	self.layers[i+1].weights.T) * \
	self.layers[i].f_prime

	def update_weights(self, n_examples, eta, lmbd):
	for i in range(1, self.num_layers):
	if i < self.num_layers-1:
	self.layers[i].del_b = \
	np.dot(np.ones((1, self.minibatch_size)),
	self.layers[i].delta)

	self.layers[i].biases = self.layers[i].biases - \
	(eta / self.minibatch_size) * self.layers[i].del_b

	self.layers[i].del_w = np.dot(self.layers[i-1].z.T,
	self.layers[i].delta)

	# Apply L2 regularization to weight updates with regularization
	# rate lambda
	self.layers[i].weights = (1 - eta * (lmbd / n_examples)) * \
	self.layers[i].weights - (eta / self.minibatch_size) * \
	self.layers[i].del_w

	def train(self, data, n_examples, eta, lmbd):
	inputs, labels = data
	output = self.forward_propagate(inputs)
	self.backpropagate(output, labels)
	self.update_weights(n_examples, eta=eta, lmbd=lmbd)

	# Module of activation functions for use in matrix-based neural networks

	import numpy as np


	def sigmoid(s, deriv=False):
	if not deriv:
	return 1.0 / (1.0 + np.exp(-s))
	else:
	return sigmoid(s) * (1.0 - sigmoid(s))


	def softmax(s):
	z = np.sum(np.exp(s), axis=1)
	z = z.reshape(z.shape[0], 1)
	return np.exp(s) / z


	def relu(s, deriv=False):
	if not deriv:
	for i in range(np.ma.size(s, axis=0)):
	for j in range(np.ma.size(s, axis=1)):
	if s[i, j] <= 0:
	s[i, j] = 0

	else:
	for i in range(np.ma.size(s, axis=0)):
	for j in range(np.ma.size(s, axis=1)):
	if s[i, j] <= 0:
	s[i, j] = 0
	else:
	s[i, j] = 1

	return s


	def prelu(s, alpha=0.1, deriv=False):
	if not deriv:
	for i in range(np.ma.size(s, axis=0)):
	for j in range(np.ma.size(s, axis=1)):
	if s[i, j] <= 0:
	s[i, j] *= alpha

	else:
	for i in range(np.ma.size(s, axis=0)):
	for j in range(np.ma.size(s, axis=1)):
	if s[i, j] <= 0:
	s[i, j] = alpha
	else:
	s[i, j] = 1

	return s

	# Module of cost functions for use in matrix-based neural networks

	import numpy as np


	class quadratic(object):
	@staticmethod
	def func(y_hat, label):
	# Return cost of actual output "y_hat" with respect to expected
	# output "label"
	return 0.5np.linalg.norm(y_hat - label)*2

	@staticmethod
	def error(y_hat, label):
	# Return the error signal "delta" from the output layer
	return (y_hat - label)


	class cross_entropy(object):
	@staticmethod
	def func(y_hat, label):
	# Return cost of actual output "y_hat" with respect to expected
	# output "label"
	return np.sum(np.nan_to_num(-label * np.log(y_hat) - (1 - label) *
	np.log(1 - y_hat)))

	@staticmethod
	def error(y_hat, label):
	# Return the error signal "delta" from the output layer
	return (y_hat - label)

	import cPickle
	import gzip
	import numpy as np


	def load_data():
	data = gzip.open('../data/mnist.pkl.gz', 'rb')
	training_data, validation_data, test_data = cPickle.load(data)
	data.close()

	training_inputs, training_labels = training_data

	vectorized_labels = np.zeros((len(training_labels), 10))

	for i in range(len(training_labels)):
	vectorized_labels[i] = vectorize_label(training_labels[i])

	training_data = zip(training_inputs, vectorized_labels)
	validation_data = zip(*validation_data)
	test_data = zip(*test_data)

	return training_data, validation_data, test_data


	def vectorize_label(i):
	vectorized_label = np.zeros((10))
	vectorized_label[i] = 1
	return vectorized_label

	# Evaluates neural networks with given hyper-parameters and returns results

	import numpy as np


	def evaluate(network, epochs, eta, lmbd, training_data,
	validation_data=None, test_data=None):

	n_training = len(training_data)

	print "Training for {0} epochs...".format(epochs)
	for t in range(epochs):
	np.random.shuffle(training_data)

	out_str = "\tEpoch {0:2d}:".format(t+1)

	for i in range(n_training/network.minibatch_size):
	data = create_minibatch(training_data, i, network.minibatch_size,
	training_data=True)
	network.train(data, n_training, eta, lmbd)

	if validation_data:
	n_validation = len(validation_data)
	n_correct = 0

	for i in range(n_validation/network.minibatch_size):
	inputs, labels = create_minibatch(validation_data, i,
	network.minibatch_size)
	output = network.forward_propagate(inputs)
	y_hat = np.argmax(output, axis=1)
	n_correct += np.sum(y_hat == labels)

	out_str = "{0} Training accuracy: {1:.2f}%".format(
	out_str, float(n_correct)/n_validation * 100)

	if test_data:
	n_test = len(test_data)
	n_correct = 0
	for i in range(n_test/network.minibatch_size):
	inputs, labels = create_minibatch(test_data, i,
	network.minibatch_size)
	output = network.forward_propagate(inputs)
	y_hat = np.argmax(output, axis=1)
	n_correct += np.sum(y_hat == labels)

	out_str = "{0} Test accuracy: {1:.2f}%".format(
	out_str, float(n_correct)/n_test * 100)

	print out_str


	def create_minibatch(data, i, minibatch_size, training_data=False):
	inputs, labels = zip(*data)

	n = np.size(inputs[0], axis=0)

	minibatch_inputs = np.zeros((minibatch_size, n))
	if training_data:
	minibatch_labels = np.empty((minibatch_size, 10))

	for j in range(minibatch_size):
	minibatch_inputs[j, :] = inputs[i+j]
	minibatch_labels[j, :] = labels[i+j]

	else:
	minibatch_labels = np.empty(minibatch_size)

	for j in range(minibatch_size):
	minibatch_inputs[j, :] = inputs[i+j]
	minibatch_labels[j] = int(labels[i+j])

	return minibatch_inputs, minibatch_labels

	import mnist_loader
	import matrix_network
	import evaluator

	training_data, validation_data, test_data = mnist_loader.load_data()

	network = matrix_network.Network([784, 100, 10], 100)

	evaluator.evaluate(network, 30, 1.0, 0.1, training_data,
	validation_data, test_data)