Mastodon

Building a Matrix-based Neural Network From the Ground Up

Earlier in the summer I began my quest to learn more about machine learning and had originally planned to just dive into the deep end with TensorFlow. I very quickly realized the error in this approach and updated my decision. I instead sought to build a neural network from the ground up. Not only did I wish to better understand the fundamental math, but also to familiarize myself with the mathematics libraries needed to effectively work with the mountain of data used to train a network.

Machine learning by xkcd

Looking for guidance in my endeavor, I naturally turned to Google. Glancing over the search results, I found that almost all of them could be described by a Venn diagram with two categories: “Mathematically rigorous” and “Computationally well-designed.” Sadly, this Venn diagram features very little overlap. Wanting to have my cake and eat it too, I found the best of each category and spent what little free time I found myself with this past summer melding them together (Because c’mon, who doesn’t like Cake? They’re probably my favorite '90s alt-rock band).

In the “Mathematically rigorous” corner, Michael Nielsen’s neuralnetworksanddeeplearning.com is one of the best resources for anyone looking to get their feet wet in artificial neural networks, and for the can’t-be-beat price of free (But seriously, if you do read this book, please kick the author a few bucks. If a broke college student can do it, you can to)! Obviously, having some calculus and linear algebra under your belt is nice, but even if you don’t, Nielsen’s explanations of the underlying mathematical principles is very readable. The code in the book on the other hand is exactly what you would expect research code to look like. Although this isn’t inherently bad, PEP 8 isn’t too complicated and it greatly improves readability, a central tenant of Python. Indentations and whitespaces aside, the amount that Python’s memory management leaves to be desired grows proportionally to the scale of the amount of data it’s storing. The time complexity of using vanilla Python is a similar situation. Thankfully Nielsen addresses this early on, challenging the reader to implement the code using a more efficient library, such as NumPy.

Great! What’s NumPy? This led me to our top contender in the “Computationally well-designed” corner, a series of articles by machine learning researcher Brian Dolhansky. His code is beautiful, fully implementing NumPy and no God objects to be found! …nor an explanation of why a softmax activation function was applied in the output layer while the rest are sigmoid. It’s either assumed that you already know the basics or that you can learn them later. This is perfectly fine, as Nielsen’s book gives an excellent explanation of softmax neurons, as well as other more advanced techniques like weight regularization. Reverse-engineering Dolhansky’s code gave me an understanding of the framework in which to implement neural-network-based algorithms in a computationally efficient way.

So without further ado, here is my attempt at a ground-up implementation of a neural network.

import numpy as np
import activations as sigma
import costs as cost
class Layer:
def __init__(self, size, minibatch_size, input_layer=False,
output_layer=False, activation=sigma.prelu):
self.size = size
self.input_layer = input_layer
self.output_layer = output_layer
self.activation = activation
self.z = np.zeros((minibatch_size, size[0]))
if not input_layer:
self.s = np.zeros((minibatch_size, size[1]))
self.delta = np.zeros((minibatch_size, size[1]))
# Initilize weights with gaussian distribution proportional to the
# number of inputs of the input neuron
self.weights = np.random.normal(size=size,
scale=1/np.sqrt(size[0]))
self.del_w = np.zeros_like(self.weights)
if not input_layer and not output_layer:
self.f_prime = np.zeros((minibatch_size, size[1]))
self.biases = np.zeros((1, size[1]))
# Initilize neurons with no bias, allowing biases to be learned
# through backpropagation
self.del_b = np.zeros_like(self.biases)
def forward_propagate(self):
if self.input_layer: # No activation function applied to input layer
return self.z
if not self.input_layer and not self.output_layer:
self.s = np.dot(self.s, self.weights)
self.s = np.add(self.s, self.biases)
self.f_prime = self.activation(self.s, deriv=True)
else:
self.s = np.dot(self.s, self.weights)
self.z = self.activation(self.s)
return self.z
class Network:
def __init__(self, sizes, minibatch_size, cost=cost.cross_entropy):
self.sizes = sizes
self.num_layers = len(sizes)
self.layers = np.empty(self.num_layers, dtype=object)
self.minibatch_size = minibatch_size
self.cost = cost
print "Initilizing network..."
for i in range(self.num_layers-1):
if i == 0:
print "\tInitilizing input layer of size {0}.".format(
sizes[i])
self.layers[i] = Layer([sizes[i]], minibatch_size,
input_layer=True)
else:
print "\tInitilizing hidden layer of size {0}.".format(
sizes[i])
self.layers[i] = Layer([sizes[i-1], sizes[i]], minibatch_size)
print "\tInitilizing output layer of size {0}.".format(sizes[-1])
self.layers[-1] = Layer([sizes[-2], sizes[-1]], minibatch_size,
output_layer=True, activation=sigma.softmax)
print "Done!"
def forward_propagate(self, input_data):
self.layers[0].z = input_data
for i in range(self.num_layers-1):
self.layers[i+1].s = self.layers[i].forward_propagate()
return self.layers[-1].forward_propagate()
def backpropagate(self, y_hat, label):
# Calculate derivative of cost function
self.layers[-1].delta = (self.cost).error(y_hat, label)
for i in range(self.num_layers-2, 0, -1):
self.layers[i].delta = np.dot(self.layers[i+1].delta,
self.layers[i+1].weights.T) * \
self.layers[i].f_prime
def update_weights(self, n_examples, eta, lmbd):
for i in range(1, self.num_layers):
if i < self.num_layers-1:
self.layers[i].del_b = \
np.dot(np.ones((1, self.minibatch_size)),
self.layers[i].delta)
self.layers[i].biases = self.layers[i].biases - \
(eta / self.minibatch_size) * self.layers[i].del_b
self.layers[i].del_w = np.dot(self.layers[i-1].z.T,
self.layers[i].delta)
# Apply L2 regularization to weight updates with regularization
# rate lambda
self.layers[i].weights = (1 - eta * (lmbd / n_examples)) * \
self.layers[i].weights - (eta / self.minibatch_size) * \
self.layers[i].del_w
def train(self, data, n_examples, eta, lmbd):
inputs, labels = data
output = self.forward_propagate(inputs)
self.backpropagate(output, labels)
self.update_weights(n_examples, eta=eta, lmbd=lmbd)

The activations module contains a few different activation functions to play around with.

# Module of activation functions for use in matrix-based neural networks
import numpy as np
def sigmoid(s, deriv=False):
if not deriv:
return 1.0 / (1.0 + np.exp(-s))
else:
return sigmoid(s) * (1.0 - sigmoid(s))
def softmax(s):
z = np.sum(np.exp(s), axis=1)
z = z.reshape(z.shape[0], 1)
return np.exp(s) / z
def relu(s, deriv=False):
if not deriv:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = 0
else:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = 0
else:
s[i, j] = 1
return s
def prelu(s, alpha=0.1, deriv=False):
if not deriv:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] *= alpha
else:
for i in range(np.ma.size(s, axis=0)):
for j in range(np.ma.size(s, axis=1)):
if s[i, j] <= 0:
s[i, j] = alpha
else:
s[i, j] = 1
return s

Likewise, the cost module includes a couple of different cost functions for use with different activation functions.

# Module of cost functions for use in matrix-based neural networks
import numpy as np
class quadratic(object):
@staticmethod
def func(y_hat, label):
# Return cost of actual output "y_hat" with respect to expected
# output "label"
return 0.5*np.linalg.norm(y_hat - label)**2
@staticmethod
def error(y_hat, label):
# Return the error signal "delta" from the output layer
return (y_hat - label)
class cross_entropy(object):
@staticmethod
def func(y_hat, label):
# Return cost of actual output "y_hat" with respect to expected
# output "label"
return np.sum(np.nan_to_num(-label * np.log(y_hat) - (1 - label) *
np.log(1 - y_hat)))
@staticmethod
def error(y_hat, label):
# Return the error signal "delta" from the output layer
return (y_hat - label)

Although I’m not working with the raw MNIST dataset, it still needs to be loaded, and requires some additional pre-processing for training.

import cPickle
import gzip
import numpy as np
def load_data():
data = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(data)
data.close()
training_inputs, training_labels = training_data
vectorized_labels = np.zeros((len(training_labels), 10))
for i in range(len(training_labels)):
vectorized_labels[i] = vectorize_label(training_labels[i])
training_data = zip(training_inputs, vectorized_labels)
validation_data = zip(*validation_data)
test_data = zip(*test_data)
return training_data, validation_data, test_data
def vectorize_label(i):
vectorized_label = np.zeros((10))
vectorized_label[i] = 1
return vectorized_label

The oh-so-cleverly named evaluator module handles the training and evaluation of the network, returning the results of each epoch.

# Evaluates neural networks with given hyper-parameters and returns results
import numpy as np
def evaluate(network, epochs, eta, lmbd, training_data,
validation_data=None, test_data=None):
n_training = len(training_data)
print "Training for {0} epochs...".format(epochs)
for t in range(epochs):
np.random.shuffle(training_data)
out_str = "\tEpoch {0:2d}:".format(t+1)
for i in range(n_training/network.minibatch_size):
data = create_minibatch(training_data, i, network.minibatch_size,
training_data=True)
network.train(data, n_training, eta, lmbd)
if validation_data:
n_validation = len(validation_data)
n_correct = 0
for i in range(n_validation/network.minibatch_size):
inputs, labels = create_minibatch(validation_data, i,
network.minibatch_size)
output = network.forward_propagate(inputs)
y_hat = np.argmax(output, axis=1)
n_correct += np.sum(y_hat == labels)
out_str = "{0} Training accuracy: {1:.2f}%".format(
out_str, float(n_correct)/n_validation * 100)
if test_data:
n_test = len(test_data)
n_correct = 0
for i in range(n_test/network.minibatch_size):
inputs, labels = create_minibatch(test_data, i,
network.minibatch_size)
output = network.forward_propagate(inputs)
y_hat = np.argmax(output, axis=1)
n_correct += np.sum(y_hat == labels)
out_str = "{0} Test accuracy: {1:.2f}%".format(
out_str, float(n_correct)/n_test * 100)
print out_str
def create_minibatch(data, i, minibatch_size, training_data=False):
inputs, labels = zip(*data)
n = np.size(inputs[0], axis=0)
minibatch_inputs = np.zeros((minibatch_size, n))
if training_data:
minibatch_labels = np.empty((minibatch_size, 10))
for j in range(minibatch_size):
minibatch_inputs[j, :] = inputs[i+j]
minibatch_labels[j, :] = labels[i+j]
else:
minibatch_labels = np.empty(minibatch_size)
for j in range(minibatch_size):
minibatch_inputs[j, :] = inputs[i+j]
minibatch_labels[j] = int(labels[i+j])
return minibatch_inputs, minibatch_labels

Finally, I wrote a quick script to initialize networks and evaluate them on the MNIST dataset with given hyper-parameters, greatly reducing the amount of repetitive typing, though sadly not eliminating it entirely.

import mnist_loader
import matrix_network
import evaluator
training_data, validation_data, test_data = mnist_loader.load_data()
network = matrix_network.Network([784, 100, 10], 100)
evaluator.evaluate(network, 30, 1.0, 0.1, training_data,
validation_data, test_data)

There’s quite a bit going on in the code overall, despite the relatively simple algorithms underpinning it. The power of these networks lies in the scale of data that can be leveraged by these algorithms when implemented using modern matrix libraries. In the next few weeks, I’ll attempt to shed some light on what’s happening under the hood.

❋❋❋

Roll Your Own Python IDE

I’ve found myself using Python significantly more in the past week and quickly grew tired of working in a shell in Terminal, so I started looking around for a good IDE that supported Python. I first tried Xcode, surprised to discover that you can specify whatever build tool you’d like for a project, as I already use it for iOS development, however wrestling with auto-indentation appearing in all of the wrong places became frustrating to say the least. Going back to the drawing board, I started scouring python.org’s wiki of various IDEs, trying just about every macOS-supported IDE, extension, and plugin listed. My wishlist of requirements and features might be a little long, but hey, almost all of my time spent when coding in Python will be within this integrated development environment (Hence the name), so I think I’m allowed to be a bit picky. Here it is:

  • Source code editor
  • Syntax-highlighting
  • Linter
  • Autocomplete
  • Compiler
  • Debugger
  • Autosave
  • Version control (Preferably through Git)
  • Free and open-source
  • Native application

Ok, so those last two aren’t deal-breakers, but they’d be greatly appreciated, especially that last one, given how atrocious Java app UIs tend to be (Looking at you, CrashPlan).

I had all but given up hope of finding an IDE, and started looking into text-editors that could hopefully fulfill at least part of my wishlist, which is when I found Atom, “A hackable text editor for the 21st century.” Now I had heard of Atom before as one of the contenders in the modern incarnation of the “vim vs emacs” war, facing off against Sublime. As a result, I had always just written it off as another glorified iteration of TextEdit, until I found this post about configuring it for Python. Configuring Atom with just the packages mentioned in the post alone crossed off quite a few of my wishlist requirements, even the optional points regarding a native, open source application. Seeing the plethora of other packages available for Atom, I tried taking things a step further, rolling my own Python IDE in Atom.

Atom, a hackable text editor for the 21st century

Atom, a hackable text editor for the 21st century.

Installing Packages

There are two ways to install packages in atom, you can either search for packages published on atom.io from the Install tab in Preferences, or use apm in Terminal. For example, to install Linter for Atom from Terminal, just enter:

$
apm install linter

Atom needs to be restarted after packages are installed for changes to take effect.

Configuring Atom

If you haven’t already, install Linter for Atom, mentioned above. Additionally, we’re going to need a Python source code checker, flake8, which we can get through pip:

$
pip install flake8

We’ll also need an interface between linter and flake8, linter-flake8. Again, you can use the Install tab in Preferences, or:

$
apm install linter-flake8

For autocomplete, you can install the autocomplete-python package. The packages script and python-debugger provide compiling and debugging functionality, respectively. There are a few dependency packages that Atom might ask you to install after restarting, such as intentions and busy-signal. Finally, autosave is a pre-installed package in Atom but is disabled by default. To enable, go to the Packages tab in Preferences, and search for “autosave,” which will appear under Core Packages. In its settings, select “Enabled.” But what about version control? I almost forgot! Atom comes with Git support and core APIs already baked in.

Atom is incredibly customizable thanks to the wealth of packages available, and this process could easily be applied to other languages. I look forward to seeing how far this text editor can go, as I explore and tinker with everything Atom has to offer.

❋❋❋

Machine Learning 101

Computers have become ubiquitous in this day and age, permeating every facet of our lives. Considered to be Turing complete, they are capable of performing a set of given instructions, unwaveringly. While this is perfect for tasks involving exact mathematical processes, say for instance, calculating pi to over 22 trillion digits, they can’t perform simple tasks a child could, such as distinguishing a cat from a bird, or reading handwriting. Machine learning is the field of computer science aiming to achieve these feats, by giving computers the ability to operate without explicit instruction.

There are a variety of different approaches and algorithms being studied, from decision trees to inductive logic programming. One technique in particular, artificial neural networks, tries to emulate what some might say is the most advanced computer ever known, the human brain. Neural networks, as the name implies, are simply networks of interconnected nodes known as neurons. Neurons process multiple inputs, and decide whether or not to pass a signal on to the next one. The simplicity of individual neurons allows them to be interconnected into vast networks, amplifying their processing power with each layer.

A simple three layer neural network

A simple three layer neural network, courtesy Wikimedia Commons.

As ever-larger artificial neural networks are built with more and more layers, hidden layers grow deeper and deeper, this is where the term “Deep Learning” comes from.

From Biology to Bits

Biological neurons communicate through a type of chemical called neurotransmitters. These molecules bind with receptors on the cell’s dendrites and alter the electric potential across its membrane by allowing ions to flow in and out of the cell. When a large enough electric potential is produced, it propagates through the neuron, down its axon, where it releases neurotransmitters across a synapse to the dendrites of the next neuron.

All of that is a little convoluted to try to simulate, but we can reduce the process to a mathematically model-able one. Inputs can be anything as complicated as a dimension of a dataset to as simple as a binary yes or no. An artificial neuron will multiply each input by a variable weight, then sum the results. Input weights are adjusted through a process called backpropagation (We’ll get more into this later), which allows the network to evolve and produce “better” results (by whatever metric that’s measured) over time. To determine whether or not the neuron will “fire,” and what signal to pass to the next, the sum of the weighted inputs are passed through an activation function. A biological neuron’s activation function is simply a threshold that must be surpassed, but an artificial neuron’s can be more complicated. One such activation function commonly employed is the sigmoid function.

Sigmoid function

The sigmoid function, courtesy Wikimedia Commons.

The sigmoid function is an “S”-shaped curve and can map the weighted input sum to a value between 0 and 1 (Which helps simplify calculations with normalized data).

Making Mistakes

Much like how no one is born knowing how to ride a bike, artificial neural networks must also be taught and trained how to process inputs correctly. There are two schools of thought regarding learning: supervised and unsupervised. Backpropagation is a common training method that can be used in both settings, though is generally considered a supervised learning technique. A two part process, it begins with error propagation. The network’s output is compared to a desired output, and the error in the output is calculated by a loss function. This error is propagated backward through the network to determine each neuron’s contribution. These error values are then used to calculate the gradient of the loss function. The gradient is the slope or “steepness” of the loss function. The lowest point of the loss function is where error is smallest. The goal of the second part of the process is to find the input weights that will lead to this minimum. Optimization methods, such as gradient descent, are used to calculate new input weights for the network. These input weights are updated with the new, “optimized” values, and the process is repeated.

Error surface of a neuron

3-D plot of a neuron's error with two inputs, courtesy Wikimedia Commons.

For a step-by-step example of backpropagation, check out this tutorial.

The Tip of the Iceberg

Machine learning and artificial neural networks are exciting tools that are already integrating into the technology we use everyday. The scope of their use and abilities is only growing. While there are still many unique and perplexing challenges to overcome, one day our interactions with them may become as casual and universal as smartphones are today.