Building PyTorch from Source with CUDA on macOS

12 Jul 2018

PyTorch 1.0.0 is here! Unfortunately, if you want GPU support on macOS, you’ll have to get your hands dirty. Here’s how to build PyTorch from source with CUDA 10.0 on macOS High Sierra.

Prerequisites

Xcode

In this tutorial we’ll be building PyTorch with CUDA 10.0. Xcode 9.4 is required to install CUDA, and can be downloaded from Apple. Be sure to download Command Line Tools for Xcode 9.4 as well. Extract Xcode to /Applications (Renamed in case another version of Xcode already exists):

tar -zxvf Xcode_9.4.1.xip -C /Applications/Xcode-9.4.app

Command Line Tools can be installed by following the instructions given in the installer.

CUDA and cuDNN

As previously mentioned, we’ll be building against the latest releases of CUDA and cuDNN, versions 10.0 and 7.4.1, respectively. CUDA can be downloaded directly from NVIDIA, while downloading cuDNN requires a developer account. Installation of CUDA is extremely straightforward, just follow the instructions provided by the downloaded installer. cuDNN must be extracted and copied to the proper directories.

Navigate to the directory containing the downloaded tarball extract it:

tar -xzvf cudnn-10.0-osx-x64-v7.4.1.5.tgz

Copy the library components into your CUDA installation, and make them executable:

sudo cp cuda/include/cudnn.h /usr/local/cuda/include

sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib

sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*

You can delete the extracted directory afterward if you’d like:

rm -rf cuda

Lastly, set an environmental variable, DYLD_LIBRARY_PATH, to point to cuDNN’s location.

export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH

Of course it never hurts to check your work. Verify cuDNN with the following command (It may return some warnings but no errors):

echo -e '#include"cudnn.h"\n void main(){}' | nvcc -x c - -o /dev/null -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudnn

Building PyTorch

Clone the GitHub repository and navigate into it.

git clone --recursive https://github.com/pytorch/pytorch && cd pytorch

Let’s checkout the correct branch, and download the required dependencies.

git checkout v1.0.0

git submodule update --init

PyTorch requires a few additional python dependencies. I’d recommend installing these into a virtual environment for the build, however I’ll leave the implementation details up to the reader.

pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

Add the path to the python execetuable you’ll be using as an environmental variable for cmake.

export CMAKE_PREFIX_PATH=$(dirname $(which python))/../

Finally, we’re ready to go! If you’d rather build a .whl for installation with pip, replace install in the command below with bdist_wheel, and you’ll find it in a dist directory upon completion.

MACOSX_DEPLOYMENT_TARGET=10.13 CC=clang CXX=clang++ python setup.py install

Let’s give PyTorch 1.0 a try. Let’s train one of the example models on the MNIST dataset.

 from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
 
 
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)
 
    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
 
 
def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
 
 
def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
 
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
 
 
def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    args = parser.parse_args()
 
    use_cuda = not args.no_cuda and torch.cuda.is_available()
 
    torch.manual_seed(args.seed)
 
    device = torch.device("cuda" if use_cuda else "cpu")
 
    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.test_batch_size, shuffle=True, **kwargs)
 
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args.lr,
                          momentum=args.momentum)
 
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(args, model, device, test_loader)
 
 
if __name__ == '__main__':
    main()

Running the example CPU-only and again with a GPU produced models with extremely similar accuracy, 98.41% and 98.38% respectively, however GPU training was, on average, 27% faster than CPU-only.

It’s unfortunate that first-party support for GPU-accelerated machine learning on macOS in general leaves much to be desired, however given the circumstances surrounding Apple and NVIDIA, you really can’t blame the developers of libraries like PyTorch or TensorFlow for not devoting more engineering resources to a largely front-end platform like macOS. That being said, diversity is better for everyone, and hopefully Apple will continue to prove their renewed commitment to Pro users with hardware capable of leveraging these advanced libraries. Likewise, we should all hope that machine learning libraries will eventually break free of the chokehold NVIDIA has on them, and that they can begin supporting alternative frameworks, such as ROCm, but I digress. I’ll save that discussion for another day.

Tags: CUDA, macOS, PyTorch, tutorial

	from __future__ import print_function
	import argparse
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import torch.optim as optim
	from torchvision import datasets, transforms


	class Net(nn.Module):
	def __init__(self):
	super(Net, self).__init__()
	self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
	self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
	self.conv2_drop = nn.Dropout2d()
	self.fc1 = nn.Linear(320, 50)
	self.fc2 = nn.Linear(50, 10)

	def forward(self, x):
	x = F.relu(F.max_pool2d(self.conv1(x), 2))
	x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
	x = x.view(-1, 320)
	x = F.relu(self.fc1(x))
	x = F.dropout(x, training=self.training)
	x = self.fc2(x)
	return F.log_softmax(x, dim=1)


	def train(args, model, device, train_loader, optimizer, epoch):
	model.train()
	for batch_idx, (data, target) in enumerate(train_loader):
	data, target = data.to(device), target.to(device)
	optimizer.zero_grad()
	output = model(data)
	loss = F.nll_loss(output, target)
	loss.backward()
	optimizer.step()
	if batch_idx % args.log_interval == 0:
	print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
	epoch, batch_idx * len(data), len(train_loader.dataset),
	100. * batch_idx / len(train_loader), loss.item()))


	def test(args, model, device, test_loader):
	model.eval()
	test_loss = 0
	correct = 0
	with torch.no_grad():
	for data, target in test_loader:
	data, target = data.to(device), target.to(device)
	output = model(data)
	test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
	pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
	correct += pred.eq(target.view_as(pred)).sum().item()

	test_loss /= len(test_loader.dataset)
	print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
	test_loss, correct, len(test_loader.dataset),
	100. * correct / len(test_loader.dataset)))


	def main():
	# Training settings
	parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
	parser.add_argument('--batch-size', type=int, default=64, metavar='N',
	help='input batch size for training (default: 64)')
	parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
	help='input batch size for testing (default: 1000)')
	parser.add_argument('--epochs', type=int, default=10, metavar='N',
	help='number of epochs to train (default: 10)')
	parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
	help='learning rate (default: 0.01)')
	parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
	help='SGD momentum (default: 0.5)')
	parser.add_argument('--no-cuda', action='store_true', default=False,
	help='disables CUDA training')
	parser.add_argument('--seed', type=int, default=1, metavar='S',
	help='random seed (default: 1)')
	parser.add_argument('--log-interval', type=int, default=10, metavar='N',
	help='how many batches to wait before logging training status')
	args = parser.parse_args()

	use_cuda = not args.no_cuda and torch.cuda.is_available()

	torch.manual_seed(args.seed)

	device = torch.device("cuda" if use_cuda else "cpu")

	kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
	train_loader = torch.utils.data.DataLoader(
	datasets.MNIST('../data', train=True, download=True,
	transform=transforms.Compose([
	transforms.ToTensor(),
	transforms.Normalize((0.1307,), (0.3081,))
	])),
	batch_size=args.batch_size, shuffle=True, **kwargs)
	test_loader = torch.utils.data.DataLoader(
	datasets.MNIST('../data', train=False, transform=transforms.Compose([
	transforms.ToTensor(),
	transforms.Normalize((0.1307,), (0.3081,))
	])),
	batch_size=args.test_batch_size, shuffle=True, **kwargs)

	model = Net().to(device)
	optimizer = optim.SGD(model.parameters(), lr=args.lr,
	momentum=args.momentum)

	for epoch in range(1, args.epochs + 1):
	train(args, model, device, train_loader, optimizer, epoch)
	test(args, model, device, test_loader)


	if __name__ == '__main__':
	main()