Building PyTorch from Source with CUDA on macOS
PyTorch 1.0.0 is here! Unfortunately, if you want GPU support on macOS, you’ll have to get your hands dirty. Here’s how to build PyTorch from source with CUDA 10.0 on macOS High Sierra.
Prerequisites
Xcode
In this tutorial we’ll be building PyTorch with CUDA 10.0. Xcode 9.4 is required to install CUDA, and can be downloaded from Apple. Be sure to download Command Line Tools for Xcode 9.4 as well. Extract Xcode to /Applications
(Renamed in case another version of Xcode already exists):
Command Line Tools can be installed by following the instructions given in the installer.
CUDA and cuDNN
As previously mentioned, we’ll be building against the latest releases of CUDA and cuDNN, versions 10.0 and 7.4.1, respectively. CUDA can be downloaded directly from NVIDIA, while downloading cuDNN requires a developer account. Installation of CUDA is extremely straightforward, just follow the instructions provided by the downloaded installer. cuDNN must be extracted and copied to the proper directories.
Navigate to the directory containing the downloaded tarball extract it:
Copy the library components into your CUDA installation, and make them executable:
You can delete the extracted directory afterward if you’d like:
Lastly, set an environmental variable, DYLD_LIBRARY_PATH
, to point to cuDNN’s location.
Of course it never hurts to check your work. Verify cuDNN with the following command (It may return some warnings but no errors):
Building PyTorch
Clone the GitHub repository and navigate into it.
Let’s checkout the correct branch, and download the required dependencies.
PyTorch requires a few additional python dependencies. I’d recommend installing these into a virtual environment for the build, however I’ll leave the implementation details up to the reader.
Add the path to the python execetuable you’ll be using as an environmental variable for cmake
.
Finally, we’re ready to go! If you’d rather build a .whl
for installation with pip
, replace install
in the command below with bdist_wheel
, and you’ll find it in a dist
directory upon completion.
Let’s give PyTorch 1.0 a try. Let’s train one of the example models on the MNIST dataset.
from __future__ import print_function import argparse import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchvision import datasets, transforms class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc2 = nn.Linear(50, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 2)) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = x.view(-1, 320) x = F.relu(self.fc1(x)) x = F.dropout(x, training=self.training) x = self.fc2(x) return F.log_softmax(x, dim=1) def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step() if batch_idx % args.log_interval == 0: print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.item())) def test(args, model, device, test_loader): model.eval() test_loss = 0 correct = 0 with torch.no_grad(): for data, target in test_loader: data, target = data.to(device), target.to(device) output = model(data) test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_loader.dataset) print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset))) def main(): # Training settings parser = argparse.ArgumentParser(description='PyTorch MNIST Example') parser.add_argument('--batch-size', type=int, default=64, metavar='N', help='input batch size for training (default: 64)') parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', help='input batch size for testing (default: 1000)') parser.add_argument('--epochs', type=int, default=10, metavar='N', help='number of epochs to train (default: 10)') parser.add_argument('--lr', type=float, default=0.01, metavar='LR', help='learning rate (default: 0.01)') parser.add_argument('--momentum', type=float, default=0.5, metavar='M', help='SGD momentum (default: 0.5)') parser.add_argument('--no-cuda', action='store_true', default=False, help='disables CUDA training') parser.add_argument('--seed', type=int, default=1, metavar='S', help='random seed (default: 1)') parser.add_argument('--log-interval', type=int, default=10, metavar='N', help='how many batches to wait before logging training status') args = parser.parse_args() use_cuda = not args.no_cuda and torch.cuda.is_available() torch.manual_seed(args.seed) device = torch.device("cuda" if use_cuda else "cpu") kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} train_loader = torch.utils.data.DataLoader( datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=args.batch_size, shuffle=True, **kwargs) test_loader = torch.utils.data.DataLoader( datasets.MNIST('../data', train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=args.test_batch_size, shuffle=True, **kwargs) model = Net().to(device) optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) for epoch in range(1, args.epochs + 1): train(args, model, device, train_loader, optimizer, epoch) test(args, model, device, test_loader) if __name__ == '__main__': main()
Running the example CPU-only and again with a GPU produced models with extremely similar accuracy, 98.41% and 98.38% respectively, however GPU training was, on average, 27% faster than CPU-only.
It’s unfortunate that first-party support for GPU-accelerated machine learning on macOS in general leaves much to be desired, however given the circumstances surrounding Apple and NVIDIA, you really can’t blame the developers of libraries like PyTorch or TensorFlow for not devoting more engineering resources to a largely front-end platform like macOS. That being said, diversity is better for everyone, and hopefully Apple will continue to prove their renewed commitment to Pro users with hardware capable of leveraging these advanced libraries. Likewise, we should all hope that machine learning libraries will eventually break free of the chokehold NVIDIA has on them, and that they can begin supporting alternative frameworks, such as ROCm, but I digress. I’ll save that discussion for another day.