Neural networks are widely used in all kinds of applications today. To assist data scientists to focus on their work in improving neural network computations, several neural network frameworks (PyTorch, TensorFlow, Caffe, …) have emerged, all with their strengths and weaknesses. What they have in common is that they model the network as a graph of layers, where each layer does a specific computational operation on the data. This is implemented by mapping these computations on existing BLAS or specialized neural network libraries, provided by hardware manufacturers that promise to achieve peak performance on the underlying hardware.
While this approach is convenient to use and makes use of optimized implementations for compute intensive operations, it does not take the actual structure of the neural network and the data path into consideration. Especially in layers that are memory bound, this causes constant cache thrashing resulting in significant performance drops. For this, the developers of these optimized libraries started to create specialized implementations of sequences of layers, trying solve these problems. However, this is just trying to circumvent the actual problem. The structure of a neural networks can be seen as computer program. While for C/C++ or other languages we have compilers that tune the code to run optimally, no such technology exists for neural networks, that takes the structure of the network into account and tries to optimally map it onto the computing hardware.
The mission of the BrainSlug project is to transparently accelerate neural network workloads, with as few computing overhead as possible. We integrate BrainSlug into neural network frameworks and where it attaches to the neural network.
“Brain Slugs are a species of space parasite that attaches its jelly-like body to a person’s head and takes control of their brain.” [Futurama Wiki]
Similar to the Brain Slugs in the TV series Futurama, we take over the control of “brains” and reshape their execution process. BrainSlug analyzes the underlying structure and applies a series of optimizations (including operation reordering, loop merging, kernel fusion, etc.) to maximize the data reuse in neural network layers and to utilize caches and other on-chip memories more efficiently. While we alter the computations we ensure that we do not alter the results. Therefore we rely solely on optimizations to instructions, caching, and workflows and do not employ alternative algorithms, approximations, data types with lower accuracy or other methods that could influence the results. Our initial prototype already achieves up to 41.1% and 35.7% on CPUs and GPUs respectively for prediction on state-of-the-art neural networks, compared to PyTorch. Please refer to our technical report for more details and results from BrainSlug.
Our approach aims at assisting data scientists with their work and not push them to become high performance computing experts. For this we designed BrainSlug to have a very simple API interface. The following example shows how to initialize a Densenet 201 using PyTorch, optimize it using BrainSlug and execute the forward pass. To enable BrainSlug, only the two comment lines need to be added to the source code (example shows syntax of newest BrainSlug development version).
from torch.autograd import Variable import torch from torchvision import models #import brainslug.pytorch as brainslug model = models.__dict__['densenet201']() #model = brainslug.optimize(model, [0, 3, 224, 224]) model(Variable(torch.rand(32, 3, 224, 224)))
To support not only a single neural network framework, we designed BrainSlug to operate as middleware, that can plug-in several different frameworks as frontends and utilize also different kind of compute devices as backends.
The frontends ensure the compatibility between the BrainSlug core and the neural network framework. They directly interface with the framework, read the structure of the neural network, pass it to the core and further take care of the execution control from inside the framework, as well supply the data in the framework’s own tensor format.
The optimizer performs the main work in BrainSlug. It analyzes the neural network structure, identifies optimizable layers and applies various optimizations. Then it uses a generic SIMD processor model to fine tune the code towards the underlying hardware and then generates device specific code using the corresponding backends.
The backends are very slim and only provide cooking receipts to generate device specific code based on the generic SIMD processor model and to handle device specific API calls during runtime.
The scheduler is a runtime controller that manages the execution of the optimized layers.
BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism
Nicolas Weber (email@example.com), Florian Schmidt (firstname.lastname@example.org), Mathias Niepert (email@example.com) and Felipe Huici (firstname.lastname@example.org)
Technical Report, ArXiV, 2018
Extended Abstract, DeepMobile, 2018
BrainSlug is at an early stage. For now we support PyTorch v.3.* as frontend and Intel CPUs and NVIDIA GPUs as compute devices. It can only optimize the inference/prediction pass for CNN based layers. In the next months we want to extend the number of optimizable functions, implement a frontend for TensorFlow , add support for the NEC Aurora vector processor and enable also to optimize training of neural networks.