Transparent Neural Network Acceleration

We are hiring!


Neural networks are widely used in all kinds of applications today. To assist data scientists to focus on their work in improving neural network computations, several neural network frameworks (PyTorch, TensorFlow, MxNet, CNTK, …) have emerged, all with their strengths and weaknesses. What they have in common is that they model the network as a graph of layers, where each layer does a specific computational operation on the data. This is implemented by mapping these computations on existing BLAS or specialized neural network libraries, provided by hardware manufacturers that promise to achieve peak performance on the underlying hardware.

While this approach is convenient to use and makes use of optimized implementations for compute intensive operations, it does not take the actual structure of the neural network and the data path into consideration. Especially in layers that are memory bound, this causes constant cache thrashing resulting in significant performance drops. For this, the developers of these optimized libraries started to create specialized implementations of sequences of layers, trying solve these problems. However, this is just trying to circumvent the actual problem. The structure of a neural networks can be seen as computer program. While for C/C++ or other languages we have compilers that tune the code to run optimally, no such technology exists for neural networks, that takes the structure of the network into account and tries to optimally map it onto the computing hardware.

The SOL Project

The mission of the SOL project is to transparently accelerate neural network workloads, with as few computing overhead as possible. We integrate SOL into neural network frameworks and where it attaches to the neural network.

SOL takes over the control of neural networks and reshapes their execution process. It analyzes the underlying structure and applies a series of optimizations (including operation reordering, loop merging, kernel fusion, etc.) to maximize the data reuse in neural network layers, to utilize caches and other on-chip memories more efficiently and optimize external library parameters to deliver optimal performance. While we alter the computations we ensure that we do not alter the results. Therefore we rely solely on optimizations to instructions, caching, and workflows and do not employ alternative algorithms, approximations, data types with lower accuracy or other methods that could influence the results.

Our approach aims at assisting data scientists with their work and not push them to become high performance computing experts. For this we designed SOL to have a very simple API interface, which is demonstrated in Code 1. Figure 1 shows performance measurements for CPU and GPU compared to PyTorch 1.2 and SOL.

Code 1: This code exmaple shows how to initialize a Densenet 201 using PyTorch, optimize it using SOL and execute the forward pass.

import torch                                      # load PyTorch
from torchvision import models                    # load PyTorch model zoo
import sol.pytorch as sol                         # load SOL library

py_model  = models.__dict__['densenet201']()      # initialize DenseNet 201
input     = torch.rand(32, 3, 224, 224)           # initialize random input data
sol_model = sol.optimize(py_model, input.size())  # optimize py_model using SOL
sol_model.load_state_dict(py_model.state_dict())  # load parameters of py_model into sol_model
output    = sol_model(input)                      # run sol_model

Non-native device support

SOL is more than just a performance optimizer. It enables to run workloads on hardware that is not natively supported by the framework, e.g., NEC's SX-Aurora Tsubasa. With SOL you only need to add the line sol.device.set(, 0) to run all workloads on the SX-Aurora, as shown in Code 2. If you are interested in SOL, you can apply for our SOL4VE closed beta program.

Code 2: This code exmaple shows how to run any workload on the NEC SX-Aurora Tsubasa.

...                                               # initialize model as before
sol.device.set(, 0)                  # 0 indicates the device index
output    = sol_model(input)                      # sol_model gets executed on the NEC SX-Aurora Tsubasa
Figure 1: Performance measurements for running inference on ResNet 50, Densenet 121, MobileNet2, MNasNet 0.5 and ShuffleNet V2 0.5 on an Intel I7 9600K CPU and NVIDIA RTX 2080 GPU. ResNet and DenseNet are well established networks, which use conservative layer configurations, that are highly optimized in AI frameworks. Still SOL achieves speedups of up to 2.57x for CPU and 1.70x for GPU. MobileNet2, MNasNet and ShuffleNet V2 are more modern architectures, making heavy use of more specialized layers, e.g., grouped convolutions and permutations, to achieve similar accuracy levels, but with much less model parameters. These more modern architectures hugely benefit from code spezialization. This enables SOL to achieve up to 7.51x and 4.41x speedup on CPU and GPU.