Neural networks are widely used in all kinds of applications today. To assist data scientists to focus on their work in improving neural network computations, several neural network frameworks (PyTorch, TensorFlow, MxNet, CNTK, …) have emerged, all with their strengths and weaknesses. What they have in common is that they model the network as a graph of layers, where each layer does a specific computational operation on the data. This is implemented by mapping these computations on existing BLAS or specialized neural network libraries, provided by hardware manufacturers that promise to achieve peak performance on the underlying hardware.
While this approach is convenient to use and makes use of optimized implementations for compute intensive operations, it does not take the actual structure of the neural network and the data path into consideration. Especially in layers that are memory bound, this causes constant cache thrashing resulting in significant performance drops. For this, the developers of these optimized libraries started to create specialized implementations of sequences of layers, trying solve these problems. However, this is just trying to circumvent the actual problem. The structure of a neural networks can be seen as computer program. While for C/C++ or other languages we have compilers that tune the code to run optimally, no such technology exists for neural networks, that takes the structure of the network into account and tries to optimally map it onto the computing hardware.
The SOL Project
The mission of the SOL project is to transparently accelerate neural network workloads, with as few computing overhead as possible. We integrate SOL into neural network frameworks and where it attaches to the neural network.
SOL takes over the control of neural networks and reshapes their execution process. It analyzes the underlying structure and applies a series of optimizations (including operation reordering, loop merging, kernel fusion, etc.) to maximize the data reuse in neural network layers, to utilize caches and other on-chip memories more efficiently and optimize external library parameters to deliver optimal performance. While we alter the computations we ensure that we do not alter the results. Therefore we rely solely on optimizations to instructions, caching, and workflows and do not employ alternative algorithms, approximations, data types with lower accuracy or other methods that could influence the results.
Our approach aims at assisting data scientists with their work and not push them to become high performance computing experts. For this we designed SOL to have a very simple API interface, which is demonstrated in Code 1. Figure 1 shows performance measurements for CPU and GPU compared to PyTorch 1.2 and SOL.
Code 1: This code exmaple shows how to initialize a Densenet 201 using PyTorch, optimize it using SOL and execute the forward pass.
import torch # load PyTorch from torchvision import models # load PyTorch model zoo import sol.pytorch as sol # load SOL library py_model = models.__dict__['densenet201']() # initialize DenseNet 201 input = torch.rand(32, 3, 224, 224) # initialize random input data sol_model = sol.optimize(py_model, input.size()) # optimize py_model using SOL sol_model.load_state_dict(py_model.state_dict()) # load parameters of py_model into sol_model output = sol_model(input) # run sol_model
Non-native device support
SOL is more than just a performance optimizer. It enables to run workloads on hardware
that is not natively supported by the framework, e.g., NEC's SX-Aurora Tsubasa.
With SOL you only need to add the line
sol.device.set(sol.device.ve, 0) to
run all workloads on the SX-Aurora, as shown in Code 2. If you are interested in SOL, you can apply for
our SOL4VE closed beta program.
Code 2: This code exmaple shows how to run any workload on the NEC SX-Aurora Tsubasa.
... # initialize model as before sol.device.set(sol.device.ve, 0) # 0 indicates the device index output = sol_model(input) # sol_model gets executed on the NEC SX-Aurora Tsubasa