SINGA's software stack includes two major levels, the low level backend classes and the Python interface level. Figure 1 illustrates them together with the hardware. The backend components provides basic data structures for deep learning models, hardware abstractions for scheduling and executing operations, and communication components for distributed training. The Python interface wraps some CPP data structures and provides additional high-level classes for neural network training, which makes it convenient to implement complex neural network models. Next, we introduce the software stack in a bottom-up manner.
Figure 1 - SINGA V3 software stack.
Device instance, i.e., a device, is created against one hardware device,
e.g. a GPU or a CPU.
Device manages the memory of the data structures, and
schedules the operations for executing, e.g., on CUDA streams or CPU threads.
Depending on the hardware and its programming language, SINGA have implemented
the following specific device classes:
- CudaGPU represents an Nvidia GPU card. The execution units are the CUDA streams.
- CppCPU represents a normal CPU. The execution units are the CPU threads.
- OpenclGPU represents normal GPU card from both Nvidia and AMD. The execution units are the CommandQueues. Given that OpenCL is compatible with many hardware devices, e.g. FPGA and ARM, the OpenclGPU has the potential to be extended for other devices.
Tensor class represents a multi-dimensional array, which stores model
variables, e.g., the input images and feature maps from the convolution layer.
Tensor instance (i.e. a tensor) is allocated on a a device, which manages
the memory of the tensor and schedules the (computation) operations against
tensors. Most machine learning algorithms could be expressed using (dense or
sparse) the tensor abstraction and its operations. Therefore, SINGA would be
able to run a wide range of models, including deep learning models and other
traditional machine learning models.
There are two types of operators against tensors, linear algebra operators like
matrix multiplication, and neural network specific operators like convolution
and pooling. The linear algebra operators are provided as
Tensor functions and
are implemented separately for different hardware devices
- CppMath (tensor_math_cpp.h) implements the tensor operations using Cpp for CppCPU
- CudaMath (tensor_math_cuda.h) implements the tensor operations using CUDA for CudaGPU
- OpenclMath (tensor_math_opencl.h) implements the tensor operations using OpenCL for OpenclGPU
The neural network specific operators are also implemented separately, e.g.,
- GpuConvFoward (convolution.h) implements the forward function of convolution via CuDNN on Nvidia GPU.
- CpuConvForward (convolution.h) implements the forward function of convolution using CPP on CPU.
Typically, users create a
Device instance and use it to create multiple
Tensor instances. When users call the Tensor functions or neural network
operations, the corresponding implementation for the resident device will be
invoked In other words, the implementation of operators is transparent to users.
The Tensor and Device abstractions are extensible to support a wide range of hardware device using different programming languages. A new hardware device would be supported by adding a new Device subclass and the corresponding implementation of the operators.
Optimizations in terms of speed and memory are done by the
MemPool of the
Device. For example, the
Scheduler creates a
computational graph according to the dependency of the operators.
Then it can optimize the execution order of the operators for parallelism and
Communicator is to support distributed training. It implements
the communication protocols using sockets, MPI and NCCL. Typically users only
need to call the high-level APIs like
get() for sending and
receiving tensors. Communication optimization for the topology, message size,
etc. is done internally.
All the backend components are exposed as Python modules via SWIG. In addition, the following classes are added to support the implementation of complex neural networks.
Opt and its subclasses implement the methods (such as SGD) for updating model
parameter values using parameter gradients. A subclass DistOpt
synchronizes the gradients across the workers for distributed training by
calling methods from
Operator wraps multiple functions implemented using the Tensor or neural
network operators from the backend. For example, the forward function and
ReLU compose the
Layer and its subclasses wraps the operators with parameters. For instance,
convolution and linear operators
have weight and bias parameters. The parameters are maintained by the corresponding
Autograd implements the
reverse-mode automatic differentiation
by recording the execution of the forward functions of the operators calling the
backward functions automatically in the reverse order. All functions can be
buffered by the
Scheduler to create a computational graph for
efficiency and memory optimization.
Module provides an easy interface to implement new network models. You just
need to inherit
Module and define the forward propagation of the model by
creating and calling the layers or operators.
Module will do autograd and
update the parameters via
Opt automatically when training data is fed into it.
To support ONNX, SINGA implmenets a sonnx module, which includes
- SingaFrontend for saving SINGA model into onnx format.
- SingaBackend for loading onnx format model into SINGA for training and inference.