Deep Learning Model Deployment

Deployment server

Note: the order doesn’t indicate its popularity.

1. Ray

Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.

2. Nvidia Triton

Nvidia Triton an open-source inference serving software, standardizes AI model deployment and execution and delivers fast and scalable AI in production.

3. Truss

Truss: the simplest way to serve AI/ML models in production.

Model conversion or coding languages

1. TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

2. AITemplate

AITemplate(AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving.

3. TorchScript

TorchScript is an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.

Here is an introduction tutorial on TorchScript and the documentation about it.

4. Tensor Comprehensions

Tensor Comprehensions (TC) is a notation based on generalized Einstein notation for computing on multi-dimensional arrays. TC greatly simplifies ML framework implementations by providing a concise and powerful syntax which can be efficiently translated to high-performance computation kernels, automatically.

5. Apache TVM

Apache TVM is an End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators. It aims to enable machine learning engineers to optimize and run computations efficiently on any hardware backend.

6. OpenAI Triton

Triton is an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce.

6. ONNX

ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

7. Pytorch Lightning

Lightning is a hyper-minimalistic framework used to build machine learning components that can plug into existing ML workflows.

8. Torch.compile

Torch.compile makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels, all while requiring minimal code changes.

Profiling tools:

1. NVIDIA NSight Systems

NVIDIA Nsight™ Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs, from large servers to our smallest system on a chip (SoC).