Create Inference Service Using Triton Framework¶

The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations.

Danger

The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.

Introduction to Triton¶

Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models.

Prerequisites¶

Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition.

Note

The model to be inferred must adhere to the following directory structure within the dataset:

  <model-repository-name>
  └── <model-name>
     └── <version>
        └── <model-definition-file>

The directory structure in this example is as follows:

    model-repo
    └── mnist-cnn
        └── 1
            └── model.pt

Create Inference Service¶

Currently, form-based creation is supported, allowing you to create services with field prompts in the interface.

点击创建

Configure Model Path¶

The model path model-repo/mnist-cnn/1/model.pt must be consistent with the directory structure of the dataset.

Model Configuration¶

点击创建

Configure Input and Output Parameters¶

Note

The first dimension of the input and output parameters defaults to batchsize, setting it to -1 allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input.

Configure Environment¶

You can import the environment created in Manage Python Environment Dependencies to serve as the runtime environment for inference.

Advanced Settings¶

点击创建

Configure Authentication Policy¶

Supports API key-based request authentication. Users can customize and add authentication parameters.

Affinity Scheduling¶

Supports automated affinity scheduling based on GPU resources and other node configurations. It also allows users to customize scheduling policies.

Access¶

点击创建

API Access¶

Triton provides a REST-based API, allowing clients to perform model inference via HTTP POST requests.
Clients can send requests with JSON-formatted bodies containing input data and related metadata.

HTTP Access¶

Send HTTP POST Request: Use tools like curl or HTTP client libraries (e.g., Python's requests library) to send POST requests to the Triton Server.
Set HTTP Headers: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers.
Construct Request Body: The request body usually contains the input data for inference and model-specific metadata.

Example curl Command¶

  curl -X POST "http://<ip>:<port>/v2/models/<inference-name>/infer" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "model_input",            
        "shape": [1, 1, 32, 32],          
        "datatype": "FP32",               
        "data": [
          [0.1234, 0.5678, 0.9101, ... ]  
        ]
      }
    ]
  }'

<ip> is the host address where the Triton Inference Server is running.
<port> is the port where the Triton Inference Server is running.
<inference-name> is the name of the inference service that has been created.
"name" must match the name of the input parameter in the model configuration.
"shape" must match the dims of the input parameter in the model configuration.
"datatype" must match the Data Type of the input parameter in the model configuration.
"data" should be replaced with the actual inference data.

Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements.