Leveraging the Latest NVIDIA Data Center GPUs for AI Inferencing

Technical Blog
September 14, 2023

Introduction

Hey there! At SmartCow, we're an edge AI engineering company that focuses on AIoT embedded edge nodes like the NVIDIA Jetson platform. Our AI analytics are tailored to run efficiently on these embedded systems. The AI lifecycle typically includes data collection, feature engineering, algorithm development, and of course, benchmarking.

We often benchmark our AI models on NVIDIA’s comprehensive lineup of Jetson system-on-modules to evaluate performance metrics such as FPS, throughput, latency, and hardware-related metrics like GPU and RAM utilization.

Recently, our partners at PNY provided us access to a server boasting six NVIDIA data center GPUs ,NVIDIA A2, A10, A40, A100 from the NVIDIA Ampere architecture series, the NVIDIA L40 based on the new e NVIDIA Ada Lovelace architecture, and the NVIDIA H100 Tensor Core GPU based on Hopper architecture. The server also had 2x AMD EPYC 7742 CPUs, 16x 64GB DDR4 3200MHz Hynix SDRAM, and up to 4x U2 NVMe SSDs. 

We put this powerhouse to the test by benchmarking three NVIDIA pre-trained AI models (PeopleNet, FaceDetect, and Face-Mask classification) used in our SmartSpaces solution. Curious about the tools we used and the results we got? Keep reading, we will cover it in the later section!

Experiment Insights

We have used the NVIDIA DeepStream SDK, it is a streaming analytics toolkit for multisensor processing. It runs on discrete GPUs such as NVIDIA T4, NVIDIA Ampere Architecture and on system on chip platforms such as the NVIDIA Jetson family of devices.

We tested a 1920x1080 video at 30 FPS using h.264 encoding. Results don't cover all GPU performance factors due to hardware variation and video content. We used a busy location video as a typical scenario, but other factors also affect results. This blog gives a rough idea of GPU performance in real-world applications.

On the top, we have the original video that we used for all of our tests. On the left, we have the annotated video, and on the right, we have an example of the multi-stream output.

Pipeline Overview

Deepstream Pipeline Created by SmartCow

The diagram above illustrates a simplified view of the DeepStream pipeline that was used for the experiments.

  • Uridecodebin: Reads the video stream from a URI (on disk, network, or online).
  • Nvstreammux: Generates frame batches from input streams for model inferencing.
  • Nvinfer: The plugin that deploys the model and performs inference on the frames or crops. There are three nvinfer  plugins in the pipeline, one for each model: people detection, face detection, and mask classification.
  • Nvtracker: Assigns persistent IDs to detected objects, in this case people, and tracks them over time in multiple frames. 
  • The final component takes the pipeline metadata, such as bounding box locations, draws it on the frame, encodes the video, and displays it on screen.

NVIDIA DeepStream SDK

DeepStream simplifies building streaming pipelines for AI-driven video, audio, and image analytics. It supports C or Python development, offering flexibility. With hardware-accelerated plug-ins, it unifies Python and C via Gstreamer origins.

DeepStream is also an integral part of NVIDIA Metropolis, the platform for building end-to-end services and solutions that transform pixel and sensor data to actionable insights. (Resource: NVIDIA)

DeepStream is also an integral part of NVIDIA Metropolis, the platform for building end-to-end services and solutions that transform pixel and sensor data to actionable insights. (Resource: NVIDIA)

While PyTorch and TensorFlow are popular open source machine learning frameworks, they aren't ideal for running on NVIDIA Jetson modules directly. They offer training and basic inference options, whereas DeepStream provides a complete GPU-accelerated pipeline with decoding, preprocessing, inference, postprocessing, and tracking. Comparing them is like comparing apples to oranges, as they serve different roles within the AI pipeline.

GStreamer is used in NVIDIA DeepStream to handle multimedia data, and NVIDIA has extended it with additional plugins and APIs for video analytics and deep learning inference. Developers can use GStreamer's existing plugins or create custom ones to extend DeepStream's functionality.

The NVIDIA Video Codec SDK offers tools for video processing and analytics on NVIDIA GPUs that can be used with GStreamer plugins, such as NVDEC and NVENC, for hardware-accelerated decoding and encoding of video streams, reducing CPU usage. In the following experiments, let's see how NVDEC and NVENC contribute to our benchmarks.

Tracker Benchmarks

The nvtracker NVIDIA DeepStream plugin comes packaged with three trackers:

  • Maximum Performance (max_perf): Favors speed and performance over accuracy, tends to be more lightweight.
  • Performance (Perf): Is a balanced tracker that keeps a balance between accuracy and performance. 
  • Accuracy: Favors accuracy over performance, tends to require the most processing power out of all the three trackers but is the most accurate.

In our experiments, we aimed to evaluate the impact of different trackers on pipeline performance. To accomplish this, we ran the people detection model within the pipeline while using the nvtracker component to identify and maintain the detected people across all frames. The nvtracker component was placed after the first nvinfer element, as shown in the previous figure.
When using the tracker element in NVIDIA DeepStream, the interval parameter in the nvinfer plugin configuration is a crucial parameter that affects performance. This parameter determines how many frames (or batches of frames) to skip during inferencing. Setting this parameter to a value greater than zero improves pipeline performance by freeing up resources. However, to keep track of detected objects moving across frames over time, a tracker must be incorporated into the pipeline since inference is not performed on all frames.

In this experiment, we measured the frames per second (FPS) that GPUs could process using different trackers mentioned above, including four distinct interval values: 0, 1, 3, and 10, and two different numbers of inputs: 16 and 25. Additionally, we included four different GPUs for comparison: the A40, A100, the newly released L4 GPU, and its predecessor, the T4. The idea behind these comparisons was that we would compare a mid-range server-grade GPU to a high-end GPU, while also comparing the new L4 GPU to its predecessor. 

In the first experiment, we evaluated the pipeline's performance using 16 inputs.

Average FPS per video stream

Average FPS per video stream

The graphs indicate minimal FPS variation between Max Perf and Perf trackers. Increasing the interval enhances pipeline performance, but compromises accuracy. Hence, achieving a balance between performance and accuracy by choosing suitable values is vital during pipeline design.


We replicated the same tests with 25 input video streams, and the similar Max Perf and Perf tracker pattern persisted in the graphs. This suggests a preference for the Perf tracker due to slightly better tracking accuracy. However, resource availability on the device should be considered.

Average FPS per video stream
A close-up of a graphDescription automatically generated
Average FPS per video stream

Interval Benchmarks

In our second experiment, we wanted to assess the overall performance of our video inference pipeline and determine the impact of interval values on FPS for different GPUs. We used the complete pipeline with all three models and fixed the number of input streams to 25. Based on the results of our previous experiment, we opted to use the Perf tracker, which provides a good balance of performance and accuracy.

A graph showing the average value of a gpuDescription automatically generated

We found that with all GPU types, the performance of the pipeline increased as the interval value was raised, which was expected. However, this increase in performance comes at the cost of detection and tracking accuracy. While increasing the interval can help offload some of the GPU workload and improve pipeline performance, particularly in a multi-model pipeline, the gain in performance is not always linear. So, it's essential to consider the cost of detection accuracy when skipping a large number of frames.

Benchmarks for the Number of Input Streams

Our final experiment aimed to assess the impact of the number of input sources on the pipeline's performance and determine the maximum number of input sources each GPU can handle with our sample pipeline. To set a baseline, we used the FPS value of the video, which is 30 FPS, and any performance below this value was considered to be degraded.

Theoretically, the number of input streams a GPU can handle would depend on the number of hardware decoders it has, with each decoder processing a specific number of streams at a given resolution. However, in our experiment, we found that this was not always the case as AI inference was also conducted in parallel with video decoding. The graph above shows that in most cases, the decoders are not the bottleneck for the pipeline.

NVIDIA L4 versus NVIDIA T4

The new NVIDIA L4 GPU delivers a remarkable leap in acceleration, offering approximately twice the performance of its predecessor, the T4. The data presented in this blog post clearly demonstrates this superiority across various experiments, particularly the tracker experiments. Irrespective of the type of tracker used, the L4 GPU consistently achieved twice the FPS performance compared to the T4.

A close-up of a graphDescription automatically generated

Now, turning to the interval experiment, we can observe a comparable trend where the L4 GPU's performance is around twice that of the T4.

In terms of hardware specifications, the T4 is equipped with 1 encoder and 2 decoders, whereas the L4 comes with 2 encoders and 4 decoders. This difference did not have a significant impact on the results of the maximum number of streams experiments, primarily because the decoders were not the major bottleneck in the pipeline. However, when we shift our focus to inference, the L4 consistently outperformed the T4, delivering performance at twice the speed, as clearly demonstrated in the following graph:

POV from GPU Specialized Distributor - PNY

It's important to note that previous benchmarks only looked at the inference capabilities of NVIDIA GPUs and did not reflect how fast they could perform during training. NVIDIA H100, in particular, provides the most advanced compute performance ever reached, delivering 2.5x more performance compared to the NVIDIA Ampere architecture series A100 GPU. 

The H100 recently dominated the MLCommons training results after also dominating the inference benchmarks on large size AI models that have grown five orders of magnitude since 2018. We have compiled a table below that shows all the models with the datasets tested on the H100 at the MLCommons training benchmark challenge.

Overall, the exceptional performance, high number of Tensor Cores, large memory capacity, high-bandwidth memory, thermal capacity, AI frameworks support, and enterprise-grade reliability and security features make NVIDIA H100 superior to all other GPUs for AI training.

Conclusion

This blog offers some interesting insights into the practical limitations of GPU performance when deploying AI models. We dive into how different trackers and the number of input streams can affect the performance of a video inference pipeline that uses multiple models. Plus, we also discuss how hardware decoders can limit the capabilities of these pipelines.

When it comes to designing a pipeline, it's all about striking the right balance between performance and accuracy. This can be tricky, and sometimes sacrifices may need to be made based on the specific application and hardware constraints. But it all comes down to finding the sweet spot that works best for you! Learn more by contacting SmartCow and PNY team to get the best suit for your solution. 

About  Authors

Pooja Venkatesh

Pooja is a Senior Product Manager at SmartCow. She manages end-to-end product development and comes with development experience in deep learning and computer vision. Pooja also worked previously at NVIDIA as part of the Metropolis team and carries an in-depth understanding of the IVA industry.

Adrian Apap

At SmartCow AI Technologies, Adrian works as an AI/ML Specialist and Solutions Architect. His responsibilities include designing AI pipelines and optimizing deep learning and computer vision models for production.

Nitin Rai

Nitin is an AI Application Developer at SmartCow, His focus area is Rapid Prototyping of Computer Vision Applications.

Cédric Ceola

Cédric is a Senior Solution Architect Manager leading the EMEA technical and support team at PNY. Cédric comes with extensive experience in GPUs, networking and linux systems. He is supporting large server cluster deployment on the technical side and helping customers build the right solution for their NVIDIA datacenter needs.

Our technical blogs are hosted on Medium. Click the link below to go there now