Skip to the content.

Home | Proposal | Milestone | Final | Analysis Notebook Guide

Project Proposal

Title

Authors: Akira van de Groenendaal, Tianyou Zhang

URL: https://ak33ra.github.io/15-418-final-project/

Summary

We will experimentally evaluate how multiple deep learning models running concurrently on a single GPU impact each other’s performance. Using real NVIDIA GPUs and inference workloads, we will measure throughput, latency, slowdown, fairness, and GPU utilization when different models—varying in size and compute vs. bandwidth bottlenecks—share the device simultaneously. Our goal is to characterize bottlenecks and identify patterns that inform better GPU multi-tenancy strategies.

Background

Modern AI systems frequently need co-locate multiple models on the same GPU to improve device utilization, or to make better use to limited computing resources. Ideally, models can run independently and provide speedups directly proportional to the number of models we can fit on a GPU. However, GPU kernels from separate models often interfere with each other, competing for shared resources. While tools like CUDA Streams and MPS further enable concurrency, they offer only basic resource partitioning, and programmers also have very little (or no) control over the exact allocation of memory, SMs, and work schedules.

Different models also stress different resources. Some might be compute heavy, while others are bandwidth heavy. In addition, different types of models are dominated by different types of operations.

When these workloads execute together, we’d expect more complex contention patterns to emerge:

Studying this problem is interesting, and characterizing these problems in specific ways can help us better understand how to effectively use a fixed amount of compute, as well as suggest future directions for improvement.

The Challenge

One significant challenge is the open-endedness of this project- there isn’t a concrete algorithm to implement and benchmark vs a known sequential version. Additionally, the nature of this work is new to us, and part of the learning will be figuring out efficient pipelines for running experiments, collecting, and plotting data in order to extract meaningful conclusions. We’ll also have to deepen our understanding of GPUs and the models we’ll be running. In particular, we need to learn about how various types of models actually run on a GPU, and how changing various parameters affects the compute/ memory patterns.

We want comprehensive metrics and analysis that is able to provide beneficial data for the research field, which is multi-dimensional due to the many factors that can affect model performance.

Goals and Deliverables

Plan to Achieve

Hope to Achieve:

Platform

Tools: C++, Python, NVIDIA GeForce RTX 2080, NVIDIA V100, Other research GPUs like A100, Nsight Systems.

Models: GPT-2, BERT, LLAMA, DeepSeek, YOLO

Most models can be run through C++ and Python, which have many libraries for gathering metrics. We will will also be using a variety of different GPUs in order to explore how mult-tenancy can impact GPUs with varying power. We also want to run varying sized models to see how parameter count and model size affect multi-tenancy results. The models that we chose vary in size and use cases, which will allow for more complex analysis and experiments for real world use cases.

Using GPUs is the natural choice since most models in deployment run on GPUs and utilize resources like Tensor Cores and SMs. Additionally, Nvidia’s tooling provides good visibility into kernel-level interactions.

Schedule

Week of Nov 17th: Explore research in this area and understand how to run and benchmark the different models Understand how to use MPS and running multiple models with batch requests. Set up infrastructure/ environment for running multiple models concurrently, controlling batch sizes, timing and logging.

Week of Nov 24th: Obtain different benchmarks for the models. Start with isolated base lines (model running alone), before investigating different combinations and variables. Collect Nsight data (if available).

Week of Dec 1st: Analyze the data and find possible areas of optimization. Possibly create scheduler or scheduling ideas. Write up insights and patterns.