Problem

This is a project born out of the problems I faced during my time as a student researcher at IIT Bombay. The institute had a limited set of GPUs and there always existed infighting amongst us students to get priority access to the GPUs. GPU management was required.

Targets for a V1 solution

If a model is training, then it occupies a single GPU, not spread across several
A deployment script that
- automatically assigns a particular GPU for a model in queue
- checkpoint a model in between and remove from GPU
- redeploy a model on a different GPU
Should not be a added code, ONNX preferred

Notes: Design of a V1 Solution

First version limitations

Will not be model specific, hence treating all models the same
- Should not matter according our algorithm
- Have to check how far off our algorithm is from the most efficient split of transformers
Will not multi node
- Probably will be easy since all iitb servers are connected
- Overhead to be measured
Assuming largest layer will fit in a single gpu
- Otherwise will have to resort to tensor parallelism
All layers together fill fit in all the gpus available in the node
- Otherwise will have to resort to multi node
  Parallelization ways
Data parallel
- The zero implementation by deep speed
- Split the params, gradients, optimizer states and the activations are stored in memory into multiple gpus
- When going through one gpu, it gives all the other gpus the params and the activations in both passes and collects the gradients accumulated in backward pass
- Pain points
  - How to determine the correct split of the three things
    - Split maybe just one if not able to
    - Compute the lowest possible split possible and each of the sections memory occupancy
      - Is this an effective strategy?
  - What to do if any of the gpus are not free
    - Calculate the coalesce of the splits
    - Thus the split will be discrete
  - What to do if suddenly a part of gpu is occupied or freed up
    - Every t interval, run the algorithm and determine effective split
    - Have to keep in mind the gpu-to-gpu transfer (the same overhead occurs when transferring the params, just the more memory, but that may be irrelevant
- Model parallel and pipeline parallel
  - Model is sequentially split across the gpus
  - Data is divided into micro batches
  - Each micro batches pass through first split and then move on
  - Parameters of the model are not shared with any other gpu
- Tensor parallelism
  - All gpus have all the data but the model parameters are split horizontally
    System design
Assertion
- largest layer will fit in a single gpu
- All layers together fill fit in all the gpus available in the node
Input variables
- GPU memory
- Gpu utilization

Algorithm

Step 1: Determine the memory occupied the model
- Parameters
- Gradients
- Optimizer states
- Input data
Step 2 : Determine the splits for the model

Version 1

Given a single model of memory M, it has to discover the best place and deploy it
- Input
  - Gpu memory
  - Gpu utlization
- First version would be a simple model deployer, no ZERO configurations, no multiple nodes
- Identify from among all the gpus
  - Where the model can fit
  - Where there is less gpu utilization
- Deploy model to any of the best gpus
  - Through onnx
  - Wont involve changes to code
- Through deepspeed
  - Will involve changes to code
    References
https://huggingface.co/docs/transformers/v4.15.0/en/parallelism
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
https://github.com/microsoft/DeepSpeed
https://towardsdatascience.com/pytorch-lightning-vs-deepspeed-vs-fsdp-vs-ffcv-vs-e0d6b2a95719

GPU Management

Problem

Targets for a V1 solution

Notes: Design of a V1 Solution

First version limitations

Parallelization ways

System design

Algorithm

Version 1

References

Enjoy Reading This Article?