GPU Management
Problem
This is a project born out of the problems I faced during my time as a student researcher at IIT Bombay. The institute had a limited set of GPUs and there always existed infighting amongst us students to get priority access to the GPUs. GPU management was required.
Targets for a V1 solution
- If a model is training, then it occupies a single GPU, not spread across several
- A deployment script that
- automatically assigns a particular GPU for a model in queue
- checkpoint a model in between and remove from GPU
- redeploy a model on a different GPU
- Should not be a added code, ONNX preferred
Notes: Design of a V1 Solution
First version limitations
- Will not be model specific, hence treating all models the same
- Should not matter according our algorithm
- Have to check how far off our algorithm is from the most efficient split of transformers
- Will not multi node
- Probably will be easy since all iitb servers are connected
- Overhead to be measured
- Assuming largest layer will fit in a single gpu
- Otherwise will have to resort to tensor parallelism
- All layers together fill fit in all the gpus available in the node
- Otherwise will have to resort to multi node
Parallelization ways
- Otherwise will have to resort to multi node
- Data parallel
- The zero implementation by deep speed
- Split the params, gradients, optimizer states and the activations are stored in memory into multiple gpus
- When going through one gpu, it gives all the other gpus the params and the activations in both passes and collects the gradients accumulated in backward pass
- Pain points
- How to determine the correct split of the three things
- Split maybe just one if not able to
- Compute the lowest possible split possible and each of the sections memory occupancy
- Is this an effective strategy?
- What to do if any of the gpus are not free
- Calculate the coalesce of the splits
- Thus the split will be discrete
- What to do if suddenly a part of gpu is occupied or freed up
- Every t interval, run the algorithm and determine effective split
- Have to keep in mind the gpu-to-gpu transfer (the same overhead occurs when transferring the params, just the more memory, but that may be irrelevant
- How to determine the correct split of the three things
- Model parallel and pipeline parallel
- Model is sequentially split across the gpus
- Data is divided into micro batches
- Each micro batches pass through first split and then move on
- Parameters of the model are not shared with any other gpu
- Tensor parallelism
- All gpus have all the data but the model parameters are split horizontally
System design
- All gpus have all the data but the model parameters are split horizontally
- Assertion
- largest layer will fit in a single gpu
- All layers together fill fit in all the gpus available in the node
- Input variables
- GPU memory
- Gpu utilization
Algorithm
- Step 1: Determine the memory occupied the model
- Parameters
- Gradients
- Optimizer states
- Input data
- Step 2 : Determine the splits for the model
Version 1
- Given a single model of memory M, it has to discover the best place and deploy it
- Input
- Gpu memory
- Gpu utlization
- First version would be a simple model deployer, no ZERO configurations, no multiple nodes
- Identify from among all the gpus
- Where the model can fit
- Where there is less gpu utilization
- Deploy model to any of the best gpus
- Through onnx
- Wont involve changes to code
- Through deepspeed
- Will involve changes to code
References
- Will involve changes to code
- Input
- https://huggingface.co/docs/transformers/v4.15.0/en/parallelism
- https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
- https://github.com/microsoft/DeepSpeed
- https://towardsdatascience.com/pytorch-lightning-vs-deepspeed-vs-fsdp-vs-ffcv-vs-e0d6b2a95719
Enjoy Reading This Article?
Here are some more articles you might like to read next: