Microsoft revealed that it operates a globally distributed scheduling service for AI workloads that it modestly dubbed “Singularity”.
Described in a prepress article [PDF] Co-authored by 26 Microsoft employees, Singularity’s goal is described as helping the software giant control costs by driving high utilization of deep learning workloads.
Singularity achieves this goal with what the paper describes as a “new workload-aware scheduler that can seamlessly anticipate and elastically scale deep learning workloads to drive high utilization. without affecting their accuracy or performance, across a global fleet of AI accelerators (e.g. GPU, FPGA).”
The document spends more time on the scheduler than on Singularity itself, but offers some numbers to describe the architecture of the system. A performance analysis by Singularity mentions a test run on Nvidia DGX-2 servers using a Xeon Platinum 8168 with two sockets of 20 cores each, eight Model V100 GPUs per server, 692 GB of RAM, and networked over InfiniBand. With hundreds of thousands of GPUs in the Singularity fleet, along with FPGAs and possibly other accelerators, Microsoft has at least tens of thousands of such servers!
The architecture of singularity. Click to enlarge
The paper focuses on Singularity’s scaling technology and schedulers, which he says are his secret sauce because they reduce costs and increase reliability.
The software automatically decouples tasks from accelerator resources, which means that when tasks increase or decrease, “we just change the number of devices that workers are mapped to: this is completely transparent to the user, because the global size (i.e. total number of workers) of the job remains the same regardless of the number of physical devices running the job.”
This is possible thanks to “a new technique called replica splicing that allows multiple workers to be sliced in time on the same device with negligible overhead, while allowing each worker to use all of the device’s memory.”
For this to happen requires what the authors call a “device proxy” which “runs in its own address space and has a one-to-one correspondence with a physical accelerator device. When a worker initiates Device APIs, they are intercepted and sent over shared memory to the device proxy process which runs in a separate address space and whose lifetime is decoupled from the lifetime of the worker process.”
The above allows more tasks to be scheduled, more efficiently, so that the thousands of servers are up and running longer. It also allows rapid scaling up or down without interruption.
“Singularity achieves a significant breakthrough in planning deep learning workloads, converting niche features such as elasticity into common, always-on features that the planner can rely on to enforce SLAs. strict,” the article concludes.
Unfortunately, the document makes no mention of openly sharing Microsoft’s own research or techniques, but instead highlights the company’s AI operations. ®