DL Scheduler

Note

  1. NeurComm,spatial,temporal policy 互相dependent?
  2. 系统资源利用率高,不一定就是任务多?指标有噪声?
  3. RL regret
  4. Biased random walk
  5. intrisic cost and extrisic cost

ASTRAEA: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2022)

Placement Sensitivity.

If this job is allocated with two nodes, the inter-node communication could be the performance bottleneck [3].

DL scheduler描述:

A scheduler is indispensable in a cluster to manage the computing resources and schedule jobs. At runtime, the scheduler continuously receives jobs submitted by users with the explicit resource demands. Then it decides when and where to run by selecting the proper jobs to satisfy the scheduling objectives and placing them to the appropriate servers for execution [24]. Different scheduling systems may pursue various scheduling objectives, e.g., improving resource utilization [5], [20], [25], optimizing the performance of workloads [26], [27], maximizing scheduling efficiency [14], [28], guaranteeing service and user experience [29], [30], etc.

Unpredictable Training Time.

In Venus (商汤的一个production集群), 52.2% of jobs are successfully completed, 27.9% are canceled, and 19.9% fail. Due to the high cancellation/failure ratio, it is difficult to predict the running time of DLT jobs just based on the remaining number of iterations.

Issues of DLT Clusters Without Fairness

FIFO (First-In-First-Out)

Long-term jobs may monopolize the whole cluster or VC

Blocked DLT jobs (especially for short-term jobs) may suffer from large execution slowdown ratio (>5), which is defined as (running_time+pending_time)/running_time.


Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters (ICML 2020)

Intro

局限性:

  1. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis.

  2. ML techniques for automating the heuristic search by using black-box approaches which require significant training data and time

贡献:

  1. a domain-driven(先验来自系统架构) Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture;
    • our inductive bias is a set of statistical relationships between measurements from microarchitectural monitors
    • 比如PCIe会导致资源竞争,GPU-GPU通信最高慢1.8倍,black-box的方法需要自己学,因此加入领域知识作为inductive bias
  2. a sampling-based technique to compute the gradients of a Bayesian model without performing full probabilistic inference.

Model

The model integrates real-time performance measurements, prior knowledge about workloads, and system architecture to (i) dynamically infer system state (i.e., resource utilization), and (ii) automatically schedule tasks on a heterogeneous processing fabric.

image-20220718160811763

上图$\hat{b}_t$是估计的资源利用率。因为计算机不同层次的抽象,获得的系统指标不能准确表明系统真实状态,不能直接measure到的指标成为hidden资源

Performance Counters. PCs are generally relied upon to conduct low-level performance analysis or tuning of performance bottlenecks in applications.

使用贝叶斯网络的目的:

  • model aleatoric uncertainty in measurements
    • Measurements made from PCs have some inherent noise
    • if a single scheduling agent is controlling a cluster of machines (which is common in data centers), measurements made on different machines will not be in sync and will often be delayed by network latency. (数据缺失,因为网络延迟和抖动,丢包)
    • measured value $m_c$ can be modeled in terms of the true value $v_c$ plus measurement noise $e_c$, i.e., $m_c= v_c+ e_c$.
    • 噪声误差$e_c \sim \mathcal{N}(0, \sigma)$是0均值的,That is a valid assumption because the only reason for systematic errors is hardware or software bugs. follows from prior work based on extensive measurement studies (Weaver & McKee, 2008).
  • encode our knowledge about system architecture in terms of invariants or statistical relationships between the measurements.

image-20220718165624794

把体系结构建模成贝叶斯网络(根据处理器手册和Linux kernel source code as part of the perf package构建)

We do not explicitly model GPU performance counters as low-level scheduling decisions (e.g., warp-level scheduling) in GPUs are obfuscated by the NVIDIA runtime/driver.


Deep Learning-based Job Placement in Distributed Machine Learning Clusters (INFOCOM 2019)

Intro

Interference-aware job placement,另外有一个辅助的reward预测的模型

主要针对PS架构:The workers and PSs may well be distributed onto different physical servers, when they cannot be completely hosted on one server, or to maximize resource fragment utilization on servers.

资源利用率低:over-subscription of resources

资源(CPU caches, disk I/O, network I/O and buses (e.g., QPI, PCIe))竞争:Different levels of interference (i.e., resource contention) occur when different types of ML jobs are co-located

资源竞争导致性能下降,具体数据:We observe a 33% performance degradation on average, and nearly 2x slowdown for training ResNeXt when the jobs are co-located.

有些job co-loacted,资源竞争比较小:ResNet and WLM are less affected when trained together, so do ResNet and Seq2Seq.

已经有一些工作关注资源竞争:These studies build an explicit interference model of the target performance based on certain observations/assumptions and rely on hand-crafted heuristics for incorporating interference in scheduling。但是需要profiling,设置阈值,即一个白盒模型,因此本文使用RL做一个black-box模型。

使用RL的挑战:动作空间大:Even with 6 jobs, 3 workers plus PSs each, and 6 servers, there are more than 100 million different ways of placement.

Model

提交job前,用户需要提供以下信息:

  1. worker和PS的资源需求
  2. worker和PS的数量
  3. training epoch的数量
image-20220721222331069

RL的状态表示:

  1. $x [N\times L]$,N是并行的jobs个数,包括新来的和已经运行的,L是集群中最大能部署的job数量
  2. $r [N\times 2(1+K)]$,所有job的worker/PS resource demands,K是资源的种类
  3. worker的数量
  4. M个物理机的K种资源状态
  5. 放置的位置

RL的action:<N, p, M>第N个job放置p个worker在第M个机器上;一次不一定放完,因此allow multiple inferences,在所有的action都做完后,再观测state

RL的reward:目标是最小化JCT,但是:

  1. 只有job运行完才知道reward的大小:Job completion time would be a natural reward to observe, but it is only known when a job is finished, which may well be hundreds of scheduling intervals later;
  2. JCT不只由当前的placement决定,还由未来的co-loacted 的job决定:completion time of a job is decided not only by the current job placement state, but also future deployment of upcoming jobs

设计了平均interval的reward:

$r=\frac{c}{e}$,c是训练完的epoch,e是总共的epoch

Reward model training:输入是job信息和placement信息,输出是reward,相当于supervised critic?使用了全连接层,发现比CNN的效果好;从trace中训练

RL训练时候,exploration采用熵最大是不够的,因此:

  1. $1-\epsilon$ 的概率使用神经网络的输出进行placement
  2. $\epsilon$的概率从multi-resource bin packing(最小空闲优先,为了防止资源碎片Tetris) and load balancing(最大空闲优先) policies中随机选

The rationale behind is to enable NN to effectively explore the tradeoff between resource utilization (bin packing is best for) and workload interference (that load balancing avoids). In this way, we improve exploration quality to guide the NN training to converge to a good policy.


分布式机器学习集群的资源调度机制研究 (Thesis 2020)

Intro

现有的相关研究大多基于研究者对特定的机器学习框架 和工作负载的理解对资源-任务完成时间建立白盒模型,再通过启发式算法求解。 白盒模型的准确性会极大的影响资源调度的性能且不具备通用性。

本文对 4 种不同的概率代理模型和采集函数组合的贝叶斯优化子算法,并且概率代理模型为高斯过程,采集函数为 EI(Expected Improvement) 的贝叶斯优化子算法在所有场景中都表现最好

image-20220723161422998

资源调度分类:静态和动态调度;抢占式和非抢占式调度

挑战:

  1. 分布式机器学习任务本身的复杂性和异构性
  2. 即使对于一个确定的分布式机器学习任务要想确定资源和任务完成时间的关系也是十分困难的,和资源并不是线性加速的。虽然最近有一些研究者尝试对分布式机器学习任务的完成时间建立复杂的数 学模型。如文献[10]对参数服务器架构的分布式机器学习任务的完成时间进行数 学建模。文献[34]则对 AllReduce 架构的分布式机器学习任务的完成时间建模,但 是仅限于卷积神经网络这一种机器学习模型。
  3. 位置敏感的
  4. 资源竞争

收敛曲线近似:(k是迭代次数)

一阶优化(GD):

image-20220723161915012

二阶优化(牛顿法,BFGS)

image-20220723162006488

Model

贝叶斯优化见原论文第三章,写的非常好

Metis 算法以一种固定时间分片的方式工作,每个时间槽是一个调度间隔(如 30 分钟)。

Agent 根据状态选择采用动作。在采取动作后,神经网络的输入状态也随之变。对于新的输入状态,神经网络又产生新的动作,直到集群中没有未分配 的节点可以分配或者 Agent 产生的动作为 ∅ 。

image-20220723165507601

不同于标准的强化学习每次执行动作都能立即收到奖赏,在一个时间槽内本 文的神经网络会多次决策,产生多个动作,但是直到这个时间槽结束才能知道这 些动作的奖赏。

本章的调度目标是最大化集群中机器学习模型的整体性能。但是需要等到模 型训练完成才能知道模型的最终性能,而这可能需要几百甚至几千个时间槽。对 于在线强化学习来说,奖赏的显著反馈延迟是不可接受的,因为延迟的奖赏对于 改善早期决策的指导作用很小。