RL for Resource Managemet

2020-10-07 DL RL, System 0 Word Count: 626(words) Read Count: 3(minutes)

Resource Management with Deep Reinforcement Learning (HotNet2016)

intro

Perusing recent research in the ﬁeld, the typical design ﬂow is:

come up with clever heuristics for a simpliﬁed model of the problem;
painstakingly test and tune the heuristics for good performance in practice.

This process often has to be repeated if some aspect of the problem such as the workload or the metric of interest changes.

DeepRM does not require any prior knowledge of the system’s behavior to learn these strategies.

model

Object: average job slowdown

discuss

假设机器资源很大，没有考虑data locality
没有考虑consist of multiple stages的job，job的资源属性是不可观测的，部分观测MDP(Further, the resource proﬁle of a job may not be known in advance (e.g., for non-recurring jobs), and the scheduler might have get an accurate view only as the job runs. The RL paradigm can in principle deal with such situations of partial observability by casting the decision problem as a POMDP)
有限的time-horizon，设计time-independent的value net

A New Approach for Resource Scheduling with Deep Reinforcement Learning (arxiv2018)

intro

Due to many factors that need to be consider, the scheduling problem in most cases is an NP-hard problem or an NP-complete problem.

contributions:

imitation learning used as an initial policy
cnn
redefine cluster capacity
online and offline versions of DeepRM
- Online: arrive with possion dist
- Offline: arrive at one time

discuss

multi-cluster scheduler，Because jobs do not perfectly fill all the cluster space, each node will have some resource fragmentation that is difficult to use.

Intelligent Cloud Resource Management with Deep Reinforcement Learning (IEEE Cloud Computing2017)

intro

issues in traditional resource management:

需要协调cloud environment, physical resources, virtual resources, which is a vast project.
cloud systems are configured manually and heavily rely on relevant knowledge and experiences.

model

State: a configuration of multiple resources, such as the amounts of CPUs and memories.

Action: one-unit adjustment of a single resource, such as one-unit increase of CPU.

Reward: a two-dimensional vector involving performance and cost

discuss

不一定服从MDP
hierarchical RL for semi-MDP

Data Centers Job Scheduling with Deep Reinforcement Learning (PAKDD2020)

使用A2C来做

A Reinforcement Learning Based Resource Management Approach for Time-critical Workloads in Distributed Computing Environment (BigData2018)

intro

time-critical and non-time-critical applications as hybrid workloads

解决多个job：每个job作为输入，输出一个vector，这些vector拼接一起？

随机化cluster的状态以增加泛化能力

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing (ICPP2020)

intro

both time-critical and general applications

All computing environment features such as multi-cluster, elasticity and heterogeneity (类似于job和machine的亲和性), and the job features such as job type and timeliness, add up to the complexity of generating a satisfactory resource management approach that can well utilize the multi-cluster environment.

clusters are allowed to have elasticity, that is, their computing capability in terms of executor numbers, can be temporarily expanded (with an upper bound) when necessary, to fit workload pressure.

Jobs:

streaming jobs, repetitive executions with data batches and time-critical.
non-streaming timecritical jobs
other general non-time-critical jobs

state使用LSTM

多目标RL

A novel multi agent reinforcement learning approach for job scheduling in Grid computing (2011)

Decentralized

Resource Allocation in the Grid Using Reinforcement Learning

multi-agent

Online Resource Allocation Using Decompositional Reinforcement Learning (AAAI2005)

Online Planner Selection with Graph Neural Networks and Adaptive Scheduling (AAAI2020)

本文链接： http://example.com/2020/10/07/RL-for-Resource-Managemet/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

xmz

RL for Resource Managemet

Resource Management with Deep Reinforcement Learning (HotNet2016)

intro

model

discuss

A New Approach for Resource Scheduling with Deep Reinforcement Learning (arxiv2018)

intro

discuss

Intelligent Cloud Resource Management with Deep Reinforcement Learning (IEEE Cloud Computing2017)

intro

model

discuss

Data Centers Job Scheduling with Deep Reinforcement Learning (PAKDD2020)

A Reinforcement Learning Based Resource Management Approach for Time-critical Workloads in Distributed Computing Environment (BigData2018)

intro

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing (ICPP2020)

intro

A novel multi agent reinforcement learning approach for job scheduling in Grid computing (2011)

Resource Allocation in the Grid Using Reinforcement Learning

Online Resource Allocation Using Decompositional Reinforcement Learning (AAAI2005)

Online Planner Selection with Graph Neural Networks and Adaptive Scheduling (AAAI2020)

xmz