User Allocation In MEC: A DRL Approach 论文笔记
论文:ICWS 2021 移动边缘计算中的用户分配:一种深度强化学习方法
II. MOTIVATION-A Motivating Example
Nearest Neighbourhood Greedy Algorithm代码
Method:提出一种设备端深度强化学习(DRL)框架来解决边缘用户分配(EUA)问题,基于与 MEC 系统的经验和交互逐步学习适当的资源分配。DRL Agent在服务延迟阈值约束下学习在某边缘服务器上服务的用户数量。DRL Agent通过同时观察系统参数,直接从边缘服务器中学习非线性依赖关系。
II. MOTIVATION-A.验证假设的观察结果
分别使用 GPU 和不使用 GPU 观察目标检测应用程序 YOLO处理图像的服务执行时间:
• 跨时间的变化:相同配置的同一台机器上执行时间也存在显着差异,受服务调用模式和温度等多个隐藏参数影响。
• 跨服务的变化:不同服务的执行时间模式各异,如 Yolo 执行时间随用户增加近似线性增长,而 MobileNet 表现为非线性,使建模任务复杂化。
II. MOTIVATION-A Motivating Example
七个移动用户 U1、U2 .. U7和两个边缘服务器 e1 和 e2,每个用户请求边缘服务器上可用的两个服务 s1 和 s2 之一,用户U1、U2、U4和U6请求服务s1,其余用户请求服务s2,每个边缘服务器由一个资源向量4 元组(Available RAM、Core 的数量、CPU 后台工作负载%、GPU Utilized%)。
如对于边缘服务器 e1,服务s1的单个用户请求的执行时间为 3.12s,四个用户的预期执行时间用线性插值得到12.48 s。然而实际 3.468 s。
假设延迟阈值6.58 秒,用仅考虑服务单个请求的执行时间的确定性方法将U1、U2和U3分配给e1,只会给s1分配2个用户(每个用户3.12s),给s2分配1个用户(6.32s)
使用确定性方法分配的用户总数为 3
运行 YOLO 的四个用户的执行时间低于 6.55 的延迟阈值。实际上可以在 e1 上分配用户 U1、U2、U4(只需 3.35)。e2 可以容纳 U5 和 U7(只需 6.12)。
使用数据驱动的方法分配总共 5 个用户(比确定性方法多 2 个用户),更准确地建模资源利用率。
MEC环境中每个边缘服务器的覆盖率半径为。边缘服务器覆盖半径内的移动users 可以请求托管在该服务器上的服务 。每个边缘服务器上可用资源(RAM、Cores、CPU 背景工作负载%、GPU Utilized%)
a)新用户加入边缘服务器的覆盖区域 (b)用户远离边缘服务器的覆盖区域 (c)用户服务请求更改 (d)边缘服务器或移动用户离线
RL 框架中的Agent通过探索环境并从动作接收反馈来学习环境以选择更好的动作选择,RL可以在不需要大量标记数据的情况下学习底层环境。
在这个 RL 框架中,Agent不断地与边缘服务器交互以采取行动,执行多个服务请求并根据执行占用空间获得相应的奖励。
#compute latency
def get_reward(self, state, action):
#将动作 action 转换为两个用户数量 u1 和 u2
u1 = action//5 + 1
u2 = (action+1) - (u1-1)*5
#sample time from dataframe
gram = state[0]
gcores = state[1]
gwl_c = state[2]
gwl_g = state[3]
gs1 = u1*100
gs2 = u2*100
fetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) & (self.df['workload_gpu']==gwl_g) & (self.df['users_yolo']==gs1) & (self.df['users_mnet']==gs2)]
if fetch_state.empty:#找不到匹配的状态信息,则返回较大的负奖励,表示这是一个不利的动作选择
return -20
# 计算网络延迟:
time1 = fetch_state.sample().iloc[0]['time_yolo'] #从匹配的状态信息中随机选择一个样本
time2 = fetch_state.sample().iloc[0]['time_mnet']#获取 time_yolo 和 time_mnet 的延迟时间
tm = max(time1, time2)#两者中较大的延迟时间用作网络延迟的阈值
#add total latencies due to network based on number of u1 and u2
if (tm <= latency_threshold): #用户数量的变化对应的奖励,以及动作本身的基础奖励(u1 + u2)
return 0.01*(gs1 - state[4]) + 0.01*(gs2 - state[5]) + u1 + u2
return -5 - u1 - u2
训练RL Agent
import numpy as np
import gym
from gym import spaces
from gym.utils import seeding
class yolosystem(gym.Env):
metadata = {'render.modes': ['human']}
def __init__(self, n_actions, filename):
super(yolosystem, self).__init__()
self.n_actions = n_actions #total number of action space after ranging [10, 20, 30 ...]
self.action_space = spaces.Discrete(self.n_actions) #total number of users in the action space; starts with zero
self.observation_space = spaces.Box(low=np.array([0,0,0,0,0,0]), high=np.array([11000]*6), shape=(6, ), dtype=np.int32) #<RAM, Core, Workload>
self.current_obs = np.array( [3000, 2, 40, 2, 100, 100] ) #current observation = <ram, cores, workload%>
#Load dataset
self.df = pd.read_csv(filename)
# computer percentage of GPU usage from actual use
self.df['workload_gpu'] = self.df['workload_gpu'].multiply(1/80).round(0).astype(int) #round gpu workload
#get unique data in set
self.ram = self.df.ram.unique()
self.cores = self.df.cores.unique()
self.workload_cpu = self.df.workload_cpu.unique()
print(self.df) #print dataset
def seed(self, seed=1010):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
assert self.action_space.contains(action) #action should be in action space
state = self.current_obs
done = True #Episodes ends after each action
#compute latecy from the number of users
reward = self.get_reward(state, action) #linear latency
# print(action, reward)
self.current_obs = self.get_random_state() #go to a random state
# print(self.current_obs)
return self.current_obs, reward, done, {} #no-states, reward, episode-done, no-info
def reset(self):
self.current_obs = self.get_random_state()
return self.current_obs #current state of the system with no load
def render(self, mode='human', close=False):
print(f"Current State:<{self.current_obs}>")
#compute latency
def get_reward(self, state, action):
#将动作 action 转换为两个用户数量 u1 和 u2
u1 = action//5 + 1
u2 = (action+1) - (u1-1)*5
#sample time from dataframe
gram = state[0]
gcores = state[1]
gwl_c = state[2]
gwl_g = state[3]
gs1 = u1*100
gs2 = u2*100
fetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) & (self.df['workload_gpu']==gwl_g) & (self.df['users_yolo']==gs1) & (self.df['users_mnet']==gs2)]
if fetch_state.empty:#找不到匹配的状态信息,则返回较大的负奖励,表示这是一个不利的动作选择
return -20
# 计算网络延迟:
time1 = fetch_state.sample().iloc[0]['time_yolo'] #从匹配的状态信息中随机选择一个样本
time2 = fetch_state.sample().iloc[0]['time_mnet']#获取 time_yolo 和 time_mnet 的延迟时间
tm = max(time1, time2)#两者中较大的延迟时间用作网络延迟的阈值
#add total latencies due to network based on number of u1 and u2
if (tm <= latency_threshold): #用户数量的变化对应的奖励,以及动作本身的基础奖励(u1 + u2)
return 0.01*(gs1 - state[4]) + 0.01*(gs2 - state[5]) + u1 + u2
return -5 - u1 - u2
#get to some random state after taking an action
def get_random_state(self):
#generate state randomly
gram = np.random.choice(self.ram, 1)[0]
gcores = np.random.choice(self.cores, 1)[0]
gwl_c = np.random.choice(self.workload_cpu, 1)[0]
#fetch gamma for the state
fetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) ]
gwl_g = fetch_state.sample().iloc[0]['workload_gpu'] #fetch workload randmoly
gs1 = random.randrange(50, 550, 50)
gs2 = random.randrange(50, 550, 50)
return np.array( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )
from stable_baselines3.common.monitor import Monitor
import os
# Create log dir
log_dir = './agent_tensorboard/'
os.makedirs(log_dir, exist_ok=True)
env = Monitor(env, log_dir)
from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy
from stable_baselines3.common.vec_env import DummyVecEnv
# wrap it非向量化的环境 env 转换为一个向量化的环境 env
env = DummyVecEnv([lambda: env])
model = DQN(MlpPolicy, env, verbose=0, tensorboard_log = log_dir, exploration_fraction=0.4, learning_starts=150000, train_freq=30, target_update_interval=30000, exploration_final_eps=0.07)
begin = time.time()
model.learn(total_timesteps=500000) #reset_num_timesteps=False
end = time.time()
training_time = end-begin
RL Allocation Algorithm代码
#Load model
def rl_algo():
server_capacity = np.zeros((N, S))
for server_id in range(N):
state = server_state[server_id]
# if model_type == 'lin':
action = model_rl.predict(np.array(state), deterministic=True)
# if model_type == 'exp':
# action = model_exp.predict(np.array(state), deterministic=True)
# 根据action计算两种服务的预测容量 (u1 和 u2)
u1 = action[0]//5 + 1
u2 = (action[0]+1) - (u1-1)*5
server_capacity[server_id][0] = u1*100 #model output
server_capacity[server_id][1] = u2*100 #model output
col1 = np.array([np.sum(ngb,axis=1)])
col2 = np.array([np.arange(U)])
sorted_ngb = np.concatenate((ngb, col1.T, col2.T), axis=1) #add rowsum and index column添加行和索引列
sorted_ngb = sorted_ngb[np.argsort(sorted_ngb[:, N])] #sort the rows based on rowsum column根据行和列对行进行排序
#run allocation algorithm
rl_allocation = []
# 遍历用户,根据用户连接的服务器列表和服务请求,选择最大预测容量的服务器分配。服务器有足够容量则更新服务器容量并记录分配结果
for i in range(U):
server_list = np.where(sorted_ngb[i, :N] == 1)[0] #获取用户连接到的服务器列表
if len(server_list) == 0: #跳过没有服务器的用户
ser = int(service[i]) #用户正在请求哪个服务
choosen_server = server_list[np.argmax(server_capacity[server_list, ser])] #找到所选服务器的 ID
if server_capacity[choosen_server][ser] > 0: #将用户分配给choosen_server
server_capacity[choosen_server][ser] -= 1 #减少服务器容量
rl_allocation.append( (int(sorted_ngb[i, N+1]), choosen_server) ) #(user, server) alloc pair
print('RL Num of allocation: {}'.format(len(rl_allocation)))
return rl_allocation
使用历史服务执行数据的平均值确定边缘服务器上服务的执行时间,进而确定可以分配到边缘服务器的用户数量的相应代码:allocation.ipynb def generate_server_state(num_server)
#获取与选择的 GPU 工作负载值匹配的行 计算YOLO 和 MNet 的平均时间
time_yolo = fetch_time['time_yolo'].mean() #average of time for particular state
time_mnet = fetch_time['time_mnet'].mean()
# 根据每个服务器的服务请求分配状态
gs1 = server_service[s_id][0]
gs2 = server_service[s_id][1]
server_state.append( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )
# 追加每个服务器的 gamma 值
gamma.append((time_yolo, time_mnet)) #append the gamma value of each server
ILP Algorithm整数线性规划(ILP)算法代码
def ilp_algo():
## ===================================ILP with python mip
# >> solver_name=GRB
# >> Currently using CBC
I = range(U) #user 用户的范围
J = range(N) #server服务器的范围
alloc = Model(sense=MAXIMIZE, name="alloc", solver_name=CBC)
alloc.verbose = 0
def coverage(user_ix, server_ix):
if ngb[user_ix][server_ix]==1:
return 1
return 0
#U: num of users, N: num of servers
# 创建二进制变量矩阵 x,其中 x[i][j] 表示用户 i 是否被分配到服务器 j
x = [[ alloc.add_var(f"x{i}{j}", var_type=BINARY) for j in J] for i in I]
#Objective Equation
# 目标函数:最大化分配的用户数量
alloc.objective = xsum( x[i][j] for i in I for j in J )
#1. 覆盖约束
for i in I:
for j in J:
if not coverage(i,j):
alloc += x[i][j] == 0
# 2. 每个用户只能被分配到一个服务器
for i in I:
alloc += xsum( x[i][j] for j in J ) <=1
# 3. 延迟约束
for j in J:
alloc += xsum( gamma[j][int(service[i])]*x[i][j] for i in I ) <=latency_threshold-network_latency[j]
#===========Start Optimization=========
alloc.optimize(max_seconds=25)# 优化模型
#==========ILP Ends here
#print(f"Number of Solutions:{qoe.num_solutions}")
ilp_allocation = [ (i,j) for i in I for j in J if x[i][j].x >= 0.99] # 获取分配结果
#print(f"Number of Solutions:{qoe.num_solutions}")
#print(f"Objective Value:{qoe.objective_value}")
allocated_num_users = len(ilp_allocation)
print("ILP Allocated Num of Users: {}".format(allocated_num_users))# 输出分配的用户数量
# selected.sort()
return ilp_allocation
Nearest Neighbourhood Greedy Algorithm代码
def greedy_algo():
server_capacity = np.zeros(N)# 初始化服务器容量数组
rl_allocation = []
for user_id in range(U):#获取与用户连接的服务器列表
server_ngb_list = np.where(ngb[user_id, :N] == 1)[0] #get the list of server to which user is connected
if len(server_ngb_list) == 0: #ignore the users which are not under any servers
# 计算每个用户到各个服务器的距离并排序
#find the distance to each users in the server_ngb_list
dist_list = np.array([ server_ngb_list, [server.iloc[i]['geometry'].centroid.distance(user.iloc[user_id]['geometry']) for i in server_ngb_list] ])
# sorted list of servers based on the distance from users
sorted_distance_list = dist_list[ :, dist_list[1].argsort()]
#get the list of servers arranged in least to max distance
server_list = sorted_distance_list[0].astype(int)
# 分配算法
lat = 0
for server_id in server_list:
lat = gamma[server_id][int(service[user_id])]#根据用户请求的服务类型和服务器,获取相应的服务延迟
if server_capacity[server_id]+lat <= latency_threshold-network_latency[server_id]:
server_capacity[server_id] += lat #increment the server_capacity of server
rl_allocation.append( (user_id, server_id) ) #(user, server) alloc pair
print('Greedy-Ngb Num of allocation: {}'.format(len(rl_allocation)))
return rl_allocation
之后生成服务器状态server_state 计算每个服务器的gamma 值 generate_server_state(num_server)
if alloc_type == 'server': #服务器固定,变化用户数量"
for U in range(100, 600, 100):#用户数量100-500
for epoch in range(50):
print("User:", U, 'Server:', N, 'Epoch:', epoch)
ngb, user, server, service, server_service, network_latency = ngb_matrix(U, N, S) #从EUA数据生成服务器和用户 # 确定邻域矩阵
server_state, gamma = generate_server_state(N) #为每个用户分配状态和γ值
#=======ILP starts
start = 0
stop = 0
execution_time_ilp = 0
start = timeit.default_timer()
ilp_aloc = ilp_algo() #call ILP algorithm
stop = timeit.default_timer()
execution_time_ilp = stop - start
#========ILP ends
#=======Greedy starts
start = 0
stop = 0
execution_time_greedy = 0
start = timeit.default_timer()
greedy_aloc = greedy_algo() #call ILP algorithm
stop = timeit.default_timer()
execution_time_greedy = stop - start
#========Greedy ends
#=======RL_linear starts
start = 0
stop = 0
execution_time_rl = 0
start = timeit.default_timer()
rl_aloc = rl_algo() #call ILP algorithm
stop = timeit.default_timer()
execution_time_rl = stop - start
#========RL_linear ends
#========Store results to file
to_append = [U, N,
len(ilp_aloc), execution_time_ilp,
len(greedy_aloc), execution_time_greedy,
len(rl_aloc), execution_time_rl,
dseries = pd.Series(to_append, index = result_user.columns)
result_user = result_user.append(dseries, ignore_index=True)
print("epoch:", epoch)
result_user.to_csv(result_file, index=False)
一、生成服务器状态计算每个服务器的 gamma 值
def generate_server_state(num_server):#生成服务器状态计算每个服务器的 gamma 值
df = pd.read_csv(filename_base)
# 将 GPU 工作负载的数值进行缩放和四舍五入
# df['ram'] = df['ram'].div(1000).round(0).astype(int)
# df['workload_cpu'] = df['workload_cpu'].div(10).round(0).astype(int)
df['workload_gpu'] = df['workload_gpu'].multiply(1/80).round(0).astype(int) #round gpu workload
# df['users_yolo'] = df['users_yolo'].div(100).round(0).astype(int)
# df['users_mnet'] = df['users_mnet'].div(100).round(0).astype(int)
#get unique data in set获取数据集中唯一的 RAM、核心数和 CPU 工作负载值
ram = df.ram.unique()
cores = df.cores.unique()
workload_cpu = df.workload_cpu.unique()
server_state = []#服务器状态
gamma = []
for s_id in range(num_server):
#对于每一个服务器,随机选择一个 RAM、核心数和 CPU 工作负载值
gram = np.random.choice(ram, 1)[0]
gcores = np.random.choice(cores, 1)[0]
gwl_c = np.random.choice(workload_cpu, 1)[0]
#fetch gamma for the state获取对应状态的行
fetch_state = df.loc[ (df['ram'] == gram) & (df['cores']== gcores) & (df['workload_cpu']==gwl_c) ]
# 从匹配的状态中随机选择一个 GPU 工作负载值
gwl_g = fetch_state.sample().iloc[0]['workload_gpu'] #fetch workload randmoly
fetch_time = fetch_state.loc[ (df['workload_gpu'] == gwl_g) ]
#获取与选择的 GPU 工作负载值匹配的行 计算YOLO 和 MNet 的平均时间
time_yolo = fetch_time['time_yolo'].mean() #average of time for particular state
time_mnet = fetch_time['time_mnet'].mean()
# 根据每个服务器的服务请求分配状态
gs1 = server_service[s_id][0]
gs2 = server_service[s_id][1]
server_state.append( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )
# 追加每个服务器的 gamma 值
gamma.append((time_yolo, time_mnet)) #append the gamma value of each server
return server_state, gamma
#================neighbourhood Computing
def ngb_matrix(U, N, S):
# 生成用户和服务器之间的邻居矩阵,并计算网络延迟
#U: number of users
#N: number of servers
#S: number of services
# U X N matrix
user = load_users(U)
server = load_servers(N)
neighbourhood = np.zeros([U, N]) #用户和服务器之间的邻居矩阵
network_latency = np.zeros(N) #每个服务器的网络延迟
latency_data = load_planetlab() #加载 PlanetLab 数据,返回一个距离矩阵(bin size 150)
# 检查每个用户是否在服务器的缓冲区内,并计算网络延迟
for u in range(0, U):
for n in range(0, N):
#检查用户是否在服务器的缓冲区内(使用几何空间的 contains 方法)
if server.iloc[n].geometry.contains(user.iloc[u].geometry):
neighbourhood[u,n]=1#邻居矩阵中相应位置设为 1
# 计算距离并分配延迟
distance = server.iloc[n].geometry.centroid.distance(user.iloc[u].geometry)
rep_lat = fetch_network_lat(int(distance), latency_data) #根据距离从 latency_data 中获取网络延迟
if network_latency[n] < rep_lat:#最大可能延迟
network_latency[n] = rep_lat
service = np.zeros(U)
for u in range(0, U):#为每个用户随机分配一个从 0 到 S-1 的服务请求
service[u] = random.randrange(0, S, 1)
server_service = np.zeros((N, S))
for n in range(0, N):
for u in range(0, U):
if neighbourhood[u][n] == 1:
server_service[n][int(service[u])] += 1
return neighbourhood, user, server, service, server_service, network_latency
#================Load Planet Lab data
# 加载 PlanetLab 数据并转换为一个矩阵格式
def load_planetlab():
#Convert to triangle
ldata = np.loadtxt('eua/PlanetLabData_1')[np.tril_indices(490)]
ldata = ldata[ ldata != 0]#提取下三角矩阵的非零值
ldata =np.unique(ldata)#去重并重置数据大小,使其符合150行的矩阵格式
length = ldata.shape[0]
latency_row = 150
latency_col = (length//latency_row) #Global Data used
ldata = np.resize(ldata, latency_col*latency_row)
latency = ldata.reshape(latency_row,-1)
return latency
#=================Fetch Network latency
# 根据距离从延迟数据中获取网络延迟
def fetch_network_lat(distance, latency_data):
rep_lat = np.random.choice(latency_data[distance], size=1, replace=True)#根据距离从延迟数据中随机选择一个延迟值
return rep_lat/1000 #将延迟值转换为秒
#===============User Data
# 加载用户数据并转换为地理数据格式
def load_users(num_of_users):
user_raw = pd.read_csv("eua/users.csv")
user_raw = user_raw.rename_axis("UID")#将数据框的索引轴重命名为 "UID",即用户的唯一标识符
df = user_raw.sample(num_of_users)#随机抽样指定数量的用户数据
# 创建地理数据框,使用Longitude和Latitude列创建点几何对象,并转换坐标参考系统(CRS)
gdf = geopandas.GeoDataFrame(df, geometry = geopandas.points_from_xy(df.Longitude, df.Latitude), crs = 'epsg:4326')#创建地理数据框
user = gdf [['geometry']] #保留geometry列
user = user.to_crs(epsg=28355) #指定数据的坐标参考系统(WGS84投影)
#Insert additional data to dataframe
#user = user.apply(add_data, axis=1)
return user
#================Server Data
def load_servers(num_of_servers):
# 加载服务器数据,并将其转换为地理数据格式
server_raw = pd.read_csv("eua/servers.csv")
server_raw = server_raw.rename_axis("SID")#将数据框的索引轴重命名为 "SID",即服务器的唯一标识符
df = server_raw.sample(num_of_servers) #随机抽样指定数量的服务器
gdf = geopandas.GeoDataFrame(df, geometry = geopandas.points_from_xy(df.LONGITUDE, df.LATITUDE), crs = 'epsg:4326')#创建地理数据框
server = gdf [['geometry']] #Keep Geometry column
server = server.to_crs(epsg=28355) #Cover to crs in Australian EPSG
def add_radius(series):
# radius = random.randrange(150, 250, 10)
# 为每个服务器添加一个固定半径的缓冲区
radius = 150 #每个服务器的缓冲区半径设为固定值 150
series.geometry = series.geometry.buffer(radius)
series['radius'] = radius
# series['resource'] = tcomp
return series
server = server.apply(add_radius, axis = 1)
return server
def plot_data(user, server):
%config InlineBackend.figure_format='retina'
%matplotlib inline
cbd = geopandas.read_file('eua/maps', crs = {'init': 'epsg=28355'} ) #read cbd-australia location data
fig, ax = plt.subplots(1, 1, figsize=(15,10))
ax.set_xlim(319400, 322100)
ax.set_ylim(5811900, 5813700)
user.plot(ax=ax, marker='o', color='red', markersize=20, zorder=3, label="users")
server.plot(ax =ax, linestyle='dashed', edgecolor='green', linewidth=1, facecolor="none", zorder=1)
server.centroid.plot(ax=ax, marker='s', color='blue', markersize=50, zorder=2, label="server")
cbd.plot(ax=ax, color='grey', zorder=0, alpha = 0.3);
ax.set_title("MEC Environment(EUA): CBD Melbourne(Australia)")
ax.legend(bbox_to_anchor=(1, 0), loc='lower left')
一、对于RL算法 使用不同训练回合数:
分别是训练30,000回合的RL Agent生成的分配数量和训练1,50,000回合
动作空间的量化大小= 2
model_und = DQN.load("trained_agents/edge_agent_under_train")
model_prop = DQN.load("trained_agents/edge_agent_proper_train")
#Load model
def rl_algo_prop():
action = model_prop.predict(np.array(state), deterministic=True)
print('Actionprop: {}'.format(action))
u1 = (action[0]//10)*2 + 1
u2 = (action[0]%10)*2 + 1
server_capacity[server_id][0] = u1 #model output
server_capacity[server_id][1] = u2 #model output
二、对于RL算法 使用不同量化因子:
rl_algo_act() 代码中说是=5:
action = model_act.predict(np.array(state), deterministic=True)
print('Actionact: {}'.format(action))
u1 = (action[0]//5)*4 + 1 #25 action space
u2 = (action[0]%5)*4 + 1
server_capacity[server_id][0] = u1 #model output
server_capacity[server_id][1] = u2 #model output
action = model_thres10.predict(np.array(state), deterministic=True)
print('Actionthres10: {}'.format(action))
u1 = action[0]//5 + 1
u2 = (action[0]+1) - (u1-1)*5
server_capacity[server_id][0] = u1*100 #model output
server_capacity[server_id][1] = u2*100 #model output
对于训练不同回合数rl_algo_prop中动作空间的量化大小= 2的个人理解,不一定对:
每次agent预测出action之后,从中还原出两个服务s1、s2上的服务请求数(动作)使用的方法不同,= 2时的映射方法如下,输出action
for action in range(25):#rl_algo_prop()
u1 = (action//10)*2 + 1
u2 = (action%10)*2 + 1
print(f"Action: {action}, u1: {u1}, u2: {u2}")
Action: 0, u1: 1, u2: 1
Action: 1, u1: 1, u2: 3
Action: 2, u1: 1, u2: 5
Action: 3, u1: 1, u2: 7
Action: 4, u1: 1, u2: 9
Action: 5, u1: 1, u2: 11
Action: 6, u1: 1, u2: 13
Action: 7, u1: 1, u2: 15
Action: 8, u1: 1, u2: 17
Action: 9, u1: 1, u2: 19
Action: 10, u1: 3, u2: 1
Action: 11, u1: 3, u2: 3
Action: 12, u1: 3, u2: 5
Action: 13, u1: 3, u2: 7
Action: 14, u1: 3, u2: 9
Action: 15, u1: 3, u2: 11
Action: 16, u1: 3, u2: 13
Action: 17, u1: 3, u2: 15
Action: 18, u1: 3, u2: 17
Action: 19, u1: 3, u2: 19
Action: 20, u1: 5, u2: 1
Action: 21, u1: 5, u2: 3
Action: 22, u1: 5, u2: 5
Action: 23, u1: 5, u2: 7
Action: 24, u1: 5, u2: 9
文章提到:使用大小为的量化减少动作空间的基数,=10时新的动作元组 (2,2) 表示旧动作空间中范围 (11 - 20,11 - 20) 中的所有动作
于是我暂且认为=2指的是,量化后的动作空间中的一个动作,代表原来动作空间中的两个动作,也就是,第一个动作中的 u1: 1是我们选择来代表原来动作空间中u1: 1、u1: 2;同理u2:1代表u2:1、u2:2。下一个动作就从3开始。
for action in range(25):#rl_algo_act()
u1 = (action//5)*4 + 1
u2 = (action%5)*4 + 1
print(f"Action: {action+1}, u1: {u1}, u2: {u2}")
Action: 0, u1: 1, u2: 1
Action: 1, u1: 1, u2: 5
Action: 2, u1: 1, u2: 9
Action: 3, u1: 1, u2: 13
Action: 4, u1: 1, u2: 17
Action: 5, u1: 5, u2: 1
Action: 6, u1: 5, u2: 5
Action: 7, u1: 5, u2: 9
Action: 8, u1: 5, u2: 13
Action: 9, u1: 5, u2: 17
Action: 10, u1: 9, u2: 1
Action: 11, u1: 9, u2: 5
Action: 12, u1: 9, u2: 9
Action: 13, u1: 9, u2: 13
Action: 14, u1: 9, u2: 17
Action: 15, u1: 13, u2: 1
Action: 16, u1: 13, u2: 5
Action: 17, u1: 13, u2: 9
Action: 18, u1: 13, u2: 13
Action: 19, u1: 13, u2: 17
Action: 20, u1: 17, u2: 1
Action: 21, u1: 17, u2: 5
Action: 22, u1: 17, u2: 9
Action: 23, u1: 17, u2: 13
Action: 24, u1: 17, u2: 17
for action in range(25):#rl_algo_thres10
u1 = action//5 + 1
u2 = (action+1) - (u1-1)*5
print(f"Action: {action+1}, u1: {u1}, u2: {u2}")
Action: 1, u1: 1, u2: 1
Action: 2, u1: 1, u2: 2
Action: 3, u1: 1, u2: 3
Action: 4, u1: 1, u2: 4
Action: 5, u1: 1, u2: 5
Action: 6, u1: 2, u2: 1
Action: 7, u1: 2, u2: 2
Action: 8, u1: 2, u2: 3
Action: 9, u1: 2, u2: 4
Action: 10, u1: 2, u2: 5
Action: 11, u1: 3, u2: 1
Action: 12, u1: 3, u2: 2
Action: 13, u1: 3, u2: 3
Action: 14, u1: 3, u2: 4
Action: 15, u1: 3, u2: 5
Action: 16, u1: 4, u2: 1
Action: 17, u1: 4, u2: 2
Action: 18, u1: 4, u2: 3
Action: 19, u1: 4, u2: 4
Action: 20, u1: 4, u2: 5
Action: 21, u1: 5, u2: 1
Action: 22, u1: 5, u2: 2
Action: 23, u1: 5, u2: 3
Action: 24, u1: 5, u2: 4
Action: 25, u1: 5, u2: 5
B. Experimental Results