基于偏好强化学习的气象导航任务调度方法
DOI:
CSTR:
作者:
作者单位:

1.北方工业大学 人工智能与计算机学院;2.北方工业大学

作者简介:

通讯作者:

中图分类号:

基金项目:


A Task Scheduling Method for Weather Routing Based on Preference Reinforcement Learning
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对气象导航服务在台风变动等突发海况下引发的密集计算需求,解决云边协同调度中严格时间窗口约束与多维异构资源争抢问题,实现在服务器端任务的时效性、服务器节点能耗与负载均衡等多目标间的动态权衡。将事件驱动的调度过程建模为半马尔可夫决策过程(semi-Markov decision process,SMDP),提出一种基于偏好的多目标分层强化学习框架。首先,针对联合动作空间庞大且稀疏的问题,提出基于偏好的多目标优化分层强化学习的气象导航任务调度(PBMO-HRL)算法,将决策解耦为“任务选择”与“节点分配”两层策略;其次,结合动作掩码并采用显式期望计算以降低方差,期望形式的策略梯度估计以降低方差;此外,由于事件驱动导致决策间隔可变(SMDP),引入时间感知折扣与合并空闲状态转移以减少冗余决策步,从而缓解价值高估/估计偏差;最后,设计动态偏好管理器,根据在线拥塞与能耗指标平滑调节偏好。在离散事件仿真环境中,实验结果表明该框架在超体积(hypervolume,HV)与期望效用均值(expected utility metric,EUM)等帕累托指标上均收敛。以任务截止时间窗口是否满足作为SLA判定依据,相较最早截止期优先(EDF)与能耗贪心(Energy)基线,PBMO-HRL能耗分别降低约16.6%和5.6%,SLA违约率分别降低约20.5%和28.8%。所提框架能够有效应对突发负载,在保障核心气象导航任务服务等级协议(service level agreement,SLA)的同时,实现了非平稳环境下的自适应多目标寻优。

    Abstract:

    To address the intensive computational demands triggered by sudden sea conditions, such as typhoon variations, in weather routing services, this study aims to resolve the strict time-window constraints and multi-dimensional heterogeneous resource contention in cloud-edge collaborative scheduling. It seeks to achieve a dynamic trade-off among multiple objectives, including server-side task timeliness, server node energy consumption, and load balancing.The event-driven scheduling process is modeled as a semi-Markov decision process (SMDP), and a preference-based multi-objective hierarchical reinforcement learning framework is proposed. First, to tackle the large and sparse joint action space, a Preference-Based Multi-Objective Hierarchical Reinforcement Learning (PBMO-HRL) algorithm for weather routing task scheduling is proposed, decoupling the decision-making into a two-layer policy: "task selection" and "node allocation". Second, dynamic action masking and precise expectation evaluation are introduced to eliminate gradient variance in the discrete action space. Furthermore, to address the variable step-size characteristics, a time-aware discount factor and an idle-state transition merging mechanism are incorporated to mitigate value overestimation. Finally, a dynamic preference manager based on Logit space is designed to smoothly adjust preferences according to online congestion and energy consumption metrics. In a discrete-event simulation environment, a multi-preference evaluation protocol based on a simplex grid demonstrates that the framework converges on Pareto metrics, including hypervolume (HV) and expected utility metric (EUM). Compared to the Earliest Deadline First (EDF) and Energy-greedy baselines, PBMO-HRL reduces energy consumption by approximately 16.6% and 5.6%, and decreases SLA violation rates by about 20.5% and 28.8%, respectively. The proposed framework effectively manages bursty workloads, ensuring the service level agreements (SLA) of core weather routing tasks while achieving adaptive multi-objective optimization in non-stationary environments.

    参考文献
    相似文献
    引证文献
引用本文
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2026-02-28
  • 最后修改日期:2026-04-22
  • 录用日期:2026-04-23
  • 在线发布日期:
  • 出版日期:
文章二维码