欢迎访问《兵工学报》官方网站,今天是 分享到:

兵工学报 ›› 2023, Vol. 44 ›› Issue (S2): 101-113.doi: 10.12382/bgxb.2023.0881

所属专题: 群体协同与自主技术

• • 上一篇    下一篇

基于安全强化学习的多智能体覆盖路径规划

李松1, 麻壮壮1, 张蕴霖1, 邵晋梁1,2,3,*()   

  1. 1 电子科技大学 自动化工程学院, 四川 成都 611731
    2 深圳市人工智能与机器人研究院 群体频谱智能研究中心,广东 深圳 518054
    3 电磁空间认知与智能控制技术实验室,北京 100089
  • 收稿日期:2023-09-06 上线日期:2024-01-10
  • 通讯作者:
  • 基金资助:
    国家自然科学基金项目(62273077); 深圳市人工智能与机器人研究院项目(AC01202201002)

Multi-agent Coverage Path Planning Based on Security Reinforcement Learning

LI Song1, MA Zhuangzhuang1, ZHANG Yunlin1, SHAO Jinliang1,2,3,*()   

  1. 1 School of Automation Engineering,University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
    2 Research Center on Crowd Spectrum Intelligence, Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen 518054, Guangdong, China
    3 Laboratory of Electromagnetic Space Cognition and Intelligent Control, Beijing 100089, China
  • Received:2023-09-06 Online:2024-01-10

摘要:

覆盖路径规划的目的是为智能体找到一条安全的轨迹,其不仅可以有效覆盖任务区域,而且可以避开障碍物与邻近智能体。在执行覆盖任务时,复杂的大面积任务区域总是不可避免的。如何在保证智能体安全的前提下加强智能体之间的协同合作,以改善集群任务效率低、能力不足的缺点是值得探索的问题。为此,利用栅格地图建立离散的覆盖路径规划数学模型,提出一种基于值分解网络的安全多智能体强化学习算法,并通过理论证明论证其合理性。该算法通过分解群体价值函数以避免智能体的虚假奖励,有助于加强智能体之间协同覆盖策略的学习,以提高算法的收敛速度。通过在训练过程中引入屏蔽器以修正智能体的出界和碰撞等行为,保证智能体在整个任务过程中的安全。仿真和半实物实验结果表明,新算法不仅可以保证智能体的覆盖效率,同时还能有效维护智能体的安全。

关键词: 多智能体系统, 覆盖路径规划, 安全强化学习, 值分解网络

Abstract:

The purpose of coverage path planning is to find a safe path for an agent, which can not only effectively cover the task area, but also avoid obstacles and neighboring agents. Complex and large task areas are always unavoidable when the coverage tasks are performed, so it is worth exploring how to ensure the safety of agents and enhance the collaboration between agents to improve the task efficiency and capacity of cluster. Therefore, a discrete coverage path planning mathematical model is established using raster maps, a secure multi-agent reinforcement learning algorithm based on value decomposition network is proposed, and its reasonableness is theoretically demonstrated. The proposed algorithm helps to strengthen the learning of collaborative coverage strategies among the agents by decomposing the group value function to avoid the false rewards of the agents, thus improving the convergence speed of the algorithm. The safety of the agent during an entire task is guaranteed by introducing a shield in the training process to correct the behaviors of the agent, such as out-of-bounds and collision. The simulated and semi-physical experiment results show that the algorithm can not only ensure the coverage efficiency of the agents, but also effectively maintain the safety of the agents.

Key words: multi-agent system, coverage path planning, safe reinforcement learning, value decomposition network

中图分类号: