基于DDQN-D3PG的无人机空战分层决策

doi:10.12382/bgxb.2024.0978

摘要/Abstract

摘要：

强化学习在无人机空战应用中面临僵化的奖励函数与单一模型难以处理高维连续状态空间中复杂任务的挑战,严重限制了算法在动态多变态势下的决策泛化能力。针对上述问题,融合分层式与分布式架构的精髓,提出一种集成深度双Q网络(Double Deep Q-Network,DDQN)与深度确定性策略梯度(Deep Deterministic Policy Gradient,DDPG)算法的自主决策框架。根据敌我双方在不同态势下的优势差异,设计一系列基于不同奖励函数权重组合的DDPG算法模型,并以此构建底层分布式深度确定性策略梯度(Distributed DDPG,D³PG)决策网络。引入擅长处理离散动作空间的DDQN算法构建上层决策网络,根据实时态势变化自主地选择并切换至最合适的底层策略模型,实现决策的即时调整与优化。为进一步提升红蓝双方无人机近距离空战环境的真实性与挑战性,在DDPG算法的训练中引入自我博弈机制,构建具备高度智能化的敌方决策模型。实验结果表明,新算法在无人机与智能化对手的博弈对抗中胜率最高达96%,相较D³PG等算法提升20%以上,且在多种初始态势下均能稳定战胜对手,充分验证了该方法的有效性和先进性。

关键词: 无人机空战, 强化学习, 分层决策, 深度双Q网络, 分布式深度确定性策略梯度

Abstract:

Application of reinforcement learning in unmanned aerial vehicle (UAV) air combat faces the challenges of which the rigid reward functions and single models are used to handle the complex tasks difficultly in high-dimensional continuous state spaces.This severely limits the decision-making generalization capability in dynamic and of algorithm varied situations.Addressing the aforementioned issues,an autonomous decision-making framework with the deep double Q-network (DDQN) and deep deterministic policy gradient (DDPG) algorithms is proposed,which integrates the essence of hierarchical and distributed architectures.Based on the advantage differences between the opposing forces in various situations,a series of DDPG algorithm models with different reward function weight combinations are designed to construct a bottom-level distributed deep deterministic policy gradient (D³PG) decision-making network.The DDQN algorithm which excels in handling discrete action spaces is introduced to construct a top-level decision-making network.It allows for autonomous selection and switching to the most suitable bottom-level policy model based on real-time situation changes,thereby achieving the instant adjustment and optimization of decisions.To further enhance the realism and challenge of combat environment,a self-play mechanism is introduced into the DDPG algorithm training to construct an enemy decision-making model with high intelligence.The experimental results demonstrate that UAVs equipped with the proposed algorithm achieve a maximum win rate of 96% in adversarial engagements against intelligent opponents,which is increased by more than 20% compared to those of baseline algorithms such as D³PG.Moreover,it consistently defeats the opponents under various initial conditions,confirming the effectiveness and advancement of the proposed algorithm.

Key words: UAV aerial combat, reinforcement learning, hierarchical decision-making, double deep Q-network, distributed deep deterministic policy gradient

中图分类号:

V279

王昱, 李远鹏, 郭中宇, 李硕, 任田君. 基于DDQN-D³PG的无人机空战分层决策[J]. 兵工学报, 2025, 46(8): 240978-.

WANG Yu, LI Yuanpeng, GUO Zhongyu, LI Shuo, REN Tianjun. Hierarchical Decision-making for UAV Air Combat Based on DDQN-D³PG[J]. Acta Armamentarii, 2025, 46(8): 240978-.

图/表 16

图1 红蓝双方无人机近距离空战目标示意图

Fig.1 Schematic diagram of UAV engaging targets

图2 无人机运动模型

Fig.2 Motion model of UAV

图3 无人机空战模型

Fig.3 Air combat model of UAV

图4 空战对抗决策系统

Fig.4 Air aombat decision-making system

图5 敌机模型训练过程

Fig.5 Training process of enemy UAV model

图6 DDQN-D3PG模型

Fig.6 DDQN-D3PG model

图7 DDQN-D3PG模型训练过程

Fig.7 Training process of DDQN-D3PG model

表1 对抗环境参数

Table 1 Parameters of adversarial environment

参数	数值	参数	数值
$D m a x$ /m	4000	q_max/rad	π/6
z_min/m	2000	z_max/m	12000
v_min/(m·s^-1)	50	v_max/(m·s^-1)	400

表1 对抗环境参数

Table 1 Parameters of adversarial environment

参数	数值	参数	数值
$D m a x$ /m	4000	q_max/rad	π/6
z_min/m	2000	z_max/m	12000
v_min/(m·s^-1)	50	v_max/(m·s^-1)	400

表2 模型训练参数

Table 2 Parameters of model training process

参数	数值	参数	数值
N	300	M	100
F_b	30	F_t	100
T	200

图8 蓝机模型训练过程

Fig.8 Training process of blue UAV model

图9 4种算法对比训练结果

Fig.9 Comparison of training results for four algorithms

图10 平均奖励

Fig.10 Average reward

图11 平均对战时间

Fig.11 Average combat time

图12 优势态势下的博弈过程

Fig.12 The game process under the advantaged situation

图13 均势态势下的博弈过程

Fig.13 Game process under the equilibrium situation

图14 劣势态势下的博弈过程

Fig.14 Game process under the disadvantageous situation

参考文献 22

[1]	於志文, 孙卓, 程岳, 等. 智能无人机集群协同感知计算研究综述[J]. 航空学报, 2024, 45(20):630912.
	YU Z W, SUN Z, CHENG Y, et al. A review of intelligent UAV swarm collaborative perception and computation[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(20):630912. (in Chinese)
[2]	周新民, 吴佳晖, 贾圣德, 等. 无人机空战决策技术研究进展[J]. 国防科技, 2021, 42(3):33-41.
	ZHOU X M, WU J H, JIA S D, et al. Progress in research on combat decision-making technology in UAVs[J]. National Defense Technology, 2021, 42(3):33-41. (in Chinese)
[3]	董一群, 艾剑良. 自主空战技术中的机动决策:进展与展望[J]. 航空学报, 2020, 41(增刊2):724264.
	DONG Y Q, AI J L. Decision making in autonomous air combat:review and prospects[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(S2):724264. (in Chinese)
[4]	GARCIA E, CASBEER D W, PACHTER M. Active target defense differential game with a fast defender[C]// Proceedings of 2015 American Control Conference. Chicago,IL,US: IEEE, 2015:3752-3757.
[5]	ALKAHER D, MOSHAIOV A. Game-based safe aircraft navigation in the presence of energy-bleeding coasting missile[J]. Journal of Guidance,Control,and Dynamics, 2016, 39(7):1539-1550.
[6]	车竞, 钱炜祺, 和争春. 基于矩阵博弈的两机攻防对抗空战仿真[J]. 飞行力学, 2015, 33(2):173-177.
	CHE J, QIAN W Q, HE Z C. Attack-defence confrontation simulation of air combat based on game-matrix approach[J]. Flight Dynamics, 2015, 33(2):173-177. (in Chinese)
[7]	ZHANG T, LI C C, MA D Y, et al. An optimal task management and control scheme for military operations with dynamic game strategy[J]. Aerospace Science and Technology, 2021, 115:106815.
[8]	BOTVINICK M, WANG J X, DABNEY W, et al. Deep reinforcement learning and its neuroscientific impli-cations[J]. Neuron, 2020, 107(4):603-616.
[9]	TENG T H, TAN A H, TAN Y S, et al. Self-organizing neural networks for learning air combat maneuvers[C]// Proceedings of the 2012 International Joint Conference on Neural Networks. Brisbane,QLD,Australia: IEEE, 2012:1-8.
[10]	吴傲, 杨任农, 梁晓龙, 等. 基于模糊推理的无人战斗机视距空战机动决策[J]. 南京航空航天大学学报, 2021, 53(6):898-908.
	WU A, YANG R N, LIANG X L, et al. Maneuver decision on visual range air combats of unmanned combat aerial vehicles based on fuzzy inference[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2021, 53(6):898-908. (in Chinese).
[11]	YANG Q M, ZHU Y, ZHANG J D, et al. UAV air combat autonomous maneuver decision based on DDPG algorithm[C]// Proceedings of the 2019 IEEE 15th International Conference on Control and Automation. Edinburgh,UK: IEEE, 2019:37-42.
[12]	何子琦, 李博宸, 王成罡, 等. 针对区域防御的多无人机序列捕捉策略[J]. 兵工学报, 2025, 46(4):240343. doi: 10.12382/bgxb.2024.0343
	HE Z Q, LI B C, WANG C G, Multi UAV sequential capture strategy for area defense[J]. Acta Armamentarii, 2025, 46(4):240343. (in Chinese)
[13]	张耀中, 吴卓然, 张建东, 等. 基于ME-DDPG算法的无人机多对一追逃博弈[J/OL]. 系统工程与电子技术, 2024(2024-10-10)[2024-12-24]. http://kns.cnki.net/kcms/detail/11.2422.tn.20241009.1739.012.html.
	ZHANG Y Z, WU Z R, ZHANG J D, et al. UAV many-to-one pursuit-evasion game based on ME-DDPG algorithm[J/OL]. Systems Engineering and Electronics, 2024(2024-10-10) [2024-12-24]. http://kns.cnki.net/kcms/detail/11.2422.tn.20241009.1739.012.html. (in Chinese)
[14]	ZHANG L J, PENG J B, YI W G, et al. A state-decomposition DDPG algorithm for UAV autonomous navigation in 3-D complex environments[J]. IEEE Internet of Things Journal, 2024, 11(6):10778-10790.
[15]	LI Y F, LÜ Y X, SHI J P, et al. Autonomous maneuver decision of air combat based on simulated operation command and FRV-DDPG algorithm[J]. Aerospace, 2022, 9:658-676.
[16]	BAI S X, SONG S M, LIANG S Y, et al. UAV maneuvering decision-making algorithm based on twin delayed deep deterministic policy gradient algorithm[J]. Journal of Artificial Intelligence and Technology, 2022, 2:16-22.
[17]	钟皓俊, 王振雷. 基于双经验回放池TD3算法的PID参数优化[J/OL]. 控制理论与应用, 2024(2024-10-25)[2024-12-24].
	ZHONG H J, WANG Z L. PID parameter optimization based on TD3 algorithm of double replay buffer[J/OL]. Control Theory & Applications, 2024(2024-10-25) [2024-12-24]. (in Chinese)
[18]	周攀, 黄江涛, 章胜, 等. 基于深度强化学习的智能空战决策与仿真[J]. 航空学报, 2023, 44(4):126731. doi: 10.7527/S1000-6893.2022.26731
	ZHOU P, HUANG J T, ZHANG S, et al.、 Intelligent air combat decision making and simulation based on deep reinforcement learning[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(4):126731. (in Chinese) doi: 10.7527/S1000-6893.2022.26731
[19]	李永丰, 吕永玺, 史静平, 等. 深度确定性策略梯度和预测相结合的无人机空战决策研究[J]. 西北工业大学学报, 2023, 41(1):56-64.
	LI Y F, LÜ Y X, SHI J P, et al. UAV's air combat decision-making based on deep deterministic policy gradient and prediction[J]. Journal of Northwestern Polytechnical University, 2023, 41(1):56-64. (in Chinese)
[20]	李曾琳, 李波, 白双霞, 等. 基于AM-SAC的无人机自主-空战决策[J]. 兵工学报, 2023, 44(9):2849-2858. doi: 10.12382/bgxb.2022.0669
	LI Z L, LI B, BAI S X, et al. UAV autonomous air combat decision-making based on AM-SAC[J]. Acta Armamentarii, 2023, 44(9):2849-2858. (in Chinese) doi: 10.12382/bgxb.2022.0669
[21]	王昱, 任田君, 范子琳, 等. 基于角度特征的分布式DDPG无人机追击决策[J]. 控制理论与应用, 2025, 42(7):1356-1366.
	WANG Y, REN T J, FAN Z L, et al. Distributed DDPG UAV pursuit decision based on angle features[J]. Control Theory & Applications, 2025, 42(7):1356-1366. (in Chinese)
[22]	王昱, 任田君, 范子琳. 基于引导Minimax-DDQN的无人机空战机动决策[J]. 计算机应用, 2023, 43(8):2636-2643. doi: 10.11772/j.issn.1001-9081.2022071069
	WANG Y, REN T J, FAN Z L. Air combat maneuver decision of unmanned aerial vehicle based on guided minimax-DDQN[J]. Journal of Computer Applications, 2023, 43(8):2636-2643. (in Chinese) doi: 10.11772/j.issn.1001-9081.2022071069

基于DDQN-D³PG的无人机空战分层决策

Hierarchical Decision-making for UAV Air Combat Based on DDQN-D³PG

RichHTML

PDF

PDF (Mobile)

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 22

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王存灿, 王晓芳, 林海. 一种元学习和强化学习结合的多飞行器协同制导律[J]. 兵工学报, 2025, 46(7): 240568-.
[2]	路潇然, 邹渊, 张旭东, 孙巍, 孟逸豪, 张彬. 基于Munchausen-PER算法优化的混合动力履带车辆能量管理策略[J]. 兵工学报, 2025, 46(6): 240498-.
[3]	周桢林, 龙腾, 刘大卫, 孙景亮, 钟建鑫, 李俊志. 基于强化学习冲突消解的大规模无人机集群航迹规划方法[J]. 兵工学报, 2025, 46(5): 241146-.
[4]	先苏杰, 王康, 曾鑫, 宋杰, 吴志林. 基于深度强化学习的落角和视场角约束制导律[J]. 兵工学报, 2025, 46(4): 240435-.
[5]	潘云伟, 李敏, 曾祥光, 黄傲, 张加衡, 任文哲, 彭倍. 基于人工势场和改进强化学习的自主式水下潜航器避障和航迹规划[J]. 兵工学报, 2025, 46(4): 240300-.
[6]	李传浩, 明振军, 王国新, 阎艳, 丁伟, 万斯来, 丁涛. 基于多智能体深度强化学习的无人平台箔条干扰末端防御动态决策方法[J]. 兵工学报, 2025, 46(3): 240251-.
[7]	张旺, 邵学辉, 唐慧龙, 魏建林, 王伟. 一种探索率自适应设置的强化学习雷达干扰决策方法[J]. 兵工学报, 2025, 46(3): 240357-.
[8]	肖柳骏, 李雅轩, 刘新福. 基于强化学习的高超声速滑翔飞行器自适应末制导[J]. 兵工学报, 2025, 46(2): 240222-.
[9]	胡砚洋, 何凡, 白成超. 高超声速飞行器末制导段协同避障决策方法[J]. 兵工学报, 2024, 45(9): 3147-3160.
[10]	孙浩, 黎海青, 梁彦, 马超雄, 吴翰. 基于知识辅助深度强化学习的巡飞弹组动态突防决策[J]. 兵工学报, 2024, 45(9): 3161-3176.
[11]	陈文杰, 崔小红, 王斌锐. 安全最优跟踪控制算法与机械手仿真[J]. 兵工学报, 2024, 45(8): 2688-2697.
[12]	王霄龙, 陈洋, 胡棉, 李旭东. 基于改进深度Q网络的机器人持续监测路径规划[J]. 兵工学报, 2024, 45(6): 1813-1823.
[13]	娄抒瀚, 王冲冲, 龚炜, 邓立原, 李莉. 基于MLAT-DRL算法的协同区域信息采集策略[J]. 兵工学报, 2024, 45(12): 4423-4434.
[14]	董明泽, 温庄磊, 陈锡爱, 杨炅坤, 曾涛. 安全凸空间与深度强化学习结合的机器人导航方法[J]. 兵工学报, 2024, 45(12): 4372-4382.
[15]	李加申, 王晓芳, 林海. 引入虚拟目标的高超声速巡航导弹智能机动突防策略[J]. 兵工学报, 2024, 45(11): 3856-3867.