基于多智能体深度强化学习的无人平台箔条干扰末端防御动态决策方法

doi:10.12382/bgxb.2024.0251

摘要/Abstract

摘要：

无人平台箔条质心干扰是导弹末端防御的重要手段,其在平台机动和箔条发射等方面的智能决策能力是决定战略资产能否保护成功的重要因素。针对目前基于机理模型的计算分析和基于启发式算法的空间探索等决策方法存在的智能化程度低、适应能力差和决策速度慢等问题,提出基于多智能体深度强化学习的箔条干扰末端防御动态决策方法:对多平台协同进行箔条干扰末端防御的问题进行定义并构建仿真环境,建立导弹制导与引信模型、无人干扰平台机动模型、箔条扩散模型和质心干扰模型;将质心干扰决策问题转化为马尔科夫决策问题,构建决策智能体,定义状态、动作空间并设置奖励函数;通过多智能体近端策略优化算法对决策智能体进行训练。仿真结果显示,使用训练后的智能体进行决策,相比多智能体深度确定性策略梯度算法,训练时间减少了85.5%,资产保护成功率提升了3.84倍,相比遗传算法,决策时长减少了99.96%,资产保护成功率增加了1.12倍。

关键词: 无人平台, 质心干扰, 箔条干扰, 末端防御, 多智能体强化学习, 电子对抗

Abstract:

Chaff centroid jamming of unmanned platform is an important means of missile terminal defense.The intelligent decision-making ability in platform maneuvering and chaff launching is an important factor to determine whether the strategic assets can be protected successfully.The current decision-making methods,such as computational analysis based on mechanism model and space exploration based on heuristic algorithm,have the problems of low degree of intelligence,poor adaptability and slow decision-making speed.A dynamic decision-making method of chaff jamming for terminal defense based on multi-agent deep reinforcement learning is proposed.The problem of cooperative chaff jamming of multi-platform for terminal defense is defined,and a simulation environment is constructed.The missile guidance and fuze model,unmanned jamming platform maneuvering model,chaff diffusion model and centroid jamming model are established.The centroid jamming decision problem is transformed into a Markov decision problem,a decision-making agent is constructed,the state and action spaces are defined,and a reward function is set.The decision-making agent is trained by using the multi-agent proximal policy optimization (MAPPO) algorithm.The simulated results show that the proposed method reduces the training time by 85.5% and increases the success rate of asset protection by 3.84 compared with the multi-agent deep deterministic policy gradient (MADDPG) algorithm.Compared with the GA,it reduces the deciding time by 99.96 % and increases the success rate of asset protection 1.12.

Key words: unmanned platform, centroid jamming, chaff jamming, terminal defense, multi-agent reinforcement learning, electronic countermeasure

李传浩, 明振军, 王国新, 阎艳, 丁伟, 万斯来, 丁涛. 基于多智能体深度强化学习的无人平台箔条干扰末端防御动态决策方法[J]. 兵工学报, 2025, 46(3): 240251-.

LI Chuanhao, MING Zhenjun, WANG Guoxin, YAN Yan, DING Wei, WAN Silai, DING Tao. Dynamic Decision-making Method of Unmanned Platform Chaff Jamming for Terminal Defense Based on Multi-agent Deep Reinforcement Learning[J]. Acta Armamentarii, 2025, 46(3): 240251-.

图/表 18

图1 多无人平台箔条末端防御示意图

Fig.1 Schematic diagram of multi-platform chaff terminal defense

图2 多平台箔条末端防御模型组成

Fig.2 Composition of multi-platform chaff terminal defense model

图3 导弹目标相对运动关系

Fig.3 Relative motion relationship of missile and target

图4 箔条干扰末端防御智能体运行流程

Fig.4 Operation flow of chaff jamming terminal defense agent

表1 干扰决策状态空间表示

Table 1 Representation of jamming decision state space

状态类型	状态名称	状态标识	维度	取值范围
威胁信息	导弹位置坐标/m	$o i, t m p$	N_mis×2	[X_min,X_max]×[Y_min,Y_max]
	导弹方向坐标/rad	$o i, t m a$	N_mis	[-πrad,πrad)
	导弹机动速率/(m·s^-1)	$o i, t m v$	N_mis	[0, $v m a x m i s$ ]
	导弹跟踪时长/s	$o i, t m t$	N_mis	[0,T_max)
弹药信息	箔条状态矩阵/m	$o t c s$	$N m a x c h a f f$ ×3	[X_min,X_max]×[Y_min,Y_max]×[0,k]
弹药信息	箔条弹剩余数量	$o t c l$	N_plat	{0,1…, $N c h a f f m a x$ }
无人干扰平台信息	位置坐标/m	$o t c p$	2	[X_min,X_max]×[Y_min,Y_max]
	方向坐标/rad	$o t c a$	1	[-πrad,πrad)
	速率值/(m·s^-1)	$o t c v$	1	[0, $v m a x c a r$ ]
任务信息	资产位置/m	$o t a p$	N_mis×2	[X_min,X_max]×[Y_min,Y_max]
任务信息	资产威胁距离/m	$o t a l$	N_mis	[0, $(X m a x - X m i n) 2 + (Y m a x - Y m i n) 2$ ]

表1 干扰决策状态空间表示

Table 1 Representation of jamming decision state space

状态类型	状态名称	状态标识	维度	取值范围
威胁信息	导弹位置坐标/m	$o i, t m p$	N_mis×2	[X_min,X_max]×[Y_min,Y_max]
	导弹方向坐标/rad	$o i, t m a$	N_mis	[-πrad,πrad)
	导弹机动速率/(m·s^-1)	$o i, t m v$	N_mis	[0, $v m a x m i s$ ]
	导弹跟踪时长/s	$o i, t m t$	N_mis	[0,T_max)
弹药信息	箔条状态矩阵/m	$o t c s$	$N m a x c h a f f$ ×3	[X_min,X_max]×[Y_min,Y_max]×[0,k]
弹药信息	箔条弹剩余数量	$o t c l$	N_plat	{0,1…, $N c h a f f m a x$ }
无人干扰平台信息	位置坐标/m	$o t c p$	2	[X_min,X_max]×[Y_min,Y_max]
	方向坐标/rad	$o t c a$	1	[-πrad,πrad)
	速率值/(m·s^-1)	$o t c v$	1	[0, $v m a x c a r$ ]
任务信息	资产位置/m	$o t a p$	N_mis×2	[X_min,X_max]×[Y_min,Y_max]
任务信息	资产威胁距离/m	$o t a l$	N_mis	[0, $(X m a x - X m i n) 2 + (Y m a x - Y m i n) 2$ ]

表2 干扰决策动作空间表示

Table 2 Representation of jamming decision action space

动作类型	动作名称	动作标识	取值范围
弹药动作	是否发射箔条	$a i, t i c$	{0,1}
	箔条发射方向/rad	$a i, t c a$	[-πrad,πrad)
	箔条发射距离/m	$a i, t c d$	[0, $L m a x c h a f f$ ]
无人干扰平台信息	平台移动方向/rad	$a i, t p d$	[-πrad,πrad)
无人干扰平台信息	平台移动速度/(m·s^-1)	$a i, t p s$	[0, $v m a x c a r$ ]

表2 干扰决策动作空间表示

Table 2 Representation of jamming decision action space

动作类型	动作名称	动作标识	取值范围
弹药动作	是否发射箔条	$a i, t i c$	{0,1}
	箔条发射方向/rad	$a i, t c a$	[-πrad,πrad)
	箔条发射距离/m	$a i, t c d$	[0, $L m a x c h a f f$ ]
无人干扰平台信息	平台移动方向/rad	$a i, t p d$	[-πrad,πrad)
无人干扰平台信息	平台移动速度/(m·s^-1)	$a i, t p s$	[0, $v m a x c a r$ ]

图5 MAPPO算法训练框架图

Fig.5 MAPPO algorithm training framework

表3 仿真双方性能参数

Table 3 Performance parameters of red and blue forces equipment

阵营	性能参数	标识	取值
蓝方	导弹近炸距离/m	L_fuse	60
	导弹比例导引系数	K	5
	导弹机动速率/(m·s^-1)	v^mis	150
	导弹最大角速度/(rad·s^-1)	$ω m a x m i s$	π/6
	资产雷达反射截面积/m²	σ^a	5000
红方	资产引爆点安全距离/m	$L a s s e t_s f s a f e$	60
	无人干扰平台最大速率/(m·s^-1)	$v m a x p l a t$	20
	无人干扰平台最大角速度/(rad·s^-1)	$ω m a x p l a t$	π/3
	无人干扰平台最大加速度/(m·s^-2)	$a m a x p l a t$	5
	箔条云持续时间/s	$T a l i v e c h a f f$	2
	箔条最大发射距离/m	$L m a x c h a f f$	10
	箔条武器装填量	$N m a x c h a f f$	5
	箔条最大反射截面积/m²	k	1000

表3 仿真双方性能参数

Table 3 Performance parameters of red and blue forces equipment

阵营	性能参数	标识	取值
蓝方	导弹近炸距离/m	L_fuse	60
	导弹比例导引系数	K	5
	导弹机动速率/(m·s^-1)	v^mis	150
	导弹最大角速度/(rad·s^-1)	$ω m a x m i s$	π/6
	资产雷达反射截面积/m²	σ^a	5000
红方	资产引爆点安全距离/m	$L a s s e t_s f s a f e$	60
	无人干扰平台最大速率/(m·s^-1)	$v m a x p l a t$	20
	无人干扰平台最大角速度/(rad·s^-1)	$ω m a x p l a t$	π/3
	无人干扰平台最大加速度/(m·s^-2)	$a m a x p l a t$	5
	箔条云持续时间/s	$T a l i v e c h a f f$	2
	箔条最大发射距离/m	$L m a x c h a f f$	10
	箔条武器装填量	$N m a x c h a f f$	5
	箔条最大反射截面积/m²	k	1000

表4 MAPPO算法中的参数取值

Table 4 Value of parameters in MAPPO algorithm

参数	标识	数值
策略网络学习率	α	5×10^-4
价值网络学习率	β	5×10^-4
最大迭代步数	step_max	3×10⁵
数据缓冲区容量	B	5
评估测试间隔	T_ite	50
每轮最大步数	T_max	30
连续更新次数	K	5
截断参数	ε	0.2
策略熵系数	φ	0.01
折扣因子	γ	0.999
GAE参数	λ	0.95
策略/价值网络层数	N_layer	4
隐藏层神经元个数	N_hidden	64
测试环境个数	N_env	100

表5 测试环境参数

Table 5 Test environment parameters

环境序号	资产位置/m	车辆方向/ rad	导弹位置/m
1	$3345.9 4790.8 4354.1 3048.8 4933.4 4892.2$	$2.59 - 0.32 2.88$	$482.6 3185.1 2917.6 5889.4 2616.3 2769.1$
2	$3204.0 4596.9 3386.2 3040.0 4990.6 3491.9$	$1.22 0.27 2.84$	$5536.7 6601.3 5096.9 5639.4 7012.2 5877.6$
︙	︙	︙	︙
100	$3193.3 4789.2 4423.6 3249.6 4903.9 4866.0$	$0.01 - 0.97 - 2.95$	$149.4 5490.2 1595.4 1478.2 3777.5 2043.0$

表5 测试环境参数

Table 5 Test environment parameters

环境序号	资产位置/m	车辆方向/ rad	导弹位置/m
1	$3345.9 4790.8 4354.1 3048.8 4933.4 4892.2$	$2.59 - 0.32 2.88$	$482.6 3185.1 2917.6 5889.4 2616.3 2769.1$
2	$3204.0 4596.9 3386.2 3040.0 4990.6 3491.9$	$1.22 0.27 2.84$	$5536.7 6601.3 5096.9 5639.4 7012.2 5877.6$
︙	︙	︙	︙
100	$3193.3 4789.2 4423.6 3249.6 4903.9 4866.0$	$0.01 - 0.97 - 2.95$	$149.4 5490.2 1595.4 1478.2 3777.5 2043.0$

图6 每个仿真步数平均累积奖励值曲线

Fig.6 Average cumulative reward value curve for each training cycle

图7 每个仿真步数平均保护成功率曲线

Fig.7 Average protection success rate curve of each training cycle

图8 训练后的多智能体末端防御仿真流程

Fig.8 The multi-agent terminal defense simulation process after training

图9 单次仿真中资产威胁距离的变化

Fig.9 The change in dfistance of asset from threat in a single simulation

表6 末端防御性能对比结果

Table 6 Comparison results of terminal defense performances

算法	训练时长/s	平均每轮决策时长/s	平均成功率			平均最小资产威胁距离/m
算法	训练时长/s	平均每轮决策时长/s	资产1	资产2	资产3	资产1	资产2	资产3
MAPPO算法	2763.07	0.063277	0.75	0.85	0.77	85.27	108.75	91.63
MADDPG算法(1000)	19102.6	0.022671	0.18	0.22	0.01	34.14	33.18	13.20
MADDPG算法(10000)	35916.8	0.020543	0.09	0.20	0.15	26.36	32.23	26.33
MADDPG算法(100000)	28220.1	0.017842	0.07	0.16	0.26	29.35	37.57	39.71
GA		148.68	0.49	0.49	0.14	71.98	67.45	27.08

图10 MAPPO和MADDPG算法资产保护成功率的变化

Fig.10 Changes in the success rates of asset protection of agent trained by MAPPO and MADDPG algorithm

图11 GA效果和决策用时随优化代数的变化曲线

Fig.11 Variation of effect and decision time of GA with optimization generation

图12 智能体奖励设置对算法收敛过程的影响

Fig.12 The influence of agent reward setting on the convergence process of the algorithm

参考文献 32

[1]	张继传, 王声才. 现代末端防御武器系统探析[J]. 火力与指挥控制, 2008, 33(增刊1):41-43.
	ZHANG J C, WANG S C. Research on modern terminal defence weapon system[J]. Fire Control and Command Control, 2008, 33(S1):41-43. (in Chinese)
[2]	LI J, WANG S H, DAI C C, et al. Optimization of cooperative deployment of multiple terminal defense system[C]// Proceedings of 2018 IEEE CSAA Guidance,Navigation and Control Conference. Washington,D.C.,US: IEEE, 2018:1-6.
[3]	赵玲, 刘正敏, 姜长生. 末端防御系统中的对抗决策方法研究[J]. 宇航学报, 2011, 32(3):574-581.
	ZHAO L, LIU Z M, JIANG C S. Research on confrontation strategy algorithm of terminal defense system[J]. Journal of Astronautics, 2011, 32(3):574-581. (in Chinese)
[4]	张成, 吴新良, 张博, 等. 面向机载末端防御的导弹威胁信息融合与识别方法[J]. 现代雷达, 2022, 45(8):1-8.
	ZHANG C, WU X L, ZHANG B, et al. Recognition method of air platform terminal missile based on multi-source information fusion[J]. Modern Radar, 2022, 45(8):1-8. (in Chinese)
[5]	KIM J S, LEE D Y, KIM T H, et al. Chaff cloud modeling and electromagnetic scattering properties estimation[J]. IEEE Access, 2023, 11:58835-58849.
[6]	全斯农, 范晖, 代大海, 等. 一种基于精细极化目标分解的舰船箔条云识别方法[J]. 雷达学报, 2021, 10(1):61-73.
	QUAN S N, FAN H, DAI D H, et al. Recognition of ships and chaff clouds based on sophisticated polarimetric target decomposition[J]. Journal of Radars, 2021, 10(1):61-73. (in Chinese)
[7]	陈静. 雷达箔条干扰原理[M]. 北京: 国防工业出版社, 2007.
	CHEN J. Principles of radar chaff jamming[M]. Beijing: National Defense Industry Press, 2007. (in Chinese)
[8]	李永祯, 刘业民, 庞晨, 等. 基于分层极化特性的箔条云识别方法[J]. 系统工程与电子技术, 2021, 43(8):2099-2107. doi: 10.12305/j.issn.1001-506X.2021.08.10
	LI Y Z, LIU Y M, PANG C, et al. Chaff clouds recognition method based on layered polarization characteristics[J]. Systems Engineering and Electronics, 2021, 43(8):2099-2107. (in Chinese) doi: 10.12305/j.issn.1001-506X.2021.08.10
[9]	唐波, 鲁嘉淇, 郭琨毅, 等. 针对时空统计特性的箔条云半实物射频仿真[J]. 电子学报, 2023, 51(4):843-849. doi: 10.12263/DZXB.20211282
	TANG B, LU J Q, GUO K Y, et al. Reproducing of the spatiotemporal statistical characteristics in RFSS for chaff clouds[J]. Acta Electronica Sinica, 2023, 51(4):843-849. (in Chinese) doi: 10.12263/DZXB.20211282
[10]	王湖升, 陈伯孝, 叶倾知. 基于箔条干扰实测数据的对抗方法研究[J]. 系统工程与电子技术, 2023, 45(7):2010-2021. doi: 10.12305/j.issn.1001-506X.2023.07.11
	WANG H S, CHEN B X, YE Q Z. Research on anti chaff jamming method based on measured data[J]. Systems Engineering and Electronics, 2023, 45(7):2010-2021. (in Chinese)
[11]	刘业民, 李永祯, 黄大通, 等. 基于极化单脉冲雷达扩展目标角度估计方法[J]. 系统工程与电子技术, 2021, 43(6):1497-1505. doi: 10.12305/j.issn.1001-506X.2021.06.06
	LIU Y M, LI Y Z, HUANG D T, et al. Angle estimation method of extended target based on polarization monopulse radar[J]. Systems Engineering and Electronics, 2021, 43(6):1497-1505. (in Chinese) doi: 10.12305/j.issn.1001-506X.2021.06.06
[12]	张凯娜, 吴上, 张军周. 对抗新型反舰导弹箔条质心干扰策略研究[J]. 舰船电子工程, 2021, 41(11):64-68.
	ZHANG K N, WU S, ZHANG J Z. Research on strategy of chaff centroid jamming against new anti-ship missiles[J]. Ship Electronic Engineering, 2021, 41(11):64-68. (in Chinese)
[13]	白杨, 张成, 王博宇, 等. 机载末端红外对抗作战效能仿真研究[J]. 红外与激光工程, 2022, 51(11):149-158.
	BAI Y, ZHANG C, WANG B Y, et al. Simulation of airborne terminal infrared countermeasure operational effectiveness[J]. Infrared and Laser Engineering, 2022, 51(11):149-158. (in Chinese)
[14]	刘赟, 张翠侠, 颜如祥, 等. 空间箔条有效抛撒时间区间确定和策略研究[J]. 兵工学报, 2015, 36(7):1302-1308. doi: 10.3969/j.issn.1000-1093.2015.07.020
	LIU Y, ZHANG C X, YAN R X, et al. Determination and strategies of effective dispersion time interval of chaffs in outer space[J]. Acta Armamentarii, 2015, 36(7):1302-1308. (in Chinese) doi: 10.3969/j.issn.1000-1093.2015.07.020
[15]	周中良, 程越, 阮铖巍, 等. 空战机动规避与箔条干扰协同应用策略[J]. 西安电子科技大学学报, 2017, 44(5):153-157,164.
	ZHOU Z L, CHEN Y, RUAN C W, et al. Collaborative application strategy of evasive maneuver and chaff jamming in the air combat[J]. Journal of Xidian University, 2017, 44(5):153-157,164. (in Chinese)
[16]	杨勇, 李亚南. 箔条弹的投放决策研究[J]. 雷达科学与技术, 2016, 14(5):466-470.
	YANG Y, LI Y N. Research on decision-making of chaff cartridge diffusion[J]. Radar Science and Technology, 2016, 14(5):466-470. (in Chinese)
[17]	彭绍荣, 胡生亮, 许江湖, 等. “箔条链”式舰艇反导质心干扰作战方法研究[J]. 现代防御技术, 2022, 50(3):78-83. doi: 10.3969/j.issn.1009-086x.2022.03.010
	PENG S R, HU S L, XU J H, et al. “Chaff chain” operation method for anti-missile centroid jamming of warship[J]. Modern Defence Technology, 2022, 50(3):78-83. (in Chinese)
[18]	雷震烁, 刘松涛, 姜宁, 等. 舰艇箔条冲淡干扰发射时机模型研究[J]. 火力与指挥控制, 2021, 46(3):16-19.
	LEI Z S, LIU S T, JIANG N, et al. Study on the dilution jamming launch timing model of warship chaff[J]. Fire Control & Command Control, 2021, 46(3):16-19. (in Chinese)
[19]	裴立冠, 刘经东, 马春波. 基于灰狼算法的箔条幕干扰构设方法研究[J]. 系统工程与电子技术, 2023, 46(2):1-13.
	PEI L G, LIU J D, MA C B. Study on chaff screen jamming construction method based on gray wolf algorithm[J]. Systems Engineering and Electronics, 2023, 46(2):1-13. (in Chinese)
[20]	裴立冠, 周唯, 刘经东. 基于布谷鸟搜索算法的机动化箔条幕布放方法研究[J]. 系统工程与电子技术, 2023, 46(3):1-13.
	PEI L G, ZHOU W, LIU J D. Research on arrangement method of motorized chaff screen based on the cuckoo bird search algorithm[J]. Systems Engineering and Electronics, 2023, 46(3):1-13. (in Chinese)
[21]	蔡蒨, 张为华. 箔条配比优化仿真分析[J]. 舰船电子工程, 2015, 35(5):77-79,88.
	CAI Q, ZHANG W H. Simulation analysis of the chaff match optimization[J]. Ship Electronic Engineering, 2015, 35(5):77-79,88. (in Chinese)
[22]	王晴昊, 姚登凯, 赵顾颢, 等. 远距离支援复合干扰空域规划研究[J]. 西北工业大学学报, 2018, 36(6):1176-1184.
	WANG Q H, YAO D K, ZHAO G H, et al. Research on airspace planning for stand-off and compound jamming[J]. Journal of Northwestern Polytechnical University, 2018, 36(6):1176-1184. (in Chinese)
[23]	HEUILLET A, COUTHOUIS F, DÍAZ-RODRÍGUEZ N. Explainability in deep reinforcement learning[J]. Knowledge-Based Systems, 2021, 214:106685.
[24]	SHOHAM Y, LEYTON-BROWN K. Multiagent systems:algorithmic,game-theoretic,and logical foundations[M]. Cambridge,MA, US: Cambridge University Press, 2008.
[25]	NING Z P, XIE L H. A survey on multi-agent reinforcement learning and its application[J/OL]. Journal of Automation and Intelligence, 2024, 3(2):73-91.
[26]	丁世飞, 杜威, 张健, 等. 多智能体深度强化学习研究进展[J]. 计算机学报, 2024, 47(7):1347-1567.
	DING S F, DU W, ZHANG J, et al. Research progress of multi-agent deep reinforcement learning[J]. Chinese Journal of Computers, 2024, 47(7):1347-1567. (in Chinese)
[27]	葛致磊, 王红梅, 王佩, 等. 导弹导引系统原理[M]. 北京: 国防工业出版社, 2016.
	GE Z L, WANG H M, WANG P, et al. Principle of missile guidance system[M]. Beijing: National Defense Industry Press, 2016. (in Chinese)
[28]	张合, 李豪杰. 引信机构学[M]. 北京: 北京理工大学出版社, 2007.
	ZHANG H, LI H J. Fuze mechanism[M]. Beijing: Beijing Institute of Technology Press, 2007. (in Chinese)
[29]	尹建平, 王志军. 弹药学[M] .第3版. 北京: 北京理工大学出版社, 2018.
	YIN J P, WANG Z J. Ammunition theory[M]. 3rd edition. Beijing: Beijing Institute of Technology Press, 2018. (in Chinese)
[30]	方良, 郝建滨, 朱璐, 等. 基于对抗反舰导弹的箔条质心干扰建模与仿真研究[J]. 兵器装备工程学报, 2019, 40(2):59-61.
	FANG L, HAO J B, ZHU L, et al. Research on modeling and simulation of chaff centroid jamming combat base on resist anti-ship missiles[J]. Journal of Ordnance Equipment Engineering, 2019, 40(2):59-61. (in Chinese)
[31]	白岚, 王科. 机载箔条干扰弹投放时机的仿真研究[J]. 舰船电子工程, 2012, 32(12):85-87.
	BAI L, WANG K. Simulation research on optimal opportunity of using airborne chaff cartridge[J]. Ship Electronic Engineering, 2012, 32(12):85-87. (in Chinese)
[32]	YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of PPO in cooperative multi-agent games[C]// Proceedings of Advances in Neural Information Processing Systems 35:Annual Conference on Neural Information Processing Systems 2022,NeurIPS 2022, New Orleans,LA,US: Curran Associates Inc., 2022.