一种元学习和强化学习结合的多飞行器协同制导律

doi:10.12382/bgxb.2024.0568

摘要/Abstract

摘要：

针对高超声速再入滑翔飞行器在复杂环境中以指定角度同时命中目标的协同制导问题,提出一种基于元学习和强化学习算法的协同制导律。考虑复杂作战环境的干扰,建立协同制导问题的马尔可夫决策模型,以飞行器运动状态和比例导引系数作为状态空间和动作空间,综合考虑多飞行器攻击目标的相对距离、剩余飞行时间差以及过载情况设计奖励函数。基于元学习理论和强化学习算法将近端策略优化算法与门控循环单元相结合,通过学习相似协同制导任务的共同特征,提高协同制导策略在复杂干扰环境下的命中精度,实现攻击角度和攻击时间约束,同时提升协同制导策略对不同作战场景的适应性。仿真结果表明:该协同制导律能够在复杂战场环境下实现多飞行器以指定攻击角度对目标的同时攻击,并快速适应新的协同制导任务,在协同作战场景发生变化时仍能保持良好性能。

关键词: 高超声速再入滑翔飞行器, 协同制导, 元学习, 强化学习, 近端策略优化

Abstract:

For the cooperative guidance issue of high-hypersonic re-entry gliding vehicles to simultaneously hit a target at a specified angle in a complex environment,a cooperative guidance law based on meta-learning and reinforcement learning algorithms is proposed.Considering the interference caused by complex combat environment,a Markov decision model for the cooperative guidance issue is established,taking the gliding vehicles’ motion status and proportional guidance factor as the state space and action space.A reward function is designed by comprehensively considering the vehicle-target distance,remaining flight time difference,and overload situation for multiple gliding vehicles attacking a target.Based on meta-learning theory and reinforcement learning algorithm,the proximal policy optimization algorithms are combined with the gated recurrent units to learn the common features of similar cooperative guidance tasks.This approach enhances the accuracy of cooperative guidance strategies in complex interference environments to achieve the constraints on angle of attack and attack time,while also improving the adaptability of cooperative guidance strategy to different combat scenarios.Simulated results indicate that the proposed cooperative guidance law enables multiple aerial vehicles to simultaneously attack a target at a specified attack angle in complex battlefield environment and quickly adapt to new cooperative guidance tasks.The cooperative guidance law maintains good performance even when the cooperative combat scenario changes.

Key words: hypersonic re-entry gliding vehicle, cooperative guidance, meta-learning, reinforcement learning, proximal policy optimization

中图分类号:

V249.31

王存灿, 王晓芳, 林海. 一种元学习和强化学习结合的多飞行器协同制导律[J]. 兵工学报, 2025, 46(7): 240568-.

WANG Cuncan, WANG Xiaofang, LIN Hai. A Cooperative Guidance Law Based on Meta-learning and Reinforcement Learning for Multiple Aerial Vehicles[J]. Acta Armamentarii, 2025, 46(7): 240568-.

图/表 39

图1 弹目相对运动关系图

Fig.1 Missile-target relative motion diagram

图2 协同制导交互过程

Fig.2 Cooperative guidance interaction process

图3 PPO算法网络结构框图

Fig.3 Structural diagram of PPO algorithm network

图4 GRU网络的基本结构

Fig.4 Basic structure of GRU network

图5 GPPO算法结构框图

Fig.5 Structural diagram of GPPO algorithm

图6 Actor和Critic网络的基本结构

Fig.6 Basic structure of Actor and Critic networks

表1 3枚导弹和目标的初始参数

Table 1 Initial parameters of 3 missiles and the target

参数	初始值
目标初始位置/m	(45000,0,0)
M₁初始位置/m	(0,30000,0)
M₁速度/(m·s^-1)	1200
M₁初始弹道倾角/(°)	0
M₁初始弹道偏角/(°)	-5
M₂初始位置/m	(0,30000,1000)
M₂速度/(m·s^-1)	1200
M₂初始弹道倾角/(°)	0
M₂初始弹道偏角/(°)	-20
M₃初始位置/m	(0,30000,-1000)
M₃速度/(m·s^-1)	1200
M₃初始弹道倾角/(°)	0
M₃初始弹道偏角/(°)	15

表2 奖励函数参数

Table 2 Reward function parameters

参数	数值	参数	数值
a	100	b₄	-10
b₁	-0.5	b₅	-50
b₂	30	c₁	0.1
b₃	10	c₂	-0.1

表3 网络结构参数

Table 3 Network structure parameters

参数	Actor网络	Critic网络
输入层	10	10
隐藏层	64	64
激活函数	Tanh	Tanh
GRU层	64	64
激活函数	Tanh	Tanh
输出层	3	1
激活函数	Tanh	-

表4 算法训练超参数

Table 4 Algorithm training hyperparameters

参数	数值
学习率	0.0003
奖励折扣系数γ	0.99
GAE系数λ	0.95
裁剪因子ε	0.2

表5 导弹所受的干扰

Table 5 Interference experienced by the missile

干扰	大小
$d 1 i$ (i=1,2,3)/((°)·s^-1)	0.1N(0,1)
$d 2 i$ (i=1,2,3)/((°)·s^-1)	0.1N(0,1)

表5 导弹所受的干扰

Table 5 Interference experienced by the missile

干扰	大小
$d 1 i$ (i=1,2,3)/((°)·s^-1)	0.1N(0,1)
$d 2 i$ (i=1,2,3)/((°)·s^-1)	0.1N(0,1)

图7 PPO制导律网络奖励函数曲线

Fig.7 Reward function curve of PPO guidance network

图8 导弹运动轨迹

Fig.8 Missile trajectory

图9 弹目相对距离随时间变化曲线

Fig.9 Missile-target relative distance over time

图10 剩余飞行时间tgo随时间变化曲线

Fig.10 Remaining flight time tgo over time

图11 导弹速度V随时间变化曲线

Fig.11 Missile velocity V over time

图12 导弹弹道倾角θ随时间变化曲线

Fig.12 Missile trajectory inclination angle θ over time

图13 导弹弹道偏角ψV随时间变化曲线

Fig.13 Missile trajectory deviation angle ψV over time

图14 PPO下导引系数变化曲线

Fig.14 The curve of guidance coefficient variation under PPO

图15 导弹纵向加速度曲线

Fig.15 Missile longitudinal acceleration curve

图16 导弹侧向加速度曲线

Fig.16 Missile lateral acceleration curve

表6 导弹脱靶量、攻击时间和攻击角度

Table 6 Missile miss distance,attack time,and angle of attack

制导律	导弹	脱靶量/m	攻击时间/s	攻击角度/(°)
PPO策略	M₁	0.74	63.45	-65.00
	M₂	3.42	63.46	-64.98
	M₃	2.69	63.45	-65.00
PITCG	M₁	20.85	63.47	-64.98
	M₂	30.21	65.74	-65.05
	M₃	23.57	64.17	-65.04

图17 目标位置示意图

Fig.17 Schematic diagram of target position

图18 GPPO制导律网络奖励函数曲线

Fig.18 Reward function curve of GPPO guidance network

表7 在线应用情况1下制导律性能对比

Table 7 Online performance comparison of guidance laws under case 1

制导律	脱靶量/m		攻击时间误差/s		攻击角度误差/(°)
制导律	平均值	标准差	平均值	标准差	平均值	标准差
PPO	6.79	1.88	0.42	0.76	0.07	0.17
GPPO	5.46	0.49	0.17	0.18	0.03	0.08

表8 在线应用情况2下制导律性能对比

Table 8 Online performance comparison of guidance laws under case 2

制导律	脱靶量/m		攻击时间误差/s		攻击角度误差/(°)		平均训练回合数
制导律	平均值	标准差	平均值	标准差	平均值	标准差	平均训练回合数
PPO	4.89	0.89	0.07	0.04	0.03	0.06	85
GPPO	2.57	0.61	0.02	0.02	0.01	0.02	40

图19 情况1脱靶量结果图

Fig.19 Miss distance in Case 1

图20 情况1攻击时间误差结果图

Fig.20 Attack time error in Case 1

图21 情况2脱靶量结果图

Fig.21 Miss distance in Case 2

图22 情况2攻击时间误差结果图

Fig.22 Attack time error in Case 2

表9 场景一参数

Table 9 The parameters table of Scenario 1

参数	数值
目标位置/m	(42064,0,3535)
弹1弹道偏角/(°)	2
弹2弹道偏角/(°)	-17
弹3弹道偏角/(°)	5
期望攻击角度/(°)	-65

图23 场景一协同制导网络奖励曲线

Fig.23 Cooperative guidance network reward in Scenario 1

图24 场景一导弹运动轨迹

Fig.24 Missile trajectory in Scenario 1

图25 场景一导弹的导引系数变化曲线

Fig.25 Variation curves of missile guidance coefficients in Scenario 1

表10 PPO2制导律性能参数

Table 10 Performance parameters of PPO2 guidance law

参数	平均值	标准差
脱靶量/m	4.28	0.78
攻击时间误差/s	0.05	0.03
攻击角度/(°)	-65.02	0.07
训练回合数	59	-

表11 场景二参数

Table 11 The parameters table of Scenario 2

参数	数值
目标位置/m	(42000,0,3500)
弹1弹道偏角/(°)	2
弹2弹道偏角/(°)	-15
弹3弹道偏角/(°)	10
期望攻击角度/(°)	-70

图26 场景二导弹运动轨迹

Fig.26 Missile trajectory in Scenario 2

图27 场景二导弹弹道倾角θ随时间变化曲线

Fig.27 Missile trajectory inclination angle θ over time in Scenario 2

图28 场景二导弹弹道偏角ψV随时间变化曲线

Fig.28 Missile trajectory deviation angle ψV over time in Scenario 2

参考文献 24

[1]	SZIROCZAK D, SMITH H. A review of design issues specific to hypersonic flight vehicles[J]. Progress in Aerospace Sciences, 2016,84:1-28.
[2]	LEE C H, KIM T H, TANK M J. Interception angle control guidance using proportional navigation with error feedback[J]. Journal of Guidance Control and Dynamics, 2013, 36(5):1556-1561.
[3]	黎克波, 廖选平, 梁彦刚, 等. 基于纯比例导引的拦截碰撞角约束制导策略[J]. 航空学报, 2020, 41(增刊2):724277.
	LI K B, LIAO X Q, LIANG Y G, et al. Guidance strategy with pure proportional guidance and intercept collision angle constraint[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(S2):724277. (in Chinese)
[4]	WANG Y N, WANG H, LIN D F, et al. Nonlinear modified bias proportional navigation guidance law against maneuvering targets[J]. Journal of the Franklin Institute, 2022, 359(7):2949-2975.
[5]	KIM T H, PARK B G, TAHK M J. Bias-shaping method for biased proportional navigation with terminal-angle constraint[J]. Journal of Guidance,Control,and Dynamics, 2013, 36(6):1810-1816.
[6]	CHEN X T, WANG J Z. Optimal control based guidance law to control both impact time and impact angle[J]. Aerospace Science and Technology, 2019,84:454-463.
[7]	JEON I S, LEE J I, TAHK M J. Impact-time-control guidance law for anti-ship missiles[J]. IEEE Transactions on Control Systems Technology, 2006, 14(2):260-266.
[8]	SALEEM A, RATNOO A. Lyapunov-based guidance law for impact time control and simultaneous arrival[J]. Journal of Guidance,Control,and Dynamics, 2016, 39(1):164-172.
[9]	CHO D, KIM H J, TANK M J. Nonsingular sliding mode guidance for impact time control[J]. Journal of Guidance Control and Dynamics, 2016, 39(1):1-8.
[10]	LI B F, LIN D, WANG H. Finite time convergence cooperative guidance law based on graph theory[J]. Optik-International Journal for Light and Electron Optics, 2016, 127(21):10180-10188.
[11]	李国飞, 汤清璞, 吴云洁. 从飞行器无导引头的主-从式多飞行器协同制导方法[J]. 兵工学报, 2023, 44(11):3436-3446. doi: 10.12382/bgxb.2023.0678
	LI G F, TANG Q P, WU Y J. Cooperative guidance method of leader and seeker-less follower flight vehicles[J]. Acta Armamentarii, 2023, 44(11):3436-3446. (in Chinese) doi: 10.12382/bgxb.2023.0678
[12]	CHEN Y D, WANG J N, WANG C Y, et al. Three-dimensional cooperative homing guidance law with field-of-view constraint[J]. Journal of Guidance,Control,and Dynamics, 2019, 43(5):1-9.
[13]	ZHANG Y A, WANG X L, MA G X. Impact time control guidance law with large impact angle constraint[J]. Proceedings of the Institution of Mechanical Engineers, Part G:Journal of Aerospace Engineering, 2015, 229(11):2119-2131.
[14]	LI W, WEN Q Q, HE L, et al. Three-dimensional impact angle constrained distributed cooperative guidance law for anti-ship missiles[J]. Journal of Systems Engineering and Electronics, 2021, 32(2):447-459. doi: 10.23919/JSEE.2021.000038
[15]	GRANDO R B, DE J J C, KICH V A, et al. Double critic deep reinforcement learning for mapless 3D navigation of unmanned aerial vehicles[J]. Journal of Intelligent & Robotic Systems, 2022, 104(2):29.
[16]	SUN B, KAMPEN V E J. Reinforcement-learning-based adaptive optimal flight control with output feedback and input constraints[J]. Journal of Guidance,Control,and Dynamics, 2021, 44(9):1685-1691.
[17]	ZHANG J R, ZHANG K P, ZHANG Y, et al. Near-optimal interception strategy for orbital pursuit-evasion using deep reinforcement learning[J]. Acta Astronautica, 2022,198:9-25.
[18]	HE X J, CHEN Z H, JIA F, et al. Guidance law based on zero effort miss and Q-learning algorithm[C]//Proceeding of the 17th Symposium on Novel Photoelectronic Detection Technology and Applications.Kunming, China:SPIE, 2021,11763:708-716.
[19]	陈中原, 韦文书, 陈万春. 基于强化学习的多发导弹协同攻击智能制导律[J]. 兵工学报, 2021, 42(8):1638-1647.
	CHEN Z Y, WEI W S, CHEN W C. Intelligent guidance law for multi-missile coordinated attack based on reinforcement learning[J]. Acta Armamentarii, 2021, 42(8):1638-1647. (in Chinese)
[20]	李博皓, 安旭曼, 杨晓飞, 等. 攻击角度约束下的分布式强化学习制导方法[J]. 宇航学报, 2022, 43(8):1061-1069.
	LI B H, AN X M, YANG X F, et al. Distributed reinforcement learning guidance method under attack angle constraint[J]. Journal of Astronautics, 2022, 43(8):1061-1069 (in Chinese).
[21]	WANG N, WANG X, CUI N, et al. Deep reinforcement learning-based impact time control guidance law with constraints on the field-of-view[J]. Aerospace Science and Technology, 2022,128:107765.
[22]	刘旭, 李响, 王晓鹏. 高超声速滑翔飞行器解析协同再入制导[J]. 宇航学报, 2023, 44(5):731-742.
	LIU X, LI X, WANG X P. Analytical cooperative re-entry guidance for hypersonic glide vehicles[J]. Journal of Astronautics, 2023, 44(5):731-742. (in Chinese)
[23]	高峰, 唐胜景, 师娇, 等. 一种基于落角约束的偏置比例导引律[J]. 北京理工大学学报, 2014, 34(3):277-282.
	GAO F, TANG S J, SHI J, et al. A bias proportional navigation guidance law based on terminal impact angle constrain[J]. Transactions of Beijing Institute of Technology, 2014, 34(3):277-282. (in Chinese)
[24]	李东旭, 王晓芳, 林海. 多高超声速导弹协同末制导律及可行初始位置域研究[J]. 弹道学报, 2019, 31(4):1-7. doi: 10.12115/j.issn.1004-499X(2019)04-001
	LI D X, WANG X F, LIN H. Research on cooperative terminal guidance law and feasibleinitial position domain for multi-hypersonic missiles[J]. Journal of Ballistics, 2019, 31(4):1-7 (in Chinese). doi: 10.12115/j.issn.1004-499X(2019)04-001