Intelligent air defense task assignment based on hierarchical reinforcement learning

Introduction

Modern air defense operations are becoming more complex with the rapid development of long-range, elemental, and intelligent processes. The rational planning of interception plans for incoming air targets to maximize operational effectiveness has become a massive challenge for the defenders in modern air defense operations (Yang et al., 2019). Task assignment changes the weapon target assignment (WTA) fire unit-target model to a task-target assignment model. This improves the ability to coordinate the various components, and the assignment scheme is more flexible, providing fundamental assurance of maximum operational effectiveness (Wang et al., 2019). With the continuous adoption of new technologies on both sides of the battlefield, the combat process is becoming increasingly complex, involving many elements. The battlefield environment and the adversary’s strategy are rapidly changing and challenging to quantify. Relying on human judgment and decision-making can no longer adapt to the requirements of fast-paced and high-intensity confrontation, and depending on traditional analytical model processing cannot adapt to the needs of complex and changing scenarios. Reinforcement learning (RL) does not require an accurate mathematical model of the environment and the task and is less dependent on external guidance information. Therefore, some scholars have investigated the task assignment problem through intelligent methods such as single-agent reinforcement learning, multi-agent reinforcement learning (MARL), and deep reinforcement learning (DRL). Zhang et al. (2020) proposed an Imitation augmented deep reinforcement learning (IADRL) model to enable unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to form a complementary and cooperative alliance to accomplish tasks that they cannot do alone. Wu et al. (2022) proposed a dynamic multi-UAV task assignment algorithm based on reinforcement learning and a deep neural network, which effectively solves the problem of poor mission execution quality in complex dynamic environments. Zhao et al. (2019) proposed a Q-learning-based fast task assignment (FTA) algorithm for solving the task assignment problem of heterogeneous UAVs.

In modern air defense operations, the threat to the defense can be either a large-scale air attack or a small-scale contingency, so mission assignment methods must balance effectiveness and dynamism. A centralized assignment solution is not fast enough, while a fully distributed assignment method does not respond effectively to unexpected events (Lee et al., 2012). The one-general agent with multiple narrow agents (OGMN) architecture proposed in the literature (Liu J. Y. et al., 2022), which divides agents into general and narrow agents, improves the computational speed and coordination ability. However, the narrow agent in the OGMN is entirely rule-driven. It lacks a certain degree of autonomy, which cannot fully adapt to the complex and changing battlefield environment. Therefore, this paper proposes the hierarchical reinforcement learning architecture for ground-to-air confrontation (HRL-GC) architecture based on the OGMN architecture, which layers the agents into scheduling and execution. The scheduling agent is responsible for assigning targets to the execution agent, which makes the final decision based on its state. Data drive both types of agents. Considering the inefficiency of the initial phase of agents training, this paper proposes a model-based model predictive control with proximal policy optimization (MPC-PPO) algorithm to train the execution agent to reduce inefficient exploration. Finally, the HRL-GC is compared with two other architectures in a large-scale air defense scenario, and the effectiveness of the MPC-PPO algorithm is verified. Experimental results show that the HRL-GC architecture and MPC-PPO algorithm are suitable for large-scale air defense problems, effectively balances the effectiveness and dynamics of task assignment.

Reinforcement learning was first introduced in the 1950s (Minsky, 1954) with the central idea of allowing an agent to learn in its environment and continuously refine its behavioral strategies through constant interaction with the environment and exploration by trial and error (Moos et al., 2022). With the continuous development of RL, algorithms such as Q-learning (Watkins and Dayan, 1992) and SARSA (Chen et al., 2008) have been proposed. However, when faced with problems in large-scale, high-dimensional decision-making environments, traditional RL methods also rapidly increase the computation, and storage space required to solve such problems.

Deep reinforcement learning is a combination of RL and deep learning (DL). DL enables reinforcement learning to be extended to previously intractable decision problems and has led to significant results in areas such as drone surveys (Zhang et al., 2022), recommender search systems (Shen et al., 2021), and natural language processing (Li et al., 2022), particularly in the area of continuous end-to-end control (Zhao J. et al., 2021). In the problem studied in this paper, the decisions shaped by the DRL for the agents must be temporally correlated, thus enabling the air defense task assignment strategy to maximize future gains and take the lead on the battlefield more easily.

Hierarchical reinforcement learning

Hierarchical reinforcement learning (HRL) was proposed to solve the curse of dimensionality in reinforcement learning. The idea of this method is to decompose a whole task into multi-level subtasks by introducing mechanisms such as State space decomposition (Takahashi, 2001), State abstraction (Abel, 2019), and Temporal abstraction (Bacon and Precup, 2018) so that each subtask can be solved in a small-scale state space, thus speeding up the solution of the whole task. To model these abstract mechanisms, researchers introduced the semi-Markov Decision Process (SMDP) (Ascione and Cuomo, 2022) model to handle actions that must be completed at multiple time steps. The state space decomposition approach decomposes the state space into different subsets. It adopts a divide-and-conquer strategy for solving so that each solution is performed in a smaller subspace. Based on this idea, this paper divides the task assignment problem into two levels, scheduling and execution, and proposes the HRL-GC architecture to combine the advantages of centralized and distributed assignment effectively.

Model-based reinforcement learning

Model-free RL does not require environmental models (e.g., state transfer probability models and reward function models) but is trained directly to obtain high-performance policies (Abouheaf et al., 2015). On the other hand, model-based RL is an approach that first learns the model during the learning process and then searches for an optimized policy based on that model knowledge (Zhao T. et al., 2021). Model-free RL is less computationally intensive at each iteration because it does not require learning model knowledge but has the disadvantage that too much invalid exploration leads to inefficient agents’ learning. Model-based RL methods can use a minimal number of samples to learn complex gaits, using the data collected to understand the model. The model is then used to generate a large amount of simulation data to learn a “state-action” value function to reduce the interaction between the system and the environment and improve sampling efficiency (Gu et al., 2016). For air defense scenarios, the sampling cost is high, and it isn’t easy to collect many data samples. Therefore, this paper uses a model-based RL approach to build a neural network model based on a small amount of sample data collected. The agent interacts with the model to obtain the data, thus reducing the sampling cost and improving the training efficiency.

Model predictive control

Model predictive control (MPC) is a branch of optimal control (Liu S. et al., 2022), and the idea of MPC is widely used in model-based RL algorithms due to its efficiency in unconstrained planning problems. It is based on the specific idea of using the collected data to train a model and obtain an optimal sequence of actions by solving an unconstrained optimization problem (Yang and Lucia, 2021), as shown in Eq. 1.

at*,at+1*,…,at+H*=arg⁡maxat,at+1,…,at+H∑k=0Hr⁢(st+k,at+k)s.t.st+k+1=f^⁢(st+k,at+k),k=0,1,…,H(1)

Where f^⁢(∙) is the learned model, the model is often a parametric neural network whose input is the current moment action at, and the present moment state st outputs the predicted state s^t+1 for the next moment; the loss function of the neural network can be constructed as (Yaqi, 2021)

ε⁢(θ)=1|?|⁢∑(st,at,st+1)∈?12⁢||(st+1-st)-f^θ⁢(st,at)||2(2)

Where ? is the collected demonstration dataset, it is obtained by first generating random strategies to interact with the model, calculating the reward value for each policy, and selecting the sequence of actions with the highest cumulative reward. The first action of this sequence is then acted upon in the environment to obtain a new state, add the data to the demonstration dataset ?, and repeat the same method to get the next action value. The model is trained using Eq. 2, and the dataset is continuously optimized, repeating the process until both the model and the taught dataset achieve good performance. By doing so, model errors and external disturbances can be effectively suppressed, and robustness can be improved (Nagabandi et al., 2018). Based on this idea, the MPC-PPO algorithm is proposed to train the model by the MPC method and then use the model to pre-train the network of PPO to improve the pre-training efficiency.

Problem modeling Problem formulation

Modern large-scale air defense missions are no longer a one-to-one confrontation of one interceptor against one incoming target but rather a one-to-many and many-to-one confrontation accomplished through efficient organizational synergy in the form of tactical coordination. This is in response to saturated long-range attacks by cruise missiles and a multi-directional and multi-dimensional suppression attack by a mixture of human-crewed and uncrewed aircraft. However, this one-to-many and many-to-one confrontation assignment is not fixed; during air defense confrontations, the air attack offensive posture changes in real-time, and the confrontation assignment needs to be highly dynamic to respond to changes in the posture of the air attack threat (Rosier, 2009). The critical issue in this paper is the effective integration of combat resources according to the characteristics of different weapon systems and the ability to dynamically change the strategy according to the situation so that they can play a “1 + 1 > 2” combat effectiveness.

To reduce complexity while satisfying dynamism, this paper divides the air defense operations process into two parts, resource scheduling and mission execution, based on the idea of HRL. The complexity of the high-dimensional state-action space is reduced by decomposing the entire process into multiple more minor problems and then integrating the solutions to these problems into a solution to the overall task assignment problem.

Markov Decision Process modeling of executive agents

In this paper, we study the air defense task assignment problem in a red-blue confrontation scenario, where the red side is the ground defender, and the blue side is the air attacker. We define a sensor and several interceptors around it as an interception unit. We use an independent learning framework to build the same MDP model for each interception unit.

State space: (1) states information of the defender’s defended objects; (2) resource assignment of the unit, sensor and interceptor states; (3) states information of the attacker’s targets within its own tracking and interception range; (4) states information of the attacker’s incoming targets that are assigned to it.

Action space: (1) what timing to choose to track the target; (2) which interceptor to choose to intercept the target; (3) how many resources to choose to intercept the target; and (4) what timing to choose to intercept.

Reward function: To balance the efficiency of exploration and learning of the agent, guiding the agent progressively toward the winning direction. This paper uses the principle of least resources to design the reward function.

R=5⁢m+2⁢n-5⁢i+j(3)

Where m is the number of human-crewed aircraft intercepted, n is the number of high threat targets blocked, j is the number of missiles intercepted, and i is the number of times our unit has been attacked as a result of a failed interception. Add five points for blocking staffed units, two points for intercepting high-threat targets, one point for intercepting missiles, and five points for each time our unit is attacked due to a failed interception.

Markov Decision Process modeling of scheduling agents

The task of the scheduling agent is to coordinate the tracking and interception tasks to interception units based on the global situation, with a state space, action space, and reward function designed as follows:

State space: (1) states information of the defender’s defended objects; (2) states information of the defender’s interception units, including resource assignment, sensor and interceptor states, and states information of the attacker’s targets within the unit’s interception range; (3) states information of the attacker’s incoming targets; and (4) states information of the attacker’s units that can be attacked.

Action space: (1) select the target to be tracked; (2) select the target to be intercepted; (3) select the interception unit.

Reward function: The merit of the task assignment strategy depends on the final result of the task execution, so the reward of the scheduling agent is the sum of the tips of all the executing agents at the bottom plus the base reward, as shown in Eq. 4.

R=σT≥σmin,T∈⋃i=mSi≥ST,T∈(11)

Where n denotes the number of missiles to be guided, θT represents the set of flight airspace angles of the target missile, θguide denotes the operating range of the sensor, σT denotes the guidance accuracy, σmin denotes the minimum guidance accuracy requirement, Si means the guidance distance of the sensor, and ST represents the distance of the missile. That is, the constraints of minimum guidance accuracy and maximum guidance distance must be satisfied during cooperative guidance.

(2) Time constraints

Due to the highly real-time nature of air defense tasks, task assignment is highly time-constrained, and, for the executing agent, the factors associated with the time constraint are mainly reflected in

1. Timing of interceptions

Longest interception distance:

DL⁢I=DL⁢S2+(vm⁢tL)2+2⁢vm⁢tL⁢DL⁢S2-(H2+P2)(12)

Nearest interception distance:

DN⁢I=DN⁢S2+(vm⁢tN)2+2⁢vm⁢tN⁢DN⁢S2-(H2+P2)(13)

Where vm is the speed of the target, H is the altitude of the target, P is the shortcut of the target’s flight path, DLS and DNS are the target’s kill zone oncoming far boundary and the target’s kill zone oncoming near the border, and tL and tN are the times the target flies to the distant and near edges of the oncoming kill zone, respectively.

2. Timing of sensor switch-on:

Sensor detection of the target is a prerequisite for intercepting the target. In combat, it takes a certain amount of time, called pre-interception preparation time tP, from sensor detection to interceptor’s interception of the target.

The required distance for sensors to find a target DS is based on the length of the target at the furthest encounter point.

DS=DN⁢S2+vm2⁢(tL+tP)2+2⁢vm⁢(tL+tP)⁢DN⁢S2-(H2+P2)(14)

We define the state that satisfies the security constraint as S, which gives us Eq. 15.

,,,]},,,]},,,]},,,,,]}],"socialLinks":[,"type":"Link","color":"Grey","icon":"Facebook","size":"Medium","hiddenText":true},,"type":"Link","color":"Grey","icon":"Twitter","size":"Medium","hiddenText":true},,"type":"Link","color":"Grey","icon":"LinkedIn","size":"Medium","hiddenText":true},,"type":"Link","color":"Grey","icon":"Instagram","size":"Medium","hiddenText":true}],"copyright":"Frontiers Media S.A. All rights reserved","termsAndConditionsUrl":"https://www.frontiersin.org/legal/terms-and-conditions","privacyPolicyUrl":"https://www.frontiersin.org/legal/privacy-policy"}'>

View original article

FRONTIERS IN NEUROROBOTICS

分享书签

0 0 0 0 0 0 0

More from this channel

Intelligent air defense task assignment based on hierarchical reinforcement learning

留言 (0)