Mark Stephenson, University of Colorado Boulder; Hanspeter Schaub, University of Colorado Boulder
Keywords: inspection, reinforcement learning, multi-agent, autonomy, rpo
Abstract:
The close-proximity inspection of objects in low Earth orbit is important to operations such as rendezvous, debris removal, servicing, and resident space object (RSO) characterization, all of which are of increasing interest to commercial and government organizations. Complex relative motion dynamics in low Earth orbit make the problem of path planning for autonomous multi-agent inspection challenging. Agents must be able to fully inspect an object subject to illumination constraints while avoiding collision with the RSO or—in the multi-agent case—each other. In this paper, autonomous satellite inspection with impulsive maneuvers is considered by learning a policy on a multi-agent semi-Markov decision process formulation of the inspection task while ensuring safety via an optimization-based shield for collision avoidance based on analytical equations of relative motion. This work demonstrates closed-loop, autonomous, safe multi-agent inspection of an RSO with shielded deep reinforcement learning over all low Earth orbit (LEO) orbits.
The problem is expressed as a multi-agent asynchronous semi-Markov decision process, a framework for sequential decision making problems with varying-duration decision intervals. Each agent acts by selecting an impulsive thrust direction and magnitude and a passive drift duration. To inspect the RSO, which has a set of inspection points and surface normals, the inspector must be within a minimum inspection range, the instrument must be a minimum angle from the surface normal, and the facet of the RSO must be sufficiently illuminated. The inspectors have total inspection time and fuel allocation constraints for the task, and must not collide with each other or the RSO. Depending on the motion and geometry of the RSO, complete illuminated inspection may not be possible. This scenario is modelled in the high-fidelity spacecraft decision making problem framework BSK-RL, yielding a realistic simulation environment for the task.
Given the asynchronous semi-MDP, a closed-loop policy can be found using state-of-the-art deep reinforcement learning algorithms, modified to properly account for variable interval decisions and asynchronous decisions. The algorithm learns a policy for each agent, or a single meta-policy for all agents, that optimizes for the long-term collection of rewards: in this case, positive rewards for inspection and negative rewards for fuel usage a safety violations. Using RL for this problem yields distinct advantages: since the agent(s) respond to the current state of the environment, sources of external information or uncertainty are implicitly accounted: for example, an external algorithm that determines some parts of the RSO were not sufficiently inspected would be able to reflag those regions as uninspected, and the policy would respond accordingly. Likewise, if the RSO performed an attitude or orbital maneuver during the inspection task, the policy would select actions based on the new state.
To ensure safety of a RL-based policy, a shield can be introduced to the system. In general, shields offer a probabilistic or analytical guarantee of safety by disallowing actions that could lead to unsafe states and offering alternative actions that guarantee remaining in the safe domain. In the multi-agent inspection problem, the primary safety concern is collision between agents and the RSO. Analytical equations of relative motion for circular and eccentric orbits can be used to pose an optimization-based shield that minimizes the deviation from the policy’s desired action while guaranteeing that a minimum keepout distance from the other agents and the RSO is maintained, and that the long-term passive trajectory is likewise safe (either periodic or dominated by a secularly increasing term). Since agents are making decision asynchronously, the challenges of multi-agent path-planning are avoided within the shield, as actions are safe relative to the current passive trajectories of all other objects.
Three primary results are shown, investigating the performance and flexibility of the resulting policies. First, the Pareto front between inspection time and fuel is examined, showing a limited trade space between speed of inspection and fuel efficiency. Second, the impact of the shield on the policy’s performance is examined: despite the strong guarantees that it provides even in domains with many potential collisions, its impacts are relatively minimal and the safety guarantees are shown to hold in practice. Third, the use of single-agent-trained policies in a multi-agent setting are compared to a per-agent policy trained in a multi-agent setting. It is shown that the increased coordination of the multi-agent policies improves the diversity of trajectories taken in the distributed system, though in the cases considered, a single agent is sufficient to effectively complete the task. As a whole, the results present a safe and efficient method for autonomous multi-agent inspection of space objects.
Date of Conference: September 16-19, 2025
Track: Machine Learning for SDA Applications