Strategic Satellite Custody Maintenance with AlphaZero

Tyler Becker, University of Colorado Boulder; Zachary Sunberg, University of Colorado Boulder

Keywords: Game theory, adversarial, custody maintenance, AlphaZero, deep learning, reinforcement learning, Monte Carlo tree search, pursuit-evasion

Abstract:

Custody maintenance of resident space objects (RSOs) is a fundamental challenge in space domain awareness (SDA), particularly when tracking adversarial satellites that deliberately attempt to evade observation and break custody. This challenge will only intensify as satellites become more numerous and maneuverable. Traditional methods for optimizing sensor tasking schedules rely on deterministic or probabilistic models of target behavior. While these approaches are effective for tracking targets subject to natural and non-adversarial disturbances, they fail in adversarial settings where an intelligent agent actively attempts to avoid detection.

For these possibly adversarial cases, SDA must move beyond traditional optimization paradigms and incorporate strategic reasoning about adversaries in a game-theoretic manner. Consider an observer satellite tracking a non-cooperative target. If the target follows a predictable strategy fully modeled by the tracking team, traditional state estimation techniques paired with single-agent optimization are sufficient. However, when the target strategically adapts its policy to counter the sensing strategy, these optimization frameworks are vulnerable to exploitation. To address these challenges, we formulate the RSO tracking problem as a competitive game between observer and evader, ensuring the robustness of the tasking strategy to any evasive strategy the evader may employ.

We model the problem of maintaining custody of an adversarial RSO as a zero-sum Markov game between an observer satellite and an evader satellite. In a Markov game, players occupy a joint state and take actions that determine state transitions, each receiving rewards based on the joint state and action. The state space for our formulation consists of the observer’s and target’s 3D positions and velocities under orbital dynamics. Both the observer and target satellites are given the option to execute discrete impulse burns to modify their trajectories. The observer’s objective is to maximize the signal-to-noise ratio (SNR) of a camera sensor, incorporating sensor constraints such as pixel resolution, aperture size, lighting conditions, and line-of-sight visibility. Conversely, the evader’s objective is to minimize the SNR of the camera sensor and evade detection.

Unlike Markov decision processes, where optimality is defined by maximizing the expected discounted sum of rewards, Markov games require a different solution concept due to the interdependence of each player’s strategy on the other’s reward. As such, we use the Nash equilibrium as a solution metric: a strategy profile where no player can unilaterally improve their reward by deviating to a different strategy. Since Nash equilibria are fixed points in the joint strategy space, solving for the observer’s optimal tasking strategy inherently also yields the target’s optimal evasion strategy. The Nash equilibrium is particularly alluring in the zero-sum setting as it ensures, once deployed, that the target can only match the prescribed strategy’s expected utility, but cannot exceed it. Any deviation from the prescribed strategy is expected to reduce its utility, thereby improving the effectiveness of the observer’s strategy.

Previous game-theoretic approaches to SDA have typically relied on either full differentiability of dynamics and rewards or used an imperfect information framework. Pursuit-evasion games are well-studied for optimal intercept course calculations, but they assume fully differentiable problem structures, which preclude the inclusion of discrete state components as well as most high-fidelity sensor models. Additionally, these methods often assume the existence of deterministic Nash equilibrium solutions, which may not always exist. Our Markov game formulation, by contrast, allows for arbitrary dynamics and reward functions without requiring differentiability. While imperfect information games offer a more general framework where players do not necessarily know each other’s actions or observations, the computational difficulty of solving these games severely limits their scalability. While most imperfect information game solutions are limited to offline computation, the full state observability in the Markov game allows us to leverage efficient online policy search.

Our approach to solving this custody maintenance Markov game adapts AlphaZero, a deep reinforcement learning framework originally designed for board games, to simultaneous-action Markov games. DeepMind’s Alpha series, including AlphaGo and AlphaZero, demonstrated the power of reinforcement learning and self-play in solving complex, sequential decision-making problems in Go, Chess, and Shogi. AlphaZero is a hybrid online-offline solution method that combines deep reinforcement learning with Monte Carlo tree search (MCTS). Traditional deep reinforcement learning methods are purely offline due to the computational cost of training. Conversely, MCTS is an online method that makes decisions by refining a default policy with iterative tree growth. AlphaZero bridges this gap by iterative using MCTS to improve a neural network policy, which in turn guides future searches. At deployment, the trained network provides a strong prior policy, requiring only shallow online searches to achieve competitive performance. This balance enables efficient, high-quality decision-making in adversarial satellite tracking. However, conventional AlphaZero assumes turn-based decision-making and is not directly applicable to Markov games where players act simultaneously. To address this, we introduce a modified search strategy that accounts for simultaneous decision-making by both players.

To evaluate our approach, we compute the exploitability of a synthesized strategy, measuring the difference between its expected utility and the actualized utility when facing a maximally exploitative opponent. Using this exploitability metric, we show that we can synthesize strong tasking strategies that are robust to any policy an evader could wish to employ.

Date of Conference: September 16-19, 2025

Track: Space Domain Awareness

View Paper