On-board, Autonomous, Hybrid Spacecraft Subsystem Fault and Anomaly Detection, Diagnosis, and Recovery

Richard Stottler, Stottler Henke Associates, Inc.; Sowmya Ramachandran, Stottler Henke Associates, Inc.; Christian Belardi, Stottler Henke Associates, Inc.; Rocky Mandayam, Stottler Henke Associates, Inc.

Keywords: Fault Detection Isolation and Recovery (FDIR), Machine Learning, Model-Based Reasoning, Hybrid

Abstract:

     Future conflicts may involve attacks on space-based assets, physically and/or by interfering with their communications. In such situations it is imperative that the satellite already have onboard an autonomous ability to detect faults and other anomalies, determine the components involved and the root cause, determine the best method to recover mission capabilities, schedule the required recovery plan, then adaptively execute it. Traditionally, Fault Detection, Isolation, and Recovery (FDIR) systems have utilized Model Based Reasoning (MBR), which requires knowledge of the subsystem design and the behavior of components down to the desired level of diagnosis. To the degree this information is readily available, it is important to make good use of it. However, the field of machine learning (ML) has also shown that systems can also learn, off-line, the normal behavior of complex systems in many different environments and states, and then detect abnormal behavior in real-time. These system can also be trained with known abnormal states, and recognize these more specifically when they occur.
     This paper will describe NASA-funded applications of MBR and ML systems to several different spacecraft systems as well as additional techniques associated with automatic intelligent planning and scheduling and adaptive execution to create a complete high-level closed-loop approach to FDIR. This closed loop begins with the onboard monitoring of subsystem sensor data and associated commands. In the hybrid approach, this monitoring occurs in two ways. In the MBR portion of the monitor, a model of the spacecraft’s subsystems, components, and their interconnections is used to compare the actual sensor values against what would be expected based on the current space environment, spacecraft state, and the commands. While allowing for noise, significant deviations of the sensor values are noted. Meanwhile, the ML portion uses a pre-trained Self-Organizing Map (SOM)-based architecture to produce high-resolution clusters of nominal system behavior that can distinguish between nominal and anomalous activity during online system monitoring. However an anomaly is detected, the system begins a diagnosis process. Similar to the way a human engineer operates, the first step is usually to validate the anomalous sensor values. Often, sensors nearby (physically or in the connection diagram) can be used to confirm or refute the anomalous values. The latter case implies a sensor problem rather than a more a significant fault. The next step is to reason upstream from the designated sensors to determine what components could potentially be at fault. If the sensors are adequately dense, the process may determine the faulty component or components. Often, however, there will be a set of possible candidate components only some of which may be at fault. Most spacecraft systems have some degree of redundancy and, depending on the severity of the fault and the urgency with which capabilities must be restored, this redundancy can be used to either route around all of the possibly faulty components or, be used to further refine the diagnosis by reconfiguring to isolate each specific candidate component, in turn, to determine which is at fault. These options can be automatically determined by using the graph of which components are connected to which others.
     Once the at-fault component(s) is/are determined, the next step is to determine if full capabilities can be restored or if decreased capabilities will have to be accommodated. Especially in the latter case, replanning and rescheduling will likely be required. Usually a dynamic priority-based scheme is used to trade-off between different mission objectives, based on the current environment and tactical situation and optimizing constraint-satisfaction is used to meet as much of the desired tasking and as many of the mission objectives as possible within dictated time frames and the new capability limits. The new task schedule is then passed to an adaptive execution system, often utilizing rules and/or finite state machines to execute the commands required to enact the schedule and to react to unexpected spacecraft subsystem responses. As the commands are executed, the sensors report new values resulting from the commands and/or from additional faults, and the entire cycle repeats itself, as necessary, to close the high-level loop.

Date of Conference: September 15-18, 2020

Track: Machine Learning Applications of SSA

View Paper