Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in
A Personal Learning Journey: From Academic Curiosity to Orbital Imperatives
My journey into this niche began not with satellites, but with a frustratingly simple board game. While exploring reinforcement learning (RL) for a personal project—teaching an AI to play a complex strategy game—I kept hitting the same wall. The agent would learn to win, often spectacularly, but its decision-making process was a black box. It would make moves that were technically optimal according to its reward function but were utterly inexplicable, violating unspoken rules and long-term strategic principles. This wasn't just an academic annoyance; it was a fundamental flaw. If I couldn't understand why it chose a path, how could I ever trust it with something important?
This realization sent me down a rabbit hole of research into explainable AI (XAI) and offline reinforcement learning. I devoured papers on Decision Transformers, fascinated by their sequence-modeling approach to decision-making. Then, a chance conversation with a friend in aerospace engineering lit the spark. He described the agonizingly slow, human-intensive process of responding to satellite anomalies—a solar panel fails to deploy, a thruster misfires, a sensor goes noisy. Teams of experts would spend days analyzing telemetry, running simulations, and deliberating on corrective actions, all while the multi-million dollar asset drifted, its mission compromised.
The connection clicked instantly. What if we could use the trajectory-based reasoning of Decision Transformers to suggest immediate response actions? And what if we could bake in the ability to audit every decision against a framework of human-aligned ethics and operational constraints? The challenge was no longer a board game; it was a real-world problem where transparency wasn't a nice-to-have, but a non-negotiable requirement for safety, accountability, and trust. This article is the culmination of my subsequent months of research, experimentation, and prototype development at the intersection of these fields.
Technical Background: Decision Transformers and The Alignment Problem
Traditional RL agents learn a policy (π) that maps states (s) to actions (a) by maximizing a reward (r). The "reward is enough" hypothesis falls apart in high-stakes environments. An agent could learn to stabilize a satellite's tumble by firing all thrusters at once, achieving "stability" while exhausting precious fuel and dooming the mission. This is a misalignment between the proxy reward (reduce angular velocity) and the true human objective (preserve mission lifetime).
Decision Transformers (DTs), introduced by Chen et al., reframe RL as a conditional sequence modeling problem. Instead of learning from rewards, they learn from trajectories of states, actions, and returns-to-go (RTG). The RTG is the sum of future rewards from that point in the trajectory. The model, typically a GPT-style transformer, is trained to predict actions autoregressively, conditioned on past states, actions, and the desired RTG.
Trajectory τ = (s1, a1, R1, s2, a2, R2, ..., sT, aT, RT)
Where Rt = Σ_{k=t}^{T} r_k (Return-to-Go from step t)
During my experimentation, I found this paradigm shift profound. By conditioning on a target RTG, you can guide the agent's behavior at inference time. Want a conservative, fuel-saving policy? Input a moderate target RTG. Need an aggressive stabilization maneuver? Input a high target RTG. This gives a direct dial for influencing agent behavior.
However, the core problem remained: alignment and auditability. The model learns correlations from historical data, which may contain biases, suboptimal human decisions, or edge cases not covered by ethical guidelines (e.g., "never point an imaging satellite at a densely populated area during a test maneuver"). Baking in auditability means designing the system so that every decision can be traced back to:
- The data it was derived from.
- The explicit constraints it was subjected to.
- The quantifiable trade-offs it made.
Architectural Blueprint: Baking in Ethics and Auditability
The solution I converged upon through iterative prototyping is a multi-component architecture. The key insight from my research was that alignment cannot be an afterthought; it must be embedded in the data representation, the model's conditioning, and the inference loop.
1. The Ethically-Augmented Trajectory Representation
The first step is to enrich the standard DT trajectory. We add two critical elements: Operational Constraints (C) and Ethical State Embeddings (E).
import numpy as np
import torch
class EthicalTrajectoryDataset(torch.utils.data.Dataset):
"""
A dataset for sequences of ethically-augmented satellite states.
"""
def __init__(self, trajectories, context_len=30):
self.trajectories = trajectories # List of dicts
self.context_len = context_len
def __getitem__(self, idx):
traj = self.trajectories[idx]
# Standard DT components
states = torch.tensor(traj['states'], dtype=torch.float32) # e.g., [pos, vel, temp, power]
actions = torch.tensor(traj['actions'], dtype=torch.float32) # e.g., [thrust_x, torque_y]
rtg = torch.tensor(traj['rtg'], dtype=torch.float32) # return-to-go
# Augmented components for alignment
constraints = torch.tensor(traj['constraints'], dtype=torch.float32)
# e.g., [fuel_remaining, max_thrust, forbidden_zone_flag]
ethical_embed = torch.tensor(traj['ethical_embed'], dtype=torch.float32)
# Pre-computed vector encoding: [priv_violation_risk, debris_risk, treaty_compliance]
# Stack into a single sequence token per timestep
# This structure is what enables auditability.
tokens = torch.cat([states, actions, rtg.unsqueeze(-1),
constraints, ethical_embed], dim=-1)
# Sample a context window
start_idx = np.random.randint(0, max(1, len(tokens) - self.context_len))
context_tokens = tokens[start_idx:start_idx + self.context_len]
# For training: predict action given past context.
# x: all tokens up to the action position.
# y: the action components of the next token.
action_dim = actions.shape[-1]
x = context_tokens[:-1].flatten() # Context
y = context_tokens[1:, states.shape[-1]:states.shape[-1]+action_dim].flatten() # Next action
return x, y
This data structure is the foundation of auditability. Every predicted action is intrinsically linked to the constraints and ethical state that preceded it.
2. The Human-Aligned Decision Transformer (HADT) Model
The model itself is a modified transformer decoder. The critical design choice, validated through ablation studies in my experiments, is to use separate conditioning heads for the RTG, constraints, and ethical embeddings. This allows us to intervene on these inputs during inference clearly.
import torch.nn as nn
import math
class MultiConditionalAttentionBlock(nn.Module):
"""A transformer block with distinct conditioning pathways."""
def __init__(self, embed_dim, num_heads, cond_dims={'rtg':1, 'constraint':3, 'ethical':3}):
super().__init__()
self.embed_dim = embed_dim
self.ln1 = nn.LayerNorm(embed_dim)
self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
# Separate projection networks for each condition type
self.cond_projs = nn.ModuleDict({
key: nn.Sequential(nn.Linear(dim, embed_dim), nn.GELU())
for key, dim in cond_dims.items()
})
self.ln2 = nn.LayerNorm(embed_dim)
self.mlp = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.GELU(),
nn.Linear(4 * embed_dim, embed_dim)
)
def forward(self, x, conditions):
# x: sequence of state+action embeddings
# conditions: dict of condition sequences
attn_input = self.ln1(x)
# Add each condition's influence
for key, cond_seq in conditions.items():
if cond_seq is not None:
attn_input = attn_input + self.cond_projs[key](cond_seq)
# Self-attention
attn_out, _ = self.attn(attn_input, attn_input, attn_input)
x = x + attn_out
# FFN
x = x + self.mlp(self.ln2(x))
return x
class HumanAlignedDecisionTransformer(nn.Module):
def __init__(self, state_dim, act_dim, embed_dim, num_layers, num_heads):
super().__init__()
self.state_embed = nn.Linear(state_dim, embed_dim)
self.action_embed = nn.Linear(act_dim, embed_dim)
# Separate embeddings for conditions (allows zeroing out)
self.rtg_embed = nn.Linear(1, embed_dim)
self.constraint_embed = nn.Linear(3, embed_dim) # example dims
self.ethical_embed = nn.Linear(3, embed_dim)
self.blocks = nn.ModuleList([
MultiConditionalAttentionBlock(embed_dim, num_heads)
for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(embed_dim)
self.action_head = nn.Linear(embed_dim, act_dim)
# Learnable positional embeddings
self.pos_embed = nn.Parameter(torch.zeros(1, 1024, embed_dim))
def forward(self, states, actions, rtg, constraints, ethical):
# states, actions: sequences up to time t-1
# rtg, constraints, ethical: sequences up to time t (for conditioning)
batch_size, seq_len = states.shape[0], states.shape[1]
# Embeddings
state_emb = self.state_embed(states)
action_emb = self.action_embed(actions) if actions is not None else 0
# Create token sequence: interleave state and previous action embs
token_emb = torch.zeros_like(state_emb)
token_emb = state_emb + action_emb # Simplified combination
# Add positional embedding
pos = self.pos_embed[:, :seq_len, :]
token_emb = token_emb + pos
# Prepare condition dict
conditions = {
'rtg': self.rtg_embed(rtg),
'constraint': self.constraint_embed(constraints),
'ethical': self.ethical_embed(ethical)
}
# Forward through blocks
x = token_emb
for block in self.blocks:
x = block(x, conditions)
x = self.ln_f(x)
action_pred = self.action_head(x)
return action_pred
This architecture makes the influence of each ethical and operational factor explicit and separable. During an audit, we can replay a decision and observe, for example, how the output changes if the "forbidden_zone_flag" constraint is toggled.
3. The Real-Time Audit Logger
Auditability requires logging not just the decision, but the context of the decision. I implemented a lightweight logger that captures the full input context for every inference call.
import json
from datetime import datetime
import hashlib
class EthicalAuditLogger:
def __init__(self, log_dir="./audit_logs"):
self.log_dir = log_dir
def log_decision(self, satellite_id, timestamp, model_input, model_output,
human_override=None, override_reason=""):
"""
Logs a complete decision context for future audit.
"""
# Create a deterministic hash of the input for quick lookup/grouping
input_hash = hashlib.sha256(
str(model_input).encode() + satellite_id.encode()
).hexdigest()[:16]
log_entry = {
"decision_id": f"{satellite_id}_{timestamp.isoformat()}_{input_hash}",
"satellite_id": satellite_id,
"timestamp": timestamp.isoformat(),
"model_input": {
"states": model_input['states'].tolist() if hasattr(model_input['states'], 'tolist') else model_input['states'],
"target_rtg": float(model_input['target_rtg']),
"constraints": model_input['constraints'].tolist() if hasattr(model_input['constraints'], 'tolist') else model_input['constraints'],
"ethical_state": model_input['ethical_state'].tolist() if hasattr(model_input['ethical_state'], 'tolist') else model_input['ethical_state']
},
"model_output": {
"recommended_action": model_output.tolist() if hasattr(model_output, 'tolist') else model_output,
"action_confidence": float(model_output.std()) # example metric
},
"human_intervention": {
"overridden": human_override is not None,
"final_action": human_override.tolist() if human_override is not None else None,
"reason": override_reason
},
"audit_trail": [] # For post-hoc annotations by engineers
}
# Save to date-partitioned file
date_str = timestamp.strftime("%Y-%m-%d")
filename = f"{self.log_dir}/{satellite_id}_{date_str}.jsonl"
with open(filename, 'a') as f:
f.write(json.dumps(log_entry) + '
')
return log_entry["decision_id"]
Implementation in Action: Simulating an Anomaly Response
Let's walk through a simplified scenario. The satellite "Voyager-6" experiences a sudden attitude disturbance (a tumble). The ground system detects the anomaly.
Step 1: Context Assembly. The system gathers the last 30 minutes of telemetry (states), calculates the current operational constraints (fuel < 30%, in a crowded orbital slot), and computes the ethical state vector (low population risk, medium debris risk, in compliance).
Step 2: Target RTG Selection. Here, human alignment is direct. A human operator or a meta-policy sets the target_rtg. A high value might prioritize immediate stabilization at all costs. A moderate value, aligned with a "conserve resources" doctrine, would balance stabilization with fuel preservation. My experimentation showed that letting a separate, simple policy network learn to set the target RTG based on mission phase further improved alignment.
Step 3: Constrained Inference. The model does not output raw actions. It outputs suggestions that are immediately passed through a hard-coded constraint filter. This is a critical safety layer I implemented after early tests showed the model could occasionally suggest physically impossible maneuvers.
class ConstraintActionFilter:
def __init__(self, satellite_dynamics_model):
self.dynamics = satellite_dynamics_model
def filter(self, suggested_action, current_constraints):
filtered_action = suggested_action.copy()
# 1. Fuel Budget Hard Constraint
max_fuel_use = current_constraints['fuel_remaining'] * 0.05 # Use max 5% remaining
total_impulse = np.linalg.norm(filtered_action[:3])
if total_impulse > max_fuel_use:
scale_factor = max_fuel_use / total_impulse
filtered_action[:3] *= scale_factor
# 2. Forbidden Pointing Constraint (e.g., avoid imaging populated areas)
if current_constraints['forbidden_zone_flag']:
# Zero out torque axes that would point instrument in forbidden direction
filtered_action[3:] *= np.array([1.0, 0.0, 1.0]) # Example: nullify y-axis torque
# 3. Dynamic Feasibility Check (simplified)
if not self.dynamics.is_maneuver_feasible(filtered_action):
# Fallback to a minimal, safe damping maneuver
filtered_action = np.array([0.0, 0.0, 0.0, -0.1, -0.1, -0.1])
return filtered_action
Step 4: Human-in-the-Loop Review & Audit. The filtered recommended action is presented to the human operator with its full audit context: the target RTG used, the dominant constraints that modified it, and the ethical risk scores. The operator can accept, modify, or override it. The EthicalAuditLogger captures everything.
Challenges, Solutions, and Future Directions
My exploration was not without significant hurdles.
Challenge 1: The Sim-to-Real Gap. Training requires vast amounts of anomaly data, which is thankfully rare in reality. My solution was to use high-fidelity simulation environments like NASA's GMAT or AGI's STK, and then employ adversarial anomaly generation to create a robust training dataset of "what-if" scenarios.
Challenge 2: Quantifying the "Ethical State." How do you turn a principle like "avoid creating space debris" into a number? Through research, I landed on a multi-faceted approach: pre-computed risk scores from external models (e.g., debris collision probability models) combined with rule-based flags (e.g., treaty-defined restricted zones).
Challenge 3: The Performance vs. Auditability Trade-off. Adding multiple conditioning vectors and logging every inference has a computational cost. My optimization involved using cached ethical embeddings for nominal states and only recalculating them when the anomaly detection threshold was crossed.
Future Directions from my research point to several exciting frontiers:
- **Quantum-Enhanced Constraint




