I Trained an AI to Beat Final Fight… Here’s What Happened [p]

Reddit r/MachineLearning / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author trained an AI agent for the arcade game Final Fight using behavior cloning from demonstrations, then evaluated its progress in the first stage.
  • They encountered several practical training and evaluation issues, including action-space remapping from a MultiBinary representation to emulator inputs, trajectory alignment/offset bugs, and different LSTM behavior between evaluation and manual rollouts.
  • The agent shows some ability to make progress but struggles with consistency and survival, indicating limits of pure imitation learning in this setting.
  • The author plans to extend the approach with GAIL plus PPO to improve performance beyond imitation and is seeking community advice on limited-data BC, transitioning from BC to PPO, and handling partial observability.
  • Code and results are shared via a GitHub repository so others can review the full experimental pipeline.
I Trained an AI to Beat Final Fight… Here’s What Happened [p]

Hey everyone,

I’ve been experimenting with Behavior Cloning on a classic arcade game (Final Fight), and I wanted to share the results and get some feedback from the community.

The setup is fairly simple: I trained an agent purely from demonstrations (no reward shaping initially), then evaluated how far it could go in the first stage. I also plan to extend this with GAIL + PPO to see how much performance improves beyond imitation.

A couple of interesting challenges came up:

  • Action space remapping (MultiBinary → emulator input)
  • Trajectory alignment issues (obs/action offset bugs 😅)
  • LSTM policy behaving differently under evaluation vs manual rollout
  • Managing rollouts efficiently without loading everything into memory

The agent can already make some progress, but still struggles with consistency and survival.

I’d love to hear thoughts on:

  • Improving BC performance with limited trajectories
  • Best practices for transitioning BC → PPO
  • Handling partial observability in these environments

Here’s the code if you want to see the full process and results:
notebooks-rl/final_fight at main · paulo101977/notebooks-rl

Any feedback is very welcome!

submitted by /u/AgeOfEmpires4AOE4
[link] [comments]

I Trained an AI to Beat Final Fight… Here’s What Happened [p] | AI Navigate