Offline RL Simulation

Need to simulate an RL agent on logged states. Policy is known but no feedback mechanism.

Option 1 - Learn the world

Option 2 - Cluster the observations, use the nearest one

Bib

Currently exploring offline RL for work, its pretty interesting.

Models for Option 2 involve some crazy math, learning "kinematic inseperability" to create a state encoder for clustering. Learning a representation of states in a general RL problem is hard because it is inextricably tied to exploration.

The papers below are a really good read, also a good intro into RL research for newbies.

[1] MSFT offline-rl-sim
[2] HOMER