How to formulate exploration-exploitation trade-off better than all the hacks on top of Bellman equation?
We can first of all simply estimate the advantage of exploration by Monte-Carlo in a swarm setting: Pitting fully exploitative agents against fully exploitative agents which have the benefit of recent exploration. This can be easily done by lagging policy models.
Of course the advantage of exploration needs to be divided by the cost of exploration, which is linear to the number of agents used in the swarm to explore at a particular state.
Note that the advantage of exploration depends on the state of the agent, so we might want to define an explorative critic to estimate this.
What's beautiful in this formulation is that we can incorporate autoregressive #WorldModels naturally, as the exploitative agents only learn from rewards, but the explorative agents choose their actions in a way which maximizes the improvement of the auto-regressive World Model.
It brings these two concepts together as sides of the same coin.
Exploitation is reward-guided action, exploration is auto-regressive state transition model improvement guided action.
Balancing the two is a swarm dynamic which encourages branching where exploration has an expected value in reward terms. This can be estimated by computing the advantage of exploitative agents utilizing recent exploration versus agents which do not, and returning this advantage to the points of divergence between the two.