hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.7K
active users

#rl

7 posts5 participants0 posts today

#rl #job #whining

I made a list of pros and cons for different career paths I could take, if I get fired from my current one. It ended up being a depressing exercise. The math just doesn't work, if I try striking out on my own. The stuff I want to sell is mostly in the $150-200 range. I'd have to sell about 5 a week to come close to what I'm making now (and that's ignoring overhead, tax, etc) which means *making* that many a week too.

I'm so damn risk-averse. Terrified to make a wrong step.

Self-Improving Reasoners.

Both expert human problem solvers and successful language models employ four key cognitive behaviors

1. verification (systematic error-checking),

2. backtracking (abandoning failing approaches),

3. subgoal setting (decomposing problems into manageable steps), and

4. backward chaining (reasoning from desired outcomes to initial inputs).

Some language models naturally exhibits these reasoning behaviors and exhibit substantial gains, while others don't and quickly plateau.

The presence of reasoning behaviors, not the correctness
of answers is the critical factor. Models with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions.

It seems that the presence of cognitive behaviors enables self-improvement through RL.

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
arxiv.org/abs/2503.01307

#reinforcementlearning #RL
#AI #DL #LLM

arXiv logo
arXiv.orgCognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRsTest-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

Hear me out: I think applying RL on #LLMs and LMMs is misguided, and we can do much better.

Those #RL algorithms are unsuitable for this, and for example they cannot learn how their decisions affect the eventual rewards, but instead are just optimized to make the decisions based on Bellman optimization.

Instead we can simply condition the LLMs with the rewards. The rewards become the inputs to the model, not something external to it, so the model will learn the proper reward dynamics, instead of only being externally forced towards the rewards. The model can itself do the credit assignment optimally without fancy mathematical heuristics!

This isn't a new idea, it comes from goal-conditioned RL, and decision transformers.

We can simply run the reasoning trajectories, judge the outcomes, and then put the outcome tokens first to these trajectories before training them to the model in a batch.

arxiv.org/abs/2211.15657

arXiv logo
arXiv.orgIs Conditional Generative Modeling all you need for Decision-Making?Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

How to formulate exploration-exploitation trade-off better than all the hacks on top of Bellman equation?

We can first of all simply estimate the advantage of exploration by Monte-Carlo in a swarm setting: Pitting fully exploitative agents against fully exploitative agents which have the benefit of recent exploration. This can be easily done by lagging policy models.

Of course the advantage of exploration needs to be divided by the cost of exploration, which is linear to the number of agents used in the swarm to explore at a particular state.

Note that the advantage of exploration depends on the state of the agent, so we might want to define an explorative critic to estimate this.

What's beautiful in this formulation is that we can incorporate autoregressive #WorldModels naturally, as the exploitative agents only learn from rewards, but the explorative agents choose their actions in a way which maximizes the improvement of the auto-regressive World Model.

It brings these two concepts together as sides of the same coin.

Exploitation is reward-guided action, exploration is auto-regressive state transition model improvement guided action.

Balancing the two is a swarm dynamic which encourages branching where exploration has an expected value in reward terms. This can be estimated by computing the advantage of exploitative agents utilizing recent exploration versus agents which do not, and returning this advantage to the points of divergence between the two.

Continued thread

Instead of "model is itself the environment" in an #LLM setting, you can take note that in normal #RL, you'd typically have state-action-reward-state-action-reward-... sequences, where the action inflicts itself upon the environment which changes and its new form projects into the next state.

For LLMs, there's only the outcome which comes from the environment. That's the final reward. Before that it's just auto-regressive action-action-action-...

So this makes advantage computation heavier than necessary and instead the advantage can be just returned backwards over the completion tree alternatives.