hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

8.9K
active users

I'm a little puzzled at the salience that is being given to the Apple conclusions on #LLM #reasoning when we have lots of prior art. For example: LLMs cannot correctly infer a is b, if their corpora only contain b is a. #Paper: https://arxiv.org/abs/2309.12288

arXiv logo
arXiv.orgThe Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Valentina Tereshkova was the first woman to travel to space", it will not automatically be able to answer the question, "Who was the first woman to travel to space?". Moreover, the likelihood of the correct answer ("Valentina Tershkova") will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if "A is B" occurs, "B is A" is more likely to occur. It is worth noting, however, that if "A is B" appears in-context, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. Code available at: https://github.com/lukasberglund/reversal_curse.
mnl mnl mnl mnl mnl

@modulux I think it's just people looking for confirmation, and the name apple attached somehow is giving that research more public light, when it really is just another paper on how to get better benchmark to study "reasoning" capabilities. I put reasoning in quotes since it means such different things depending on if you are in the field or not.

@mnl I tend to think so too. I suppose it shouldn't really surprise me, but I expected a bit more critical engagement.

There's lots of evidence on the limits of LLM reasoning, as well as pretty basic self-reference and so on (how many letters l does this statement include?). And yes, reasoning is being used rather technically in the sense of deductive inference.