The big problem with these sites is that you can just tune the model to work really well for these specific tests and that doesn’t necessarily translate into general capabilities. So, a model can do really well on a benchmark and not have comparable performance for real world tasks.
it’s really not a benchmark though. unless i am mistaken this specific graph comes from a blind voting platform. you can absolutely game benchmarks and tests but i do not see how you can meaningfully game a voting platform unless something is wrong with the platform itself.
It’s definitely one of the better approaches for removing bias and making results harder to game, but it still has limitations because of the scope of the problems. A user visits the Arena website and types a prompt into a chat interface, the system randomly selects two different models from the pool of participants. The user then selects the response they liked best.
One of the problems that the user base is skewed toward tech savvy and AI interested users. The evaluation is limited to a single turn chat interaction and does not test long conversations or multi step tasks. So, it’s not really representative of how a model would behave using an agentic tool like opencode.
And that’s where the problem with gaming the test surfaces. You can tune the model for the kinds of short questions that it would encounter in LMArena, but that may not translate into real world scenarios where it has to do things like analyzing codebases or doing step by step problem solving.
yeah absolutely especially considering agentic use cases it’s limited but i still think it’s a valid metric that is at least correlated with performance for purely chatbot users.
The big problem with these sites is that you can just tune the model to work really well for these specific tests and that doesn’t necessarily translate into general capabilities. So, a model can do really well on a benchmark and not have comparable performance for real world tasks.
it’s really not a benchmark though. unless i am mistaken this specific graph comes from a blind voting platform. you can absolutely game benchmarks and tests but i do not see how you can meaningfully game a voting platform unless something is wrong with the platform itself.
It’s definitely one of the better approaches for removing bias and making results harder to game, but it still has limitations because of the scope of the problems. A user visits the Arena website and types a prompt into a chat interface, the system randomly selects two different models from the pool of participants. The user then selects the response they liked best.
One of the problems that the user base is skewed toward tech savvy and AI interested users. The evaluation is limited to a single turn chat interaction and does not test long conversations or multi step tasks. So, it’s not really representative of how a model would behave using an agentic tool like opencode.
And that’s where the problem with gaming the test surfaces. You can tune the model for the kinds of short questions that it would encounter in LMArena, but that may not translate into real world scenarios where it has to do things like analyzing codebases or doing step by step problem solving.
yeah absolutely especially considering agentic use cases it’s limited but i still think it’s a valid metric that is at least correlated with performance for purely chatbot users.
yeah for what it’s actually testing it’s probably as good as you can make it