• TheSpectreOfGay [hy/hym, she/her]@hexbear.net
      link
      fedilink
      English
      arrow-up
      28
      ·
      22 days ago

      more money you spend = betterer things are, obviously

      there’s no redundancy about a bunch of private companies trying to make the same thing. clearly all the chinese models being open source is just a weird commie thing

    • asdasd201@lemmygrad.ml
      link
      fedilink
      English
      arrow-up
      9
      ·
      edit-2
      22 days ago

      Am I justified in looking skeptically at LMArena’s scorings? I’ve got no evidence, but I feel like they are funded by American big tech.

      • CarmineCatboy2 [he/him]@hexbear.net
        link
        fedilink
        English
        arrow-up
        8
        ·
        edit-2
        22 days ago

        Yes, because a lot of AI benchmarking at the moment is something that the companies created for themselves to gauge their own definitions of progress. Which is how OpenAI can spend last year releasing what they think is a massively better model of ChatGPT only to be met with an universal ‘meh, I guess’. From their paying users, even.

        What we can conclude is that the US evaporating all their venture capital and also the Gulf’s to build language models does not meaningfully outperform the engineering that some chinese firms do on the side.

        To what extent this matters is what’s harder to divine because the language model marketing is overwhelming any practical use for this technology outside of very specific companies making their own very specialized research. The idea that this is a future 100 trillion dollar industry of actual AGI distracts us from what could be a respectable 50 billion dollar industry of very specialized uses in mineral prospecting, manufacturing or QoL for coding.

      • daniyeg [he/him]@hexbear.net
        link
        fedilink
        English
        arrow-up
        8
        ·
        22 days ago

        i don’t know anything about LMArena or their specific methodology but from a cursory glance, it is a crowdsourced blind pairwise testing platform and i’d just assume they are doing it right. based on this, you’d still be correct to be skeptical about their rankings. it’s an elo platform, any result they publish is biased towards their audience preferences and not necessarily indicative of actual head to head performance in your/any specific use case. that being said nothing is perfect, and compared to specific benchmarks it cannot be gamed. you can game it slightly to fit the preferences of the users of LMArena, but i would assume that their taste is not that different than the general population. it is at the very least correlated with actual performance.

        my take on the graph is that the gap is really not conclusive to show an actual performance lead. either a. people have a bias towards the models they were exposed to first (a company iterating on their models will likely produce new models with the same type/tone of response) or b. US companies are spending an ungodly amount of money on fine-tuning for marginal gains that are honestly not worth it and has to be done again on a new model.

        • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
          link
          fedilink
          English
          arrow-up
          8
          ·
          22 days ago

          The big problem with these sites is that you can just tune the model to work really well for these specific tests and that doesn’t necessarily translate into general capabilities. So, a model can do really well on a benchmark and not have comparable performance for real world tasks.

          • daniyeg [he/him]@hexbear.net
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            22 days ago

            it’s really not a benchmark though. unless i am mistaken this specific graph comes from a blind voting platform. you can absolutely game benchmarks and tests but i do not see how you can meaningfully game a voting platform unless something is wrong with the platform itself.

            • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
              link
              fedilink
              English
              arrow-up
              6
              ·
              edit-2
              22 days ago

              It’s definitely one of the better approaches for removing bias and making results harder to game, but it still has limitations because of the scope of the problems. A user visits the Arena website and types a prompt into a chat interface, the system randomly selects two different models from the pool of participants. The user then selects the response they liked best.

              One of the problems that the user base is skewed toward tech savvy and AI interested users. The evaluation is limited to a single turn chat interaction and does not test long conversations or multi step tasks. So, it’s not really representative of how a model would behave using an agentic tool like opencode.

              And that’s where the problem with gaming the test surfaces. You can tune the model for the kinds of short questions that it would encounter in LMArena, but that may not translate into real world scenarios where it has to do things like analyzing codebases or doing step by step problem solving.

              • daniyeg [he/him]@hexbear.net
                link
                fedilink
                English
                arrow-up
                5
                ·
                22 days ago

                yeah absolutely especially considering agentic use cases it’s limited but i still think it’s a valid metric that is at least correlated with performance for purely chatbot users.

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      English
      arrow-up
      9
      ·
      22 days ago

      To be fair, the US did have a lead in this tech initially. DeepSeek was actually bootstrapped off the llama model that was initially open sourced by meta because they were basically throwing in the towel when they realized they were far behind openai and google. So, a lot of the initial research did come from the US.

      • Hohsia [any]@hexbear.net
        link
        fedilink
        English
        arrow-up
        3
        ·
        22 days ago

        Oh damn that’s news to me. But if I understand correctly, it does make sense because China seems to be more focused on AI’s application to the real world

        • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
          link
          fedilink
          English
          arrow-up
          7
          ·
          22 days ago

          Oh yeah absolutely, Chinese companies are basically treating LLMs like shared infrastructure, and are focusing on building things on top of it. It’s similar to the way companies use Linux today where nobody is actually trying to monetize it directly, but people build useful platforms on top of it. Another huge difference with China is that it has a massive industry, so there are more niches to apply this tech as a result. You can try doing things like factory automation, robotics, infrastructure monitoring, etc. If you don’t have these types of industries or mass infrastructure, then you can’t really apply these tech to these domains. So, AI research in China is inherently more grounded than it is in the US.