I’ve read some of Ed Zitron’s long posts on why the AI industry is a bubble that will never be profitable (and will bring down a lot of companies and investors), and one of the recurring themes is that the AI companies are trying to capture growing market share in an industry where their marginal profits are still negative, and that any increase in revenue necessarily increases their costs of providing their services.

But some of the comments in various HackerNews threads are dismissive, saying that each new generation of models makes the cost of inference lower, so that with sufficient customer volume, the companies running the models can make enough profit on inference to make up for the staggering up-front capital expenditures it took to build out the data centers, train their models, etc.

It’s all pretty confusing to me. So for those of you who are familiar with the industry, I have several questions:

  1. Is the cost of running any given pretrained model going down, for specific models? Are there hardware and software improvements that make it cheaper to run those models, despite the model itself not changing?
  2. Is the cost of performing a particular task at a particular quality level going down, through releases of newer models of similar performance (i.e., a smaller model of the current generation performing similarly to a bigger model of the previous generation, such that the cost is now cheaper)?
  3. Is the cost of running the largest flagship frontier models going down for any given task? Or does running the cutting edge show-off tasks keep increasing in cost, but where the companies argue that the improvement in performance is worth the cost increase?

I suspect that the reason why the discussion around this is so muddled online is because the answers are different depending on which of the 3 questions is meant by “is running an AI model getting cheaper over time?” And the data isn’t easy to synthesize because each model has different token prices and different number of tokens per query.

But I wanted to hear from people who are knowledgeable about these topics.

  • Scrubbles@poptalk.scrubbles.tech
    link
    fedilink
    English
    arrow-up
    11
    ·
    6 hours ago

    I think we’re seeing a lot of optimization right now. The most exciting one I’ve seen is TurboQuant. Short version, every message you send to a model has context, the entire conversation you’ve had, instructions, skills, everything. That takes up an exponential amount of ram, and this is what is causing the VRAM/RAM shortage. TurboQuant (and other copycats now) claims that it can reduce that VRAM usage of the context by 20x. That’s absolutely huge, that’s 1M context models running on consumer hardware potential huge.

    Deepseek v4 also boasts some large claims, saying they have a model that does better than Anthropic’s or OpenAI, while being 1/10th the size. That also is a huge reduction in compute and VRAM, but I’ll be looking for the proof.

    We’ve seen other items too, with upgrades in running models, how quickly results are streamed, to me TurboQuant is the most exciting.

    I think it’s good that they’re finally looking at optimization. Yes, their cost has been power and compute. NVidia is more than happy to keep things inefficient because they sell GPUs that way. Software companies are doing the opposite now, reducing the compute overhead to start saving them money, which they desperately need to do if this is going to continue. New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

    I think this is what is going to be required to finally push past the horribleness of AI companies, and they need to do this quickly.

    • Zikeji@programming.dev
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      edit-2
      4 hours ago

      There’s also speculative decoding and adjacent techniques getting traction, increasing performance of the models on the same hardware.

    • GamingChairModel@lemmy.worldOP
      link
      fedilink
      arrow-up
      1
      ·
      6 hours ago

      New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

      Well, I wonder if the frontier ends up looking like supersonic commercial flight (prohibitively expensive so that there wasn’t enough of a market for consumers at the actual cost of providing the service): technology that continues to exist but never really gets used, because the alternatives that aren’t as good are still much, much cheaper.

  • Zarxrax@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    6 hours ago

    Its easy to think of it similar to something like computer hardware or game consoles. There is always newer and better hardware coming out. And the newer stuff is always more efficient (performance/watt) than the old stuff. But the user’s expectations increase as well, so new hardware doesn’t just aim to be more efficient, it aims to be more powerful. Then that sets a new baseline for expectations.

    So a lot of these LLM and other types of models are very much like that. The newer models definitely bring improvements in efficiency and performance. But no one wants to sit still, they have to keep pushing the envelope to make them better and more powerful.

  • BlameThePeacock@lemmy.ca
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 hours ago

    For an equivalent prompt and similar quality answer, yes. Inference prices are dropping.

    However, higher quality answers (or more complex prompt handling) are currently going up in inference price.

    The fun part will be once quality hits a point where the average user (or even business) doesn’t care about the incremental quality change any more. Then it’s going to be a race to the bottom for performance per dollar.

    Who cares if the not all companies or investors make money? They can make their bets, some will win and some will lose. I just want better tech for cheaper prices.

    • GamingChairModel@lemmy.worldOP
      link
      fedilink
      arrow-up
      5
      ·
      6 hours ago

      Who cares if the not all companies or investors make money?

      I care about the downstream effects on everyone else, of who else gets hurt in a crash.

      • BlameThePeacock@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        3 hours ago

        That has nothing to do with the technology. The last crash was caused by a global virus, and the one before that was the banking system…