Is AI inference getting cheaper or more expensive over time?

GamingChairModel@lemmy.world · 7 hours ago

Is AI inference getting cheaper or more expensive over time?

Danitos@reddthat.com · 2 hours ago

https://www.tobyord.com/writing/hourly-costs-for-ai-agents. This person analyzed a question very similar to yours. For me, this means:

Cost of running state of the art models is increasing exponentially.
For a given target “intelligence”, the cost is decreasing linearly.

I don’t really know what to make out of that in a broader picture

Scrubbles@poptalk.scrubbles.tech · 6 hours ago

I think we’re seeing a lot of optimization right now. The most exciting one I’ve seen is TurboQuant. Short version, every message you send to a model has context, the entire conversation you’ve had, instructions, skills, everything. That takes up an exponential amount of ram, and this is what is causing the VRAM/RAM shortage. TurboQuant (and other copycats now) claims that it can reduce that VRAM usage of the context by 20x. That’s absolutely huge, that’s 1M context models running on consumer hardware potential huge.

Deepseek v4 also boasts some large claims, saying they have a model that does better than Anthropic’s or OpenAI, while being 1/10th the size. That also is a huge reduction in compute and VRAM, but I’ll be looking for the proof.

We’ve seen other items too, with upgrades in running models, how quickly results are streamed, to me TurboQuant is the most exciting.

I think it’s good that they’re finally looking at optimization. Yes, their cost has been power and compute. NVidia is more than happy to keep things inefficient because they sell GPUs that way. Software companies are doing the opposite now, reducing the compute overhead to start saving them money, which they desperately need to do if this is going to continue. New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

I think this is what is going to be required to finally push past the horribleness of AI companies, and they need to do this quickly.

Zikeji@programming.dev · edit-2 4 hours ago

There’s also speculative decoding and adjacent techniques getting traction, increasing performance of the models on the same hardware.

GamingChairModel@lemmy.world · 6 hours ago

New technology has always been horribly inefficient, it’s only once more people see it does it start to get optimized.

Well, I wonder if the frontier ends up looking like supersonic commercial flight (prohibitively expensive so that there wasn’t enough of a market for consumers at the actual cost of providing the service): technology that continues to exist but never really gets used, because the alternatives that aren’t as good are still much, much cheaper.

Zarxrax@lemmy.world · 6 hours ago

Its easy to think of it similar to something like computer hardware or game consoles. There is always newer and better hardware coming out. And the newer stuff is always more efficient (performance/watt) than the old stuff. But the user’s expectations increase as well, so new hardware doesn’t just aim to be more efficient, it aims to be more powerful. Then that sets a new baseline for expectations.

So a lot of these LLM and other types of models are very much like that. The newer models definitely bring improvements in efficiency and performance. But no one wants to sit still, they have to keep pushing the envelope to make them better and more powerful.

BlameThePeacock@lemmy.ca · 6 hours ago

For an equivalent prompt and similar quality answer, yes. Inference prices are dropping.

However, higher quality answers (or more complex prompt handling) are currently going up in inference price.

The fun part will be once quality hits a point where the average user (or even business) doesn’t care about the incremental quality change any more. Then it’s going to be a race to the bottom for performance per dollar.

Who cares if the not all companies or investors make money? They can make their bets, some will win and some will lose. I just want better tech for cheaper prices.

GamingChairModel@lemmy.world · 6 hours ago

Who cares if the not all companies or investors make money?

I care about the downstream effects on everyone else, of who else gets hurt in a crash.

BlameThePeacock@lemmy.ca · 3 hours ago

That has nothing to do with the technology. The last crash was caused by a global virus, and the one before that was the banking system…