The Chinese AI labs are really trying to pop the bubble, too.
How?
Well lemme ask you this. What if models 80-90% as good as Claude, with weights just thrown out there for any provider (or homelab) to host, flood the market? What if they’re so dirt cheap to run, they’re almost free, and don’t even need Nvidia GPUs? What they need fewer resources to run with each update, instead of more?
…What if this already happened, and Big Tech is maddly lobbying to ban/censor them before people realize it, and that the “infinite scaling” thing is a big fat lie?
Well a funny thing about off-shoring your economy is that it really just means exploiting people in countries that can’t stand up to your imperial might. So it inevitably creates enemies. Now you have no economy at home AND the rest of the world hates you! Double stupidity!
Yep, the Chinese models are already up 10 times cheaper and now that Anthropic, OpenAI, and Google, all are increasing prices up to 10 more for models like Opus, it will make Chinese models anywhere from 50 to 100 times cheaper.
American corps. are betting that since people have their workflow already established they won’t switch to other providers, but that’s not the case. There’s already a mass move to Chinese models.
safe?
local models in a sandbox without phone home sounds pretty safe, but are we there yet?
to my limited understanding the tokens are open source… not the model itself
Chinese models are really good. How you use them depends on what your goals are. If you want something on par with Claude or ChatGPT, you need to go to Deepseek or Qwen’s website. You can also find these models on openrouter. If you want a local/small model, then get ollama and find a model in the repository.
You could also get anythingllm or LM Studio and get models from within the app. There’s also huggingface.
Since you asked about safety, deepseek on the official website does collect info and there was a time some of that info was leaked but you can get around that using something like openrouter. Note similar things happened to ChatGPT and Meta AI. There is also the potential for bias (there was a time people were spamming their Deepseek Tiananmen Square responses – personally, it just would not process my query) but Grok has that same issue.
Look into zen.ai which is opencode’s sister company that provides llm access. “At cost”
You can see just how cheap they are. I use Augment Code at work and they have kimi 2.6. It’s really solid. Opus/GPT are still better, but for many tasks, kimi works great and doesn’t make me cringe at the price.
Qwen 3.6 is supposed to be really good too. I haven’t used it that much.
ollama or llama.cpp to self host if you have a good mac or good video card. this is perfectly safe.
there are a bazillion hosted inference providers to choose from https://huggingface.co/docs/inference-providers/en/index be aware that you are sending your code to fuck knows who and they are sending back fuck knows what. ymmv, yolo.
hook one of them up to opencode.ai or pi.dev or one of the bazillion other ‘harneses’ or whatever we are calling it this week and try not to rm -r anything important.
for a good time try and get a chinese models to say something about tibet, or taiwan… its like having your own virtual tankie tamagochi!
inference providers could be anyone from anywhere, there are even proxy resellers. some are harvesting and reselling your data.
if you send your code to claude/openai/google there is certainly a much higher degree of confidence in who you are sending your data to. yes they to harvest your data and can send you malicious commands (esp if you have a promp injection attack).
its like buying a cheap vps, if the stakes are low its fine, if it important then you need to consider about the consequences of your actions.
The most famous is Deepseek. It’s not even made by “AI” company, it was a side hustle from stock trading company. They released it for free just to flex.
I hear people use minimax as replacement for sonnet and deepseek as replacement for opus, both can be used directly in Claude code instead of Anthropic models
Deepseek was big because not only did they publish the full model for everyone to use, but the MoE structure significantly brought down the hardware requirements in terms of processing power. As long as you have enough VRAM, you can run it on older hardware with no need for the latest Nvidia stuff.
Now they got v4 which many have found to be within a 10% margin of Claude and ChatGPT.
On top of that, China has cheapo VRAM GPUs available or soon to be released, like the MTT S80. Yeah it sucks as a Graphics card because the chip is behind, but you get 16Gb of GDDR6 for much cheaper than anything else.
But its not a conspiracy to fight China. The infinite scaling was just Nvidia solidifying themselves as the monopoly because they want all AI infrastructure to be dependent on them, which is why they still illegally export to China, despite an export ban attempting to reduce their potential competition.
Moore Threads (MTT) already has their own CUDA like system called MUSA, and I’m sure they’ll be happy to put in proper hardware support for new stuff like Bf16 and FP8/4. It’ll take a few years, but eventually China will catch up to the point where Nvidia gets shanked by cheaper hardware.
MTT is just a pipe dream, last I checked. But Deepseek is actively being served, in mixed FP8/FP4, on racks of Huawei accelerators.
I believe Baidu trained a model on them, too. But most training (like Deepseek’s) is still done on CUDA.
…Also, be careful equating this stuff with any kind of “consumer friendly” hardware you or I could buy. That’s less likely. The Huawei accelerators (and other local Chinese hardware experiments) are geared towards huge servers serving requests in parallel.
Wasn’t there development of a linux translation layer for CUDA workloads to run on AMD GPUs? I haven’t heard about it in a while, but I’d imagine that’d help the situation.
You mean ZLUDA? AMD gave up on it and released the project on github.
Personally I dream for CUDA API implementation in open source GPU drivers on Linux. That would absolutely level the playing field for the whole industry
Hyper scaling was always about cornering the computer market, It was never about providing us some vastly new and superior service.
Exactly, its a method of taking tens of billions of dollars in capital and buying a near monopoly. No other providers can compete if the hyperscalers buy all of the hardware, driving up the prices while also selling the service at a loss.
Nobody working out of their garage with a cool idea for a better service can compete if they can’t get hardware and have to charge double what the hyperscalers are charging because they can’t burn capital for years.
It’s a practice that should be considered illegal market manipulation, because that’s what it i
Agreed. I am not longer paying token fees as I am running QWEN 3.6 27B MTP on my 4090 GPU and it is as good and as fast as the frontier models for agentic coding.
Same. I’m running Qwen3.6-35B-A3B-FP8 (Qwen3.6-35B-A3B-UD-IQ4_XS.gguf) via the turboquant fork of llama.cpp with a few tweaked memory settings, and I get like 40 tokens / second – nothing that required special insight on my part just following the instructions I saw on a youtube video I found via !LocalLLaMA@sh.itjust.works and asking claude to help me through the installation.
AI has no economic moat. There’s nothing stopping anyone from running LLMs locally.
I just updated my setup from LMStudio to llama.cpp with the new QWEN 3.6 27B MTP model and I am getting 80-112 tokens/second, 90 average which is just shocking to me. I am on a 4090 with a context
Window of 64k. It hardly use cloud AI anymore as I rarely need more than 64k if I ensure my first prompt is written like a design document. Multiple prompts are not great so I often just figure out where my initial prompt went wrong, adjust and try again in a fresh session. Way faster this way too. It has really worked out well for me as I am getting just as much done locally for free as I was with hundreds of dollar a month on cloud AI. I am still shocked and grateful it flowed this way.
I am using llamma.cpp with QWEN 3.6 27B MTP, with a 64k context window on a 4090 that OpenCode talks to and then it in term talks to the Unity Game engine via MCP. Getting 80/112 tokens/second work 90 average which is shocking to me as it really does feel as fast as cloud AI (well faster for me as I am in Vietnam and round trips to US data centers really adds up in a session). The only really issue is you pretty much have to one shot prompts as follow up prompts will easily go over the context window size. If I cannot one shot prompts them use cloud AI both that is very rare for my use case. Maybe 1 in 50 or so and only when the tasks touches a lot of large scripts and scenes.
The state of things is what if, that’s true. It has not happened. :)
At some point, it should happen though. Still not going to put a dent in the datacenter / dystopia rally though, since they will pick Nvidia and known brands.
The Chinese AI labs are really trying to pop the bubble, too.
How?
Well lemme ask you this. What if models 80-90% as good as Claude, with weights just thrown out there for any provider (or homelab) to host, flood the market? What if they’re so dirt cheap to run, they’re almost free, and don’t even need Nvidia GPUs? What they need fewer resources to run with each update, instead of more?
…What if this already happened, and Big Tech is maddly lobbying to ban/censor them before people realize it, and that the “infinite scaling” thing is a big fat lie?
That’s the state of things.
It turns out that off-shoring your economy to a political rival is a really dumb thing for a capitalist to do.
If you offshore your economy, what prevents the recipient from becoming a rival?
We’ve offshored to all the BRICS countries and they’ve all become our rivals was a result
Well a funny thing about off-shoring your economy is that it really just means exploiting people in countries that can’t stand up to your imperial might. So it inevitably creates enemies. Now you have no economy at home AND the rest of the world hates you! Double stupidity!
but also necessary and that’s the beautiful contradiction of capitalism that will cause it’s inevitable downfall
(Exact source unkown)
Or:
“Just keep making and selling them more rope, eventually they’ll hang themselves with it.”
But, but, this quarter profits.
I wish, I wish we would bring out the guillatine for these greedy treasonous capitalist fucks.
We’ve lost so much because of them
Yep, the Chinese models are already up 10 times cheaper and now that Anthropic, OpenAI, and Google, all are increasing prices up to 10 more for models like Opus, it will make Chinese models anywhere from 50 to 100 times cheaper.
American corps. are betting that since people have their workflow already established they won’t switch to other providers, but that’s not the case. There’s already a mass move to Chinese models.
People keep talking about Chinese models, where are they? How do I used them instead of Claude? Are they safe?
safe?
local models in a sandbox without phone home sounds pretty safe, but are we there yet?
to my limited understanding the tokens are open source… not the model itself
Chinese models are really good. How you use them depends on what your goals are. If you want something on par with Claude or ChatGPT, you need to go to Deepseek or Qwen’s website. You can also find these models on openrouter. If you want a local/small model, then get ollama and find a model in the repository. You could also get anythingllm or LM Studio and get models from within the app. There’s also huggingface.
Since you asked about safety, deepseek on the official website does collect info and there was a time some of that info was leaked but you can get around that using something like openrouter. Note similar things happened to ChatGPT and Meta AI. There is also the potential for bias (there was a time people were spamming their Deepseek Tiananmen Square responses – personally, it just would not process my query) but Grok has that same issue.
Look into zen.ai which is opencode’s sister company that provides llm access. “At cost”
You can see just how cheap they are. I use Augment Code at work and they have kimi 2.6. It’s really solid. Opus/GPT are still better, but for many tasks, kimi works great and doesn’t make me cringe at the price.
Qwen 3.6 is supposed to be really good too. I haven’t used it that much.
ollama or llama.cpp to self host if you have a good mac or good video card. this is perfectly safe.
there are a bazillion hosted inference providers to choose from https://huggingface.co/docs/inference-providers/en/index be aware that you are sending your code to fuck knows who and they are sending back fuck knows what. ymmv, yolo.
hook one of them up to opencode.ai or pi.dev or one of the bazillion other ‘harneses’ or whatever we are calling it this week and try not to rm -r anything important.
for a good time try and get a chinese models to say something about tibet, or taiwan… its like having your own virtual tankie tamagochi!
So literally the same as Western-made AI?
inference providers could be anyone from anywhere, there are even proxy resellers. some are harvesting and reselling your data.
if you send your code to claude/openai/google there is certainly a much higher degree of confidence in who you are sending your data to. yes they to harvest your data and can send you malicious commands (esp if you have a promp injection attack).
its like buying a cheap vps, if the stakes are low its fine, if it important then you need to consider about the consequences of your actions.
nb: i am no expert, just fucking around.
Yeah only the Chinese government is currently far better at working behind the scenes with companies than any other government in the world?
Incompetence is a feature of governments at times.
I trust the Chinese government more than American tech corporations. One side is socialist, the other side is fascist.
The most famous is Deepseek. It’s not even made by “AI” company, it was a side hustle from stock trading company. They released it for free just to flex.
I hear people use minimax as replacement for sonnet and deepseek as replacement for opus, both can be used directly in Claude code instead of Anthropic models
Liu Wen tends to be in China
Check Ollama dot com
In a way it has actually.
Deepseek was big because not only did they publish the full model for everyone to use, but the MoE structure significantly brought down the hardware requirements in terms of processing power. As long as you have enough VRAM, you can run it on older hardware with no need for the latest Nvidia stuff.
Now they got v4 which many have found to be within a 10% margin of Claude and ChatGPT.
On top of that, China has cheapo VRAM GPUs available or soon to be released, like the MTT S80. Yeah it sucks as a Graphics card because the chip is behind, but you get 16Gb of GDDR6 for much cheaper than anything else.
But its not a conspiracy to fight China. The infinite scaling was just Nvidia solidifying themselves as the monopoly because they want all AI infrastructure to be dependent on them, which is why they still illegally export to China, despite an export ban attempting to reduce their potential competition.
Moore Threads (MTT) already has their own CUDA like system called MUSA, and I’m sure they’ll be happy to put in proper hardware support for new stuff like Bf16 and FP8/4. It’ll take a few years, but eventually China will catch up to the point where Nvidia gets shanked by cheaper hardware.
MTT is just a pipe dream, last I checked. But Deepseek is actively being served, in mixed FP8/FP4, on racks of Huawei accelerators.
I believe Baidu trained a model on them, too. But most training (like Deepseek’s) is still done on CUDA.
…Also, be careful equating this stuff with any kind of “consumer friendly” hardware you or I could buy. That’s less likely. The Huawei accelerators (and other local Chinese hardware experiments) are geared towards huge servers serving requests in parallel.
Wasn’t there development of a linux translation layer for CUDA workloads to run on AMD GPUs? I haven’t heard about it in a while, but I’d imagine that’d help the situation.
You mean ZLUDA? AMD gave up on it and released the project on github.
Personally I dream for CUDA API implementation in open source GPU drivers on Linux. That would absolutely level the playing field for the whole industry
Hyper scaling was always about cornering the computer market, It was never about providing us some vastly new and superior service.
They should be strung up. And middle management needs to return to fucking school.
It’s like Kyle Kulinski said “I’m starting to understand re-education camps now”
Exactly, its a method of taking tens of billions of dollars in capital and buying a near monopoly. No other providers can compete if the hyperscalers buy all of the hardware, driving up the prices while also selling the service at a loss.
Nobody working out of their garage with a cool idea for a better service can compete if they can’t get hardware and have to charge double what the hyperscalers are charging because they can’t burn capital for years.
It’s a practice that should be considered illegal market manipulation, because that’s what it i
e: extraneous ‘completely’
‘Dumping’ is considered anti-competitive behaviour in a lot of places. This sounds a lot like that.
Agreed. I am not longer paying token fees as I am running QWEN 3.6 27B MTP on my 4090 GPU and it is as good and as fast as the frontier models for agentic coding.
Same. I’m running Qwen3.6-35B-A3B-FP8 (Qwen3.6-35B-A3B-UD-IQ4_XS.gguf) via the turboquant fork of llama.cpp with a few tweaked memory settings, and I get like 40 tokens / second – nothing that required special insight on my part just following the instructions I saw on a youtube video I found via !LocalLLaMA@sh.itjust.works and asking claude to help me through the installation.
AI has no economic moat. There’s nothing stopping anyone from running LLMs locally.
I just updated my setup from LMStudio to llama.cpp with the new QWEN 3.6 27B MTP model and I am getting 80-112 tokens/second, 90 average which is just shocking to me. I am on a 4090 with a context Window of 64k. It hardly use cloud AI anymore as I rarely need more than 64k if I ensure my first prompt is written like a design document. Multiple prompts are not great so I often just figure out where my initial prompt went wrong, adjust and try again in a fresh session. Way faster this way too. It has really worked out well for me as I am getting just as much done locally for free as I was with hundreds of dollar a month on cloud AI. I am still shocked and grateful it flowed this way.
What do you run it on?
https://www.amazon.com/dp/B0BV8H8HVD with linux mint installed.
What’s the rest of your stack look like?
I am using llamma.cpp with QWEN 3.6 27B MTP, with a 64k context window on a 4090 that OpenCode talks to and then it in term talks to the Unity Game engine via MCP. Getting 80/112 tokens/second work 90 average which is shocking to me as it really does feel as fast as cloud AI (well faster for me as I am in Vietnam and round trips to US data centers really adds up in a session). The only really issue is you pretty much have to one shot prompts as follow up prompts will easily go over the context window size. If I cannot one shot prompts them use cloud AI both that is very rare for my use case. Maybe 1 in 50 or so and only when the tasks touches a lot of large scripts and scenes.
The state of things is what if, that’s true. It has not happened. :)
At some point, it should happen though. Still not going to put a dent in the datacenter / dystopia rally though, since they will pick Nvidia and known brands.