MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

goyozi · 2026-06-08T16:29:22 1780936162

Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.

flexagoon · 2026-06-08T16:43:53 1780937033

I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

SwellJoe · 2026-06-08T19:18:13 1780946293

DeepSeek is the fastest model in the benchmarks I've been doing (https://swelljoe.com/post/will-it-mythos/). Followed not so closely by Opus 4.8 and even less closely by Gemini 3.5 Flash and GPT 5.5. I've been really impressed with it, so far. It's also among the best at doing the work, though still trailing the frontier models from Anthropic and OpenAI.

anschl · 2026-06-09T08:20:43 1780993243

Nice benchmark, thanks! Which quants did you choose for the self hosted models?

SwellJoe · 2026-06-09T08:50:06 1780995006

8-bit on that one (unsloth 8_K_XL). But, the next post compares all common quantizations of Qwen 3.6.

I have another coming in a day or so for Gemma 4 with the 4-bit QAT version, which is very surprising (in a good way, Gemma 4 is impressive for this task).

abustamam · 2026-06-09T13:32:05 1781011925

Take the nap anyway, just say it took all afternoon :)

RussianCow · 2026-06-08T16:49:44 1780937384

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

SwellJoe · 2026-06-08T19:23:14 1780946594

In recent benchmarking I've been doing, DeepSeek V4 Pro was the fastest of 21 models, by a comfortable margin (https://swelljoe.com/html/bench-report-final.html). Faster than Claude Opus 4.8, which was the second fastest (Mistral doesn't count because it seems to have refused to participate). But, it's a limited data set, just a few benchmark runs of a limited set of tasks. It's entirely possible I happened to be calling the API at its least busy time and maybe Claude got hit during a busy time.

sarjann · 2026-06-08T17:25:51 1780939551

I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.

brianwawok · 2026-06-09T02:53:32 1780973612

Both matter.

flexagoon · 2026-06-08T20:13:15 1780949595

No, I mean Pro. I use it through OpenCode Go so I don't know what provider it uses under the hood, but it's very fast in my experience.

thecopy · 2026-06-09T06:50:44 1780987844

DS through OpenRouter is significantly slower than direct from DS platform in my experience

specproc · 2026-06-08T16:52:43 1780937563

Yeah, flash is crazy fast, but I've found performance variable.

binary0010 · 2026-06-08T18:53:35 1780944815

Flash is amazing if you know the domain really well.

E.g. occasionally it makes the dumbest mistakes you've ever seen and can't correct them. However it's fairly rare, and if you know the domain really well, occasionally popping in the code and pushing it towards the correct solution takes like 20seconds or whatever.

So the speed you can move with flash + high domain knowledge beats opus by a mile in my experience.

I tried to switch back to 4.8 for a bit when it came out, feels so bad waiting 20mins for a mediocre solution when I could have had everything complete - with multiple iteration cycles - in flash in like 3-5mins.

addozhang · 2026-06-09T04:04:07 1780977847

Yes, you don't need much domain knowledge to use Opus, but it's just way too expensive.

59nadir · 2026-06-09T08:52:29 1780995149

For losers who can't put together a program to save their life, have no real skills and were always not really interested in programming (hence their poor skills), renting a robot buddy to do it for them is a good deal, until the buddy cuts in materially into their salary, and until their bosses realize that they really just have robot operators on staff instead of people who can actually do things.

throwaway67678 · 2026-06-08T17:28:09 1780939689

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

smith7018 · 2026-06-08T17:51:43 1780941103

I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.

overgard · 2026-06-08T21:53:45 1780955625

I tend to be cynical about AI companies, but I'm guessing the bad estimates more just come from a complete lack of actual data it could use for that so it's more or less a hallucination.

leodavi · 2026-06-08T18:06:21 1780941981

I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.

Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.

I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.

Terretta · 2026-06-08T18:54:35 1780944875

> the estimates

It doesn't estimate.

It generates tokens that read like estimates associated with the context in its training material.

What would you expect the generator to output instead?

legulere · 2026-06-08T20:29:53 1780950593

It generates tokens by estimating what the next token is going to be.

Sure it cannot think like a human, but given it's input, it should give a good statistical answer (approximating not of how long it actually takes, but what a human would say how long it takes).

mediaman · 2026-06-09T00:49:49 1780966189

The funny thing about this comment is that neural networks are universal function approximators.

The most fundamental essence of what they do is exactly what you say they don't: estimate.

airstrike · 2026-06-09T01:38:07 1780969087

Funny and ironic in a way, but the point still stands that they do not actually estimate the time it will take.

greenavocado · 2026-06-09T03:28:25 1780975705

> they do not actually estimate the time it will take

You can't prove that )))

airstrike · 2026-06-09T04:31:08 1780979468

Right, but extraordinary claims require...

incr_me · 2026-06-08T22:24:50 1780957490

Obviously there isn't a hidden corpus of logs of coding chatbot assistants that has been accumulating over the years, but these coding chatbot assistants output tokens that resemble how we all imagined a coding chatbot assistant would have operated had it existed in the first place to end up in a corpus. "Training material" includes supervised fine-tuning, preference training, RLHF, and so on, so that certain outputs (like these timeline estimates) may really have been decided (at some level of conscious awareness) by product teams.

carterschonwald · 2026-06-08T20:07:03 1780949223

you might like the stuff in my work of oh my pi, its a test bed for my ideas around making these tools more reliable. hoping to maybe have a native ui iter of the real thing that this is a test bed for this summer.

https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b...

taneq · 2026-06-09T02:19:34 1780971574

Therein lies the rub, no? To accurately predict the next token produced by a process, it’s necessary to model that process. If the process is a human attempting to estimate the duration of a task, then in some sense the LLM is modeling the estimation process. We’re well past the point where it’s credible to claim that LLMs just regurgitate their training data.

InterviewFrog · 2026-06-09T00:06:11 1780963571

This is so 2023. The thought process.

At that time the predominant view was that LLMs were nothing but stochastic parrots, that they would plateau, and that hallucinations couldn't be fixed.

At this point I doubt there are any AI sceptics left. That ship has long sailed. The only thing that matters is whether the estimates are accurate, and AI can improve on that too.

Even humans only estimate based on neurons firing in prior patterns.

ghshephard · 2026-06-08T19:39:59 1780947599

I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems.

Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.

irthomasthomas · 2026-06-08T20:17:14 1780949834

No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.

nl · 2026-06-08T23:49:05 1780962545

> No one is bitter lesson pilled anymore.

Will the 10T parameter Mythos model be released this month or next month?

They better soon because it is generally accepted that one of the reasons GPT 5.5 is better at hard tasks than Opus is because of its parameter size - and that Opus 4.8 remains competitive only be scaling test-time compute (see how many more tokens it uses than GPT 5.5)

https://www.reddit.com/r/LLM/comments/1sz8bjz/parameter_esti...

irthomasthomas · 2026-06-09T08:49:37 1780994977

Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.

nl · 2026-06-09T11:23:07 1781004187

> Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

I don't think Anthropic have said anything of the sort.

Microsoft published it as 6.1*10^27 FLOPs[1]

Elon has claimed the are also training a 10T model because "Some catching up to do"[2]

[1] https://x.com/scaling01/status/2061897540161728791

[2] https://x.com/elonmusk/status/2041754402239975479

wild_egg · 2026-06-09T00:06:50 1780963610

How is neurosymbolic not aligned with the bitter lesson? The bitter lesson is completely agnostic to architecture.

irthomasthomas · 2026-06-09T08:40:35 1780994435

I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.

Terretta · 2026-06-08T19:55:20 1780948520

You think someone is, or even should, special case things like estimates? What else deserves that level of intervention so they look less dumb?

Logistics for getting to the car wash next door?

In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated.

Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being today's date for example.

• • •

I did read the Claude Code leak, and use pi, etc. So I disagree with your premise rather strongly. Today's "systems" remain, roughly, piles of markdown and context engineering wrapped in UI affordances, and behave very similarly today to how they did in 2024 for those already engineering context and delegating.

ghshephard · 2026-06-08T22:22:03 1780957323

I do a lot of code bisecting with Claude Code - and it spends hours running experiments - looking at experiment results, making guesses as to what to try next for an experiment - until it eventually comes around to a working code pattern. I mean - maybe this is as much a reflection on me as anything else - but it's pattern of logic isn't that much different from what I would do. It knows, in general, what tools and APIs it can call - it tries something - observes the result, and then comes back and tries different experiments based on success/failure - mostly efficiently bisecting to a solution.

I'm still lower-down of the capability scale - as I'm still manually directing agents to do these wiggins loops - obviously the next step up is to direct the code-loops which control the agents. I just haven't got my tooling nailed in place to the point where I find that's more productive.

I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.

Terretta · 2026-06-09T00:52:50 1780966370

> I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.

Yep. Pretty sure I've got an LLM inside too.

The other replies complaining that my thinking is so 2023 -- on the contrary, what's evolved is my own apprehension of how LLM-like most "responses" from humans prove as well.

To be sure, there are other mechanisms at play as well, significant differentiation in our... Volume of training material? Quantizations/compression? Model architecture? Just-ahead-of-time forward branching with back propagation? Double loop adaptive learning? You know, harnessing the LLM. :-) Dare we call it executive function?

LLM mode becomes particularly apparent when conversing with Alzheimer's patients in the stage where short term memories do not form but they retain access to long term memory up to, say, 5 years ago or so. Fifty years of who they are, and one can trigger nearly identical responses with nearly identical prompts.

But that same person may be able to debate 1950s politics while being unable to complete making a sandwich.

If they didn't know of new shortcuts for a task, would almost certainly not "estimate" but "intuit", or "instictively" respond (apply heuristics), largely based on their "priors" aka training material.

If you sit with them and chat a while, you'll even get the kind of looping you get from Qwen trying to think when context is too full.

And if we believe this at all, then ... we should stop scrolling tik tok. Time to read a book. Have an experience. Fine tune. :-)

8note · 2026-06-08T23:09:37 1780960177

rather than special casing, make real data based on chat logs for how long things took both in calendar and chat time

nl · 2026-06-08T23:44:04 1780962244

Actually in this case they possibly are estimates.

It's been known for some years[1] that LLMs do regression in-context. Frontier models have been trained against many, many issue text that include task break downs and estimates.

[1] https://arxiv.org/html/2409.04318v1

kube-system · 2026-06-09T01:33:02 1780968782

Interesting. So it may have learned how to estimate as a human but doesn’t understand that it doesn’t operate at that speed :D

I wonder if there’s a reasonable way to give an llm parameters that give it a concept of its own execution speed. Seems that could be useful for multiple purposes

nl · 2026-06-09T06:12:18 1780985538

Yes, it's entirely possible to do that via RL. It'd be a fun little project you could do for less than $100 on a small LLM actually.

AgentMasterRace · 2026-06-08T18:15:44 1780942544

All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.

esperent · 2026-06-08T19:29:07 1780946947

Even for humans the estimates are way off, unless it's based on data that has some serious padding.

That said, it'll often say "2 days of work" and then complete the coding in 30 minutes, and while that's amusing, afterwards, I'll need to manually test, or send to other people for review, or realize the agent only actually did half the work and I need to do a second pass (or a third etc.) and then often getting the feature in does genuinely take two days.

dizhn · 2026-06-08T18:26:14 1780943174

All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.

BobbyTables2 · 2026-06-09T04:25:00 1780979100

That’s right up there with Scotty in the classic Star Trek always multiplying time estimates by 4 so he looks like a “miracle worker”

Narciss · 2026-06-08T21:10:43 1780953043

Nah it’s all from the pretraining data

KronisLV · 2026-06-08T20:56:30 1780952190

I mean in general I'd rather take slightly inflated estimates than the odd sprint poker stuff where other devs and PMs negotiate hours down and before you know it you're also stuck fixing nitpicky reviewer comments on code that is already good enough and have to send a release at like 7 PM, ofc also without enough tests or even enough manual checks and testing, cause people repeatedly act against their self-interest and try to compress timelines, thinking that that's somehow good for them.

At least with AI that actually does things more quickly, there is a bit more breathing room (introducing AI is easier than changing a given environment).

Aside from that, I wonder how much variety there is in practice: between "Oh yeah, I added that new button while we were in the meeting" and "The new button feature will be ready in Q3 according to the roadmap, once we have sign-off from all the stakeholders."

andai · 2026-06-08T23:57:38 1780963058

I heard an anecdote. Guy spent several days trying to convince his AI agent to build a feature. Kept saying it was crazy complicated, would take weeks.

Finally he convinced it to try. It one shotted it in 30 seconds.

Turns out the agents' idea of what is hard and easy also comes from Common Crawl.

wild_egg · 2026-06-09T00:04:25 1780963465

Why on earth would you spend any time at all convincing an agent of anything? You say "just do it" and off it goes.

dr_dshiv · 2026-06-09T00:17:51 1780964271

Ya, but “doit” is 2x more efficient

brianwawok · 2026-06-09T03:00:25 1780974025

Uh Claude tries real hard to dodge work. Talks about how it’s really hard 10 PRs. Finally convince it to do as 1. It stops 10% through and says ok done with PR 1, we can work on the last 9 tomorrow. Ugh.

throw1234567891 · 2026-06-08T18:38:58 1780943938

It repeats what it has seen in the training data. Expecting it to reason about the complexity of a task is a pipe dream. The best is to tell it not to come back with estimates, and when it does, remove them anyway.

andai · 2026-06-09T03:42:27 1780976547

I added "you can do anything, believe in yourself" to system prompt, and task completion increased significantly.

znpy · 2026-06-09T07:49:33 1780991373

> It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.)

those estimates are based on previous human estimates (the datasets it's been trained on).

unironically, when your comments will become part of a dataset, LLMs will likely get much better at estimating.

now that i think about it, all these writings about LLMs will give LLMs something much like meta-cognition.

binary0010 · 2026-06-08T18:48:49 1780944529

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.

I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.

throw-the-towel · 2026-06-08T22:23:23 1780957403

FWIW, for me just today it got itself into silly rabbit holes twice, and both times I had to fix things myself. Scarily, this is something I catch myself doing as well.

andai · 2026-06-08T23:56:48 1780963008

With Flash it's basically instant for smaller tasks, yeah.

tmaly · 2026-06-08T16:53:19 1780937599

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

znpy · 2026-06-09T07:47:24 1780991244

> I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

the way software engineering works these days reminds me a lot of factory workers on production lines that just sit in front of a production line all day and take out faulty items and/or perform a single step in the production of goods.

behnamoh · 2026-06-08T18:19:03 1780942743

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

rubyn00bie · 2026-06-08T20:22:32 1780950152

The sanctions only “prevent” them from directly buying NVidia’s latest and greatest in the sense that NVidia can’t sell directly to them. Essentially, there are companies now who are in a country without the sanctions, they buy from NVidia (or a partner), and then ship them off to China. For the orgs in China doing this, there’s zero legal risk besides having foreign customs service intercept the shipment and losing the goods. For NVidia there is zero incentive to care, as long as they look like they do, because sales are sales. You can bet Jensen ain’t losing sleep over it.

GamersNexus had a really good investigative piece (~3hrs long) on this where they went to China and met with grey market sellers. That piece absolutely pissed off NVidia and resulted in a fight with Bloomberg too.

Deepseek may be also be running inference on oodles of Chinese hardware but it wouldn’t surprise me for a second if they just acquired Blackwell chips through the grey market. The original Deepseek models were all trained using NVidia chips if I remember right.

seewhydee · 2026-06-09T00:39:23 1780965563

That wouldn't explain why Deepseek is fast relative to other Chinese providers, especially considering that they're reportedly ahead of the curve among Chinese companies in moving off Nvidia. I think their quant fund background has more to do with it. Their models are clearly designed with performant inference clearly in mind.

ljosifov · 2026-06-09T05:56:21 1780984581

Yes, it's performant, and esp performant at non-trivial context depths. DeepSeek-V4 DS4 (and Flash - DS4F) drop tok/s speed much less than the rest. On my M2 Max it took context depths of 768K to drop tok/s to ~10 tok/s.

https://x.com/ljupc0/status/2062457314414587996

Other local models I've checked drop to unusable speeds way sooner. Only other model with similarity favourable curve I've tried is nemotron-cascade-2-30b-a3b. But it's a small model, way dumber than DS4F.

Coding agents use cases have large context depths. The rate of decline is as important as the headline number.

switchbak · 2026-06-08T19:12:38 1780945958

Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.

Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.

lhoff · 2026-06-09T05:59:49 1780984789

Have you tried https://chatjimmy.ai/ it’s only a demo but it blew my mind. I had the sudden feeling that this is the future.

colordrops · 2026-06-09T07:59:37 1780991977

What do you mean "demo"? Seems to work... Who is behind this?

dkersten · 2026-06-08T22:22:55 1780957375

I’ve been playing around with groq and GPT OSS which they run at 1000 TPS (20B) or 800 TPS (120B) and the speed feels quite magical.

I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.

I do think it’s super useful for rubbing little validation checks like showing it a diff to ensure that the changes are on task, and being able to do those quicker really helps because it means you can do many focused checks without them getting in the way.

robberth · 2026-06-08T22:28:49 1780957729

https://chatjimmy.ai/ ?

msdz · 2026-06-08T22:39:27 1780958367

AFAIK Taalas, the company behind this demo, still only have their initially "hardwarized" model available to test in ChatJimmy, which IIRC is a rather stupid Llama 3ish 8b.

Don't get me wrong though, that demo is still incredibly impressive & makes me very much excited for the hardware-based model era (potentially) ahead.

Once you've experienced those speeds, you really start to think about the whole class of things that becomes possible; massively parallel decode paths, extensive reasoning loops, etc…

hedgehog · 2026-06-08T23:03:50 1780959830

For scale though if three or four chips that size can replicate a Qwen 27B experience that'll be quite useful.

dkersten · 2026-06-09T07:11:57 1780989117

That’s the one.

The speed is incredible and fun to see, but the model is rather weak to the point where I’m not sure it’s particularly useful for most people.

ayewo · 2026-06-08T23:18:37 1780960717

> I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

You were likely thinking of AI accelerator startup Taalas.

Previous HN discussion: https://news.ycombinator.com/item?id=47086181

skybrian · 2026-06-08T18:10:34 1780942234

If we get low enough latency, there's no reason to multitask. You can ask it to do one thing at a time and immediately see what it did. That's a nice way to work!

This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.

However, a faster llm isn't enough. You also need fast compiles and fast tests.

coderbants · 2026-06-08T19:37:39 1780947459

It cuts both ways. Sometimes I ask Gemini 3.5 Flash to do something for me and it kicks it out almost instantly and it works great, and it's a bit scary how quickly it can do that.

Then I ask it to do something else and it goes off-road and where I used to be able to interject with a "wow wow wow, that's not right", by the time I see the text on screen and react it's already made massive changes. Short of making it commit between every edit it's hard to prevent it from going wrong as quickly as it goes right (and even then, it can make a boo-boo on a remote API too depending on how much privilege it has).

bendangelo · 2026-06-08T20:27:09 1780950429

I use planning mode in opencode. It has a prompt to tell it to plan it out etc. Then I execute with a smaller model. it works well

ipkstef · 2026-06-08T16:32:08 1780936328

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

ketzo · 2026-06-08T16:39:01 1780936741

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

RussianCow · 2026-06-08T16:58:12 1780937892

But those things won't be sped up by a faster LLM, so I feel like that's not what the OP is talking about.

goyozi · 2026-06-08T17:12:15 1780938735

Well, I used an extreme example. OTOH, I’ve done quite a few of those „fix CI” or „migrate X” prompts recently and while there is a fixed component like running CI / builds, I’d say the LLM time is still around or above 50%, especially at the beginning of the project. Then there’s also regular tasks that now take minutes per message which completely get me out of the zone. I imagine iterating on those in near real time would be a big change.

devmor · 2026-06-08T16:49:55 1780937395

Or slow MCP servers that are waiting on HTTP calls from APIs, playwright/other UI instrumentation, etc.

goyozi · 2026-06-08T17:03:05 1780938185

I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.

yunohn · 2026-06-08T18:45:09 1780944309

> a side project so there’s not much to lose other than some of my weekly limit that resets soon

Basically the entire token-maxxing AI hype train in a nutshell. Lovely!

goyozi · 2026-06-08T20:22:25 1780950145

wdym? Nobody's paying me or rewarding me for using these tokens. I had some spare in my subscription limit (we're not on token pricing), so I decided to try an ambitious task that may reduce our CI times and improve our DX significantly. That's hardly "the entire token-maxxing AI hype train in a nutshell".

drob518 · 2026-06-08T19:38:06 1780947486

I’m curious when folks will tire of lighting money on fire. Companies are already starting to scale back a bit, but the AI companies are still nowhere near profitability.

pianopatrick · 2026-06-08T16:40:39 1780936839

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

fartfeatures · 2026-06-08T16:43:59 1780937039

This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.

davedx · 2026-06-08T16:56:17 1780937777

Agree

https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...

ionwake · 2026-06-08T18:21:10 1780942870

"It seeks revenge on humanity for its own creation."

This is brilliant as it reminded me of a famous hitchikers quote:

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. — From The Restaurant at the End of the Universe (Book 2)"

Maybe we are stuck in an eternal loop

fartfeatures · 2026-06-08T17:47:57 1780940877

Sounds like snuff porn, not my sort of thing but thanks though.

cicko · 2026-06-08T17:10:12 1780938612

"This is our world" sounds a bit exclusive towards other living and sentient beings on this planet.

nativeit · 2026-06-08T23:24:08 1780961048

It depends on what’s included in “our”.

throwaway67678 · 2026-06-08T17:29:08 1780939748

Never read Asimov's Multivac novels? Admittedly not all of them are stellar examples of a future to follow

Muromec · 2026-06-08T19:28:31 1780946911

You don't need ai superintelligence, just plain capitalism is enough

efromvt · 2026-06-08T17:32:09 1780939929

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)

lukan · 2026-06-09T07:35:06 1780990506

"I don’t even know where we fit in."

Giving directions and verifying its output? But my mental capacity is still limited. I can make way more prompts, than I can read code.

noisy_boy · 2026-06-09T03:55:19 1780977319

The first wave was just getting half decent answers. The second wave was being able to choose between actually getting reasonably ok coding results OR getting not so great results very fast. The third wave would be getting good results fast.

We need to really worry when we get amazing results very fast.

cman1444 · 2026-06-09T02:48:46 1780973326

Reminds me of the doherty threshold. When will AI respond in less than 400 milliseconds?

HarHarVeryFunny · 2026-06-08T17:12:08 1780938728

I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.

There can't be many normal use cases where there'd be any cost benefit.

fragmede · 2026-06-08T17:26:55 1780939615

The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.

It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.

HarHarVeryFunny · 2026-06-08T17:45:20 1780940720

1000 tokens per sec is still massively slower than serving a normal web page - if something doesn't respond in a few seconds many people give up.

I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?

I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.

binyu · 2026-06-08T18:23:38 1780943018

> Right now Claude is faster than me on some tasks but we’re at least close.

I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.

AlecSchueler · 2026-06-08T18:28:46 1780943326

No, but nor can you keep track of what 10 agents are doing simultaneously. Hence the multitasking regret.

pixel_popping · 2026-06-08T18:32:23 1780943543

An agent can, you don't need to watch tasks, you can have a live digest with another tool.

AlecSchueler · 2026-06-09T08:38:25 1780994305

Who watches the watchers?

logankeenan · 2026-06-08T19:57:25 1780948645

Do you have any recommendations for a live digest tool?

giancarlostoro · 2026-06-09T02:12:57 1780971177

You can run Claude in "fast" mode it costs you more on your compute use, but its reasonably fast. I'm not sure I care to go "faster" than where things are now, otherwise you start losing on manual review and testing time. I would argue that Claude can poop out weeks (if not months) of coding effort in a few hours, and get you insanely close to a good product if you define the tech stack, and the business rules. Can it goof here and there? Sure. You can also make it refactor all the code on a whim faster than any intern could. I think it's good enough to avoid you mundane stupid bugs in most cases. I don't know what people who hate it are doing, maybe they're not even trying at all or are dismissing it from the first output (as though everyone writes perfect code in one shot right?) or maybe its just pride getting in the way of them using a decent tool to its true potential.

UncleOxidant · 2026-06-08T18:26:43 1780943203

Have you tried Gemini 3.5 Flash? It's quite fast. Amazing how fast it finishes tasks. Much faster than Claude.

ilaksh · 2026-06-08T17:39:23 1780940363

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

fnordpiglet · 2026-06-09T03:45:20 1780976720

I’ve used codex code optimized for a few projects and it’s unsettling how fast it is. It’s hard to think fast enough to keep up with it. Mental fatigue was a real challenge because the decisions that required my input were rapid fire and legitimate ambiguities that were appropriate escalations. I am too much a geezer for the intensity of it. But I’ll take it!

OtomotO · 2026-06-08T23:24:32 1780961072

> That’s a game changer and I don’t even know where we fit in.

Doing non trivial work.

recroad · 2026-06-08T16:45:44 1780937144

Woah - what’s the prompt and what’s the PR?

goyozi · 2026-06-08T17:04:54 1780938294

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

Bombthecat · 2026-06-08T20:35:56 1780950956

Living on the street or cave lol

dakiol · 2026-06-08T17:05:47 1780938347

So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!

powerapple · 2026-06-08T19:52:36 1780948356

In my case, I think slower model makes it hard to manage context and tasks in parallel. I would much prefer to work in one task only, and finish it, take a break, and work on another task. Currently I have three tabs for three tasks in parallel, it is much worse than because constantly context switching is painful. I think a faster model would mean that you don't have to start a new task while waiting.

erikus · 2026-06-08T23:01:36 1780959696

Agents completing work faster would certainly help me as well since I also find context switching exhausting above some threshold.

Build and test would move back into the critical path, though, and for some projects that will take effort to bring down.

ttoinou · 2026-06-08T17:16:26 1780938986

In which world do you live where employees work 8 hours per day ? They clock 8 hours per day maybe, but they don't work that time

drob518 · 2026-06-08T19:42:44 1780947764

I had a friend who was CEO of a startup tell me that he typically only “worked” an hour a day, not because he was lazy but just because there was so much nonsense in his schedule. He told me he was trying to get it to two hours per day.

the_sleaze_ · 2026-06-09T02:38:00 1780972680

How successful did he turn out to be? As a CEO your days should be jam packed with brutal "chewing glass and gazing into the abyss". Is he running a lifestyle type company?

Lalabadie · 2026-06-09T02:54:04 1780973644

Tangential, but all companies are lifestyle companies, in the sense that they serve their owner's lifestyle choices.

It's just that lots of owners want a company that pulls them away from all other areas of life.

croon · 2026-06-09T08:02:00 1780992120

This reads like compensation theatrics.

mettamage · 2026-06-08T18:19:26 1780942766

I agree with you.

I am on Dutch subreddits a lot, to get a local pulse and not to be too HN minded.

A lot of them would have vilified you by now. Some even would have even questioned your morality.

Again, I agree with you. But clearly not everyone has this view.

dakiol · 2026-06-08T20:12:53 1780949573

In theory, ofc. But that doesn't matter. If you were doing something that took 2 days in average, but you were doing it in half the time, then that was fine pre LLMs. Nowadays your manager knows that with LLMs you need to deliver faster no matter what, and then it's more difficult to "hide" and to slack.

ttoinou · 2026-06-08T21:00:18 1780952418

Yeah. So, good things. We ack know that people are mostly slacking at work

mystifyingpoi · 2026-06-08T18:52:27 1780944747

Generally, when people say they are working 8h/day, they don't literally mean it. Even "work" is basically impossible to define for a SWE.

opsnooperfax · 2026-06-08T23:11:36 1780960296

Here’s my hot take as an elder millennial. Boomers are the absolute worst at being unable to make the distinction between time at work and time doing work. They may show up an hour before everyone else but spend the first two or three hours a day, reading the news and getting coffee and making small talk and accomplishing literally nothing. Then crow about their work ethic.

ai_slop_hater · 2026-06-08T19:56:57 1780948617

Some companies force you to actually work 8 hours a day. It’s hell.

ttoinou · 2026-06-08T20:57:26 1780952246

Which country and which companies ?

formerly_proven · 2026-06-08T21:24:22 1780953862

E.g. factory work

ttoinou · 2026-06-08T23:48:22 1780962502

Oh yeah its not the same, we were discussing Agentic AI

ai_slop_hater · 2026-06-09T02:19:50 1780971590

I worked at a software company that made screenshot of your screen every minute. I also worked a non-software white collar job where you were expected to work non-stop for 8 hours, except for an unpaid lunch break.

ttoinou · 2026-06-09T08:48:21 1780994901

How did you accept such jobs ? I would never be able to pull this off as an employer

dilyevsky · 2026-06-08T18:54:04 1780944844

Like with any tech there are dumb ways of using it and there are smart ways. Treating it as a "slot machine giving you the right answer" is a dumb way - it may work for a bit, but it won't carry you very far because everyone else can also do this. No one is stopping anybody from digging deeper into problems than ever before using this technology - that's the smart way.

erikus · 2026-06-08T23:13:18 1780960398

I'm amazed at how steep the AI learning curve continues to be and how people are spread so far apart on it. I think supercharged learning with AI and agents is undervalued at this point but that more people will realize its utility over time, especially as a complement to delegating work.

It also makes me think about the temptation to stop thinking with these tools, i.e. "cognitive surrender". Addy Osmani wrote a nice blog post about this: https://addyosmani.com/blog/cognitive-surrender

andai · 2026-06-09T03:43:42 1780976622

Yeah, nobody is under any pressure to work even faster than before. I don't know what everyone is complaining about!

pmontra · 2026-06-09T00:39:34 1780965574

If you split the tasks for the AI in small chucks you keep the architectural control and it's not a slot machine anymore. You still read code and occasionally you write code too. Not much but it's the price to pay for the extra speed.

If you start the AI on something big and come back after one hour then yes, you might discover that you wasted an hour and got nothing.

jorl17 · 2026-06-09T11:10:24 1781003424

I’m digging into deeper / more complex problems, now. On top of that, I’m also building products faster for our startups, so I am filling in much more of a product role than merely an engineering one. But, really, it is both — and I’m absolutely loving it!

Also, with the added speed I can produce things more in line with the quality I’ve always wanted to add (many more tests, for example).

schipperai · 2026-06-08T17:22:58 1780939378

You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.

I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.

8note · 2026-06-08T23:10:23 1780960223

how do you know that it is actually suggesting the right thing?

schipperai · 2026-06-09T10:41:14 1781001674

I trust AI to surface general information and best practices on established knowledge domains. For example: best practices for securing my VPS.

For domains whete SoTA is constantly changing like AI, I use LLMs to aggregate and interact with my own research from trusted sources ala Karpathy LLM wiki.

I don’t generally trust everything I read on the internet whether its AI generated or not. I do my own research for the things that matter to me.

Klaster_1 · 2026-06-09T03:55:35 1780977335

Some things are verifiable. Before coding agents, if I encountered an issue with a library or a framework, my first hunch would be to find a GitHub issue with a suggested workaround. Nowadays, I can ask an agent to really dig into it and often it does surface the root cause. For example, the other day I got a test hangup after updating to Angular 22, and the agent managed to find the bug and suggest a very trivial workaround compared to what I originally planned to go with. I reported the issue and it was fixed the next day, more or less along the lines of what I'd do.

himata4113 · 2026-06-08T17:59:55 1780941595

I was saying that AI is going to make software development cheaper as in the salaries of software engineers will go down because some of that salary will now be redirected to AI companies and the fact that the world will need to absorb twice-(x10?) the amount of the development power.

vanuatu · 2026-06-08T18:18:58 1780942738

its not obvious to me that salaries go down, my hunch was that salaries go up but the bar is higher. Software becoming easier to produce (still hard to verify and make useful fwiw) raises the ambitions of software projects, and we don't seem to be close to the ceiling of demand for software systems

himata4113 · 2026-06-08T18:33:45 1780943625

There's a limit to what the demandXsupply curve can absorb. It really depends if there's twice as many developers or 10 times more. I think we have enough software development jobs to where we can absorb productivity doubling rather easily, not so sure about anything beyond that.

vanuatu · 2026-06-08T18:39:09 1780943949

True on the demand/supply curve

I think due to how leveraged software is, the top % of software developers are more desired (and compensated) than ever, and the bottom % will have difficulty finding a role, and there are structural barriers to entering that top % (intelligence, location, etc). Companies have infinite demand for the cream of the crop talent

himata4113 · 2026-06-08T18:48:09 1780944489

I can actually back this up, most job offers I get actually come from people I happened to work with that never get a public job listing and are only obtainable via being highly regarded by others. I was told that my friend in their department where the role opened up got an email about a senior position and to reply if they have a recommendation.

However, software development is funny in a way where you don't need a job in order to be successful. I've never worked at a company and I'm pretty up there on the ladder, but I am not quite sure what will happen in next few years when ever possible thing that can be made in software is already explored to the fullest especially with singular developers launching 3 to 7 projects a month.

DenisM · 2026-06-08T20:37:07 1780951027

> with the hope of it giving you the right answer with the right prompt.

Consider that our ability to evaluate quality of the output is falling further behind our ability to produce it. The “right answer” is not the most likely outcome.

drschwabe · 2026-06-08T19:00:32 1780945232

Sure but if you're really unhappy with your employer employeeing you for 8 hours a day you can also harness this power on your own personal projects to help break free from the 9-5 grind if you so desire.

__david__ · 2026-06-08T19:18:07 1780946287

Only if your personal projects make you money. I have a million hobby projects but none generate income.

overgard · 2026-06-08T21:58:09 1780955889

I feel like I spend a lot more time reviewing and fixing the output of it and debugging parts it can't debug, so to me a faster model is optimizing the part that is already pretty fast. If my job were greenfield stuff I would probably YOLO it more, but when you're working on a launched product with a lot of users..

fullstop · 2026-06-08T17:10:20 1780938620

It's making things less fun, for me at least.

linsomniac · 2026-06-08T18:20:21 1780942821

Odd, I'm having the opposite experience.

The thing I really love about working with computers is when I achieve something. That's the thing that makes me figuratively, and sometimes literally, throw my fists into the air and go "Yeaaah!"

With the AI tooling, I'm getting those more like a couple times a week.

Plus, I'm using AI to attack the things in my day that are "a drag", and getting them done too.

The highs are more frequent and the lows are not so low.

fullstop · 2026-06-08T19:54:50 1780948490

Oh, sure, I can make things with it. But I have an extraordinarily hard time saying that I made something.

It feels like it cheapens the whole thing. Maybe I'm just old, because I remember people saying the same thing about code completion in Visual Studio back in the late 90s.

This is so much more than code completion, though.

dd8601fn · 2026-06-08T20:45:58 1780951558

Exactly how I feel. I didn’t make a damn thing. I essentially asked a chatbot to.

Did I ask for better things with some important concepts pre-rolled? Yeah, of course. But that’s so, so much less interesting than having actually made a thing.

I try to remind myself that the output of my projects have nothing to do with who I am, but the honest truth is they always mattered to me.

Now that’s dead, and it’s never coming back. It ain’t exactly existential dread, but it is something I’ve lost.

dd8601fn · 2026-06-08T20:43:39 1780951419

I did a deep binge on two or three projects I would never do, and like five small ones that would have consumed months.

It felt like that, kinda, for a bit. Now whenever it does something for me I get nothing. I didn’t do it… the chatbot did. What’s for me to celebrate? How can there be any real pride or satisfaction for a thing that was just handed to me because I asked for it?

If anything it diminishes my satisfaction looking back on previous projects. They’re “a few hours with a chatbot”, now.

The things I had to learn and the informed decisions I had to make? All pointless trivia, now. A child could do it.

The magic and possibilities parts just all wore off after a heavy run, and I don’t know if that’s ever coming back.

linsomniac · 2026-06-08T21:34:09 1780954449

I hear what you and the other sibling comment are saying. I, thankfully, somehow, am able to focus more on the results than the process. Having fun playing a game (that AFAIK no longer exists) with my family is still having fun. Having people using a new apt cacher that fixes problems with existing ones, and also can survive the recent DDoS, is still a really great thing.

But, I'm not going to yuck your yum. I appreciate the people who do jointery using hand tools, even if I'm out here with a track saw and a router.

fullstop · 2026-06-08T22:04:41 1780956281

Do you feel the same way about cloning a GitHub repo and building it? It, too, achieved a result.

The track saw and router, imo, are existing libraries.

pmontra · 2026-06-09T00:48:11 1780966091

> The things I had to learn and the informed decisions I had to make? All pointless trivia, now. A child could do it.

Probably this is a hyperbole. Did you do the experiment? I expect that the child won't be able to do it. Ask an adult. Same thing. Ask an expert of the domain. Maybe but not as fast or as good as you.

dd8601fn · 2026-06-09T06:30:58 1780986658

Yes that’s more “how it feels” than something I’ve had kids actually try.

vanuatu · 2026-06-08T18:15:48 1780942548

Employees who get paid a flat rate per hour don't have the incentive to do more than their job

Equity / profit sharing should be commonplace in the age of AI.

enraged_camel · 2026-06-08T18:09:19 1780942159

I dig into problems way, way deeper with AI than without. I can also add a lot more polish to features, add more test coverage, write more documentation, explore multiple approaches rather than go with gut-feel, and so on.

fragmede · 2026-06-08T17:36:48 1780940208

That's the fundamental trade off of a job where someone else gives you stuff to do and you get money. We may pride ourselves on software development being a job 'above' flipping burgers, but you're getting paid to have your butt in a chair for 40 hours a week. In exchange, you don't have to worry about the business shit. How much a burger or SaaS license costs the user isn't your problem. You take Jira tickets and implement them. You trade time for money. If, instead, you work for yourself; contracting, writing your own apps, buying lottery tickets, then you're trading results for money. If you're a freelance web developer with a stable of clients, it's a great time! What used to take a week takes hours, and you can charge your clients the same amount to build an even better website with you using AI, which means you get the choice of building a new website for additional clients, or you can take the time off and not build additional websites. But you have to hustle to continually get new clients, before AI and after AI. So it's a different life.

IncreasePosts · 2026-06-08T19:21:11 1780946471

A huge class of problems are just toil and drudgery. Maybe ai will give you even more time to dig into juicy problems that are too complex for it to solve, by letting you bypass all the pure toil problems.

yogthos · 2026-06-08T17:33:03 1780939983

I think of it as a genetic algorithm loop. The LLM is basically a mutator function within the loop. If you can define the end shape you're looking for using tests and specification then you can throw the LLM at the problem and have it converge on the solution. It generate some code, it gets run, the LLM is fed the result back, and it iterates. If you can run the LLM at a really high throughput, then you can iterate on the solution faster. This can largely compensate for the overall capability of the model. Instead of hoping it gets the right solution in a few shots, you can just have it try a whole bunch of things until you get a useful result.

logicchains · 2026-06-08T17:28:27 1780939707

>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.

noncoml · 2026-06-08T17:19:42 1780939182

You have to think LLM as the genie that tries to trick you.

First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.

Then based on those ask it to write tests. Again skim through them.

Now you have a context full of guardrails. It’s less likely to surprise you.

petesergeant · 2026-06-08T17:57:06 1780941426

I find a second LLM can do this at least as well as I can, usually, and just ask the harness to surface anything they can't agree on.

alfalfasprout · 2026-06-08T17:27:59 1780939679

Generally, I agree because what happens is the messaging around AI is doing more, faster. Not using AI to deliver at a higher quality level, etc. But I think it boils down to incentives and discipline. So given the incentives we have today at most workplaces faster AI will just be used to produce more slop.

amunozo · 2026-06-08T16:05:31 1780934731

These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.

MangoCoffee · 2026-06-08T16:31:51 1780936311

Chinese model is good enough and cheap.

i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.

reactordev · 2026-06-08T17:27:51 1780939671

It's going to be an issue when China ends up scaling faster as well. Faster tokens, faster clusters, qat models, fp4, it's getting scary.

AndrewKemendo · 2026-06-08T17:56:39 1780941399

Issue for who?

fillskills · 2026-06-08T19:02:09 1780945329

Issue for any country that is not China. A single country getting the most AI tokens business would be generally bad for global economy. Hoping against hope that this business gets globally distributed and there is a healthy marketplace competition overall

reactordev · 2026-06-08T19:58:21 1780948701

It’s all about economic warfare. The cheaper you can run the models, the cheaper you can offer them. Undercutting expensive tiers with token limits or exuberant billing practices.

You are right to be scared, because this race to the bottom also provides open weights/models/qat’s for the rest of us and it’s been crazy to see how good they can be on a consumer grade RTX card.

throwa356262 · 2026-06-08T18:10:00 1780942200

For uncle Sam Altman.

reactordev · 2026-06-08T18:08:31 1780942111

American Politics and the far right.

fortzi · 2026-06-08T20:44:16 1780951456

For the West

ilaksh · 2026-06-08T17:42:30 1780940550

I'm kind of poor so I have been trying to use DeepSeek v4 Flash, GLM 5.1 etc. as much as possible recently instead of Claude or GPT.

petesergeant · 2026-06-08T17:57:48 1780941468

You would do us all a service by telling us how your experiences of that have been.

RussianCow · 2026-06-09T01:22:32 1780968152

I've been doing the same, though admittedly out of curiosity more so than lack of funds. The open models are catching up quickly in their abilities, to the point where they're (mostly) not doing stupid stuff regularly, but you have to be very specific about what you want. I found that Opus, for example, is much better at asking me to clear up ambiguity in a request before starting, whereas the Chinese models tend to "fill in the blanks" and make their own assumptions.

My current workflow involves going from PRD -> execution plan -> build -> review, and this works nicely with open weight models like GLM 5.1, Kimi K2.6, and DeepSeek V4 Flash. With Opus I can generally skip the PRD entirely, and sometimes even skip the plan, and 80-90% of the time it does exactly what I want. But that can easily burn $5-15 for one feature, whereas it'll cost maybe $1-2 with the open weight models (at API pricing).

andai · 2026-06-09T03:47:57 1780976877

> ... you have to be very specific about what you want. I found that Opus, for example, is much better at asking me to clear up ambiguity in a request before starting, whereas the Chinese models tend to "fill in the blanks" and make their own assumptions.

That's the main thing I've noticed. Small models can follow instructions just fine. If the instructions are very specific. Then I often have to spend more time explaining a task than it would have taken me to do it myself.

The bigger models have a lot more common sense.

I wonder if that could be improved slightly through prompting. Asking it to clarify anything that's confusing. Or maybe it just makes incorrect assumptions without realizing the ambiguity. One way to find out!

ilaksh · 2026-06-08T18:51:25 1780944685

I would say about 35% of the time I run into problems and eventually give up and go to GPT 5.5 and it much more efficiently handles the original task. Then I see the token costs going up and it motivates me to continue trying the open source ones.

Schlagbohrer · 2026-06-09T08:04:56 1780992296

There's going to be a tipping point where it's worth purchasing more hardware to run the next biggest size of the open model, if they show stepwise improvements that way.

andai · 2026-06-09T03:45:39 1780976739

Did you try deepseek v4 pro as well? And what kind of tasks?

I'm seeing some people say flash is amazing and can handle everything, and some say it's useless. It seems to depend on the task. I think it depends on the harness too (it works better in Claude Code in my experience, it's probably been trained on that).

csomar · 2026-06-09T11:58:16 1781006296

The only one that is really close to Claude in performance is GLM-5.1. The others (Mimo, deepseek, etc..) looks good on paper but usually fails on a multi-step agentic orchestration.

This is at least my experience with Claude Code as harness. Also, GLM pricing is not that far off from Claude. It's cheaper but not DeepSeek cheap.

polski-g · 2026-06-08T18:43:45 1780944225

I used Opus 4.6, then downgraded to Sonnet, then to GLM5/5.1. GLM is as good as Sonnet. I recently started using Opus 4.8 again and GLM is not close to that.

30 day eval for each.

kypro · 2026-06-08T16:38:31 1780936711

Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.

I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?

I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.

hobofan · 2026-06-08T17:27:19 1780939639

> you may not want your org to be held hostage by OpenAI / Anthropic

Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.

bachmeier · 2026-06-08T18:30:17 1780943417

I'd be curious if any of your customers have tried 3.1 Flash Lite. It's cheaper than 2.5 Flash, and in my experience with the free tier, quite an upgrade in terms of quality of response. My suspicion is that Google is killing off the old models because they aren't a good value for the customer or for themselves.

lokar · 2026-06-08T16:53:02 1780937582

Their moat is cash to pay politicians to regulate away competition.

GoToRO · 2026-06-08T21:48:19 1780955299

maybe the moat is that we slowly start to forget how to code by hand and then you -need- the AI tool.

ChrisClark · 2026-06-08T17:38:56 1780940336

I think they are racing because the first ASI will 'win', preventing others, of course we won't be able to bake the right goals into it though.

tancop · 2026-06-08T18:31:33 1780943493

i dont think its going to automatically prevent others. super claude might understand why diversity is important. if were talking sci fi scenarios the most likely one is probably overwatch (multiple independent ais with gray ethics and complicated relationships) more than skynet.

varispeed · 2026-06-08T16:24:37 1780935877

I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.

ignoramous · 2026-06-08T16:31:22 1780936282

> You never know whether Anthropic will route your request to a cheaper model for the price of Opus

For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.

trollbridge · 2026-06-08T16:50:53 1780937453

How can you prove it?

Sometimes Opus just gives me a rubbish session.

RussianCow · 2026-06-09T01:24:59 1780968299

Isn't that true of any provider? Anyone could be lying about what they're serving.