Disorder is 100x Faster

AI Training by Accepting Disorder

Dec 25, 2025

Merry Christmas! In this series, we review scientific papers in an approachable way. Our main tool is NotebookLM, which is hacking into our human desire for narratives (improve recall and comprehension). To supplement, I provide brief commentary drawn from my experience as a Senior Research Scientist at Nvidia.

Today’s paper:

The paper was uploaded to arXiv on October 31, 2025.

Commentary

First, this seems like a relaxation of the CAP Theorem, the core tenet of distributed computing that you have to trade off consistency, availability, and performance. Second, it just seems like they do away with one sort of consistency layer, taking a performance hit. Third, this transfer engine approach from Perplexity shows how much they’re engineering to the hardware in order to enable new advances in reinforcement learning, and that’s what’s driving their performance. They’re doing it on an open standard, and this will run contrary to any closed ecosystems.

Send this to a friend

NotebookLM podcast and transcript (AI generated, lightly edited)

0:00

-15:44

“So our mission in this deep dive is to shortcut you straight to the solution. It’s a new system called Transfer Engine.

Transfer.

Yep. And it promises this flexible, high-speed, point-to-point communication that you absolutely need for this kind of extreme scaling. And it does it by tackling maybe the most infuriating problem in the cloud today. Hey, hardware vendor lock in.

I think that’s where we have to start because if you’re building models that are fundamentally dynamic where the communication path is always changing and only certain parts of the model are even active at any given moment, why can’t you just use the old reliable networking systems we’ve had for decades?

Well, the old reliable system is something called collective communication. You could think of it like a highly choreographed freight train.

Okay. A freight train. I like that analogy.

Libraries like NCCL, they’re designed for these static dense workloads. Think traditional tensor parallelism where every single GPU is doing the exact same thing at the exact same time.

Everything has to be perfectly synchronized.

Perfectly synchronized. But now imagine trying to use that freight train for say a local delivery service that uses dynamic routes. You just need one small package to go from server A to server C right now and server B needs a totally different package later.

The whole system falls apart.

It just falls apart. part collectives have these severe constraints.

Yeah.

Fixed membership for one, which kills any hope of dynamic elastic scaling.

Synchronized initialization adds huge overhead and they force you to communicate uniformly even when your data like in MOI model is sparse.

So we desperately need that network flexibility. The basic point to point stuff, the send, RBCV and write operations that high performance computing has used forever. If those primitives exist, why Haven’t they just become the standard for LLMs across, you know, all the different cloud providers?

Because of fragmentation. The core tech for low latency networking is RDMA remote direct memory access. But the second you try to take RDMA out of a single uniform supercomputer and put it into the messy real world of the multi- vendor cloud,

you hit a wall.

You hit an invisible wall. And that wall is an incompatibility that’s rooted in how different network cards handle basic reliability. And this the crucial part, message ordering.

Okay, let’s dig into that. architectural divide. Who are the two main players here and how are they so fundamentally different?

So on one side you got systems using Nvidia’s connectx NIC’s. These use a protocol called reliable connection or RC.

RC. Got it.

And RC has always been reliable but historically it enforces in order delivery. Every single message has to arrive in the exact sequence it was sent.

That sounds really restrictive. If the sequence is sacred that limits what you can do.

It does. Now on the other side you have cloud providers like AWS and they use their own thing the elastic fabric adapter or EFA

effect

EFA uses a protocol called scalable reliable datagram or SRD. SRD is also reliable so you know the message will get there but it is explicitly out of order

out of order.

It prioritizes speed and flexibility over maintaining a strict sequence.

Wow. Okay. That is a fundamental conflict. So if your library is written assuming the network enforces ordering

it’s instantly incompatible with AWS EU and that is the textbook definition of vendor lock in

and you see this every Right.

Oh, everywhere. These specialized high performance libraries like DF for example, they require Connectex’s special GPU initiated RDMA. Try running NVSHM on EAT. Performance just plummets. A lot of the new cutting edge libraries just don’t have stable EFA support at all because bridging that in order out of order gap while keeping performance was just seen as too hard.

So there was basically no viable cross provider solution.

Zero.

So transfer engine walks into this room of divided networking standards. How did they bridge this divide? I mean, did they try to force EFA to respect ordering or connectx to relax it? Did they build some kind of expensive slow translation layer?

No. And this is the brilliant part. The key insight was actually incredibly simple. They just asked, “What if we stop fighting the out of order nature?

We know both protocols are reliable. They won’t lose your data.” So, transfer engine decided to just treat both connect XRC and EFA SRD as reliable and fundamentally unordered. They figured out how to lacks the ordering requirement on ConnectX and then they just built an abstraction layer that works seamlessly over both.

Wait, that’s it? That’s so simple. I feel like we’ve spent years in network engineering fighting for these strict ordering guarantees and the fix was just deciding we didn’t need it as long as we had a reliable way to count things.

Precisely. The transfer engine just layers this uniform interface on top. It exposes standard operations like send address CV and one-sided right eye count. But the real magic. The core of it is this crucial primitive they built called the Ming counter.

Okay, tell me more about this counter. This sounds like the secret sauce.

The MM counter is just for completion notification. Think of like a like a ticket stub counter at a coat check.

Okay,

you drop off your coats, your data transfers in whatever order is convenient. You don’t need the attendant to hand them back in the exact same order. You just need a receipt that confirms, say, you’ve got five tickets. The ME encounter increments on the receiving side when the payload is complete, no matter what sequence the transfer is arrived in. So the receiving GPU can just know I’ve received all three pages I was expecting without needing them to arrive 1 2 3.

Exactly. And that just completely eliminates the architectural difference between RC and SRD instantly. It’s so elegant.

That’s incredibly elegant

and it was necessary. On top of solving the ordering problem, they also had to tackle another big cloud reality which is bandwidth aggregation. To hit those massive 400 Gbit per second speeds, you often have to bind 400 Gbit IC’s to a single GPU, especially on EFA,

right?

Transfer engine just transparently manages all that sharding and load balancing across the NIC’s for you. It’s seamless.

And the performance numbers confirmed they didn’t have to sacrifice speed for this portability, right? They actually hit that peak throughput.

Absolutely. Transfer engine demonstrated the full advertised 400 Gbits per second on both Nvidia Connect X7 and AWS Ephay. They got cross vendor cross cloud portability without dropping a single bit of bandwidth.

Okay, let’s shift from the architecture to the real world impact. We need to see this speed in action. Let’s start with disaggregated inference. This new architecture where you separate the heavy prompt prefill stage from the fast token decode stage.

Disaggregated inference is just so key for large scale serving. You can dedicate these massive dense GPU resources to the prefill which you know processes the big initial prompt and then you pass the output to smaller faster latency optimized servers for the actual decoding loop. But the problem is that transfer in the middle.

The transfer in the middle. Once the prefiller calculates all that context, the entire massive KV cache, all these context pages has to be transferred to a decoder and it has to be done dynamically.

And if they’re different physical servers, that transfer has to be lightning fast and totally unsynchronized.

Exactly. The old collective methods just can’t do this. The size of the KV cache changes with every prompt. The batch size is different. The membership, which decoder gets the next job, is always updating transfer engine enables this elastic system with a feature they call submit page writes

for transferring those context pages layer by layer.

Yep. Layer by layer.

So how does the decoder know when the transfer is done if it’s not waiting on some synchronized collective signal?

It uses the other side of that m encounter mechanism we talked about. It’s a command called expect mcount.

The decoder knows exactly how many layers or pages is supposed to get. It doesn’t need some explicit I’m done message from the pre-filler.

It just tells transfer Hey, I’m expecting 40 pages.

That’s it. And the MCounter notifies it the second that 40th page arrives again, regardless of the arrival order. And the decoder can immediately start generating tokens. This is the kind of true dynamic elastic scaling that collectives could never ever support.

That’s a huge shift from total rigidity to this flexible dynamic resource assignment. Okay, next up, let’s talk about the asynchronous training side, specifically reinforcement learning or RL fine-tuning. This process needs these constant super rapid updates of trillion parameter weights. Why was this such a huge bottleneck before?

Oh, this was where the limits of collectives became just painfully obvious. The old RL frameworks, they had to form this one big global collective world. And when the weights needed updating, they’d be gathered to a single chosen rank says training GPU.

Wait, one GPU?

One single GPU. Then that one GPU had to broadcast the updated weights back out to the rank of it of every single reference subgroup.

So, one poor network card was basically forced to carry the entire traffic load for all the weight updates across this massive cluster.

Precisely. It was the ultimate highway interchange bottleneck. It just throttled performance because even if you have 400 GB of capacity, if you only use one pipe for a task meant for 32 pipes, you’re limited by that single choke point. The weight updates became the biggest latency sync in the entire RL loop.

So, how did Transfer Engine fix this? How did they exploit the full bandwidth? they move to a pure point-to-point approach one-sided RDMA right so instead of having rankova broadcast everything each training GPU which already has a piece of the weights sends its specific sharded weights directly to the inference GPUs that need them

ah so all the NIC’s in the cluster are firing at the same time

all at once you’re using the full aggregated cluster bandwidth

and I remember reading the speed up was just staggering here something like 100x which suggests the old method was just fundamentally broken at scale

a 100x speed up is a paradigm shift. It’s not an optimization. It’s a new capability. Transfer engine got down to 1.3 second cross machine parameter updates for models like KY2, DeepSync V3. And if you’ve ever had to manage one of these weight transfer bottlenecks, you know that updating a trillion parameter model in under two seconds is it’s basically magic.

It changes what’s even possible in online RL

completely. And they squeezed even more efficiency out by pipelining everything. They broke the task into stages. The host to device copy weight prep like fusing or quantization right and the actual RDMA transfer and just ran these stages for different weight groups at the same time

so communication and computation are perfectly overlap

maximum GPU and network utilization all the time

okay finally let’s look at the ultimate test of sparse dynamic communication mixture of experts Moe this whole architecture is defined by scatter gather operations you have to send a token to one of maybe 16 remote experts and get a result back like instantly how do they get state-of-the-art performance here and also bring this to EFA for the first time.

MOI dispatch and combine needs incredible finesse. Transfer engine uses these specialized operations submit scatter and submit barrier. The coordination actually relies on a small host proxy thread which normally you’d think would introduce latency. But the key to their performance was figuring out how to minimize and hide that CPU overhead.

So how do you hide CPU latency?

You get speculative and you pipeline. So first the kernels exchange the routing info really fast. Then they immediately use these small speculative transfers to dispatch just a few initial tokens into private buffers on the remote expert GPUs.

A little pre-transfer.

Exactly. This tiny fast transfer makes sure the network is immediately hot and the bulk transfer paths are getting set up. So while the big transfer is moving, the network is already warm and you’ve effectively hidden the setup latency from the CTU coordination.

That’s really smart. You use a small burst to warm up the pipes for the big one. How did this stack up against deep, which is kind of the gold standard for low latency on connectx.

It’s a great question. So, DEP because it uses direct GPU initiated RDMA often has slightly lower latency for the very first token, but transfer engine’s superior bulk transfer and that efficient pipelining we talked about. It allowed it to actually surpass DEP and overall inner node performance on both dispatch and combine for bigger systems like 16 and 32 ranks

and a portability win. The EFA part,

this was huge. They provided the first viable high performance MOI implementation on EFA period. Right now it trails connect X7 by about 30% on decode latency, but that’s a trade-off most people would make in a heartbeat for the flexibility of running the same MOE architecture across different clouds. They proved it could be done.

That was a phenomenal survey. So we’ve seen a point to point isn’t just a nice to have, it’s an absolute requirement for these modern LLM patterns. And there was a fascinating detail in the research about message sizing. What did we learn about how much data you actually need to send to saturate a 400 GB link?

Yeah, that was really interested. You show that your efficiency depends completely on the primitive you use. If you used a single traditional write operation, you needed these surprisingly large messages. We’re talking at least 16 megabytes just to saturate the band.

16 megabytes for a single write just to hit top speed.

I know. But here’s the optimization. When transfer engine used its specialized paged writes like for the KV cache transfer, they only needed much much smaller messages around 32 kilobytes to get that same 400 Gbit saturation. Wow.

It just highlights how critical it is to customize the RDMA primitive for the specific LLM workload.

So, bringing this all back to the big picture, what’s the ultimate conclusion for you, the listener who might be managing or designing these massive systems?

I think the central lesson is that transfer engine made these modern complex LLM patterns disaggregated inference actually succeed in the real world of heterogeneous cloud deployments and it did it by finding that essential common ground between the vendor specific RDMA protocols. It proves proved really that portable point-to-point communication is now mandatory for scaling

and it complements the old stuff. It doesn’t replace it.

It complements it doesn’t conflict with the older collective libraries. You need both. But that point to-point layer is where all the innovation is going to happen for the next decade.

You know, we often assume that scaling especially in networking demands stricter and stricter guarantees things like mandatory in order delivery.

And that’s the final thought we want you to really mule over. Trans for success didn’t come from some brilliant new complexity. It came from an elegant simplification. The whole thing hinged on that M encounter and just accepting that reliable but unordered delivery was the necessary path to get portability between connect and eve.

Right?

And this raises a crucial question for anyone designing systems in your own scaling challenges. How often are you being constrained by these legacy protocols that prioritize strict guarantees like in order delivery when what you really need is just maximum reliability and throughput. What other systems could you drastically accelerate if you just consciously relax those ordering assumptions?

Something to explore the next time you look under the hood of your own distributed system. Thank you for joining us for this deep dive.”

What's your take?

Book a one hour call ($500) with me to solve your problem. In the past, VCs and investors have used this as technical due diligence and executives have used it for setting technical strategy.

Book Your Strategy Session

Leonidas Tam, PhD is the co-founder of Amicus AI Advisors, LLC, a California investment advisor that manages a quantitative long/short strategy and provides other investment-related services.

Neural Foundry

Brilliantly clear breakdown of how Tranfer Engine solves the connectx/EFA incompatibility problem. The insight about relaxing ordering guarantees while keeping reliability is genuinely clever, kinda reminds me of when we stopped trying to force strict consistency in distributed caches and things actually got faster. I dunno why more teams dont question these legacy assumptions that everyone takes for granted. The 100x speedup on RL weight updates is wild.

Expand full comment

Discussion about this post

Ready for more?