published 2/17/2024
Never* use Datagrams
Click-bait title, but hear me out.
TCP vs UDP
So you’re reading this blog over the internet. I would wager you do a lot of things over the internet.
If you’ve built an application on the internet, you’ve undoubtedly had to decide whether to use TCP or UDP. Maybe you’re trying to make, oh I dunno, a live video protocol or something. There are more choices than just those two but let’s pretend like we’re a networking textbook from the 90s.
The common wisdom is:
- use TCP if you want reliable delivery
- use UDP if you want unreliable delivery
What the fuck does that mean? Who wants unreliability?
- You don’t want a hard-drive that fails 5% of writes.
- You don’t want something with random holes in the middle (unless it’s cheese).
- You don’t want a service that is randomly unavailable because ¯\_(ツ)_/¯.
Nobody* wants memory corruption or deadzones or artifacts or cosmic rays. Unreliability is a consequence, not a goal.
Properties
So what do we actually want?
If you go low enough level, you can use electrical impulses to do neat stuff like:
- Power on LEDs in a desired configuration.
- Spin magnets at ludicrous speeds.
- Make objects tingle and shake.
- etc you get the idea.
But we don’t want to deal with electrical impulses. We want higher level functionality.
Fortunately, software engineering is all about standing on the shoulders of others. There are layers on top of layers on top of layers of abstraction. Each layer provides properties so you don’t have to reinvent the personal computer every time.
Our job as developers is to decide which shoulders we want to stand on. But some shoulders are awful, so we have to be selective. Over-abstraction is bad but so is under-abstraction.
What user experience are we trying to build, and how can we leverage the properties of existing layers to achieve that?
”Unreliable”
There was a recent MoQ interim in Denver. For those unaware, it’s basically a meetup of masochistic super nerds who want to design a live video protocol. We spent hours debating the semantic differences between FETCH and SUBSCRIBE among other riveting topics.
A few times, it was stated that SUBSCRIBE should be unreliable. The room cringed, and I hard cringed enough to write this blog post.
What I actually want is timeliness. If the internet can choose between delivering two pieces of data, I want it to deliver the newest one.
In the live video scenario, this is the difference between buffering and skipping ahead. If you’re trying to have a conversation with someone on the internet, there can’t be a delay. You don’t want a buffering spinner on top of their face, nor do you want to hear what they said 5 seconds ago.
To accomplish timeliness, the live video industry often uses UDP datagrams instead of TCP streams. As does the video game industry apparently. But why?
Datagrams
A datagram, aka an IP packet, is an envelope of 0s and 1s that gets sent from a source address to a destination address. Each device has a different maximum size allowed, which is super annoying, but 1200 bytes is generally safe. And of course, they can be silently lost or even arrive out of order.
But the physical world doesn’t work in discrete packets; it’s yet another layer of abstraction. I’m not a scientist-man, but the data is converted to analog signals and sent through some medium. It all gets serialized and deserialized and buffered and queued and retransmitted and dropped and corrupted and delayed and reordered and duplicated and lost and all sorts of other things.
So why does this abstraction exist?
Internet of Queues
It’s pretty simple actually: something’s got to give.
When there’s too much data sent over the network, the network has to decide what to do. In theory it could drop random bits but oh lord that is a nightmare, as evidenced by over-the-air TV. So instead, a bunch of smart people got together and decided that routers should drop at packet boundaries.
But why drop packets again? Why can’t we just queue and deliver them later? Well yeah, that’s what a lot of routers do these days since RAM is cheap. It’s a phenomenon called bufferbloat and my coworkers can attest that it’s my favorite thing to talk about. 🐷
But RAM is a finite resource so the packets will eventually get dropped. Then you finally get the unreliability you wanted all along…
Oh no
Oh shit, I forgot, I actually want timeliness and bufferbloat is the worst possible scenario. Naively, you would expect the internet to deliver packets immediately, with some random packets getting dropped. However bufferbloat causes all packets to get queued, possibly for seconds, ruling out any hope of timely delivery.
How do you avoid this? Basically, the only way to avoid queuing is to detect it, and then send less. The sender uses some feedback from the receiver to determine how long it took a packet to arrive. We can use that signal to infer when routers are queuing packets, and back off to drain any queues.
This is called congestion control and it’s a huge, never ending area of research. I briefly summarized it in the Replacing WebRTC post if you want more CONTENT. But all you need to know is that sending packets at unlimited rate is a recipe for disaster.
You, The Application Developer
Speaking of a recipe for disaster. Let’s say you made the mistake of using UDP directly because you want them datagrams. You’re bound to mess up, and you won’t even realize why.
If you want to build your own transport protocol on top of UDP, you “need” to implement:
And if you want a great protocol, you also need:
- encryption
- RTT estimates
- path validation
- path migration
- pacing
- flow control
- version negotiation
- extensions
- prioritization
- keep-alives
- multiplexing
And if you want an AMAZING protocol, you also need:
Let’s be honest, you don’t even know what half of those are, nor why they are worth implementing. Just use a QUIC library instead.
But if you still insist on UDP, you’re actually in good company with a lot of the video industry. Building a live video protocol on top of UDP is all the rage; for example, WebRTC, SRT, Sye, RIST, etc. With the exception of Google, it’s very easy make a terrible protocol on top of UDP. Look forward to the upcoming Replacing RTMP but please not with SRT blog post!
Timeliness
But remember, I ultimately want to achieve timeliness. How can we do that with QUIC?
-
Avoid bloating the buffers 🐷. Use a delay-based congestion controller like BBR that will detect queueing and back off. There are better ways of doing this, like how WebRTC uses transport-wide-cc, which I’ll personally make sure gets added to QUIC.
-
Split data into streams. The bytes within each stream are ordered, reliable, and can be any size; it’s nice and convenient. Each stream could be a video frame, or a game update, or a chat message, or a JSON blob, or really any atomic unit.
-
Prioritize the streams. Streams are independent and can arrive in any order. But you can tell the QUIC stack to focus on delivering important streams first. The low priority streams will be starved, and can be closed to avoid wasting bandwidth.
That’s it. That’s the secret behind Media over QUIC. Now all that’s left is to bikeshed the details.
And guess what? This approach works with higher latency targets too. It turns out that the fire-and-forget nature of datagrams only works when you need real-time latency. For everything else, there’s QUIC streams.
You don’t need datagrams.
In Defense of Datagrams
Never* use Datagrams got you to click, but the direction of QUIC and MoQ seems to tell another story:
- QUIC has support for datagrams via an extension.
- WebTransport requires support for datagrams.
- The latest MoQ version adds support for datagrams.
- The next MoQ version will require support for datagrams.
Like all things designed by committee, there’s going to be some compromise. There are some folks who think datagram support is important. And frankly, it’s trivial to support and allow people to experiment. For example, OPUS has FEC support built-in, which is why MoQ supports the ability to send each audio “frame” as a datagram.
But it’s a trap. Designed to lure in developers who don’t know any better. Who wouldn’t give up their precious UDP datagrams otherwise.
If you want some more of my hot-takes:
- The next blog post about FEC in OPUS, and why layers are important.
- The previous blog post gushed over QUIC, except for the datagram extension which is frankly terrible.
Conclusion
There is no conclusion. This is a rant.
Please don’t design your application on top of datagrams. Old protocols like DNS get a pass, but be like DNS over HTTPS instead.
And please, please don’t make yet another video protocol on top of UDP. Get involved with Media over QUIC instead! Join our Discord and tell me how wrong I am.
Written by @kixelated.