published 2/17/2024
Forward? Error? Correction?
So I absolutely dunked on datagrams in the last blog post. Now it’s time to dunk on the last remaining hope for datagrams: Forward Error Correction (FEC).
OPUS
Opus is an amazing audio codec. Full disclosure, I haven’t had the opportunity to work with it directly; I was stuck in AAC land at Twitch. But that’s not going to stop me from talking out of my ass.
I want to rant about OPUS’ built-in support for FEC. And to be clear, this isn’t a rant specific to OPUS. Somebody inevitably asks for FEC in every networking protocol (like MoQ) and you can link them this post now.
The general idea behind FEC is is to send redundant data so the receiver can paper over small amounts of packet loss. It’s conceptually similar to RAID but for packets spread over time instead of hard drives. There are so many possible FEC schemes, many of which are patented, and I would do the subject a disservice if I even understood them.
Conveniently, audio “frames” are so small that they fit into a single datagram. So rather than deal with retransmissions at the disgusting transport layer, the audio encoder can just encode redundancy via FEC. 🪦 RIP packet loss: 1983-2024 🪦
But despite being a great idea on paper, there’s so many things wrong with this.
Networks are Complicated
I worked with some very smart people at Twitch. However, I will never forget a presentation maybe 4 years ago where a very smart codec engineer pitched using FEC.
There was a graph that showed the TCP throughput during random packet loss. Wow, TCP sure has a low bitrate at 30% packet loss, it sucks! But look at this other green line! It’s a custom protocol using UDP+FEC and it’s multiple times faster than TCP!
If somebody shows you any results based on simulated, random packet loss, you should politely tell them: no, that’s not how the internet works.
Networking is not quantum mechanics. There are no dice involved and packet loss is not random. It depends on the underlying transport.
- Sometimes it occurs randomly due to signal interference.
- Sometimes it occurs in bursts due to batching.
- Sometimes it occurs due to congestion.
- Sometimes it occurs because ???
Unfortunately, there’s no magic loophole on the internet. There’s no one trick that has eluded TCP for 40 years, and yet the UDP geniuses have figured it out. You can’t send 10x the data to mask packet loss.
In fact, if you ever see a number like 30% packet loss in the real world (yikes), it’s likely due to congestion. You’re sending 30% too much and fully saturating a link. The solution is to send less data, not parity bits. 🤯
Fun-fact: That’s the fundamental difference between loss-based congestion control (ex. Reno, CUBIC) and delay-based congestion control (ex. BBR, COPA). BBRv1 doesn’t even use packet loss as a signal; it’s all about RTT.
Expertise
These packet loss misconceptions come up surprisingly often in the live video space. The hyperfocus on packet loss is a symptom of a larger problem: media experts suddenly have to become networking experts.
Even modern media protocols are built directly on top of UDP; for example WebRTC, SRT, Sye, RIST. And for good reason, as the head-of-line blocking of TCP is a non-starter for real-time media. But with great power (UDP) comes great responsibility.
And the same mistakes keep getting repeated. I can’t tell you the number of times I’ve talked to an engineer at a video conference who decries congestion control, and in the next breath claims FEC is the solution to all their problems. Frankly, I’m just jaded at this point.
But it is definitely possible to have both media and networking expertise. The Google engineers who built WebRTC are a testament to that. However, the complexity of WebRTC speaks volumes to the difficulty of the task.
This is one of the many reasons why we need Media over QUIC. Let the network engineers handle the network and the media engineers handle the media.
End-to-End
But my beef with FEC in OPUS is more fundamental.
When I speak into a microphone, the audio data is encoded into packets via a codec like OPUS. That packet then traverses multiple hops, potentially going over WiFi, Ethernet, 4G, fiber, satellites, etc. It switches between different cell towers, routers, ISPs, transit providers, business units, and who knows what else. Until finally, finally, the packet reaches ur Mom’s iPhone and my words replay into her ear. Tell her I miss her. 😢
Unfortunately, each of those hops have different properties and packet loss scenarios. Many of them already have FEC built-in or don’t need it at all.
By performing FEC in the application layer, specifically the audio codec, we’re making a decision that’s end-to-end. It’s suboptimal by definition because packet loss is a hop-by-hop property.
Hop-by-Hop
If not the audio codec, where should we perform FEC instead?
In my ideal world, each hop uses a tailored loss recovery mechanism. This is based on the properties of the hop, and if they expect:
- burst loss: delayed parity.
- random loss: interleaved parity.
- low RTT: retransmit packets.
- congestion: drop packets.
But at which layer? A protocol like WiFi doesn’t know the contents of each packet, especially if they’re encrypted like every modern protocol. Throughput matters when you’re downloading a movie, but latency matters when you’re on a conference call.
Our time-sensitive audio packets need to have different behavior than other traffic. There are ways to signal QoS in IP packets, but unfortunately, support is limited as is the granularity. All it takes is one router in the chain to ignore your flag and everything falls apart.
That’s why it absolutely makes sense to perform FEC at a higher level. If the transport layer knows the desired properties, then it can make the best decision. Not the audio codec.
QUIC
So I just dunked on FEC in OPUS. “Don’t do FEC in the audio codec, do it in QUIC instead.”
Well QUIC doesn’t support FEC yet. Oops. There are some proposals but I imagine it will be a long time before anything materializes.
QUIC is primarily designed and used by CDN companies.
Their whole purpose is to put edge nodes as close to the user as possible in order to improve the user experience.
When your RTT to the Google/CloudFlare/Akamai/Fastly/etc edge is 20ms, then FEC is strictly worse than retransmissions.
FEC can only ever be an improvement when target_latency < 2*RTT
.
Additionally, there might not even be a need for FEC in QUIC. WebRTC supports RED which was added to RTP in 1997. The idea is to just transmit the same packet multiple times and let the receiver discard duplicates.
RED actually works natively in QUIC without any extensions. A QUIC library can send redundant STREAM frames and the receiver will transparently discard duplicates. It’s wasteful but it’s simple and might be good enough for some hops.
In Defense of FEC
This is a hot topic and I am quite ignorant. I don’t to be too dismissive.
There are absolutely scenarios where FEC is the best solution. When you’re sending data over a satellite link, you’re dealing with a high RTT and burst loss. And there’s totally scenarios where you won’t have intermediate hops that can perform retransmissions, like a P2P connection. When the RTT gets high enough, you need FEC.
And performing that FEC in OPUS gives you an extra property that I haven’t mentioned yet: partial reconstruction. You might not be able to reconstruct the entire audio bitstream, but you can fill in the blanks so to speak. The fact that OPUS can partially decode a bitstream with only a fraction of the packets, regardless of FEC, is frankly amazing.
And most importantly, you might not have control over the lower layers. I’m used to working at a company with a global network and a CDN but that’s not a common reality. If the only thing you can control is the audio codec, then ratchet that FEC up to 11 and see what happens.
My point is that transport knows best. The audio encoder shouldn’t know that there’s a satellite link in the chain.
Conclusion
Audio is important.
Networks are complicated.
This is not haiku.
FEC should not be in an audio codec, but rather closer to the source of packet loss. But at the end of the day, I’m just shoving blame down the stack. Do what works best for your users at whatever layer you control.
Just please, never show me a graph based on random packet loss again.
Written by @kixelated.