Replacing HLS/DASH

Low-latency, high bitrate, mass fan-out is hard. Who knew?

See Replacing WebRTC for the previous post in this series.

tl;dr

If you’re using HLS/DASH and your main priority is…

cost: wait until there CDN offerings.
latency: you should seriously consider MoQ.
features: it will take a while to implement everything.
vod: it works great, why replace it?

Intro

Thanks for the positive reception on Hacker News! Anyway, I’m back.

I spent the last 9 years working on literally all facets of HLS and Twitch’s extension: LHLS. We hit a latency wall and my task was to find an alternative, originally WebRTC but that eventually pivoted into Media over QUIC.

Hopefully this time I won’t be “dunning-Krugerering off a cliff”. Thanks random Reddit user for that confidence boost.

Why HLS/DASH?

Simple answer: Apple

If your app delivers video over cellular networks, and the video exceeds either 10 minutes duration or 5 MB of data in a five minute period, you are required to use HTTP Live Streaming.

It’s an anti-climactic answer, but Twitch migrated from RTMP to HLS to avoid getting kicked off the App Store. The next sentence gives a hint as to why:

If your app uses HTTP Live Streaming over cellular networks, you are required to provide at least one stream at 64 Kbps or lower bandwidth.

This was back in 2009 when the iPhone 3GS was released and AT&T’s network was struggling to meet the demand. The key feature of HLS was ABR: multiple copies of the same content at different bitrates. This allowed the Apple-controlled HLS player to reduce the bitrate rather than pummel a poor megacorp’s cellular network.

DASH came afterwards in an attempt to standardize HLS minus the controlled by Apple part. There’s definitely some cool features in DASH but the core concepts are the same and they even share the same media container now. So the two get bundled together as HLS/DASH.

But I’ll focus more on HLS since that’s my shit.

The Good Stuff

While we were forced to switch protocols at the tech equivalent of gunpoint, HLS actually has some amazing benefits. The biggest one is that it uses HTTP.

HLS/DASH works by breaking media into “segments”, each containing a few seconds of media. The player will individually request each segment via a HTTP request and seamlessly stitch them together. New segments are constantly being generated and announced to the player via a “playlist”.

carrot — Thanks for the filer image, DALL·E

Because HLS uses HTTP, a service like Twitch can piggyback on the existing infrastructure of the internet. There’s a plethora of optimized CDNs, servers, and clients that all speak HTTP and can be used to transport media. You do have to do some extra work to massage live video into HTTP semantics, but it’s worth it.

The key is utilizing economies of scale to make it cheap to mass distribute live media. Crafting individual IP packets might the correct way to send live media with minimal latency (ie. WebRTC), but it’s not the most cost effective.

The Bad Stuff

I hope you weren’t expecting a fluff piece.

Latency

We were somewhat sad to bid farewell to Flash (gasp). Twitch’s latency went from something like 3 seconds with RTMP to 15 seconds with HLS.

There’s a boatload of latency sources, anywhere from the duration of segments to the frequency of playlist updates. Over the years we were able to slowly able to chip away at the problem, eventually extending HLS to get latency back down to theoretical RTMP levels. I documented our journey if you’re interested in the gritty details.

But one big source of latency remains: T C P

I went into more detail with my previous blog post, but the problem is head-of-line blocking. Once you flush a frame to the TCP socket, it will be delivered reliably and in order. However, when the network is congested, the encoded media bitrate will exceed the network bitrate and queues will grow. Frames will take longer and longer to reach the player until the buffer is depleted and the viewer gets to see their least favorite spinny boye.

A HLS/DASH player can detect queuing and switch to a lower bitrate via ABR. However, it can only do this at infrequent (ex. 2s) segment boundaries, and it can’t renege any frames already flushed to the socket. So if you’re watching 1080p video and your network takes a dump, well you still need to download seconds of unsustainable 1080p video before you can switch down to a reasonable 360p.

You can’t just put the toothpaste back in the tube if you squeeze out too much. You gotta use all of the toothpaste, even if it takes much longer to brush your teeth.

TCP toothpaste — Source. The analogy falls apart but I get to use this image again.

Clients

HLS utilizes “smart” clients and “dumb” servers. The client decides what, when, why, and how to download each media playlist, segment, and frame. Meanwhile the server just sits there and serves HTTP requests.

The problem really depends on your perspective. If you control:

client only: Life is great!
client and server: Life is great! You can even extend the protocol!
server only: Life is pain.

For a service like Twitch, the solution might seem simple: build your own client and server! And we did, including a baremetal live CDN designed exclusively for HLS.

But until quite recently, we have been forced to use the Apple HLS player on iOS for AirPlay or Safari support. And of course TVs, consoles, casting devices, and others have their own HLS players. And if you’re offering your baremetal live CDN to the public, you can’t exactly force customers to use your proprietary player.

So you’re stuck with a dumb server and a bunch of dumb clients. These dumb clients make dumb decisions with no cooperation with the server, based on imperfect information.

Ownership

I love the simplicity of HLS compared to DASH. There’s something so satisfying about a text-based playlist that you can actually read, versus a XML monstrosity designed by committee.

#EXTM3U
#EXT-X-TARGETDURATION:10
#EXT-X-VERSION:3
#EXTINF:9.009,
http://media.example.com/first.ts
#EXTINF:9.009,
http://media.example.com/second.ts
#EXTINF:3.003,
http://media.example.com/third.ts
#EXT-X-ENDLIST

Orgasmic.

But unfortunately Apple controls HLS.

There’s a misalignment of incentives between Apple and the rest of the industry. I’m not even sure how Apple uses HLS, or why they would care about latency, or why they insist on being the sole arbiter of a live streaming protocol. Pantos has done a great and thankless job, but it feels like a stand-off.

For example, LL-HLS originally required HTTP/2 server push and it took nearly the entire industry to convince Apple that this was a bad idea. The upside is that we got a mailing list so they can announce changes to developers first… but don’t expect the ability to propose changes any time soon.

DASH is its own can of worms as it’s controlled by MPEG. The specifications are behind a paywall or require patent licensing? I can’t even tell if I’m going to get sued for parsing a DASH playlist without paying the troll toll.

troll toll — Source. 🎵 You gotta pay the Troll Toll 🎵

What’s next?

You’re given a blank canvas and a brush to paint the greenest of fields, what do you make?

Source. Wow. That’s quite the green field.

TCP

After my previous blog post, I had a few people hit up my DMs and claim they can do real-time latency with TCP. And I’m sure a few more people will too after this post, so you get your own section that muddles the narrative.

Yes, you can do real-time latency with TCP (or WebSockets) under ideal conditions.

However, it just won’t work well enough on poor networks. Congestion and buffer-bloat will absolutely wreck your protocol on poor networks. A lot of my time spent at Twitch was optimizing for the 90th percentile; the shoddy cellular networks in Brazil or India or Australia.

But if you are going to reinvent RTMP, there are some ways to reduce queuing but they are quite limited. This is especially true in a browser environment when limited to HTTP or WebSockets.

See my next blog post about Replacing RTMP.

HTTP

Notably absent thus far has been any mention of LL-HLS and LL-DASH. These two protocols are meant to lower HLS/DASH latency respectively by breaking media segments into smaller chunks.

The chunks might be smaller, but they’re still served sequentially over TCP. The latency floor is lower but the latency ceiling is still just as high, and you’re still going to buffer during congestion.

We’re also approaching the limit of what you can do with HTTP semantics.

LL-HLS has configurable latency at the cost of and exponential number of sequential requests in the critical path. For example, 20 HTTP requests a second per track still only gets you +100ms of latency, which is not even viable for real-time latency.
LL-DASH can be configured down to +0ms added latency, delivering frame-by-frame with chunked-transfer. However it absolutely wrecks client-side ABR algorithms. Twitch hosted a challenge to improve this but I’m convinced it’s impossible without server feedback.

HESP also gets a special shout-out because it’s cool. It works by canceling HTTP requests during congestion and frankensteining the video encoding which is quite ~~hacky~~ clever, but suffers a similar fate.

We’ve hit a wall with HTTP over TCP.

HTTP/3

If you’re an astute hyper text transport protocol aficionado, you might have noticed that I said “HTTP over TCP” above. But HTTP/3 uses QUIC instead of TCP. Problem solved! We can replace any mention of ~~TCP~~ with QUIC!

Well, not quite. To use another complicated topic as a metaphor:

A TCP connection is a single-core CPU.
A QUIC connection is a multi-core CPU.

If you take a single threaded program and run it on a multi-core machine, it will run just as slow, and perhaps even slower. This is the case with HLS/DASH as each segment request is made sequentially. HTTP/3 is not a magic bullet and only has marginal benefits when used with HLS/DASH.

The key to using QUIC is to embrace concurrency.

This means utilizing multiple, independent streams that share a connection. You can prioritize a stream so it gets more bandwidth during congestion, much like you can use nice on Linux to prioritize a process when CPU starved. If a stream is taking too long, you can cancel it much like you can kill a process.

For live media, you want to prioritize new media over old media in order to skip old content. You also want to prioritize audio over video, so you can hear what someone is saying without necessarily seeing their lips move. If you can only transmit half of a media stream in time, make sure it’s the most important half.

To Apple/Pantos’ credit, LL-HLS is exploring prioritization using HTTP/3. It doesn’t go far enough (yet!) and HTTP semantics get in the way, but it’s absolutely the right direction. I’m convinced that somebody will make a HTTP/3 only media protocol at some point.

But of course I’m biased towards…

Media over QUIC

MoQ utilizes WebTransport/QUIC directly to avoid TCP and HTTP. But what about that whole economies of scale stuff?

Well, there are some important differences between Media over QUIC as compared to your standard not invented here protocol:

Reason 0: QUIC

QUIC is the future of the internet. TCP is a relic of the past.

You’re going to see a lot of this logo, although not crudely traced or green.

It’s a bold claim I know. But I struggle to think of a single reason why you would use TCP over QUIC going forward. There are still some corporate firewalls that block UDP (used by QUIC) and hardware offload doesn’t exist yet, but I mean that’s about it.

It will take a few years, but every library, server, load balancer, and NIC will be optimized for QUIC delivery. Media over QUIC offloads as much as possible into this powerful layer. We benefit also from any new features, including proposals such as multi-path, FEC, congestion control, etc. I don’t want network features in my media layer thank you very much (looking at you WebRTC).

It might not be obvious is that HTTP/3 is actually a thin layer on top of QUIC. Likewise MoQ is also meant to be a thin layer on top of QUIC, effectively just providing pub/sub semantics. We get all of the benefits of QUIC without the baggage of HTTP, and yet still achieve web support via WebTransport.

Instead we can focus on the important stuff instead: live media.

Reason 1: Relay Layer

To avoid the mistakes of WebRTC, we need to decouple the application from the transport. If a relay (ie. CDN) knows anything about media encoding, we have failed.

The idea is to break MoQ into layers.

MoqTransport is the base layer and is a typical pub/sub protocol, although catered toward QUIC. The application splits data into “objects”, annotated with a header providing simple instructions on how the relay needs to deliver it. These are generic signals, including stuff like the priority, reliability, grouping, expiration, etc.

MoqTransport is designed to be used for arbitrary applications. Some examples include:

live chat
end-to-end encryption
game state
live playlists
or even a clock!

This is huge draw for CDN vendors. Instead of building a custom WebRTC CDN that targets one specific niche, you can cast a much wider net with MoqTransport. Akamai, Google, and Cloudflare have been involved in the standardization process thus far and CDN support is inevitable.

Reason 2: Media Layer

There will be at least one media layer on top of MoqTransport. We’re focused on the transport right now so there’s no official “adopted” draft yet.

However, my proposal is Warp. It uses CMAF so it’s backwards compatible with HLS/DASH while still capable of real-time latency. I think this is critically important, as any migration has to be done piecewise, client-by-client and user-by-user. The same media segments can be served for a mixed roll-out and for VoD.

This website uses Warp! Try it out! Or watch one of my presentations.

There will absolutely be other mappings and containers; MoQ is not married to CMAF. The important part is that only the encoder/decoder understand this media layer and not any relays in the middle. There’s a lot of cool ideas floating around, such as a live playlist format and a low-overhead container.

Reason 3: IETF

Media over QUIC is an IETF working group.

I crudely traced and recolored this logo too.

If you know nothing about the IETF, just know that it’s the standards body behind favorites such as HTTP, DNS, TLS, QUIC, and even WebRTC. But I think this part is especially important:

There is no membership in the IETF. Anyone can participate by signing up to a working group mailing list (more on that below), or registering for an IETF meeting. All IETF participants are considered volunteers and expected to participate as individuals, including those paid to participate.

It’s not a protocol owned by a company. It’s not a protocol owned by lawyers.

Join the mailing list.

What’s missing?

Okay cool so hopefully I sold you on MoQ. What can’t you use it today to replace HLS/DASH?

It’s not done yet: The IETF is many things, but fast is not one of them.
Cost: QUIC is a new protocol that has yet to be fully optimized to match TCP. It’s possible and apparently Google is near parity.
Support: Your favorite language/library/cdn/cloud/browser might not even provide HTTP/3 support yet, let alone WebTransport or QUIC.
Features: Somebody has to reimplement all of the annoying HLS/DASH features like DRM and server-side advertisements…
VoD: MoQ is currently live only. HLS/DASH work great, why replace it?

We’ll get there eventually.

Feel free to use our Rust or Typescript implementation if you want to experiment. Join the Discord if you want to help!

Written by @kixelated.