Poor man’s simulcast

🎻 July 2021 fiddle

Want to learn more about WebRTC?

Look us up here for the WebRTC Fiddle of the Month, created… once a month.

Or just enroll to one of our excellent WebRTC training courses.

Resources

Transcription

Tsahi: Hi and welcome to WebRTC Fiddle of the Month! With me is Philipp Hancke, and we’re going to talk about poor man’s simulcast. It started from FaceTime Web being announced and Philip going and checking what they’ve exactly done there and finding some interesting things, right?

Philipp: Yes, they chose a non-standard approach to simulcast. And I looked at it for webrtcHacks. So you can read the full post there. But I think the technique used is interesting enough that we can take a look and talk about it, OK?

Tsahi: And I think before we start there, it would make sense to explain what simulcasts is exactly. So let’s pick that.

What is simulcast

Philipp: Simulcast is one of the most important topics in multiparty WebRTC.

Tsahi: Simulcast adds flexibility. We’ve got a media server, the SFU – Selective Forwarding Unit; and what it needs to do is to forward the media that it receives from each one of the participants in the session to the other participants. Since we want the SFU to be lazy and not use CPU to encode or decode actual media. What we do is use simulcast and in simulcast. The actual devices are the ones that are doing this encoding. Instead of sending a single stream towards the SFU, they can send even two or three separate streams, multiple media streams towards the media server.

And each one is going to be at a different bitrate, which means different quality, different level of quality: resolutions, frame rate or just pure quality. Now the SFU receives these three inputs and then it makes a decision which one to send to which of the participants. So in this case, it decided to pick the 1.5Mbps stream, the highest quality stream. At some point in the future towards the exact same device, it can decide to send a lower bitrate, it can also send different users, different streams, and it can change that dynamically over time.

Philipp: One example is when you’re displayed in a different size, so the SFU wouldn’t need to send me the full 1.5Mbps – that would be wasting a lot of bandwidth.

Tsahi: It’s like now that we’re side by side versus seeing us at the top when there is something that is being displayed. And you’ve got a fiddle or demo or code to show us.

Poor man’s simulcast

Philipp: Because the important thing here is to remember that the browser can send simulcast, but it cannot receive simulcast. And FaceTime took a different approach here. So let’s look at that.

What we have here is based on the simulcast playground, which is a JavaScript only way to visualize simulcast, how it works in the browser, and that has one peer connection. Sending simulcasts or sending three tracks. And on the other hand, splitting it up because the browser cannot receive it. It can only send it. And we do some SDP hack to visualize that. And what we do here is we look at how FaceTime does it, because FaceTime sends and receives three different streams, and it seems like it does some kind of switching between the videos in the client application.

Tsahi: OK, so instead of the SFU making that selection and sending the SFU is still making that selection and sending. But the client, instead of not knowing what happens, needs to understand that there are three separate streams and that the SFU sends data or media only on one of them at a time.

Philipp: Yes, and we’re going to look at two different ways to really switch between these videos. The first one is to reattach the source object property, which is a media stream, and I hope we can see that one. So I’m trying to demonstrate that. And we see the low resolution stream now and we’re switching to the mid resolution. And the high resolution. Sometimes you can see a lot of visual flickering here and it doesn’t happen right now.

Tsahi: I guess one of the main problems here would be timing, because if I am the SFU and I’m now sending three different streams, I’m actually sending one at the time, but the client needs to know which one is being sent and then switch between them, either because I sent him signaling or because he sees packets coming on one and not the other. But that is done on the application level, which is a bit too high. So it might be later in the process, whereas here you always have all three video streams available, then you switch between existing video streams that actually have data on them at all times.

Philipp: Yes, and that wasn’t even possible in WebRTC in the past, or at least not in a reliable way. But FaceTime uses the Insertable Streams API to do this kind of detection to know which MediaStream is the active one. So we’re going to look at that a bit later.

The second approach to the switching is to have three different video elements and then hide two of them and just displaying one of them.

Tsahi: OK. So when you get the bar, it actually changes which video element is being displayed and the other one gets hidden.

Philipp: Yes. And that is flickering a lot less because that’s just like CSS and browsers are incredibly good at CSS. So let’s switch so we can see – there’s almost no flickering. If we have a very continuous motion, you can sometimes see that things jump backwards slightly because different streams have different timestamps, so the synchronization is a bit off.

Tsahi: And you probably won’t see it in this video because this video is something that we do inside a WebRTC call that then gets recorded, then gets edited and gets into YouTube and/or Vimeo and you’re going to play it back from there.

So the easiest way would be to go to the URL below this video and to play with it on your own.

Philipp: Yes. Make sure you get to look for things like blurring or this switchbacks and small jumps in the video.

Tsahi: OK, and you want to show something in the code?

Philipp: Yes, that was interesting. So instead of a stream is a API that is mainly used for end to end encryption. But what I saw in the FaceTime JavaScript was similar to the code I’ve written here. So we have something that works on the decoding end of the connection that gets called for each encoded frame. And we have a mapping – we record the last two times that we’ve seen for a specific SSRC, and if we don’t have a timestamp, we initialize it and say, OK, there’s a current RTCP timestamp minus 100,000.

And then what we do is we look for keyframes and if we get a key frame and the time stamp is recent, then we switch to that one. Because then we know, we are prepared to switch and if we have a keyframe, we can decode the video and we can switch. And here we have the two variants: one that is reattaching the srcObject and that’s causing the flickering and it’s very annoying. And the second one shows only one video and hides the rest.

Normal vs poor man’s simulcast

Tsahi: Lets go and look at the differences themselves. So let’s see why: Whenever someone does such a thing, it’s interesting to understand why.

So we prepared this small table. And in normal simulcasts, we have the sender, the receiver, and the SFU. The sender in our case has a single MediaStreamTrack that it created and it generates three different SSRCs, one for each of the three video bitrates that it is sending out the three layers in the way.

The receiver on the other end doesn’t know what simulcast is. So it has a single MediaStreamTrack and the single SSRC throughout the whole session. It doesn’t matter if the sender or the SFU – the media server is going to switch from one track to and from one bitrate to another, there is going to be a single SSRC for the receiver. In order to do that, the media server, we need to take the three inputs that it has, decide what to send to the receiver, but then send it on over 1 SSRC, which means that will to rewrite the SSRC that this is from the sender because it is doing that. It will also need to rewrite the timestamps and the picture ID.

So that’s like work that is being done by the SFU in order to support simulcast.

Philipp: Yes, and it’s a pretty complicated thing, actually, and particularly if you have things like packet loss, rewriting the timestamps on the sequence numbers gets a bit tricky.

OK, and then we’ve got the poor man’s approach to simulcasts, which is what Apple decided to do for FaceTime. And there what they’re doing is – the sender has three MediaStreamTracks instead of one in three different sources. So it’s actually sending three separate distinct video sources. On the receiving end, we’ve got, again, three MediaStreamTracks and three SSRCs, the receiver knows that it is simulcast and it is aware of what is happening. All this is done on the application level.

Because of that, the SFU doesn’t need to do any rewrites. It just needs to pass the media directly from the sender towards the receiver.

The estimate is that this was done because the FaceTime team didn’t want to do all the rewrite stuff on the SFU, they deemed it unnecessary or too much work.

Philipp: Yes. Also, this approach plays a bit nicer with end to end encryption, which is probably a big topic for them. This is not the first time I actually see this approach, we did that very early in the life of the Jitsi video page, but it didn’t work so well. So I’m surprised to see it again. But it was interesting to look at.

Tsahi: Mm hmm. OK, but I’d say that our suggestion to developers is to use the normal simulcast approach. That’s tried and true. Everyone uses that, including Google, which means that this will work best on their browser.

Philipp: Yes.

Tsahi: OK, so thank you for joining us for this WebRTC Fiddle of the Month and see you next month.

[]