Checking audio levels

🎻 September 2020 fiddle

Source: https://jsfiddle.net/fippo/1eL9dm6u/20/

Want to learn more about WebRTC?

Look us up here for the WebRTC Fiddle of the Month, created… once a month.

Or just enroll to one of our excellent WebRTC training courses.

Resources

  • Hark – an alternative based on Web Audio

Transcription

Tsahi: Hi and welcome to the WebRTC monthly fiddle. This time we’re going to talk about checking the audio levels.

I guess the first thing to ask is why would you want to do that and what exactly we’re talking about? So we are in the call, OK? We’re speaking. We want to understand what is the audio level that we’re either sending or receiving in the call? We’re not going to talk in the fiddle on the sensor part. We leave that to you to do. Probably with Web Audio.

What we want to discuss and deal with is what you do with the incoming channels. For an SFU, media gets routed. We can have one, two, three, four, 10, 20 different audio streams coming in. And we’d like to understand you know. who is the one that is speaking or who has a dog barking in the background where you don’t have the new noise suppression machine learning stuff in the application that you’re using.

There are three ways to do that.

  1. The first one is – we can use Web Audio. If we use Web Audio, we actually need to use CPU processing to understand the audio levels, get them and then deduce from that what to do.
  2. The second approach that we have is getStats() and, you know, we like getStats() and use it all the time. In this case we’re going to go and try to dig in to getStats() to the audio levels that we need, collect that do things with it. That’s nice. But Philipp wants to do it at an interval of every twenty milliseconds, which means too many calls to getStats().
  3. Now the third approach is to use GetSynchronizationSources(), which is quite straightforward. And what I do now for the actual explanation of how you do that is to hand it over to Philipp.

Philipp: Thank you. So talk about jsFiddle.

We have a bit larger fiddle this time and it’s basically split into two parts or three parts. First part is initialization of some things. And then negotiating an active peer connection, because we’re going to look at the remote audio level, so we need an active peer connection and we do the usual thing, we get a stream from getUserMedia().

Then we call for addTrack() each of tracks of the stream, which is just a single track, but we iterate over all of the tracks anyway.

We create an offer. We call setRemoteDescription() on the other side, setLocalDescription() on the local side, create an answer, call setRemoteDescription() and setLocalDescription().

And then all the ICE candidates exchange starts, which is wired up here, and then we just negotiate.

Tsahi: So up until here, we did nothing of what we wanted. And just created the peer connection that can actually receive incoming audio. So we have something to work with.

Philipp: Yes. And then we do a setInterval() call which calls the function we give it every 200 milliseconds in this case. How often you do this is up to you. And we’re going to argue about that later. How often is the right amount?

So what we do here is we get the audio receiver. In this case, we only have a single one, but you might have to iterate over all your peer connections, over all audio receivers on your single peer connection, and we check for support for the getSynchronizationSources() methods and then we get the synchronization source.

If we get something from that, and we might not in some cases, because there’s a 10 second time out on it. We push the audio level to our array of levels. And then we make sure we only have a backlog of 10 audio levels, which we can do things like average on or whatever you want, and then we do our update to our display element. You can see that here on the right. It’s just running through the numbers here.

Tsahi: And this is me speaking to you somehow, and the numbers don’t make sense, but this is how you do it.

Philipp: Yes. And we can, for example, toggle the mute button and then we see, oh, no audio received on the remote end anymore.

Tsahi: Oh, it’s not me speaking, it’s actually you because it’s locally.

Philipp: Yes, and the audio doesn’t go from my headphones to my microphone.

Tsahi: Yes, OK, so that’s how you do it. Now, why exactly do we want it that way?

Philipp: Well, I mean, the question is really how do you average this? Because getSynchronizationSource() gives you an instantaneous audio level and the audio level would have been zero for some part of the last seconds, but not all of that.

Tsahi: You mean when I tried to breath between words or pause when I speak?

Philipp: Yes. So you should do some kind of averaging over a time window.

Tsahi: OK, so let’s say I’ve got 20 people in my call, all of them with open mics. Not that common, but let’s say that they’re all about open mics and now I’m receiving 20 different audio channels incoming. IF I do that very frequently, let’s say every 20 milliseconds that 50 checks per incoming audio channel, which means five hundred for ten and a thousand different checks in this callback if I’m doing 20 people.

Philipp: Yes. And luckily this getSynchronizationSources() API is lightweight enough so that it’s possible because doing it with getStats() will explode in that case.

Tsahi: OK, and then and my thought is that I don’t care about possible. I want to be optimal. And if I try to do that, then probably I should try to check less times or to check only one or to start checking frequently only when things change or find some other heuristic to help me out running less of these callbacks instead of more of these complex.

Philipp: Yes.

Tsahi: OK. But I guess people here should be smart enough to figure that out on their own.

Philipp: It’s a problem.

Tsahi: OK, thank you and go check your audio levels. See you next month!

Philipp: Bye.

[]