Do you hear me?

By Olivier Anguenot

Published in dev

December 29, 2022

9 min read

Changing environment

What are virtual devices?

When can virtual devices cause problems ?

Detecting a virtual microphone

Detecting a virtual camera

Detecting virtual speakers

Do I really need to detect virtual devices?

To close the year, I was interested to look at a problem that affects me for some months which is the fact that sometimes I connect using virtual devices instead of my default real devices.

I noticed it because either I couldn’t hear my recipients or they could not hear me back when I wanted to talk.

Most of the time, when someone in a meeting say “Hey, do you hear me?”, problem is due to an external bluetooth device having problem or simply because the person forgot to unmute himself. But this time, the cause is different and is the use of virtual devices.

How virtual devices can mess up your existing working setup? How these devices were installed on your computer?

Changing environment

Most of the time, when I use my browser to connect to an audio and video conference, I start by selecting the devices I want to use. With my laptop, I always connect with my built-in devices (ie: webcam, microphone and speaker). I don’t have any other choice, so with the habit, I pay less and less attention to this screen.

One day, as I needed to record my screen for a presentation, I had installed tools like OBS and ManyCam. The next day, when I joined a video conference, I was surprised because I had an audio issue: my recipient couldn’t hear me and when I looked at the devices used, I discovered new ones that seem to be related to the tools I installed…

Something has changed in my environment: My favorite devices were replaced by virtual devices without any direct actions from me.

In fact, this was a consequence of the following bad things:

The new virtual devices that have been selected by default by the browser as the microphone, camera and speakers to use
The application that didn’t store the previously used devices

Of course, as the deviceId remains the same for the same browser, session after session, it could be seen as a bug in the application not to propose the previously used device.

But this does not cover all cases: For example, in Firefox, since you still have the devices’ selector, you can select the wrong one. So, I wondered about how to avoid this problem:

From a technical point of view: Is there a way to identify devices as virtual ?
From a philosophical point of view: Do I really need to identify devices as virtual ?
From a usable point of view: What happens if I use the Microsoft Teams virtual microphone or speaker to my Rainbow application ?

This article tries to give some answers to these questions.

To keep in mind, when installing software such as communication application or streaming application, or simply when installing your devices drivers, there are often additional tools that install the virtual devices. Don’t be surprised. These virtual devices are there to provide additional functionalities to your material.

What are virtual devices?

When using videoconferencing and streaming technologies, a virtual device such as a virtual microphone or camera is an acquisition device that is emulated in software by the operating system of a computer.

The term “virtual” refers to the fact that the operating system detects the presence of a camera that does not actually exist.

For example, once installed through an application, a “virtual camera” can be selected as an image source in any video processing software, including clients for video conferencing such as MS Teams, Zoom, Google Meet or Rainbow, etc…

Virtual devices not only address acquisition devices but can be used as output devices too.

Use cases for virtual devices are multiples:

Virtual audio input: A virtual microphone can be used to add special effects, such as noise reduction or echo cancellation
Virtual video input: A virtual camera can be used to apply a filter, such as green screen, to optimize the grain/contrast of the image or to mix several video sources together to produce a scene.
Virtual audio output: A virtual speaker can be used to add virtual instruments or sound effects to a track, or to preview audio before it is exported or published.

There are many software programs that create virtual devices such as OBS, ManyCam and BlackHole. Most of them are known to streamers. But applications such as Microsoft Teams install virtual devices too (microphone and speaker).

Note: There can be seen as an audio and video pipeline like you could do using Insertable streams of the Audio API.

Checking devices at system level

Using macOS, I can list and select the audio devices to use primarily in two places:

In the tabbed panel Sound Preferences: Here, I found a list of all input and output devices. Virtual devices are listed except those that seem to come from applications such as Microsoft, ManyCams, etc…

In the Audio Midi Setup application provided by Apple: This application lists all the audio input and output devices and allows you to modify certain parameters: number of channels, format, mute state, volume, etc… For example, it is an easy way to check if the device records in stereo or in audio.

For video, there is nothing integrated to check or select at OS level the default camera I want to use, or I didn’t find it. Camera is selected at application level.

Note: These tools can be used to compare and check that devices (virtual or not) are well recognized by the browser.

Checking devices at browser level

For doing that, I took the default Choose camera, microphone and speaker sample from the WebRTC Samples and checked the list obtained.

On my computer, I listed the following devices:

Microphones: 10 devices found including 7 virtual microphones
Camera: 2 devices found including 1 virtual camera
Speakers: 10 devices found including 5 virtual cameras + 2 ‘special’ which are the speakers from my screens

Oops! More than 50% of my devices are not the ones I want to use at first sight…

When can virtual devices cause problems ?

If I mistakenly select a virtual microphone not provided by the application I want to use, in 99% of the case (not to say 100%), my recipient will not hear me.

In the opposite, if I select a virtual speaker, I will not hear my recipient.

For video, the problem is different as I will see directly that the video is not the right one and will therefore change it.

So detecting virtual devices on behalf of the user can be useful for example to display an alert if he selects the wrong device.

The problem is: “Am I able to detect the use of a virtual device?”

Detecting a virtual microphone

In this section, I tried several ways to detect virtual microphones.

Using the device label or track label

From the list of devices obtained using the function enumerateDevices, the name of the device can be extracted from the label field. Don’t forget to ask for permission to have access to this field.

The following word can be used to detect virtual devices:

Virtual: To detect any virtual devices
Teams: Virtual microphone used by Microsoft Teams. Can be input (microphone) and output (speaker)
Manycam, BlackHole, Rode, …: Virtual microphone coming from other streaming tools or from microphone vendors (Each driver can install a virtual device)

There is no official way to name a virtual device nor a specific field to filter devices to find them. But it seems that most of the case, searching for the word “Virtual” is good enough.

If you have the MediaStreamTrack, an equivalent method is to look at the same field label here.

In another way, from the track, you can get the associated device by calling the function getSettings and then look at the field deviceId.

Using the statistics

Using the getStats API, there is a way to check that the device used is a virtual microphone.

In that case, the property totalAudioEnergy from the RTCAudioSourceStats report stays at zero.

Compared to a real device where the value of totalAudioEnergy increases even if there is a little background noise, here, the value doesn’t change.

Note: Be careful if the device is muted physically. totalAudioEnergy will be equals to 0 in that case.

Using the Audio API

Goal is the same here: checking the audio level but instead of using WebRTC to do that, the same objective can be done through the Audio API

const audioContext = new AudioContext();
const analyzer = audioContext.createAnalyser();
analyzer.fftSize = 512;
analyzer.smoothingTimeConstant = 0.1;

// create a source node from the audio stream
const sourceNode = audioContext.createMediaStreamSource(stream);

// Connect the source node to the analyzer
sourceNode.connect(analyzer);

Once the audio pipe is in place, the second step is to collect the audio volume

function getMaxVolume (analyzer) {
    const fftBins = new Float32Array(analyzer.frequencyBinCount);
    analyzer.getFloatFrequencyData(fftBins);
    const maxVolume = Math.max(...fftBins);
    return maxVolume;
  }

  function getLevel (analyzer)  {
    const frequencyData = new Uint8Array(analyzer.frequencyBinCount);
    analyzer.getByteFrequencyData(frequencyData);
    const sum = frequencyData.reduce((p, c) => p + c, 0);
    return Math.sqrt(sum / frequencyData.length);
  };

  const looper = () => {
    setTimeout(() => {
      const currentVolume = getMaxVolume(analyzer);
      const currentLevel = getLevel(analyzer);
      // Do something with the volume and level computed
    }, 200);
  }

  looper();

For a virtual microphone, volume is equals to Number.NEGATIVE_INFINITY and level is equals to 0 which is not the case for real device where value changes at all times.

Note: Be careful if the device is muted physically. volume and level will be as for virtual devices.

Despite APIs, currently there is no official manner to distinguish real devices from virtual ones. So the best option remains to use the label.

Detecting a virtual camera

Here are some hints for detecting virtual cameras

Using the device label or track label

This works the same for virtual camera.

Additionally, the name OBS can be added to the short-list of terms to check.

Using the settings

With some virtual camera, the resolution obtained is not usual, so it could be used to distinguish a virtual camera.

For example, with ManyCam, the resolution obtained is 1536x864 which is more a screen resolution rather than a camera resolution.

But, as some applications allow to select resolution to use, you could obtain the same as a real camera.

Generally speaking, Settings don’t allow to distinguish between real and virtual devices.

Using the statistics

Like for detecting virtual audio devices, there is some dedicated statistics that can be used to try to identify the use of a virtual camera.

But here, the problem is different: Virtual cameras can have different behaviors: Some send a fixed image (logo or default image), others send images in loop.

In case of fixed image, you could check the properties framesSent, framesPerSeconds, packetsSent and bytesSent that will have very low values compared to those extracted from a real camera.

The more different the images sent are, the more difficult it will be to distinguish between real and virtual devices.

Note: As usual, there is always a case not covered, which is when I forget to remove the cover from my real camera. Hopefully, in that case, framesPerSeconds stays high.

Using Insertable Media Processing

Using the Insertable Media Processing on the video stream can be used in an equivalent way as the Audio API on the audio stream.

Here, the objective is to identify either a black video or a part of the video where the pixels are identical during several frames. Usually, the pixels in the center represent the user and so always change. It’s hard to make it look like a video freeze without moving :-)

The first step is to create the video pipeline. An example can be seen in the article Building your own video pipeline.

Once the pipeline is in place, the second step is to try to detect if it is a black screen. Black screen is not necessarily associated to the use of a virtual device, but it could help detecting the “cover case”.

Here is a part of a code that checks if there are colored pixels in a frame. To speed up, the frame has been reduced and the function is executed in a worker thread.

const isABlackFrame = (frame, ctx) => {
    ctx.drawImage(frame, 0, 0, 640, 480);
    const data = ctx.getImageData(0, 0, canvas.width, canvas.height).data;
    return data.some(elt => (elt !== 0 && elt !== 255));
}

If the frame is not a black frame, the last step is to check if pixels are the same as those of the previous frame by comparing successive frames.

There are several ways to do that, here is the mine where I compare two ImageData using JSON.stringify.

const isSameImageData = (referenceImageData, currentImageData) => {
  if (!referenceImageData) {
    return false;
  }

  const comparableReferenceImg = JSON.stringify(referenceImageData);
  const comparableCurrentImg = JSON.stringify(currentImageData);
  return (comparableReferenceImg === comparableCurrentImg);
}

This can be optimized by keeping a reference on the reference image to avoid the first JSON operation.

By comparing the center of the image, I can deduce if it is a virtual camera in case where I have obtained several duplicate images.

Conclusion is the same here, there are APIs, but I prefer to keep it simple and look at the label only.

Detecting virtual speakers

For virtual speakers, the problem is different again: Only Chrome gives the list of speakers. For others, the speaker is selected from the System.

Apart this point, you can analyze the incoming stream using either the statistics API to check the audioLevel and totalAudioEnergy properties from the RTCInboundRtpStreamStats report or the Audio API to build a pipeline equivalent to the one put in place for the microphone.

This can help detecting when there is no sound coming from the recipients but can’t help for detecting a virtual speakers.

So as for virtual microphones and cameras, only the label can be used (Chrome).

Note: Selecting speakers is still under standardization: Audio Output Devices API

Do I really need to detect virtual devices?

From my perspective, virtual or not virtual is not the question :-)

All you need in your application is an easy way for the user to select and reuse his preferred devices. You can add a hint when proposing devices to the user to warm him the virtual devices.

But except this point, during the conference, the objective remains to help him to understand that he is not heard, or he cannot hear. Whatever the kind of devices.

So, these APIs shown here or the pieces of code shared will complete other existing algorithms that already allow, for example, to detect that the person is talking while his microphone is off.

“Virtual devices” is more a point to have in mind when analyzing and debugging issues.