Signaling in WebRTC

By Olivier Anguenot

Published in dev

December 09, 2023

10 min read

What is the signaling layer?

The 4 pillars of the signaling layer

How to create a signaling layer?

Building your own signaling layer

Building the signaling server

Building the client using a WebSocket transport

Alternative transport: Using SSE

Alternative transport: Using Stream API

Other alternatives transport

Alternative protocols

Conclusion

WebRTC is a powerful technology for creating real-time communication applications for the web and beyond. By design, it is a peer-to-peer technology that establishes a connection between two peers. However, WebRTC provides no signaling mechanism for the connection, even if this is a crucial step. This is where you can bring your creativity 😀

In this article, I wanted to juggle different ways of doing the signaling part. Mainly because I have often said: “Yes, you can use … for your signaling part.”; but without having really implemented this technology 😅

And since we don’t often talk about the signaling part, this is an opportunity to demystify this layer in WebRTC. The signaling layer has often been reserved for a few open-source libraries that we’re happy to use without worrying too much about how they work. Here, I’m going to implement it from scratch.

For those who want to go straight to the code, you can find the code here on GitHub.

What is the signaling layer?

The signaling layer is a mechanism that allows two peers to exchange information to establish a connection. It is specified, but the implementation has to be done by developers because it is an external mechanism that you need to implement yourself.

Saying it is an external mechanism to do, doesn’t mean that you can implement whatever you want. It is a way of saying that the browser won’t do it for you: it will provide you with the necessary information but will let you implement the mechanism.

The signaling layer is often confused with the negotiation process. However, they are two different things:

The signaling layer is a generic term for the mechanism by which two peers exchange information. Here, we are talking about the channel used, the way the peers meet together (♡) and the protocol and messages exchanged with the server.
The negotiation process is a specific term for the process by which the two peers agree on the parameters of the connection. Here, we are talking about the JavaScript Session Establishment Protocol SDP offer/answer exchange which uses the signaling layer to exchange this information. This is the part defined in the WebRTC specification.

The negotiation process is therefore only part of the signaling layer.

Note: Even if WebRTC is a peer-to-peer technology, it is not always a connection between two peers. It can be a connection between a peer and a server. But for the signaling layer, this is the same, the server acts as a “peer” to exchange information and establish the connection.

The 4 pillars of the signaling layer

The signaling layer is composed of 4 parts:

The signaling channel which defines the transport mechanism used to connect the peers and exchange the messages. It can be WebSocket, HTTP, MQTT…
The signaling protocol which defines the format of the messages exchanged between the two peers. It can be SIP, XMPP…
The session establishment protocol which defines the negotiation process. It is the SDP offer/answer and ICE candidates.
The application signaling which is optional but often used to exchange application-specific messages linked to the communication. It can be used to exchange basic information such as the user’s name, the room name or more complex information by overriding or extending the session protocol or the signaling protocol.

Most of the time, this signaling layer is integrated into your application’s existing signaling layer. Here only the session establishment protocol is linked to WebRTC. The other parts are not.

But there is a hidden pillar behind the signaling layer: the signaling server. This is the server that manages the signaling layer.

Because yes, you need a server to handle the signaling layer: Two peers need a server to connect. This may be a dedicated server or one already used by your application.

This server connects the peers and transmits the messages that need to be exchanged between them.

Note: To test WebRTC, you can connect two peers loaded locally directly without using a server. But in real life, you need a server to handle the signaling layer.

How to create a signaling layer?

To make WebRTC calls, as you seen, you need a signaling layer. How do you create it?

As we saw in the previous section, you will need to implement several things and put them together:

(1) First of all, you need to choose the signaling channel and the signaling protocol. All kinds of technologies can be used to implement the channel and protocol parts. So, it can easily be integrated into your existing application.
(2) Next, you need to implement the session establishment protocol (ICE/SDP exchange). This is the most complex part.
(3) Finally, you need to implement the application signaling if required.

Since it requires a server to manage the signaling layer. it is not immediate. To play with WebRTC, you can omit the server and perform the signaling directly. But when developing, you will have this server. The minimum viable server is a server that is able to take the messages from one peer and send them to the other peer. Just like developing a chat application.

So, there is no real complexity on the server side. The complexity lies on the client side, as the JSEP specification defines several messages that need to be exchanged in a defined manner. And as internally, there is a state machine; you have to process the messages in the right order and at the right time.

The consequence is that you have to rely mainly on asynchronous requests and events: At any moment you can receive a message from the other peer. And you need to process it.

As you will appreciate, this often involves using a persistent connection between the client and the server or a way to contact any users at any time (such as using Push Notifications).

Building your own signaling layer

In this part, we will see how to build our own signaling layer.

We will rely on the following stack:

Server: Node.js that will handles the different channels (WebSocket, SSE, Stream API).
Client: For sure, JavaScript! No dependency or open-source library. Pur Vanilla! We will implement three different kinds of channels: WebSockets, SSE and Stream API. JSON will be used for the signaling protocol and JSEP for the session establishment protocol.

Note: As it is a tutorial, there is no application signaling, nor authentication or security mechanism.

So let’s start with the first implementation using WebSocket.

Building the signaling server

The server part is limited to the minimum. It is a WebSocket server that will handle the signaling channel and the signaling protocol.

It accepts the client connections and relays messages between them.

Here is the code:

const express = require('express');
const http = require('http');
const WebSocket = require('ws');

const app = express();
const server = http.createServer(app);
const wss = new WebSocket.Server({ server });

wss.on('connection', (ws) => {
  // Handle incoming messages from clients.
  ws.on('message', (message) => {
    wss.clients.forEach((client) => {
      if (client !== ws && client.readyState === WebSocket.OPEN) {
        client.send(message);
      }
    });
  });
});

// Start the server.
const PORT = process.env.PORT || 3001;
server.listen(PORT, () => {
  console.log(`Server is listening on port ${PORT}`);
});

Note: For simplicity, the server does not handle rooms. So don’t try to connect more than two clients.

Building the client using a WebSocket transport

Main principles

As the client-side code is more complex, the aim will be to divide each part into a separate block, giving each part a unique (single) responsibility. This will make the code more readable and maintainable.

Here, as it is a very simple example, it is not necessary. But feel free to use this code as a skeleton for your own application.

The proposal is to separate the different concepts:

1) we need the minimum viable HTML page to connect to the server and start a call. In this sample, we will just connect a datachannel, but you can adapt it to connect an audio/video stream.
2) we need to handle the different logical blocks: the transport (to connect to the server), the negotiator (the negotiation process) and the signaling (the aggregator that makes the glue between the transport and the negotiator).
3) we need to decide where to put the PeerConnections. Here, for simplicity, it will be the signaling block that will handle it. But often, it is better to manage it outside the signaling block to keep the media block independent (using injection).

Let’s have a look at the building blocks first and then to the HTML page.

The transport block

The transport block is responsible to connect to the server and to send and receive messages.

The interface is pretty simple: We just need some functions to connect, send a message and disconnect.

To make it agnostic, the transport block interface offers a method to set a callback: Each time a message is received, the callback is called with the message as parameter. Using this way, the transport block is able to transmit incoming messages back to the signaling block.

So, here is a JavaScript function that creates the transport block using WebSocket:

const buildWebSocketTransport = (name) => {
    let socket = null;
    let callback = null;
    let from = name;

    return {
      addListener: (cb) => {
        callback = cb;
      },
      connect: (to) => {
        socket = new WebSocket(to);

        // Connection opened
        socket.addEventListener('open', (event) => {
          console.log(`${name} 'socket' connected`);
        });
        socket.addEventListener('message', async (event) => {
          if (event.data instanceof Blob) {
            const reader = new FileReader();

            reader.onload = async () => {
              const message = JSON.parse(reader.result);
              // Send the message back to the signaling layer
              if (callback) {
                callback.call(this, message.data, message.from);
              }
            };
            reader.readAsText(event.data);
          }
        });
      },
      send:(message, to) => {
        if (socket) {
          socket.send(JSON.stringify({data: message, to, from}));
        }
      },
      disconnect:() => {
        if (socket) {
          socket.close();
          socket = null;
        }
        console.log(`${name} disconnected`);
      },
      name: name
    }
  }

As shown in this sample, each message received is sent back to the signaling block using the callback defined in the addListener function.

Note: This example uses a plain WebSocket and exchanges text messages. But you can use libraries such as Socket.io to enhance this part.

The negotiator block

The negotiator block is responsible to handle the negotiation process. Here we need to implement the SDP offer/answer exchange and the ICE candidates exchange.

But in fact, it is simply a state machine: It takes a message, does a negotiation step and if needed, returns a message.

So, here is a JavaScript function that creates the negotiator block:

const buildNegotiationProcess = () => {
    let callback = null;
    const tmpIces = [];

    return {
      addListener: (cb) => {
        callback = cb;
      },
      process: async (pc, message) => {
        const {type} = message;
        switch (type) {
          case "negotiationneeded":
            await pc.setLocalDescription(await pc.createOffer());
            callback.call(this, pc.localDescription.toJSON());
            break;
          case "offer":
            await pc.setRemoteDescription(message);
            await pc.setLocalDescription(await pc.createAnswer());
            callback.call(this, pc.localDescription.toJSON());
            break;
          case "answer":
            await pc.setRemoteDescription(message);
            break;
          case "candidate":
            if (message.candidate) {
              callback.call(this, message.candidate);
            } else {
              callback.call(this, {type: "endofcandidates"})
            }
            break;
          case "endofcandidates":
            for (const ice of tmpIces) {
              await pc.addIceCandidate(ice);
            }
            break;
          case "connectionstatechange":
            if (message.state === "failed") {
              await pc.restartIce();
            }
            break;
          default:
            // candidate
            if (message.candidate) {
              if(pc.remoteDescription ) {
                await pc.addIceCandidate(message);
              } else {
                tmpIces.push(message);
              }
            }
            break;
        }
      }
    }
  }

Here are some explanations:

The negotiation process starts when the peerConnection triggers the negotiationneeded event. This is no a real message, but I have defined one to have a homogenous state machine.
When the negotiation process needs to send a message to the other peer, it calls the callback with the message to send.
You may notice two “extra” messages which are candidate, endofcandidates. They are necessary to ensure the synchronization. For example, if for any reason, an ICE candidate is received before the remote-description, you can’t add it to the peer-connection. You need to store it and add it later.
Finally, the connectionstatechange allows the application to restart the ICE process if the connection fails.

Using this way, the signaling block always calls the negotiator whatever it happens. The negotiator is responsible for the whole negotiation process. Using this way, we follow the single responsibility principle.

The signaling block

The signaling block is the link between the transport block and the negotiator block: It is responsible for taking the message from one side and to send it to the other side. And vice versa.

It also contains three functions:

The first two are to create and release the peer-connection. As mentioned above, the peer-connections can be managed outside the signaling block. But here, for simplicity’s sake, we will manage them inside the signaling block.
The last allows the application to add a datachannel to the peer-connection. Here, we will use a datachannel to exchange messages. But you can easily adapt it to connect an audio/video stream.

So, here is a JavaScript function that creates the signaling block:

const buildSignalingLayer = (transport, negotiator) => {
    let pc =  null;
    let dt = null;
    let tp = transport;
    let np = negotiator;
    let to = null;

    // Handle incoming messages from the transport layer and send them to the negotiator (for the local peer)
    tp.addListener((message, from) => {
      if(!to) {
        to = from;
      }
      console.log("TP <-- message:", message, from);
      np.process(pc, message);
    });

    // Handle messages from the negotiator and send them to the transport layer (for the remote peer)
    np.addListener((message) => {
      console.log("TP --> message:", message);
      tp.send(message, to);
    });

    return {
      createPeerConnection: (config) => {
        pc = new RTCPeerConnection(config  || {});
        pc.onnegotiationneeded = (event) => {
          np.process(pc, {type: "negotiationneeded"});
        }
        pc.onicecandidate = (event) => {
          np.process(pc, {type: "candidate", candidate: event.candidate});
        }
        pc.onconnectionstatechange = (event) => {
          console.log(`${tp.name} 'onconnectionstatechange' to ${pc.connectionState}`);
          np.process(pc, {type: "connectionstatechange", state: pc.connectionState});
        }

        pc.ondatachannel = (event) => {
          console.log(`${tp.name} 'datachannel' opened`);
          const dt = event.channel;
          dt.onclose = (event) => {
            console.log(`${tp.name} 'datachannel' closed`);
          }
        }
      },

      close: () => {
        if(dt) {
          dt.close();
          dt = null;
        }
        if(pc) {
          pc.close();
          pc = null;
        }

        // remove references transport and negotiator
        np.addListener(null); // remove listener
        np = null;
        tp.addListener(null); // remove listener
        tp = null;
      },

      addDataChannel: (label, options, recipient) => {
        to = recipient;
        dt = pc.createDataChannel(label, options);
        dt.onopen = (event) => {
          console.log(`${tp.name} 'datachannel' opened`);
        }

        dt.onclose = (event) => {
          console.log(`${tp.name} 'datachannel' closed`);
        }
      }
    }
  }

Some points to notice:

The transport and the negotiator are injected in the signaling block. It allows making the signaling block agnostic.
The signaling uses the handlers defined by the transport and the negotiator to exchange the messages between the two entities.
As the signaling manages the peerConnection, it listens to these events and forward them to the negotiator.
As soon as the datachannel is added to the PeerConnection, the event negotiationneeded is fired. So, the negotiation process starts.
In case the connection fails, the negotiator will call the restartIce function. It is a way to restart the ICE process by triggering the negotiationneeded event.

The HTML page

The HTML page is pretty simple: Some buttons for connecting, disconnecting the users to the server using the transport. The other buttons are for creating and releasing the peerConnection and the datachannel.

Here is the HTML part:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>WebRTC Signaling</title>
</head>
<body>
  <h1>WebRTC Signaling</h1>
  <div>
    <button id="connectBobBtn">Connect Bob</button>
    <button id="connectAliceBtn">Connect Alice</button>
    <button id="disconnectBtn">Disconnect both</button>
    <button id="startBtn">Bob call Alice</button>
    <button id="endBtn">End call</button>
  </div>
  <script src="signaling.js"></script>

And the JavaScript part:

// signaling.js
// todo: copy/paste previous code here
'use strict';
import  {createTransport, getURLFromTransport} from "./transports/transport.js";
import {buildSignalingLayer} from "./signaling/signaling.js";
import {buildNegotiationProcess} from "./negotiation/negotiation.js";

// Main function called once the page is loaded
const ready = () => {
  // variables
  let transport1, transport2 = null;
  let signaling1, signaling2 = null;

  // buttons listener
  const connectBobBtn = document.getElementById("connectBobBtn");
  const connectAliceBtn = document.getElementById("connectAliceBtn");
  const disconnectBtn = document.getElementById("disconnectBtn");
  const startBtn = document.getElementById("startBtn");
  const endBtn = document.getElementById("endBtn");
  const inputSelect = document.getElementById('signaling');

  let value = inputSelect.value;
  console.log("default signaling to use", value);

   inputSelect.onchange = async (e) => {
   value = e.target.value;
    console.log("new signaling to use", value);
  }

  connectBobBtn.addEventListener("click", () => {
    // Connect Bob to the server
    transport1 = buildWebSocketTransport("bob");
    bobTransport.connect(getURLFromTransport(value, "bob"));
  });

  connectAliceBtn.addEventListener("click", () => {
    // Connect Alice to the server
    transport2 = buildWebSocketTransport("alice");
    aliceTransport.connect(getURLFromTransport(value, "alice"));
  });

  disconnectBtn.addEventListener("click", () => {
    // Disconnect both
    if (transport1) {
      transport1.disconnect();
      transport1 = null;
    }
    if (transport2) {
      transport2.disconnect();
      transport2 = null;
    }
  });

  startBtn.addEventListener("click", () => {
    // create signaling for bob
    signaling1 = buildSignalingLayer(transport1, buildNegotiationProcess());
    signaling1.createPeerConnection();

    // create signaling for alice
    signaling2 = buildSignalingLayer(transport2, buildNegotiationProcess());
    signaling2.createPeerConnection();

    signaling1.addDataChannel("aChannel", {}, "alice");
  });

  endBtn.addEventListener("click", () => {
    if (signaling1) {
      signaling1.close();
      signaling1 = null;
    }
    if (signaling2) {
      signaling2.close();
      signaling2 = null;
    }
  });
}

document.addEventListener("DOMContentLoaded", ready);

Alternative transport: Using SSE

In the previous example, we used WebSocket as transport. But you can use any other transport. An option can be to use Server-Sent Events (SSE).

SSE is a technology that allows the server to send events to the client. It is a one-way communication channel. It is often used to push notifications to the client.

So, here, the server has to implement a REST API to receive the messages from the clients and handle the SSE connection to send asynchronous messages to the clients.

Compared to WebSocket, SSE is a simpler technology. Events are sent over HTTP. There is no need for a specific protocol. But it is not bidirectional.

Here, Bob can send a message to Alice using the following REST API:

// POST /signaling
{
  "to": "alice",
  "data": {
    "type": "offer",
    "sdp": "..."
  }
}

And the server will forward the events to Alice using SSE

// SSE to alice
event: signaling
data: {"type": "offer", "sdp": "..."}

Note: Even if SSE offers nice features such as reconnection, event Ids, it uses plain text messages. So, it is mainly used to send short information to the client. But you can use it to send SDP offers/answers and ICE candidates.

Concretely, the main difference is on the server side, where you need to put 2 interfaces in place: The SSE handler and the REST API. In counterpart, it is easier to add more features (such as authentication, authorization, metrics, openness) on top of this REST API.

Alternative transport: Using Stream API

If SSE doesn’t meet your needs, another solution is to use the Stream API.

The Stream API is a technology that allows the server to send a stream of data to the client. Like the SSE, it is a one-way communication channel. It is often used to send a large amount of data to the client and when it takes time to answer to the client.

The procedure is the same as for SSE: The server has to implement a REST API to receive the messages from the clients and handle the stream connection to send asynchronous messages to the clients.

The main difference is on the client side: The client needs to handle the stream. It is a bit more complex than SSE.

Here is the part of the code that handles the stream:

// As the response from the server is a stream, we need a reader
  reader = response.body.getReader();

  // Then, we need to decode the messages
  const decoder = new TextDecoder();
  while (true) {
    const {done, value} = await reader.read();
    if (done) {
      break;
    }
    const decodedChunk = decoder.decode(value);

    // Split the chunks into messages
    const messages = decodedChunk.split("\n");
    for (const message of messages) {
      if (message.length > 0) {
        // Send the message back to the signaling layer
        if (callback) {
          callback.call(this, JSON.parse(message));
        }
      }
    }
  }

As you can see, there is an extra step here because the server can send several signaling messages in the same chunk 😱…

Other alternatives transport

There’s another alternative: You can use MQTT to handle the signaling layer. It is a lightweight protocol that is often used in IoT applications. It is a publish/subscribe protocol.

Using a broker as the signaling server, you can easily connect two MQTT clients.

I haven’t added it to the sample, but perhaps that will be subject of another article and an enrichment of this sample.

And, of course, there are many other possibilities. This overview is by no means exhaustive.

Alternative protocols

In the previous examples, we used JSON as the signaling protocol. But you can use any other protocol.

The main popular protocols are:

SIP which is a protocol used for VoIP so mainly used in telephony applications and to link WebRTC with equipments such as SBC or IP-PBX.
XMPP which is a protocol used for instant messaging so mainly used in chat applications. But extensions exist to handle WebRTC such as Jingle.

As you can see, there are a large number of possibilities here too: It is just a matter of encoding/decoding the messages.

On my own, I built some years ago a signaling server JSONgle Server and a signaling client JSONgle which tried to separate the channel layer (any) from the protocol layer (A simplified version of Jingle revisited in JSON). I used it to handle multiple P2P calls for an application for telehealth.

Conclusion

In this article, we looked at how to build a signaling layer from scratch. In the end, it is not that complex.

The question now is: Should you build your own signaling layer or use an open-source library?

The good news is that if you want to use an open-source WebRTC server, it often comes with an SDK or a library that handles the signaling layer. So, you don’t need to build it yourself.

The same applies if you want to use a CPaaS (Communication Platform as a Service) such as the one we called Twilio, the one I work for Rainbow or any other platforms that offers video conferencing or streaming services. The signaling layer is hidden behind the SDK and API used.

But if you want to do it all yourself, you can use this article as a starting point. But you will see that it will gradually become more complex as you handle new cases (hold/retrieve, mute/unmute, multi-calls). This is because, you have to send new messages to your signaling layer to sync with what’s happening in the media part.

Any doubts? Just ask!