Calling WebRTC APIs from your Web Application is easy, integrating WebRTC in your codebase is not the same thing…
Even is WebRTC is now a standard, widely used in browsers or in desktop and mobile applications, it can’t be compared to other HTML5 standards that emerged at the same period. As WebRTC deals with real-time communication, so technologies far beyond the browser itself, integrating it in your existing codebase should be thinking.
WebRTC uses HTML DOM elements, interacts with your devices, communicates with different servers such as your media and signaling servers while adapting to your network environment. But not only, as WebRTC put in relation two or more persons, each of them may impact this living ecosystem and so inject information that forces your application to react such as to the appearance of the video of a participant or to the lost of another participant.
As you can see, WebRTC can be everywhere is your Web application including your graphical interface and your low layer modules that deal with the connection to your servers or other network equipments. But is it really a good idea to spread that technology everywhere in your existing application? What can be put in place to integrate WebRTC in a way to not complicate your codebase and so be able to support it?
This article summarizes some points to keep in mind when integrating WebRTC in an existing application but the same points apply when building a new application from zero.
One of the apparent difficulties with WebRTC is that all the API is deeply linked with the navigator
object and with your key DOM elements such as the <video> or <audio> tags.
The erroneous idea is to put all the WebRTC stuff in your views… By doing that, firstly, you complicate your views because there is a lot of logical blocks that need to be added and secondly if you have dedicated developers that work on the graphical interface only, you force them to cohabit with WebRTC… They will be less happy!
As for every application, the design principle of the separation of concerns helps you to write a good application architecture.
By moving your WebRTC code in separate modules, you keep that technology apart and avoid having to patch the views each time you make change it. You are able to build and test it in a separate ways without having to deal with the whole application.
Imagine that your application needs to know if there is or not a video in order to display or not the video element. This information can be obtain by inspecting the content of the MediaStream
object.
But instead of using the WebRTC methods and manipulating them directly, it could be a good practice to provide some convenient methods from your WebRTC modules such as
let stream; // A WebRTC stream obtained previously and referenced in your data modelconst hasVideo = () => {return (stream && !!stream.getVideoTracks().length);}const getVideo = () => {if (!hasAVideoTrack()) {return;}return stream.getVideoTracks()[0];}
This is a very basic example but all the WebRTC stuff can be hidden behind some high level functions that are easier to use from the other parts of your application such as the views. This helps simplifying your views codebase and views understanding and avoid duplicating the code.
Basically, views need to have these helpers or accessors to avoid direct access to WebRTC. How to manipulate WebRTC APIs is not the problem of the view, so the best is to offer these high-level functions that keep WebRTC internal.
This can be summarize like that: “My views have no idea of what WebRTC is which is good and should then call all the WebRTC stuff from outside”. So, put away from the view all WebRTC direct calls to APIs and all the events subscriptions. Prefer injecting your helper modules in your views.
Going deeper on that track allows to gather into a module or a group of modules all the WebRTC stuff. This is good because you can access and develop that part separately from the rest of the application and offering mocks when these parts are not yet finalized.
Relying on an unique module that deals with WebRTC allows to follow the Single responsibility principle: Only one module manages the WebRTC stuff and is responsible to trigger actions and to handle events.
This module is the foundation or the pillar of your WebRTC integration. What does it means ?
It means mainly two points:
Being the only one module (or group of modules) in the application that uses the WebRTC APIs which are methods and events in the application.
Offering an high level API to other modules for accessing data, subscribing and listening to changes and to pilot it
In simple application, the associated business logic can be handled by this module too but if the application grows in term of complexity, the business logic has to be separated from that module. For example, the notion of call, the states of that call which are manipulated and displayed from the views can be handled in a separate module which acts as a business store and that abstract the underlying technology (which is WebRTC in that case) from the top layers of the application. In that way, If we want to replace WebRTC by something else (…), all we have to do is to replace one module!
As usual, over-architecturing applications is not good too. So the complexity of that module should grow according to the complexity of the application.
Once you are ready to develop, you can start from scratch by using direcly the WebRTC APIs from your application or you can rely on some third-party libraries.
From NPM and Github, you will find a lot.
Some of these libraries have been developed some years ago because of the differences between the WebRTC specifications and the existing implementation in browsers at that time. The WebRTC API were not consistent among browsers, some methods or events were missing or have a different behavior, etc. It was a nightmare to have something that works a day and the day after or to propose an homogenous application across browsers… So some skilled WebRTC adopters shared helpers and libraries they made to avoid other developers having the same issues.
So be sure that the library you want to use is still maintained and embraced by a large community of users. Some libraries were developed some years ago when the need of WebRTC helpers were huge. Now, as the majority of browsers have implemented the common specifications, you need fewer libraries if you implement common WebRTC cases only.
The other side of the medal is that the WebRTC library you use can hide completely the WebRTC part. Depending on the need, this can be good or not. It is good because, the WebRTC complexity is completely or partially hidden by that library and as your application relies on it, less WebRTC skills are needed. But what happens when there is an issue ? There may not be many possibilities to fix the issue and it will be difficult to catch up the understanding in this stressful time.
In most of the cases, the WebRTC Adapter library needs to be used. Then depending on how you implement your signaling part and on how you manipulate the WebRTC EcmaScript objects, third-party libraries can be used to speed-up your development and rely on already tested codebase that you don’t need to support yourself.
As said in the previous paragraph, one of the most common WebRTC libraries used is WebRTC adapter or adapter.js.
This library is a shim that makes WebRTC homogenous across browsers.
Some years ago, developing a WebRTC application without that library was really a tremendous challenge because of the diversity of the WebRTC implementations among browsers.
Presently, I cherish a slightly feeling. If you want to stick with the latest new WebRTC features and so API, I think, you still need to add it because it will make the glue between the changes delivered by the browsers evolution before a common adoption. If your users are using outdated systems and so an old browser’s version, I think you need to consume it too. Finally, if you want to have an homogenous and unique way to use the WebRTC APIs in different browsers, you still need to use that library to avoid some lack of implementations sometimes in some well known browsers.
In case of you are using only Chrome and some common WebRTC API and make P2P calls, you can directly code without using that library. It will work without extra effort. But this is not really a production case.
Start without the library to discover how it works is a good way to learn and confront yourself to the current WebRTC “state of the art” and so common complicated things and issues. Then introduce it and see how it’s better… Even in 2021, this library continues to be maintained. At this time of writing, the last version was produced in March, 2020 (v7.7.1).
This is where the problem starts for a Web developer.
For sure, you can use some WebRTC APIs to access the camera and display the video on your Web page. Adding effects and taking photos are easy to do. But if you want to make a WebRTC call to another browser, you have to implement a way to put the users in relation and to exchange the capabilities of the two WebRTC end-point (the browser). This is called the Signaling.
What ? This is not part of WebRTC ? For sure not! That part is outside the scope of WebRTC because in 99% of the case, for example, if your application already exchanges instant messages between your users, you already have a way to implement your signaling server. You can already send the information for establishing the call from a sender to the recipient.
But if you don’t have a such server, you will need to develop it. If you don’t have some backend skills, you will need some to build for example a Node.JS server with the support of WebSocket.
Then you have 2 choices: Either relying on an established signaling protocol such as SIP or Jingle. Jingle needs XMPP so if you don’t have an Ejabberd server to handle the XMPP part, SIP is perhaps easier to integrate because you just need to find a good WebRTC SIP JavaScript library. The second choice is to build your own signaling protocol based on JSON messages. Basically, it’s not something so complicated for common cases: you have to send and receive the SDP and ICE candidates. And for an audio or video between 2 persons this is enough.
In my Github, I developed a signaling library called JSONgle adapted from Jingle but using JSON messages. The library uses Socket.IO but that transport layer can be replaced by your own. It can be a starting point if you want to explore that way.
Depending on the features offered by your application, the media part can be limited or complete.
If your application only offers P2P communication (with 2 participants), the minimum viable media part is to have a way to support NAT topology and a way to relay the media when there are network equipments that blocks the establishment of the communication between the 2 participants.
For that, you need to have a STUN server and a TURN server which can be mixed into an unique piece of software.
Deploying and configuring a STUN and TURN server such as Coturn is not so simple. So having some IT skills is highly recommended.
Then, if your application wants to offer call with more than 2 participants, you need to add a media server. To be quick, for that part, you will not develop it on your own except if you have time and money :-) Hopefully, a lot of solutions exist and are described in the two next paragraphs.
You start discovering if not already done that your Web skills will not help you for building the media part. You need extra skills and the missing ones are not easy to grab because there are from IT, network protocols and network security…
As seen in the previous paragraph, you need a signaling server, surely a WebRTC media server, a media relay and a server.
Free or open-source WebRTC solutions exist and can be integrated in your application. But be careful because a majority of them have the signaling part and the media part deeply linked together and so your application needs to integrate the whole (application side + server side). If you don’t already have a server part, it could help you to develop faster. In the other case, it is better to choose the one that suits best.
Some well known platforms are Janus, Jitsi (Meet + VideoBridge), MediaSoup, Kurento, BBB and Medooze. Additionally, we can add Freeswitch and Asterisk too.
Once you will have integrated one of these libraries, the next big thing is how to control your solution by monitoring it and knowing when to scale it…
Additionally to the open-source libraries, you can find paid solutions. In a majority of the cases, they provide the infrastructure components (TURN, STUN) that make your application works in 99,99% of the case which means in the cases when the users are not directly connected to Internet but behind a corporate firewall or some network equipments that block the traffic. These PaaS solutions offer also the possibility to have conferences in audio and video and all the features that comes with that service.
When using these libraries or SDK, you will often paid for the traffic and services that go through the infrastructure in place.
Choose carefully the solution you want to use because once done, cost is high to migrate from one solution to an other one.
Mainly, the complexity that hides these WebRTC platform as a Service is around themes such as the scalability, the security and the availability. Having a media server that runs locally on a development environment is not the same thing of having a service hosted that is monitored, updated, supported and that works…
As you can imagine, integrating real-time communication in your Web application requires to pay attention not only to your JavaScript codebase but to the complete ecosystem where you application runs.
Having more that Web skills is really needed to overcome that challenge because you have to understand how the signaling and the media flow circulate, how to deploy and configure components such as your media relay or your media server. A part of your development’s time will be dedicated to manage non-Web components.
Having a WebRTC development environment in place is a first step.