Written by Julian Spittka, Chris Owen, Dusan Stevanovic, Raphael Robert
Some of us have been striving to build the perfect conferencing system since the beginning of our careers. But what exactly makes such a “perfect” system? Besides high quality media and low latency, which are the basis for any Internet communication system, there are 3 elements:
End-to-end encryption is state of the art for providing best-in-class security, where no media can be decrypted on an intermediary server for mixing or other modifications.
Traditionally, conferencing was done on a server. All clients would connect to a conference room (i.e. a server), which would mix an audio and/or generate a video signal for every participant and forward those to all endpoints.
The obvious disadvantage – which makes this traditional way of doing conferencing impossible today – is that media data is only encrypted in transit but completely open on the server. Whoever controls the server has access to the content of every conference that takes place on that server as the service is NOT end-to-end encrypted. Unfortunately, this practice is still very common in many services today.
In the past, this approach was justified by offering better scalability. Every endpoint only needs one upload and one download media stream, and the number of participants per conference is purely limited by the CPU and network capabilities of the server. Today – with clients becoming more capable – this argument no longer holds.
A few of us had the privilege to work for Skype in the past. Skype tried to solve the above problem by using one of the participant’s devices as a “server”. Therefore, the server was owned by a trusted entity and decoded media data would never be available to anyone outside the trusted circle. This worked for a while, but wasn’t a scalable solution as devices would be frequently overloaded or calls would drop entirely every time the hosting device left the call.
When Wire introduced the first version of an end-to-end encrypted conferencing system in early 2017, we tried to overcome the above problem by making every conference participant’s device its own “server”. Devices in a group call essentially set up an encrypted 1:1 call to every remote device and build a full mesh between all participants. Therefore, this system was end-to-end encrypted and didn’t suffer from relying on a single device’s availability. On the flip side, it turned out to be a challenge to maintain all legs of the full mesh, e.g. a 10 participant call would have to maintain 45 1:1 calls in total.
Also, as Wire uses standard WebRTC to offer calling inside browsers, the possibilities for optimizing resources were limited. Every 1:1 call to remote participants would require its own encoder and decoder and allocate network resources. Therefore the CPU of devices could quickly take a high toll and also the network upload speed frequently reached its limits. This solution worked well and reliably for smaller groups but it became evident we would run into scalability issues in the future.
Therefore, we immediately started working on the next generation of a scalable conferencing platform. Instead of setting up a 1:1 call to every remote participant the new solution reintroduces a central server. Every device in a conference call connects to this server. All media data is only encoded and encrypted once for the group and then sent to the server, instead of being encrypted individually per participant. The server then forwards the received media packets to all remote participants without the need to decrypt them. We needed to look into standard ways of accomplishing this using WebRTC. After some tedious R&D we were able to find a workable solution where the server has no knowledge of any encryption keys. This solves most of our requirements. The solution is:
Despite all those advantages, there were still a few challenges we had to solve:
We want to prevent the calling server from being able to collect metadata on users and their calls. Wire users have a unique identifier, a so-called user ID that represents them. Since this user ID is long-lived, we want to avoid using it in the context of calling in order to avoid all too easy data aggregation and tracking across calls. Instead, we generate a random value for each call, share it among the devices and use it to generate ephemeral conversation and user ID’s going to the calling server. This makes it difficult for the server to track a user as they will have a different ID every time a new call is made. We extend the same concept to the identifier of a conversation, the so-called conversation ID. We use a random conversation ID for every call, which makes it difficult to keep track of calls within a conversation.
We also improve privacy between users: Since all call participants connect to the calling server for signalling and routing encrypted media streams, participants do not see each other’s IP addresses.
The design of the key schedule has the goals of maintaining good security while also keeping the number of key messages low. If we adopted a method that each client generates a key and sends it to the others when they join, calls with larger numbers of clients would result in many key messages, as many as 90 could be needed to set up a 10 person call. Instead, we designate one client to generate the keys on a rolling schedule and distribute them as necessary. When a client joins the call, they get given the current key and are able to encrypt and decrypt immediately. Every 30s a new key is used for media encryption. If no clients left the call, this key is derived from the previous key material and clients in the call do not need to be sent the new key. This provides Forward Secrecy, meaning that new joiners do not have the key material to decrypt the portion of the call that took place before they joined. If one or more clients have left, a fresh key is generated and sent to all clients now in the call, effectively excluding previous participants from having access to the current key material and providing Post-Compromise Security. The keys are passed between clients using Proteus messages, end-to-end encrypted between clients going via the messaging infrastructure, completely avoiding the calling server. Therefore the calling server will never be able to decrypt any media data and can be considered a no-trust zone.
You can find more detailed information in the Security Whitepaper on https://wire.com/security.
The basic principles described in this article are not only limited to audio and video conferencing. Screen sharing or other content types can also be distributed in a fully secure and private way. In particular, this concept can also be extended to large(r) scale broadcast and streaming services. This approach also levels the path towards combining end-to-end encrypted conference calling with the newly designed Messaging Layer Security (MLS) protocol. This opens up new possibilities for future standardization efforts.
This conferencing service we built is only the beginning. While there are still plenty of improvements and optimizations to be done, we are confident it is a platform that is scalable and can be used as a basis for many new features in the future.
Wire™ is the most secure collaboration platform, transforming the way businesses communicate at the same speed that our founders disrupted telephony with Skype. Headquartered in Berlin with offices in Switzerland and San Francisco, Wire’s award-winning collaboration and communications platform counts over 1,800 enterprise customers worldwide. Recognized by IDC, Forrester, and Gartner as one of the most secure collaboration platforms, Wire offers messaging, audio/video conferencing, file-sharing, and external collaboration - all protected by the strongest end-to-end encryption.
If you are searching for the most secure video conferencing solution, look no further. But don't take our word for it, try it for free today.
Looking for a walkthrough of our enterprise communication solution? Contact us today to learn how Wire™ fits into your organization.