The Internet is widely used for audio communications. Numerous collaboration applications exist that make it trivial to carry on a conversation with almost anyone, worldwide. Likely, you already use these applications regularly. What is thought of as traditional phone lines have even largely migrated towards running over the Internet. So, why is performing music any different?
The answer lies in the problem of trying to keep a common rhythm going between remote musicians. Maintaining a shared beat or sense of pulse is difficult if it takes too long for one musician's sound to reach another's ears. There can be drastic consequences. The "Happy Birthday" effect is familiar from family video conferences. The length of time for sound to get between individuals makes it impossible for the group to sing together. Singers find themselves in a situation of "I'm waiting for you and you're waiting for me" and that's the problem. Video calls are engineered for turn taking in conversation. Music is different because it involves simultaneous coordinated activity rather than alternation. Tightly synchronized performance only works over the Internet with extremely low-latency audio applications like JackTrip.
A group’s ability to maintain a steady pulse is heavily impacted by what is known as latency. This is a term which refers to how long it takes for one performer's sound to reach the another's ears. It is typically measured in milliseconds (msec), or 1/1000 of a second. Research has found that the ability to perform syncronized rhythms together requires a latency below 25-30 msec one way. There isn’t a hard and fast number for this because everyone is different and musical situations differ. Particularly important in this regard is the speed or tempo of a piece (measured in beats-per-minute). Slower tempi can tolerate relatively longer Internet latencies.
To help put this into perspective, sounds traveling through air 25 feet (roughly 8 meters) take about 25 msec. We're comfortable playing or singing together at distances within this range. As a group spreads out, say across a football field, the ability to keep a coordinated rhythm becomes increasingly difficult. This is why minimizing latency is so important.
When performing music over the Internet, the sound you make has to pass through several stages to reach another performer. Each of these stages adds latency.
The farther away one performer is from another, the longer it will take the sound to travel between them. The globe is laced with fiberoptic cables running across land and under the sea. Connected together, these form the Internet. Data travels at near the speed-of-light (roughly 70%) across these “backbone” network segments and transits from one to another at relay points called “network hops”. The greater the distance and the greater the number of hops the longer a sound will take to get from source to destination. Good, tight latencies are achievable when musicians are physically located no more than a few hundred miles apart (approximately 1000 kilometers). Figure about 10-12 msec to go traverse a large metropolitan area (one way).
Each musician's home or studio has an Internet connection. Latency depends on the type of connection, with fiber-to-the-home (FTTH) being fastest (about 2 msec) and cable or DSL being among the slowest (about 10-15 msec). Internet connection latency is not the same as the bandwidth of your Internet connection which specifies how much data can be transmitted over a given period of time. Having a high bandwidth rate (even gigabit) may not correspond to having low latency. The typical tools to measure quality-of-service (QoS) of a connection are speed test and round-trip ping time (which measures data echoes off a test point elsewhere on the Internet). However, these only give a rough approximation of the QoS required for network music performance. One aspect of QoS which has great impact for latency is the smoothness of very fast data flows (audio packet jitter). This is best when connections are uncongested.
Quite simply, do not use WiFi. Even the best WiFi routers add significant latency and an impossible amount of jitter (occasionally stalling audio packet flows for 10's of msec). Computers or devices being used for JackTrip need to be connected to the home network using an Ethernet cable (usually plugged into Ethernet ports in the back of the WiFi router). Latency across the wired portions of a home network can be less than 1 msec.
The latency from ADC and DAC processing is determined by the hardware you are using, often referred to as your “sound card” or audio interface. Every sound card is different and there's a very large range. Laptops can easily have latencies over 100 milliseconds, and even the best USB-based audio interfaces have latencies in the range of 5-15 milliseconds (total in and out). Low latency is more of a priority for sound card manufacturers targeting music studio applications rather than use cases like music players, video players, gaming, and dialog recording. Unfortunately, USB microphones are not yet proving to be low enough latency, either.
The best way to reduce sound card latency is to use a hardware device which is designed for minimal latency. An economical way we’ve found is what's bundled in our Raspberry Pi JackTrip Device, which includes a HiFiBerry sound card. Latency is exceptionally good at about 1 msec.
Bluetooth devices are out, consider it important to use analog only. Remembering that the speed of sound in air can be significant (1 msec per foot, or roughly 3 msec per meter), care should be taken to keep the microphone close to the sound source. Analog headphones are obviously great for this proximity delay but loudspeakers can also be used, bearing in mind that they should be nearby (while avoiding the possibility of feedback). Electronic instruments (guitars, keyboards, etc.) can be plugged directly into the analog input of the audio interface. Monitor their sound by mixing it to the headphones or loudspeakers via the device's direct monitoring.
When connecting remote musicians together over the Internet, there are two common ways to wire-up multiple sites: the client-server (hub and spoke) method and the peer-to-peer (p2p) method. Let’s explore the differences between these.
In the Client-Server Model, every performer’s computer sends a single copy of their audio input to a central server. The server mixes all the audio streams together and sends a single copy of the mix back to every performer’s computer, which plays it to their audio output. You can visualize this method as a hub and spoke pattern:
The processing and bandwidth requirements for each performer in the client-server model remain constant and low regardless of the number of performers. However, the server’s processing and bandwidth requirements will grow proportionally with the number of performers. Servers are designed for this, so it's not a problem because they can easily be scaled to handle up to handle hundreds of simultaneous performer connections. This makes the client-server pattern most suitable for use with groups that are larger than a handful of performers.
In the peer-to-peer (or p2p) method, each performer’s computer sends a copy of their audio input directly to every other performer. Each performer’s computer mixes all the incoming audio streams together and plays the result to their audio output. You can visualize this as a mesh pattern:
The processing and bandwidth requirements for each performer’s computer are directly proportional to the number of connected performers. Today's laptops and home Internet connections tend to max out beyond a dozen or so performers. This method is not viable for larger groups.
Click here to learn more about JackTrip.