Practical insights

Latency tolerances

It is popular to benchmark creative technical solutions in terms of quality of service⁴⁷, which encompasses computer science related terms such as* bandwidth*,* jitter*,* packet loss and latency*. Audiences on the other hand are concerned with the* quality of experience⁴⁷*, encompassing subjective terms such as smoothness, responsiveness, comfort and immersion.

These two do not always align. Solutions that technically fall short in terms of strict quality of service thresholds can sometimes still deliver a satisfying experience if they remain within perceptual tolerance windows. On the other hand, despite meeting quality of service thresholds, something might feel just off about an experience without careful thought to multisensory elements. This distinction helps guide where creative effort and technical effort are best invested.

It is often difficult to ensure that information being presented to different senses arrives at exactly the same time. Fortunately, in integrating sensory information our brains can tolerate small mismatches in timing. This means that absolute zero-latency is not always necessary for an engaging, comfortable experience. The graph below illustrates the lags that can be tolerated for different tasks and information type.

Loading chart…

These considerations are important in a range of different screen, performance and gaming contexts. The following are some illustrative examples:

Motion-to-Photon Latency in Gaming: When a player moves their character, a traditional view is that there should be a minimal delay before this movement is reflected on screen. This delay is known as motion-to-photon latency. Studies on first-person targeting and competitive FPS play show that reducing local latency improves performance; performance differences are already measurable in the tens-of-milliseconds range^55,56.
Remote Collaborative Performances: When actors or musicians collaborate from different locations on the map, delays in audio and visual transmission can disrupt the natural flow of interaction, which can influence the quality of their performance and expressive nuance⁵⁷. In networked music performance, 20–30 ms is often cited as a target that feels close to in-person playing, although tolerable delay depends on repertoire, rhythm and performer adaptation⁵⁸. That being said, large orchestras, spread over a 40ft area (12.2m), would have delays of 35–36 ms milliseconds due to speed of sound, implying that there are workarounds to this issue⁴⁸. Out of curiosity, it is worth noting a striking parallel: the band AC/DC often performed within a few feet of each other, with Malcolm Young (rhythm guitar), Cliff Williams (bass) and Phil Rudd (drums) positioned very close and maintaining their famously precise rhythmic tightness: "They didn't just play in time, they felt the time — the groove — together."⁵⁹
Live Music Streaming: When streaming a live music performance into viewers' homes, it is important to consider how latency affects the experience. A small mismatch between audio and video can be tolerated, but the acceptable limits depend on which signal leads. For broadcast-style audiovisual sync, viewers begin to detect mismatch at roughly 45 ms when audio leads video, and 125 ms when video leads audio.
Arena-scale performances: Sound travels through air at approximately 343 meters per second, meaning it takes about 2.9 milliseconds to travel 1 metre. If you sit 10 metres from a stage, you will hear the performer about 29 milliseconds later than someone in the front row. Temperature also has a significant effect: the speed of sound increases by around 0.5–0.6 m/s for each 1°C rise in air temperature⁶⁰, with smaller contributions from humidity and air pressure. High frequencies are more dampened than lower frequencies, leading to a loss of brightness over distance. Here's an interesting thought: if a performer's mouth occupies about 2 cm in your visual field at 1 m but only 0.2 cm at 10 m (simply due to physics), lip-sync errors may be less perceptually obvious at that distance — but timing differences in sound remain important. This is one reason large venues and festivals strategically distribute speakers (delay towers) to keep sound aligned across the audience area. A known issue is that audience members located between two towers may hear slightly misaligned signals, creating a hollow or echo-like effect. Much of the craft of setting up public address systems involves tuning these time delays and shaping directivity patterns to minimise overlap and interference^54,61.

Tolerances also vary substantially from person to person. For example, children typically have a higher tolerance for audiovisual delays than adults⁶², people with Attention Deficit/Hyperactivity Disorder (ADHD) tend to have a lower tolerance⁶³ (narrower windows; note some recent counter evidence exists⁶⁴), and individuals with developmental dyslexia⁶⁵ or with schizophrenia⁶⁶ generally display higher tolerances (broader windows) for the same lags. Because inclusive design should cater for all, including the most sensitive groups, latency budgets ought to be set just below the smallest empirically reported window, even if larger windows appear acceptable for other audiences.

Tolerance of spatial mismatches

When sensory signals appear in roughly the same location, our brains tend to merge them. Interestingly, this principle can hold even if there is a small but noticeable gap. For example, in a ventriloquist's performance the puppeteer's mouth is the true source of sound, yet onlookers often perceive the puppet itself as talking⁶⁷. Even when there is a 30° separation⁶⁸ between the puppet and the ventriloquist's mouth, observers often fuse the two. A variety of factors influence the level of acceptable variation, as outlined in the next section.

Prior experience and the assumption of unity

If we believe that two signals share the same origin, our brains can tolerate more pronounced mismatches in timing or location than usual. Early work⁶⁹ showed that simply expecting signals to come from one source can override some discrepancies in timing or location. This is known as the assumption of unity and it is strongly influenced by prior experience – the brain learns from repeated multisensory encounters, building strong associations that guide whether future signals are integrated.

Black and white photograph showing a section of a cratered planetary surface with several raised, rounded formations resembling faces. The formations stand out due to shadows and light contrasts, highlighting unusual shapes amid a textured, pockmarked terrain.

The Face on Mars captured by NASA's Viking 1 orbiter in 1976. This image is often cited as a classic example of pareidolia: the human tendency to perceive meaningful patterns, such as faces, in random or ambiguous visual data. © NASA/JPL, Viking 1 Orbiter, 1976.

Perceptual 'filling in'

Our brains often fill in missing or messy sensory data, drawing on past knowledge and context to create an experience that is fuller than that of the sensory information provided. An example of this is pareidolia, the tendency to see meaningful patterns in random or ambiguous things. This is why we might see a face on the surface of the moon or a human figure in the shadows on a dark street⁷⁰. This ability to interpret incomplete sensory information is essential for navigating the messy reality of everyday perception. And it also offers creative opportunities. In particular, it points to the conclusion that not every detail needs perfect simulation. Strategic hints can prompt audiences to imagine the rest, easing technical demands without sacrificing immersion.

Multisensory attention

Multisensory cues are excellent at attracting and holding attention. Hearing a loud bang alongside a bright flash is more attention-grabbing than either of these alone. And in a performance or interactive exhibit, coordinating sound, light and even gentle tactile cues can effectively steer the audience's focus.

Attention can also be drawn by sensations which violate our expectations⁷¹. This suggests that deliberate deviations from multisensory realism could be used to capture people's attention in interesting ways, where objects may appear to be made of one material but feel unexpectedly different to the touch (e.g., a surface that looks like hard stone but is actually soft), thereby creating surprise and increasing user interest or engagement.

Once attention is focused on a particular event, it has a number of other important effects. For example, attention can enhance multisensory integration, amplifying effects such as the ventriloquist effect discussed earlier⁶⁷. Attention can also selectively amplify some stimuli while suppressing others. The cocktail party effect is a classic example whereby you can often tune out background noise to focus on a single conversation⁷². Interestingly, attending to lip movements in a noisy room has been found to make a speaker's voice perceptually louder, roughly the same as a +5 dB increase in sound level or about 150% of the original loudness⁷². And this amplification effect occurs in the other senses as well⁷³, for example heightening visual contrast⁷⁴, or making flavours seem richer or more pronounced⁷⁵.

Attention directed toward emotional stimuli can also heighten affective responses, making emotionally charged experiences more intense⁷⁶. Whether it's focusing on a beautifully plated meal or a dramatic moment in a film, attention can amplify the depth of an emotional experience by prioritising and enhancing relevant sensory cues.

Holistic experiential integration

Audiences absorb far more than just the main show. From their first step into a venue, the lighting, ambient sounds and even the scent of the lobby, all shape their overall impression. Subtle touches, like a textured ticket or a gentle aroma, can make a lasting emotional impact. This highlights the importance of carefully designing the onboarding and offboarding phases of an experience (the whole arc³²) to ensure it feels inclusive, enjoyable and comfortable for everyone⁷⁷ .