Siri, Apple's personal virtual assistant, debuted on the iPhone but, step by step, expanded to iPad, Mac, and Apple Watch. When it came time to build the HomePod, though, Apple needed to make a version of Siri optimized not for wrist, hand, lap or table distances, but for across the room and around the house. That meant making Siri optimized for far-field use cases.

Apple's Machine Learning Journal:

Far-field speech recognition becomes more challenging when another active talker, like a person or a TV, is present in the same room with the target talker. In this scenario, voice trigger detection, speech decoding, and endpointing can be substantially degraded if the voice command isn't separated from the interfering speech components. Traditionally, researchers tackle speech source separation using either unsupervised methods, like independent component analysis and clustering [4], or deep learning [5, 6]. These techniques can improve automatic speech recognition in conferencing applications or on batches of synthetic speech mixtures where each speech signal is extracted and transcribed [6, 7]. Unfortunately, the usability of these batch techniques in far-field voice command-driven interfaces is very limited. Furthermore, the effect of source separation on voice trigger detection, such as that used with "Hey Siri", has never been investigated previously. Finally, it's crucial to separate far-field mixtures of competing signals online to avoid latencies and to select and decode only the target stream containing the voice command.

Voice assistants are still new by user interface standards but will inarguably play a huge part in the future of human-machine interactions, so I can't get enough of this kind of stuff.

Well worth a read.

This post may contain affiliate links. See our disclosure policy for more details.