Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

*=Equal Contributors

This paper was accepted on the Environment friendly Pure Language and Speech Processing workshop at NeurIPS 2023.

Interactions with digital assistants usually start with a predefined set off phrase adopted by the consumer command. To make interactions with the assistant extra pure, we discover whether or not it’s possible to drop the requirement that customers should start every command with a set off phrase. We tackle this job by combining the decoder alerts of an computerized speech recognition (ASR) system with acoustic and lexical representations as enter options to a big language mannequin (LLM). We’re occupied with data- and resource-efficient techniques that require solely a small quantity of coaching information and may doubtlessly run on units similar to smartphones. Because of this, our mannequin is finetuned on a small quantity of multimodal information utilizing low-rank adaptation. We evaluate the proposed system to unimodal fashions that rely both on lexical or acoustic data solely. The effectiveness of our methodology is analyzed by finetuning decoder-only LLMs with sizes between 3 billion and 13 billion parameters on coaching information consisting of 10k to 80k utterances. We present that our greatest multimodal system yields higher outcomes than unimodal baselines whereas utilizing solely a fraction of the coaching information.

Source link