This paper was accepted on the Trade Monitor at NAACL 2024.
With more and more extra highly effective compute capabilities and sources in immediately’s units, historically compute-intensive computerized speech recognition (ASR) has been transferring from the cloud to units to raised defend person privateness. Nonetheless, it’s nonetheless difficult to implement on-device ASR on resource-constrained units, resembling smartphones, sensible wearables, and different small house automation units. On this paper, we suggest a collection of mannequin structure adaptions, neural community graph transformations, and numerical optimizations to suit a sophisticated Conformer primarily based end-to-end streaming ASR system on resource-constrained units with out accuracy degradation. We obtain over 5.26 occasions quicker than realtime (0.19 RTF) speech recognition on small wearables whereas minimizing vitality consumption and attaining state-of-the-art accuracy. The proposed strategies are extensively relevant to different transformer-based server-free AI purposes. As well as, we offer an entire principle on optimum pre-normalizers that numerically stabilize layer normalization in any Lp-norm utilizing any floating level precision.