Overcoming Automatic Speech Recognition Challenges: The Next Frontier | by Tal Rosenwein

[ad_1]

Developments, Alternatives, and Impacts of Computerized Speech Recognition Expertise in Numerous Domains

Picture by Andrew DesLauriers on Unsplash

TL;DR:

This publish focuses on the developments in Computerized Speech Recognition (ASR) expertise and its impression on numerous domains. ASR has grow to be prevalent in a number of industries, with improved accuracy pushed by scaling mannequin measurement and establishing bigger labeled and unlabelled coaching datasets.

Trying forward, ASR expertise is predicted to proceed bettering with the scaling of the acoustic mannequin measurement and the enhancement of the inner language mannequin. Moreover, self-supervised and multi-task coaching methods will allow low-resource languages to learn from ASR expertise, whereas multilingual coaching will increase efficiency even additional, permitting for primary utilization reminiscent of voice instructions in lots of low-resource languages.

ASR may even play a major position in Generative AI, as interplay with avatars can be through an audio/textual content interface. With the emergence of textless NLP, some end-tasks, reminiscent of speech-2-speech translation, could also be solved with out utilizing any express ASR mannequin. Multimodal fashions that may be prompted utilizing textual content, audio, or each can be launched and generate textual content or synthesize audio as an output.

Moreover, open-ended dialogue techniques with voice-based human-machine interfaces will enhance robustness to transcription errors and variations between written and spoken types. This can present robustness to difficult accents and kids’s speech, enabling ASR expertise to grow to be a necessary software for a lot of purposes.

An end-to-end speech enhancement-ASR-diarization system is about to be launched, enabling the personalization of ASR fashions and bettering efficiency on overlapped speech and difficult acoustic eventualities. This can be a important step in the direction of fixing ASR expertise’s challenges in real-world eventualities.

Lastly, A wave of speech APIs is predicted. And nonetheless, there are alternatives for small startups to outperform large tech corporations in domains with extra authorized or regulatory restrictions on using expertise/knowledge acquisition and in populations with low expertise adoption charges.

Computerized Speech Recognition (ASR) expertise is gaining momentum throughout numerous industries reminiscent of training, podcasts, social media, telemedicine, name facilities, and extra. An amazing instance is the rising prevalence of voice-based human-machine interface (HMI) in shopper merchandise, reminiscent of good vehicles, good houses, good assistive expertise [1], smartphones, and even synthetic intelligence (AI) assistants in lodges [2]. As a way to meet the rising demand for quick and correct responses, low-latency ASR fashions have been deployed for duties like key phrase recognizing [3], endpointing [4], and transcription [5]. Speaker-attributed ASR fashions [6–7] are additionally gaining consideration as they allow product personalization, offering larger worth to end-users.

Prevalence of Knowledge. Streaming audio and video platforms reminiscent of social media and YouTube have led to the simple acquisition of unlabeled audio knowledge [8]. New self-supervised methods have been launched to make the most of this audio with no need floor fact [9–10]. These methods enhance the efficiency of ASR techniques within the goal area, even with out fine-tuning on labeled knowledge for that area [11]. One other method gaining consideration attributable to its means to make the most of this unlabeled knowledge is self-training utilizing pseudo-labeling [12–13]. The principle idea is to mechanically transcribe unlabeled audio knowledge utilizing an automated speech recognition (ASR) system after which use the generated transcription as floor fact for coaching a distinct ASR system in a supervised trend. OpenAI took a distinct method, assuming they will discover human-generated transcripts at scale on-line. They generated a high-quality and large-scale (640K hours) coaching dataset by crawling publicly obtainable audio knowledge with human-generated subtitles. Utilizing this dataset, they educated an ASR mannequin (a.ok.a Whisper) in a completely supervised method, reaching state-of-the-art (SoTA) outcomes on a number of benchmarks in zero-shot settings [14].

Losses. Regardless of Finish-2-end (E2E) losses dominating SoTA ASR fashions [15–17], new losses are nonetheless being printed. A brand new method referred to as hybrid autoregressive transducer (HAT) [18] has been launched, enabling to measure the standard of the inner language mannequin (ILM) by separating the clean and label posteriors. Later work [19] used this factorization to successfully adapt the ILM utilizing solely textual knowledge, which improved the general efficiency of ASR techniques, notably the transcription of named entities, slang phrases, and nouns, that are main ache factors for ASR techniques. New metrics have additionally been developed to higher align with human notion and overcome phrase error price (WER) semantic points [20].

Structure Alternative. Concerning the acoustic mannequin’s architectural selections, Conformer [21] remained most well-liked for streaming fashions, whereas Transformers [22] is the default structure for non-streaming fashions. As for the latter, encoder-only (wav2vec2 primarily based [23–24]) and encoder-decoder (Whisper [14]) multi-lingual fashions have been launched and improved over the SoTA outcomes throughout a number of benchmarks in zero-shot settings. These fashions outperform their streaming counterparts attributable to mannequin measurement, coaching knowledge measurement, and their bigger context.

Multilingual AI Developments from Tech Giants. Google has introduced its “1,000 Languages Initiative” to construct an AI mannequin that helps the 1,000 most spoken languages [25], whereas Meta AI has introduced its long-term effort to construct language and machine translation (MT) instruments that embody many of the world’s languages [26].

Spoken Language Breakthrough. Multi-modal (speech/textual content) and multi-task pre-trained seq-2-seq (encoder-decoder) fashions reminiscent of SpeechT5 [27] have been launched, displaying nice success on all kinds of spoken language processing duties, together with ASR, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

These developments in ASR expertise are anticipated to drive additional innovation and impression a variety of industries within the years to return.

Regardless of its challenges, the sphere of Computerized Speech Recognition (ASR) is predicted to make important developments in numerous domains, starting from acoustic and semantic modeling to conversational and generative AI, and even speaker-attributed ASR. This part supplies detailed insights into these areas and shares my predictions for the way forward for ASR expertise.

Basic Enhancements:

The development of ASR techniques is predicted on each the acoustic and semantic components.

On the acoustic mannequin aspect, bigger mannequin and coaching knowledge sizes are anticipated to reinforce the general efficiency of ASR techniques, just like the progress noticed in LLMs. Though scaling Transformer encoders, reminiscent of Wav2Vec or Conformer, poses a problem, a breakthrough is predicted to allow their scaling or see a shift in the direction of encoder-decoder architectures as in Whisper. Nevertheless, encoder-decoder architectures have drawbacks that should be addressed, reminiscent of hallucinations. Optimizations reminiscent of faster-whisper [28] and NVIDIA-wav2vec2 [29] will cut back coaching and inference time, decreasing the barrier to deploying giant ASR fashions.

On the semantic aspect, researchers will give attention to bettering ASR fashions by incorporating bigger acoustic or textual contexts. Injecting large-scale unpaired textual content into the ILM throughout E2E coaching, as in JEIT [30], may even be explored. These efforts will assist to beat key challenges reminiscent of precisely transcribing named entities, slang phrases, and nouns.

Though Whisper and Google’s common speech mannequin (USM) [31] have improved ASR system performances over a number of benchmarks, some benchmarks nonetheless should be solved because the phrase error price (WER) stays round 20% [32]. Utilizing speech basis fashions, including extra various coaching knowledge, and making use of multi-task studying will considerably enhance efficiency in such eventualities, opening up new enterprise alternatives. Furthermore, new metrics and benchmarks are anticipated to emerge to higher align new end-tasks and domains, reminiscent of non-lexical conversational sounds [33] within the medical area and filler phrase detection and classification [34] in media modifying and academic domains. Activity-specific fine-tuned fashions could also be developed for this function. Lastly, with the expansion of multi-modality, extra fashions, coaching datasets, and new benchmarks for a number of duties are additionally anticipated to be launched [35–36].

As progress continues, a wave of speech APIs is predicted, just like pure language processing (NLP). Google’s USM, OpenAI’s Whisper, and Meeting’s Conformer-1 [37] are a number of the early examples.

Though it sounds foolish, power alignment remains to be difficult for a lot of corporations. An open-source code for which will assist many obtain correct alignment between audio segments and their corresponding transcript.

Low Sources Languages:

Developments in self-supervised studying, multi-task studying, and multi-lingual fashions are anticipated to enhance efficiency on low-resource and unwritten languages considerably. These strategies will obtain acceptable performances by using pre-trained fashions and fine-tuning on a comparatively small variety of labeled samples [24]. One other promising method is twin studying [38], a paradigm for semi-supervised machine studying that seeks to leverage unsupervised knowledge by fixing two reverse duties (text-to-speech (TTS) and ASR in our case) directly. On this methodology, every mannequin produces pseudo-labels for unlabeled examples, that are used to coach the opposite mannequin.

Moreover, bettering ILM utilizing unpaired textual content can improve mannequin robustness, which can be particularly advantageous for closed-set challenges reminiscent of voice instructions. The efficiency can be acceptable however not flawless in some purposes, reminiscent of captioning YouTube movies, whereas in others, reminiscent of producing verbatim transcripts in court docket, it could take extra time for fashions to satisfy the edge. We anticipate that corporations will collect knowledge primarily based on these fashions whereas manually correcting transcripts in 2023, and we’ll see important enhancements in low-resource languages after fine-tuned on proprietary knowledge in 2024.

Generative AI:

The usage of avatars is predicted to revolutionize human interplay with digital property. Within the brief time period, ASR will function one of many foundations in Generative AI as these avatars will talk by textual/auditory interface.

However sooner or later, adjustments may happen as consideration shifts in the direction of new analysis instructions. For instance, an rising expertise that’s prone to be adopted is Textless NLP, which represents a brand new language modeling method to audio era [39]. This method makes use of learnable discrete audio items [40], and auto-regressively generates the following discrete audio unit one unit at a time, just like textual content era. These discrete items may be later decoded again to the audio area. Up to now, this expertise has been capable of generate syntactically and semantically believable speech continuations whereas additionally sustaining speaker id and prosody for unseen audio system, as may be seen in GSLM/AudioLM [39, 41]. The potential of this expertise is gigantic, as one can skip the ASR element (and its errors) in lots of duties. For instance, conventional speech-2-speech (S2S) translation strategies work as follows: They transcribe the utterance within the supply language, then translate the textual content to the goal language utilizing a machine translation mannequin, and eventually generate the audio within the goal languages utilizing a TTS engine. Utilizing textless-NLP expertise, S2S translation may be accomplished utilizing a single encoder-decoder structure that works instantly on discrete audio items with out utilizing any express ASR mannequin [42]. We predict that future Textless NLP fashions will resolve many different duties with out going by express transcription, reminiscent of question-answering techniques. Nevertheless, the primary disadvantage of this methodology is backtracking errors and debugging, as issues will get much less intuitive when engaged on the discrete items area fairly than engaged on the transcription.

T5 [43] and T0 [44] confirmed nice success in NLP by using their multi-task coaching and displaying zero-shot job generalization. In 2021 SpeechT5 [27] was printed, displaying nice success in numerous spoken language processing duties. Earlier this 12 months, VALL-E [45] and VALL-EX [46] have been launched. They confirmed spectacular in-context studying capabilities for TTS fashions through the use of textless NLP expertise, enabling cloning speaker’s voice through the use of just a few seconds of their audio, and with out requiring any fine-tuning, doing it even in cross-lingual settings.

By becoming a member of the ideas taken from SpeechT5 and VALL-E, we are able to anticipate the discharge of T0-like fashions that may be prompted utilizing both textual content, audio, or each, and generate textual content or synthesize audio as an output, relying on the duty. A brand new period of fashions will start, as in-context studying will allow generalization in zero-shot settings to new duties. This can permit semantic search over audio, transcribing a goal speaker utilizing speaker-attributed ASR or describing it in free textual content, e.g., ‘what did the younger child that coughed say?”. Moreover, it is going to allow us to categorise or synthesize audio utilizing audio or textual description and resolve NLP duties instantly from audio utilizing express/implicit ASR.

Conversational AI:

Conversational AI has been adopted primarily by task-oriented dialogue techniques, particularly AI private assistants (PA) reminiscent of Amazon’s Alexa and Apple’s Siri. These PAs have grow to be fashionable attributable to their means to offer fast entry to options and data by voice instructions. As large tech corporations dominate this expertise, new laws on AI assistants will power them to supply third-party choices for voice assistants, opening up competitors [47]. As this occurs, we are able to anticipate interoperability between private assistants, that means they may begin speaking. This can be nice as one can use any gadget to connect with any conversational agent anyplace on the earth [48]. From the ASR perspective, this can pose new challenges because the contextualization can be a lot broader, and assistants will should have the robustness to totally different accents and presumably help multilingualism.

Over the previous few years, an amazing technological leap has occurred in text-based open-ended dialogue techniques, e.g., Blender-Bot and LaMDA [49–50]. Initially, these dialogue techniques have been text-based, that means they have been fed by textual content and educated to output textual content, all within the written-form area. As ASR performances improved, open-ended dialogue techniques have been augmented with voice-based HMI, which resulted in misalignment between modalities attributable to variations between the spoken and written types. One of many primary challenges is to bridge this hole by overcoming new varieties of errors launched as a result of audio-related processing, e.g., variations between spoken and written types reminiscent of disfluencies and entity decision, and transcription errors such pronunciation errors [51–52].

Potential options may be derived from improved transcription high quality and sturdy NLP fashions that may successfully deal with transcription and pronunciation errors. A dependable acoustic mannequin’s confidence rating [53] will function a key participant in these techniques, enabling it to level out speaker errors or function one other enter to the NLP mannequin or decoding logic. Moreover, we anticipate that ASR fashions will predict non-verbal cues reminiscent of sarcasm, enabling brokers to grasp the dialog extra deeply and supply higher responses.

These enhancements will allow to push additional dialogue techniques with an auditory HMI to help difficult accents and kids’s speech, reminiscent of in Loora [54] and Speaks [55].

Pushing the bounds even additional, we anticipate the discharge of an E2E multi-task studying framework for spoken language duties utilizing joint modeling of the speech and NLP issues as in MTL-SLT [56]. These fashions will prepare in an E2E trend that may cut back the cumulative error between sequential modules and can tackle duties reminiscent of spoken language understanding, spoken summarization, and spoken query answering, by taking speech as enter and emitting numerous outputs reminiscent of transcription, intent, named entities, summaries, and solutions to textual content queries.

Personalization will play an enormous issue for AI assistants and open-ended dialogue techniques, main us to the following level: speaker-attributed ASR.

Speaker Attributed ASR:

There’s nonetheless a problem in transcribing distant conversations involving a number of microphones and events in residence environments. Even state-of-the-art (SoTA) techniques can solely obtain round 35% WER [57].

Early birds of joint ASR and diarization have been launched in 2019 [58]. This 12 months, we are able to anticipate a launch of an end-to-end speech enhancement-ASR-diarization system which can enhance efficiency on overlapped speech and allow higher efficiency in difficult acoustic eventualities reminiscent of reverberant rooms, far-field settings, and low Sign-to-Noise (SNR) ratios. The development can be achieved by joint job optimization, improved pre-training strategies (reminiscent of WavLM [10]), making use of architectural adjustments [59], knowledge augmentation, and coaching on in-domain knowledge throughout pre-training and fine-tuning [11]. Furthermore, we are able to anticipate the deployment of speaker-attributed ASR techniques for personalised speech recognition. This can additional enhance the transcription accuracy of the goal speaker’s voice and bias the transcript in the direction of user-defined phrases, reminiscent of contact names, correct nouns, and different named entities, that are essential for good assistants [60]. Moreover, low latency fashions will proceed to be a major space of focus to reinforce edge gadgets’ general expertise and response time [61–62].

The Function of Startups In comparison with Huge Tech Firms in The ASR Panorama

Though large tech corporations are anticipated to proceed dominating the market with their APIs, small startups can nonetheless outperform them in particular domains. These embody areas which are underrepresented within the large tech’s coaching knowledge attributable to laws, such because the medical area and kids’s speech, and populations that haven’t but adopted expertise, reminiscent of immigrants with difficult accents or people studying English worldwide. In markets the place there isn’t sufficient demand for giant tech corporations to spend money on, reminiscent of languages that aren’t extensively spokem small startups might discover alternatives to succeed and generate revenue.

To create a win-win state of affairs, large tech corporations can present APIs that provide full entry to the output of their acoustic fashions whereas permitting others to write down the decoding logic (WFST/beam-search) as a substitute of merely including customizable vocabulary or utilizing present mannequin adaptation options [63–64]. This method will allow small startups to excel of their domains by incorporating priming or a number of language fashions throughout inference on prime of the given acoustic mannequin, fairly than having to coach the acoustic fashions themselves, which may be pricey when it comes to human capital and area data. In flip, large tech corporations will profit from broader adoption of their paid fashions.

How Does ASR Match Into The Broader Machine Studying Panorama?

On one hand, ASR is on par with the significance of laptop imaginative and prescient (CV) and NLP when contemplating it as the tip job. That is the present state of affairs in low-resource languages and domains the place the transcript is the primary enterprise, e.g., court docket, medical information, film subtitles, and so forth.

However, ASR is not the bottleneck in different domains the place it has handed a sure usability threshold. In these instances, the NLP is the bottleneck, which signifies that bettering ASR performances towards perfectionism shouldn’t be important for extracting insights for the tip job. For instance, assembly summarization or motion merchandise extraction may be achieved in lots of instances utilizing present ASR high quality.

The developments in ASR expertise have introduced us nearer to reaching seamless communication between people and machines, for instance in Conversational AI and Generative AI. With the continued improvement of speech enhancement-ASR-diarization techniques and the emergence of textless NLP, we’re poised to witness thrilling breakthroughs on this area. As we stay up for the longer term, we are able to’t assist however anticipate the infinite prospects that ASR expertise will unlock.

Thanks for taking the time to learn this publish! Your ideas and suggestions on these projections are extremely valued and appreciated. Please be happy to share your feedback and concepts.

References:

[1] https://www.orcam.com/en/home/

[2] https://voicebot.ai/2022/12/01/hey-disney-custom-alexa-assistant-rolls-out-at-disney-world/

[3] Jose, Christin, et al. “Latency Management for Key phrase Recognizing.” ArXiv, 2022, https://doi.org/10.21437/Interspeech.2022-10608.

[4] Bijwadia, Shaan, et al. “Unified Finish-to-Finish Speech Recognition and Endpointing for Quick and Environment friendly Speech Techniques.” ArXiv, 2022, https://doi.org/10.1109/SLT54892.2023.10022338.

[5] Yoon, Ji, et al. “HuBERT-EE: Early Exiting HuBERT for Environment friendly Speech Recognition.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2204.06328.

[6] Kanda, Naoyuki, et al. “Transcribe-to-Diarize: Neural Speaker Diarization for Limitless Variety of Audio system Utilizing Finish-to-Finish Speaker-Attributed ASR.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03151.

[7] Kanda, Naoyuki, et al. “Streaming Speaker-Attributed ASR with Token-Degree Speaker Embeddings.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.16685.

[8] https://www.fiercevideo.com/video/video-will-account-for-82-all-internet-traffic-by-2022-cisco-says

[9] Chiu, Chung, et al. “Self-Supervised Studying with Random-Projection Quantizer for Speech Recognition.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2202.01855.

[10] Chen, Sanyuan, et al. “WavLM: Giant-Scale Self-Supervised Pre-Coaching for Full Stack Speech Processing.” ArXiv, 2021, https://doi.org/10.1109/JSTSP.2022.3188113.

[11] Hsu, Wei, et al. “Sturdy Wav2vec 2.0: Analyzing Area Shift in Self-Supervised Pre-Coaching.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2104.01027.

[12] Lugosch, Loren, et al. “Pseudo-Labeling for Massively Multilingual Speech Recognition.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.00161.

[13] Berrebbi, Dan, et al. “Steady Pseudo-Labeling from the Begin.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.08711.

[14] Radford, Alec, et al. “Sturdy Speech Recognition through Giant-Scale Weak Supervision.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2212.04356.

[15] Graves, Alex, et al. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Knowledge with Recurrent Neural Networks.” ICML, 2016, https://www.cs.toronto.edu/~graves/icml_2006.pdf

[16] Graves, Alex. “Sequence Transduction with Recurrent Neural Networks.” ArXiv, 2012, https://doi.org/10.48550/arXiv.1211.3711.

[17] Chan, William, et al. “Hear, Attend and Spell.” ArXiv, 2015, https://doi.org/10.48550/arXiv.1508.01211.

[18] Variani, Ehsan, et al. “Hybrid Autoregressive Transducer (Hat).” ArXiv, 2020, https://doi.org/10.48550/arXiv.2003.07705.

[19] Meng, Zhong, et al. “Modular Hybrid Autoregressive Transducer.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.17049.

[20] Kim, Suyoun, et al. “Evaluating Person Notion of Speech Recognition System High quality with Semantic Distance Metric.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.05376.

[21] Gulati, Anmol, et al. “Conformer: Convolution-Augmented Transformer for Speech Recognition.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2005.08100.

[22] Vaswani, Ashish, et al. “Consideration Is All You Want.” ArXiv, 2017, https://doi.org/10.48550/arXiv.1706.03762.

[23] Baevski, Alexei, et al. “Wav2vec 2.0: A Framework for Self-Supervised Studying of Speech Representations.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2006.11477.

[24] Babu, Arun, et al. “XLS-R: Self-Supervised Cross-Lingual Speech Illustration Studying at Scale.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.09296.

[25] https://blog.google/technology/ai/ways-ai-is-scaling-helpful/

[26] https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/

[27] Ao, Junyi, et al. “SpeechT5: Unified-Modal Encoder-Decoder Pre-Coaching for Spoken Language Processing.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.07205.

[28] https://github.com/guillaumekln/faster-whisper

[29] https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/wav2vec2

[30] Meng, Zhong, et al. “JEIT: Joint Finish-to-Finish Mannequin and Inner Language Mannequin Coaching for Speech Recognition.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2302.08583.

[31] Zhang, Yu, et al. “Google USM: Scaling Computerized Speech Recognition Past 100 Languages.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.01037.

[32] Kendall, T. and Farrington, C. “The corpus of regional african american language”. Model 2021.07. Eugene, OR: The On-line Sources for African American Language Challenge. http://oraal.uoregon.edu/coraal, 2021

[33] Brian, D Tran, et al. ‘“Mm-hm,” “Uh-uh”: are non-lexical conversational sounds deal breakers for the ambient scientific documentation expertise?,’ Journal of the American Medical Informatics Affiliation, 2023, https://doi.org/10.1093/jamia/ocad001

[34] Zhu, Ge, et al. “Filler Phrase Detection and Classification: A Dataset and Benchmark.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.15135.

[35] Anwar, Mohamed, et al. “MuAViC: A Multilingual Audio-Visible Corpus for Sturdy Speech Recognition and Sturdy Speech-to-Textual content Translation.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.00628.

[36] Jaegle, Andrew, et al. “Perceiver IO: A Basic Structure for Structured Inputs & Outputs.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.14795.

[37] https://www.assemblyai.com/blog/conformer-1/

[38] Peyser, Cal, et al. “Twin Studying for Giant Vocabulary On-Machine ASR.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.04327.

[39] Lakhotia, Kushal, et al. “Generative Spoken Language Modeling from Uncooked Audio.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2102.01192.

[40] Zeghidour, Neil, et al. “SoundStream: An Finish-to-Finish Neural Audio Codec.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.03312.

[41] Borsos, Zalán, et al. “AudioLM: a Language Modeling Method to Audio Era.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2209.03143.

[42] https://about.fb.com/news/2022/10/hokkien-ai-speech-translation/

[43] Raffel, Colin, et al. “Exploring the Limits of Switch Studying with a Unified Textual content-to-Textual content Transformer.” ArXiv, 2019, /abs/1910.10683.

[44] Sanh, Victor, et al. “Multitask Prompted Coaching Permits Zero-Shot Activity Generalization.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.08207.

[45] Wang, Chengyi, et al. “Neural Codec Language Fashions Are Zero-Shot Textual content to Speech Synthesizers.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.02111.

[46] Zhang, Ziqiang, et al. “Communicate International Languages with Your Personal Voice: Cross-Lingual Neural Codec Language Modeling.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.03926.

[47] https://voicebot.ai/2022/07/05/eu-passes-new-regulations-for-voice-ai-and-digital-technology/

[48] https://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=154094

[49] Thoppilan, Romal, et al. “LaMDA: Language Fashions for Dialog Purposes.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2201.08239.

[50] Shuster, Kurt, et al. “BlenderBot 3: a Deployed Conversational Agent that Frequently Learns to Responsibly Have interaction.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2208.03188.

[51] Xiaozhou, Zhou, et al. “Phonetic Embedding for ASR Robustness in Entity Decision.” Proc. Interspeech 2022, 3268–3272, doi: 10.21437/Interspeech.2022–10956

[52] Chen, Angelica, et al. “Instructing BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.00620.

[53] Li, Qiujia, et al. “Enhancing Confidence Estimation on Out-of-Area Knowledge for Finish-to-Finish Speech Recognition.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03327.

[54] https://loora.ai/

[55] https://techcrunch.com/2022/11/17/speak-lands-investment-from-openai-to-expand-its-language-learning-platform/

[56] Zhiqi, Huang, et al. “MTL-SLT: Multi-Activity Studying for Spoken Language Duties.” NLP4ConvAI, 2022, https://aclanthology.org/2022.nlp4convai-1.11

[57] Watanabe, Shinji, et al. “CHiME-6 Problem: Tackling Multispeaker Speech Recognition for Unsegmented Recordings.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2004.09249.

[58] Shafey, Laurent, et al. “Joint Speech Recognition and Speaker Diarization through Sequence Transduction.” ArXiv, 2019, https://doi.org/10.48550/arXiv.1907.05337.

[59] Kim, Juntae, and Lee, Jeehye. “Generalizing RNN-Transducer to Out-Area Audio through Sparse Self-Consideration Layers.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2108.10752.

[60] Sathyendra, Kanthashree, et al. “Contextual Adapters for Personalised Speech Recognition in Neural Transducers.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.13660.

[61] Tian, Jinchuan, et al. “Bayes Threat CTC: Controllable CTC Alignment in Sequence-to-Sequence Duties.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.07499.

[62] Tian, Zhengkun, et al. “Peak-First CTC: Decreasing the Peak Latency of CTC Fashions by Making use of Peak-First Regularization.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2211.03284.

[63] https://docs.rev.ai/api/custom-vocabulary/

[64] https://cloud.google.com/speech-to-text/docs/adaptation-model

[ad_2]

Source link

Overcoming Automatic Speech Recognition Challenges: The Next Frontier | by Tal Rosenwein | Mar, 2023

ChatGPT for Data Analysts

5 Ways Machine Learning Can Transform Your Digital Marketing

Editor

5 Ways Machine Learning Can Transform Your Digital Marketing

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Overcoming Automatic Speech Recognition Challenges: The Next Frontier | by Tal Rosenwein | Mar, 2023

Developments, Alternatives, and Impacts of Computerized Speech Recognition Expertise in Numerous Domains

TL;DR:

Basic Enhancements:

Low Sources Languages:

Generative AI:

Conversational AI:

Speaker Attributed ASR:

The Function of Startups In comparison with Huge Tech Firms in The ASR Panorama

How Does ASR Match Into The Broader Machine Studying Panorama?

References:

ChatGPT for Data Analysts

5 Ways Machine Learning Can Transform Your Digital Marketing

Editor

5 Ways Machine Learning Can Transform Your Digital Marketing

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended