Microsoft Indian Language Speech Corpus Package Offers Test Data for Telugu, Tamil, and Gujarati

Advertisement
By Jagmeet Singh | Updated: 6 September 2018 14:30 IST
Highlights
  • Microsoft has launched Indian language Speech Corpus
  • It offers data for Telugu, Tamil, and Gujarati
  • The dataset is provided by Microsoft Research Open Data initiative

Microsoft's Speech Corpus is touted to be the largest publicly available Indian language speech dataset

Microsoft on Thursday launched the Microsoft Indian Language Speech Corpus package that brings conversational and phrasal speech training and test data for Telugu, Tamil, and Gujarati languages. Claimed to be the largest publicly available Indian language speech dataset, the data package also includes audio and corresponding transcripts. It is essentially aimed at helping researchers and academia build Indian language speech recognition for applications where speech is required. The content of the speech dataset is provided by Microsoft Research Open Data initiative and collection is available for free.

Speech has become important to localise experiences in areas such as natural language processing, computer vision, and domain-specific sciences. Also, as Microsoft considers, there is a scarcity of adequate digital data for text, speech, and linguistic resources majorly for languages that are not as dominating as English or Hindi. This brings the need for a speech dataset like the Microsoft Speech Corpus.

Advertisement

"We believe India's increasing digital literacy needs to be supported by a multi-lingual digital world," said Sundar Srinivasan, General Manager, Artificial Intelligence and Research, Microsoft India, in a press statement. "Microsoft Indian Language Speech Corpus is an extension of our on-going efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice-based computing for India by supporting researchers and academia."

Microsoft Indian Language Speech Corpus is touted to address differences in enunciation, accent, diction, and slang that are quite common across various regions in India. It also includes audio and corresponding transcripts to help researchers and developers easily build their speech recognition systems - without being the linguistic experts of the vernaculars. The package can be accessed for free directly from the Microsoft Research Open Data site.

Advertisement

At Interspeech 2018 in Hyderabad, Microsoft tested its Indian Language Speech Corpus. Participants in a Low Resource Speech Recognition Challenge used data from the package to build their ASR systems and bring new speech recognition models. A baseline system was provided to the participants to let them compare their systems against and use as a starting point.

Notably, this isn't the first time when Microsoft has taken a step to ease the integration of Indian languages into speech recognition applications. The Redmond company is already working on a real-time language translation solution specifically for Indian languages. Likewise, the software giant under its global Local Language Program (LLP) provides various Language Interface Packs for Indian languages. There is a team of researchers at Microsoft Research Lab in Bengaluru that helps localise speech and linguistic resources that are required to build Deep Neural Network (DNN) based models.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement
Popular Mobile Brands
  1. OnePlus Nord CE 6, Nord CE 6 Lite Will Launch in India on This Date
  2. Poco C81, C81x to Launch in India With Up to 6,300mAh Battery on This Date
  3. OnePlus Nord CE 6 Visits Geekbench With These Specifications
  4. Huawei Pura 90 Series Key Specifications Surface Ahead of China Launch
  5. OnePlus Ace 6 Ultra, New Gaming Controller Will Launch on This Date
  6. The Guy Behind Sora AI Video Models Is Leaving OpenAI
  7. Dell 15 Refreshed With Up to Intel Core Ultra 7, 15.6-Inch Display
  1. Poco C81, Poco C81x India Launch Date Revealed Along With Design and Key Specifications
  2. OpenAI’s Sora Chief, CTO Announce Departure Amid Company’s Growing Enterprise Focus
  3. Apple's Redesigned MacBook Pro Said to Be Delayed Due to Supply Shortages
  4. Toshiba Z670SP MiniLED TV Series Launched in India With Up to 100-Inch 144Hz Screens: Price, Specifications
  5. Resident Evil Requiem Could Get Mercenaries Arcade Mode in May, Leak Suggests
  6. Global Memory Shortage Could Persist Until 2030 as Suppliers Prioritise AI Data Centres: Report
  7. Dell 15 (2026) Launched in India With Up to Intel Core Ultra 7 and 15.6-Inch Display: Price, Features
  8. OnePlus Nord CE 6, Nord CE 6 Lite India Launch Date Announced; Snapdragon 7s Gen 4 Chip Confirmed
  9. Xiaomi 18 Pro Max Specifications Leak; Might Feature Snapdragon 8 Elite Gen 6 Pro Chip, 6.9-Inch Display
  10. OnePlus Ace 6 Ultra Launch Date Announced; New OnePlus-Branded Gaming Controller Will Tag Along
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.