Developing Voice-Activated Software Applications
Updated Crowdsourced Speech Dataset Now Available: 13,905 Hours of Speech in 76 Languages
Nvidia and Mozilla have recently updated a renowned crowdsourced speech dataset, making it one of the world's largest open speech datasets. The updated dataset, available through the Mozilla Common Voice project, contains 13,905 hours of speech in 76 languages and 182,000 unique voices.
This extensive dataset includes demographic information such as age, gender, and accent, making it a valuable resource for developing voice-enabled services and AI models in various languages, including less commonly represented ones. The dataset now includes 16 new languages: Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, and Hausa.
Interested individuals can access the updated dataset by visiting the Mozilla Common Voice website or repository. The data is openly available for research and commercial use under an open license. Mozilla encourages contributors and developers to participate in expanding and improving the dataset.
Additionally, Mozilla provides toolkits, such as those for transcribing audio using open-source Whisper models, to support working with the data securely and privately. For those looking to collaborate or use the dataset for enterprise purposes, Mozilla can provide details on licensing and data access.
Mozilla is also working on an initiative to create a data collective ("marketplace") to facilitate controlled sharing and licensing of curated datasets, which may include this speech dataset in the future.
For more information and to download the dataset, visit the Mozilla Common Voice official site or platform. You can also use Mozilla and EleutherAI toolkits available on platforms like Mozilla.ai Blueprints for accessing or building similar datasets. For collaboration or enterprise use cases, contact Mozilla or relevant entities for licensing and data access details.
- The renovated Mozilla Common Voice dataset, with its 13,905 hours of speech in 76 languages, is particularly beneficial for AI and data-and-cloud-computing technology, as it aids in developing voice-enabled services and AI models.
- This extensive dataset, containing unique voices from various demographics and languages, is not only open for research purposes but also encourages contributors to enhance and curate it using Mozilla's provided toolkits, potentially making it available in Mozilla's future data collective (marketplace).