Maya 1: Open Source Voice AI with Emotional Intelligence and Voice Design

Maya 1 is an open-source speech model built for expressive voice generation with human emotion and precise voice design. Developed by Maya Research and backed by South Park Commons, Maya 1 represents a new approach to text-to-speech technology that puts control in the hands of developers and creators.

What is Maya 1?

Maya 1 is a 3 billion parameter decoder-only transformer model that generates emotional, expressive speech from text. Unlike traditional text-to-speech systems that rely on pre-recorded voice libraries, Maya 1 allows you to design voices on demand using natural language descriptions. The model uses the SNAC neural codec to generate compact audio tokens, enabling real-time streaming while maintaining high quality.

The system supports over 20 different emotions that can be inserted directly into text using inline tags. These emotions actually modify the audio waveform, changing pitch, breath patterns, and timing to create authentic emotional expression. This makes Maya 1 suitable for applications ranging from game character voices to customer service AI.

Key Features

Natural Language Voice Design: Describe voices using simple English instead of selecting from fixed libraries.
Inline Emotion Control: Add 20+ emotions directly into text with tags like <laugh>, <cry>, and <whisper>.
Real-Time Streaming: Low-latency generation suitable for live voice assistants and interactive applications.
Single GPU Deployment: Runs efficiently on a single GPU with 16GB+ VRAM, making it accessible for developers.
Open Source License: Apache 2.0 license allows commercial use, modification, and distribution.
Production-Ready: Includes vLLM integration, automatic prefix caching, and WebAudio compatibility.

Technical Architecture

Maya 1 uses a decoder-only transformer architecture similar to language models like Llama, but generates audio tokens instead of text tokens. The model predicts SNAC neural codec tokens, with 7 tokens per audio frame. These tokens are then decoded into 24 kHz audio. This approach dramatically reduces sequence length compared to raw waveform generation, enabling real-time performance.

Training occurred in two stages. Pretraining on internet-scale English speech data taught the model how real speech flows. Fine-tuning on curated studio recordings with human-verified descriptions, emotion tags, and accent variations refined the model for production use. All training data underwent strict preprocessing including resampling, normalization, silence trimming, and deduplication.

Mission

Maya Research builds emotionally intelligent, native voice models that let everyone speak. The company believes voice AI should not be locked behind proprietary APIs charging per-second fees. By open-sourcing Maya 1, the goal is to make production-quality voice AI accessible to developers worldwide, especially those building for languages and accents underserved by mainstream voice AI systems.

Note: This is an unofficial about page for Maya 1. For the most accurate information, please refer to the official documentation and model card on Hugging Face.

About Maya 1

What is Maya 1?

Key Features

Technical Architecture

Mission