Unreal Speech

Unreal Speech - Cheapest Text-to-Speech API with 300ms latency

Launched on Feb 23, 2025

Unreal Speech is a Text-to-Speech API service offering 300ms ultra-low latency streaming and 48 voices across 8 languages. Built on the open-source Kokoro TTS 82M parameter model, it delivers the cheapest pricing in the market—up to 11x cheaper than ElevenLabs. Ideal for developers, content creators, and enterprises building voice applications.

AI AudioFreemiumEnterpriseMulti-languageText to SpeechAPI AvailableOpen Source

What is Unreal Speech

Developing high-quality text-to-speech capabilities has long presented significant challenges for developers and businesses alike. Traditional TTS solutions often force a difficult tradeoff between quality, cost, and latency—enterprises requiring natural-sounding voice synthesis historically faced price points starting at $50-100 per month, with response times that made real-time applications impractical. These constraints have limited the adoption of voice technology across content creation, accessibility tools, and interactive applications.

Unreal Speech addresses these pain points by positioning itself as the most cost-effective Text-to-Speech API solution available. The platform delivers audio generation at a cost structure 11 times lower than ElevenLabs, making enterprise-grade voice synthesis accessible to startups, independent developers, and content creators. The service achieves this through its foundation on the open-source Kokoro TTS model, an 82M parameter architecture that balances computational efficiency with output quality.

The platform processes over 7 billion characters monthly, demonstrating production-grade reliability at scale. Enterprise customers like Listening.com have integrated Unreal Speech to handle demanding workloads—processing over 10,000 pages per hour while achieving 75% cost savings compared to previous TTS providers. This combination of ultra-low pricing and proven scalability has made Unreal Speech a preferred choice for applications ranging from podcast production to IVR systems.

Core Highlights
  • 300ms ultra-low latency streaming responses
  • 48 voices across 8 languages
  • Industry's most affordable TTS API
  • Per-word timestamp functionality for subtitle synchronization
  • Built on open-source Kokoro TTS foundation

Core Features of Unreal Speech

Unreal Speech provides a comprehensive API suite designed to handle diverse text-to-speech requirements, from real-time streaming to large-scale batch processing. Each endpoint is optimized for specific use cases, enabling developers to select the most appropriate tool for their application.

Streaming Audio API (/stream) delivers instant voice synthesis for short-form content with latencies as low as 300ms. This endpoint handles texts up to 1,000 characters and is ideal for voice assistants, real-time interactive applications, and any scenario where immediate audio feedback is critical. The synchronous design ensures predictable response times suitable for production deployments.

Standard Speech API (/speech) serves medium-length text conversion needs, processing up to 3,000 characters per request. The endpoint achieves approximately 1 second per 700 characters processing speed and returns both MP3 audio files and JSON URLs containing timestamp data. This feature enables applications requiring precise alignment between spoken text and audio positioning.

Asynchronous Long Audio Tasks (/synthesisTasks) handle large-scale audio generation workloads with texts up to 500,000 characters—equivalent to approximately 10 hours of audio output. The asynchronous architecture returns a TaskId for status polling, making this endpoint perfect for有声书 production, educational content generation, and batch processing workflows. Users have reported generating 6-hour audiobooks in under 4 minutes using this endpoint.

Per-word Timestamps represent a distinctive capability in the TTS market. Unlike competitors, Unreal Speech provides precise word-level or sentence-level timestamp data, enabling applications like synchronized highlighting, subtitle generation, and language learning tools. The WebSocket endpoint /streamWithTimestamps delivers timestamp data in real-time during streaming.

Multi-language Support encompasses 48 distinct voices across 9 languages: American English, British English, French, Hindi, Spanish, Japanese, Chinese, Italian, and Portuguese. Voice options range from female selections (Sierra, Scarlett, Hannah, Emily, Ivy, Kaitlyn, Luna, Willow, Lauren) to male voices (Noah, Jasper, Caleb, Ronan, Ethan, Daniel, Zane, Rowan).

Audio Parameter Controls allow fine-tuning across bitrate (16k-320k), speed (-1.0 to 1.0), pitch (0.5 to 1.5), and encoding format (libmp3lame, pcm_mulaw).

  • Ultra-low latency: 300ms streaming vs seconds for traditional TTS
  • Massive scale: 500,000 character limit supports full audiobook generation
  • Unique timestamps: Per-word sync unavailable from competitors
  • Cost efficiency: 11× cheaper than ElevenLabs with comparable quality
  • Open foundation: Kokoro model enables transparency and customization
  • No voice cloning: Currently unavailable, though development is underway
  • Limited voice customization: Pre-set voices only, no custom voice training
  • Language count: 8 languages vs broader offerings from major cloud providers

Application Scenarios

Unreal Speech serves a wide spectrum of industries and use cases, with each API endpoint optimized for specific requirements.

Video and Content Creation teams leverage the API to批量 generate professional-quality voiceovers at a fraction of traditional recording costs. The standard speech API enables rapid turnaround for video production, while multi-language support facilitates efficient localization workflows. Content creators can produce videos in multiple languages without engaging separate voice talent for each version.

Audiobook Production benefits significantly from the asynchronous long audio task endpoint. The 500,000-character capacity handles full-length books, and users have documented generating 6-hour audiobooks in approximately 4 minutes. This dramatically reduces production timelines that traditionally required weeks or months of studio recording.

Gaming and VR Applications require the streaming API's 300ms latency to deliver real-time dialogue generation. Unlike pre-recorded audio files, dynamic voice synthesis enables responsive NPC interactions and adaptive content delivery based on player choices.

Accessibility Tools benefit from natural-sounding voice output across 48 voice options. The variety enables matching voices to content context—educational materials might use different voices than entertainment applications—while the pricing makes deployment economically viable for non-profit accessibility projects.

Voice Assistants and Chatbots require the natural flow of conversational interaction. Streaming responses eliminate the robotic pauses typical of traditional TTS systems, creating more engaging user experiences for customer service applications and personal assistants.

Online Education platforms utilize per-word timestamps to create synchronized subtitle experiences. Students see highlighted text as audio plays, significantly improving comprehension for language learners and students with hearing impairments.

IVR Phone Systems benefit from natural voice output in multiple languages. Organizations with multilingual customer bases can deploy consistent voice experiences across all supported languages without managing separate TTS vendors.

Podcasting and News operations leverage high concurrency capabilities to produce large volumes of content efficiently. The 500+ simultaneous request capacity supports news outlets requiring rapid audio conversion of breaking stories.

Endpoint Selection Guide
  • Real-time applications (< 1s response): Use /stream endpoint
  • Standard content (1-3K chars): Use /speech endpoint
  • Audiobooks and long-form: Use /synthesisTasks endpoint
  • Synchronized highlighting/subtitles: Add timestamp parameters to any endpoint

Quick Start: Integrating Unreal Speech API

Getting started with Unreal Speech requires minimal setup—developers can begin generating audio within minutes using the provided SDKs and straightforward API endpoints.

Prerequisites: Sign up at unrealspeech.com to obtain your API key from the dashboard. The key authenticates all API requests and tracks usage against your plan's character allocation.

Python Integration uses the popular requests library for synchronous calls:

import requests

api_key = "YOUR_API_KEY"
url = "https://api.v8.unrealspeech.com/speech"

headers = {
    "Authorization": api_key,
    "Content-Type": "application/json"
}

payload = {
    "text": "Hello, welcome to the future of text-to-speech.",
    "voiceId": "scarlett",
    "bitrate": "192k",
    "speed": "0",
    "pitch": "1"
}

response = requests.post(url, json=payload, headers=headers)
# Returns MP3 audio data with timestamp URL in response headers

Node.js Implementation follows similar patterns using axios:

const axios = require('axios');

const response = await axios.post('https://api.v8.unrealspeech.com/speech', {
  text: 'Your text here',
  voiceId: 'noah',
  bitrate: '192k'
}, {
  headers: { 'Authorization': 'YOUR_API_KEY' }
});

React Native developers access the dedicated hook for streamlined integration:

import { useUnrealSpeech } from '@unrealspeech/react-native';

const { generateSpeech, isGenerating } = useUnrealSpeech('YOUR_API_KEY');

const audio = await generateSpeech({
  text: 'Content to convert',
  voiceId: 'ivy'
});

Command Line (Bash) enables quick testing and scripting:

curl -X POST "https://api.v8.unrealspeech.com/speech" \
  -H "Authorization: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello world","voiceId":"scarlett"}'

Streaming Endpoint for real-time applications uses similar payload structures but connects to the /stream endpoint for sub-second response delivery.

Optimization Best Practices
  • Match bitrate to use case: 192k for high-quality audio, 128k for voice-only applications
  • Cache frequently used voice configurations
  • Use WebSocket endpoint for applications requiring timestamp data
  • Monitor API response times in dashboard to optimize timeout settings

Complete API documentation is available at docs.v8.unrealspeech.com/ with additional examples and endpoint specifications.


Technical Architecture: Kokoro TTS Performance

Unreal Speech's capabilities stem from its foundation on Kokoro TTS, an 82M parameter open-source model that represents a significant architectural advancement over previous text-to-speech systems.

Model Architecture combines innovations from StyleTTS 2 and iSTFTNet in a hybrid decoder-only design. The transformer decoder processes text input in a single pass, eliminating the multi-stage pipeline required by older architectures like Tacotron 2. This single-pass generation significantly reduces latency while maintaining output quality. The iSTFTNet vocoder converts intermediate representations to final audio with high fidelity.

The decoder-only approach means the model generates complete audio output without iterative refinement processes. This architectural choice directly contributes to the ultra-low latency performance—traditional systems require separate encoder-decoder stages with potential quality bottlenecks at each transition.

Performance Benchmarks demonstrate impressive real-time capabilities:

  • GPU Performance: Up to 210× real-time on RTX 4090 hardware
  • CPU Performance: 3-11× real-time depending on model variant
  • Typical Latency: 40-70ms on GPU-accelerated inference
  • Concurrency: 500+ simultaneous requests with ~2 second response times

These metrics substantially outperform traditional TTS systems that typically achieve 1-5× real-time on equivalent hardware.

Model Efficiency is particularly notable—the 82M parameter count represents approximately 1/6 the size of XTTS v2 and 1/15 of MetaVoice. Smaller model size translates to reduced computational requirements, lower infrastructure costs, and faster cold-start times. The training efficiency reflects this efficiency: approximately 500 GPU hours on A100 hardware at an estimated cost of $400, making the model accessible for fine-tuning and customization.

Quality Recognition: Kokoro TTS achieved first place in the HuggingFace TTS Spaces Arena single-voice category, validating that the efficiency gains do not compromise output quality. Side-by-side comparisons show Kokoro achieving quality scores of 4.72 on fiction content, significantly outperforming major cloud TTS services.

  • Single-pass architecture: Eliminates multi-stage processing bottlenecks
  • Minimal resource requirements: 82M parameters vs 500M+ in competing models
  • Superior speed: 210× real-time vs 1-5× for traditional systems
  • Open transparency: Apache 2.0 license enables review and modification
  • Quality verified: #1 ranking in independent TTS benchmarking
  • No custom voice training: Fine-tuning capabilities limited compared to enterprise solutions
  • Single voice per request: Cannot generate multi-voice audio in single API call
  • Model update frequency: Open-source release cadence may not match commercial competitors

Pricing Plans

Unreal Speech offers tiered pricing designed to serve users from individual developers through enterprise deployments. All plans provide access to the complete API functionality, with differences in character limits and usage terms.

Plan Monthly Price Characters Audio Duration Overage Rate
Free $0 250,000 ~6 hours $16/million
Basic $4.99 3,000,000 ~67 hours $16/million
Plus $499 42,000,000 ~933 hours $12/million
Pro $1,499 150,000,000 ~3,000 hours $10/million
Enterprise $4,999 625,000,000 ~14,000 hours $8/million
Custom Contact Sales 1 billion+ Volume discounts Negotiated

Plan Details:

The Free tier provides 250,000 characters monthly (approximately 6 hours of audio) with attribution required. This tier enables full API exploration and small project development. Unused characters reset on the 1st of each month.

Basic tier at $4.99/month serves individual developers and small projects requiring 3M characters (~67 hours). This plan removes attribution requirements and permits commercial use. Overage charges apply at $16 per million characters beyond the allocation.

Plus tier ($499/month) targets growing businesses with 42M characters (~933 hours). The reduced overage rate of $12/million makes this economical for production applications with predictable usage patterns.

Pro tier ($1,499/month) provides 150M characters (~3,000 hours) for high-volume applications. The $10/million overage rate supports production deployments with some flexibility for traffic variations.

Enterprise tier ($4,999/month) delivers 625M characters (~14,000 hours) with $8/million overage pricing. This tier suits organizations with consistent high-volume requirements and provides the lowest marginal cost.

Custom Enterprise arrangements for requirements exceeding 1B characters include negotiated volume discounts and dedicated support channels.

Plan Selection Guide
  • Hobby projects: Free tier with attribution
  • Startups and MVPs: Basic tier ($4.99/month)
  • Production applications: Plus tier ($499/month) for predictable costs
  • High-volume processing: Pro tier with volume discounts
  • Enterprise scale: Contact sales for custom arrangements

Frequently Asked Questions

What languages and voices does Unreal Speech support?

Unreal Speech provides 48 distinct voices across 9 languages: American English, British English, French, Hindi, Spanish, Japanese, Chinese, Italian, and Portuguese. Voice options include female voices (Sierra, Scarlett, Hannah, Emily, Ivy, Kaitlyn, Luna, Willow, Lauren) and male voices (Noah, Jasper, Caleb, Ronan, Ethan, Daniel, Zane, Rowan).

Does Unreal Speech support voice cloning?

Voice cloning is not currently supported. The team has indicated that voice cloning functionality is under development. In the meantime, the 48 pre-built voices provide diverse options for most use cases.

How are overage charges calculated when I exceed my monthly limit?

Overage charges vary by plan: Free and Basic plans are charged at $16 per million characters, Plus at $12/million, Pro at $10/million, and Enterprise at $8/million. Charges are prorated based on your current plan rates.

Do unused characters roll over to the next month?

For Free tier users, unused characters reset on the 1st of each month. Paid plans (Basic through Enterprise) roll unused characters over to the next billing cycle, ensuring you retain access to prepaid allocations.

Can I use generated audio for commercial purposes?

Yes. All paid plans include commercial usage rights without attribution requirements. Free tier users must provide attribution when using generated audio. The terms permit use in podcasts, videos, applications, and commercial products.

How do I update my payment method?

Navigate to your Dashboard and select "Manage Subscription" to update payment methods, view billing history, or modify plan selections. The dashboard provides full self-service subscription management.

Is there an affiliate or referral program?

Yes, Unreal Speech offers an affiliate program providing 15% recurring commission on referred users' payments. Visit the affiliate portal through your dashboard or the referral link to generate unique tracking links.

Comments

Comments

Please sign in to leave a comment.
No comments yet. Be the first to share your thoughts!