Capturing the Full Signal

The Data Problem

Frontier models are increasingly bottlenecked not by compute or architecture, but by the quality and structure of their training data. Conversational AI, in particular, suffers from a scarcity of naturalistic, full-duplex audio data that captures the nuances of human dialogue: interruptions, backchannels, overlapping speech, and paralinguistic cues.

This data problem is compounded by the strict regulations surrounding voice, disallowing social media scraping and non-consensual recording. As a result, raw audio data companies tend to resort to voice actors reading off a script or recording phone calls between volunteers with a pre-chosen topic. In both cases, such conversations tend to result in increased formality and stale speech that lacks the nuances of truly spontaneous conversation. Such elements include interruptions and overlap (research shows that over 40% of speaker turns involve some degree of overlap in natural conversation), filler words, self-correction and backtracking, prosodic entrainment (the subconscious alignment of two people's tones and rhythms when talking), backchannels (sounds like "mm-hmm" or "yeah" said in the background that don't represent actual interruptions), and more. Furthermore, the issue of having to source audio data specifically from voice actors and volunteers leads to a lack of audio data diversity, as most labs cannot source a broad enough group of people that truly represent real-world demographic distributions.

Much of voice AI so far has been focused on transcription quality, synthesis quality, and the general model of intonation and individual voice (how do people speak), but as models become implemented in production systems communicating with real-world people, there needs to be an increased focus on modeling the natural flow of multi-speaker dialogue. We can see this especially with the emergence of full-duplex models specifically trained on multi-speaker conversational speech data.

Why Voice Matters

Voice matters as a research frontier because speech carries information that text does not preserve. The words spoken are only one layer. Timing and turn-taking determine when a system should speak, interrupt, wait, or yield. Prosody carries emphasis, emotion, uncertainty, urgency, confidence, sarcasm, and social intent. Backchanneling, hesitation, laughter, repair, and prosody are key in human coordination over long conversations. Environmental and action-linked audio can also reveal context about the world around a speaker and, in some domains, physical state that vision alone cannot infer.

This is why we think voice will remain a deep research frontier even as text models continue to improve. The bottleneck is no longer just transcription quality or TTS naturalness. It is building systems that can model full-duplex interaction, align speech generation to human preference, evaluate timing and conversational quality, and learn from spoken feedback and real interaction traces rather than transcripts alone.

We believe the field needs research infrastructure for voice comparable to what text and code already have: rights-complete datasets, better labels for paralinguistics and conversational dynamics, reward models for prosody and emotional appropriateness, richer evaluation suites, and synthetic environments for rapid experimentation. In the near term, that serves pre-training, evaluations, post-training, and RL for voice agents. In the longer term, it supports broader scientific progress on how machines should listen, interpret, and speak in ways that are natural, context-aware, and socially coherent.

Our Approach

Liminal stands to tackle this problem through the development of raw and synthetic data pipelines for sourcing, annotating, and generating realistic spontaneous conversations. The name is deliberate: liminal means threshold, and the most important signals in speech live precisely in that space, between words, between speakers, between what is said and what is meant.

Our data is high-quality, multi-channeled, transcribed, annotated with emotion and disfluency tags, and diverse, representing real-world distributions in terms of language, accent, ethnicity, occupation, age, and more.

Raw and Synthetic Data

Raw data is the grounding layer of our system. It is not simply a collection of audio files, but rights-complete conversational interaction data that preserves the parts of speech models lose when audio is reduced to text: timing and turn-taking, prosody and emphasis, speaker state, emotion, overlap, backchannels, pauses, self-correction, ambient context, and the acoustic artifacts of real devices and environments. As voice systems shift from cascaded STT, LLM, and TTS pipelines toward native speech-to-speech and full-duplex architectures, these signals become increasingly important. The task is no longer just recognizing words or generating clean audio. It is modeling interaction itself.

We collect this raw data from several high-value sources, including proprietary consumer applications with explicit AI data donation flows, exclusive partnerships with creators and live conversational environments, and privacy-preserving enterprise pipelines that allow companies to contribute conversational traces without exposing sensitive information. The value of these sources is not only scale, but provenance and recurrence: they generate authentic interaction traces that reveal how real people speak across accents, languages, occupations, ages, devices, and settings, and how those patterns change under different tasks and contexts.

We structure this data to be usable for modern model pipelines rather than as flat recordings. Conversations are segmented and aligned at the speaker and turn level, preserved as multi-channel audio where possible, transcribed, and annotated for emotion, disfluency, interruption, backchannels, timing, and other conversational events. This makes the data useful not only for pre-training, but also for evaluations, reward modeling, post-training, and RL, where the target is often not just semantic correctness but naturalness, responsiveness, and human preference.

Raw data alone is not enough. Its limitations are different: it is expensive to collect, constrained by rights and privacy, difficult to annotate at scale, and hard to target toward the exact edge cases a model currently fails on. This is where synthetic data becomes uniquely powerful.

We also have an ambitious vision for the unique capabilities synthetic data offers in both augmenting conversational speech datasets and providing new avenues for quickly accessible data to assist in autonomous research.

Synthetic data is widely used in ML labs building in voice for its value in augmenting limited and expensive raw audio datasets. However, most labs tend to build rudimentary synthetic pipelines with one-shot transcript generation from a flat LLM and audio synthesis from a TTS. Such pipelines fail to properly capture both the auditory prosodic nuances of conversation and the transcript-level disfluencies like interruption, filler words, backchannels, and turn timing.

We do not see synthetic data as a replacement for raw data, but as the control layer calibrated against it. Raw data captures the true distribution of speech and interaction; synthetic data lets us densify rare scenarios, stress specific failure modes, create domain-specific coverage, and materialize new datasets quickly as architectures and research needs change.

By treating synthetic speech data as a priority, rather than an afterthought, we are creating complex pipelines that properly model real-world conversation across use cases such as pre-training, evaluations, post-training, and RL. We focus on every aspect of conversational speech we observe and create robust models that target each aspect. As such, we are working to represent interruptions, overlap, backchannels, entrainment, filler words, pauses, self-correction, and more. We can predict these in real-world transcripts through multiple models and then use speech-conditioned voice models to better represent the sound and rhythm of back-and-forth speech.

Furthermore, our synthetic pipelines, from the transcript to the disfluency injections to the voice, are all conditioned on human profiles sampled from real-world distributions specific to the intended audience of the downstream model being trained. By taking global and regional census data, we can sample for things like language, accent, ethnicity, speech impediments, occupation, and more to generate more realistic conversation in both context and style.

The advantage is therefore not raw or synthetic data in isolation, but the calibration loop between them: collect real interaction data, identify what is missing or underrepresented, and generate targeted synthetic data to close the gap.

Future Vision

Liminal is not only a data company. It is an AI research company building the longitudinal data infrastructure that next-generation voice, multimodal, and embodied systems will require.

Today, our focus is on serving a rapidly growing and still underdeveloped voice AI field. But the problems we are working on (modeling spontaneous conversation, capturing paralinguistic structure, calibrating synthetic data against real interaction) are not problems that get solved once and disappear. They are research problems that deepen as voice systems become more capable and as audio becomes central to new domains.

Text and code benefit from decades of precedent in both data and architecture. Voice remains a younger and more open frontier. The field is still working through diverse model designs, from cascaded systems built from STT, LLM, and TTS components, to native speech-to-speech models, to full-duplex systems that have to reason about two speakers in a single interactive loop. As these systems improve, the central challenge is shifting from "can a model recognize words?" to "can a model understand and generate the full structure of human conversation?" As that challenge deepens, the data requirements do not shrink; they expand in complexity, annotation depth, and modality.

One of the most compelling frontiers beyond conversational voice is audio as a world-model input for physical AI. In robotics, sound captures physical properties that are fundamentally invisible to vision: the changing pitch of liquid filling a container reveals fill level that cameras cannot see; contact sounds distinguish materials that look identical; room reverberation reveals spatial geometry that is visually occluded. Recent research has demonstrated audio-driven world models that predict future acoustic states to guide robotic manipulation, contact microphones mounted on grippers that let robots distinguish materials by touch sound and detect object states invisible to cameras, and tri-modal systems combining vision, audio, and touch that significantly outperform any single- or dual-modality approach. Spatial hearing AI is enabling robots to isolate, track, and interpret multiple sound sources in three-dimensional space, giving them acoustic awareness that complements visual SLAM in environments where lighting fails or optical sensors are obstructed. Action-linked audio (contact sounds, manipulation noise, environmental acoustics) is emerging as a high-value data category for embodied AI that barely exists in current datasets. As physical AI scales, the demand for grounded, multimodal audio data tied to physical outcomes will grow substantially, and we intend to be in a position to supply it.

This is why our vision extends beyond any single product category. Voice and audio will matter in assistants, wearables, accessibility systems, enterprise agents, robotics, and automotive interfaces, not because talking is novel, but because voice is one of the richest channels humans use to express intent and coordinate action, and because audio more broadly is one of the most underutilized sensory modalities in machine intelligence. The common thread across all of these is that the most valuable data is not generic speech, but longitudinal, context-rich interaction traces where voice is tied to memory, environment, action, and outcomes.

Our work on raw and synthetic data pipelines has also revealed a broader opportunity: the potential use of synthetic data in facilitating the shift towards more autonomous forms of ML research. The bottleneck of AI labs tends to be the ability to quickly and, in parallel, test ideas. Typically, after compiling hundreds of architectural ideas across literature and from intuition, labs often have to isolate the best few for actual building. With AI Agents, being able to iterate and parallelize the building of a hundred experiments with unique architectures, code, and validation has become possible. The only bottleneck remaining is data. Even if we could generate a hundred architectures and model training code, there is still the problem of sourcing and annotating the unique data needs for each trainable model. A platform where agents and researchers could immediately generate synthetic sample datasets would allow such futuristic research labs to cheaply iterate on hundreds of ideas rapidly and figure out definitively which models work best before focusing on these models and building more robust datasets to train them.

This idea of autonomous research is something we are already building in our company. With internal tools, we have systems in place that are beginning to autonomously find, corroborate, combine, and synthesize experimental architectures, models, and ideas from literature and our own intuitions and knowledge. Rather than having the bottleneck of picking one model to optimize, we can have teams of agents immediately experiment on each idea, building them, optimizing them thanks to premade validation metrics, and extracting what works or doesn't work in order to generate new ideas and optimize our current pipelines.

If such a vision comes true, we will plan on being in the position to loosen the existing bottlenecks in not only voice AI but in AI labs in general.