Applied Research Scientist - Text-to-Speech (TTS)
Job DescriptionJob Description
Salient is one of the fastest-growing AI startups in consumer finance. In less than two years, we’ve achieved product-market fit, scaled to 8-figure ARR, and emerged as one of the undisputed leaders in financial voice AI.
A few fast facts:
-
Backed by YC and raised the largest Series A for a B2B startup from a16z
-
Reached product-market fit in <2 years and scaled to 8-digit ARR
-
19-person team building a speech AI agent that handles millions of real customer calls per day, and fully deployed in production across major financial institutions (not just PoCs)
-
We’re on a mission to pass the Turing test for conversational speech in a telephony setting
-
In-person office culture in San Francisco, CA
About the Role
We’re looking for an Applied Research Scientist with expertise in Text-to-Speech (TTS) to help us push the boundaries of speech synthesis. You’ll work on developing high-quality, low-latency TTS systems that power real-world applications. The ideal candidate combines deep modeling knowledge with a strong engineering mindset to deliver robust, scalable solutions.
Responsibilities
-
Perform any relevant engineering tasks related to model training and serving. E.g., data ingestion, data cleaning, evaluation
-
Design and train high-quality, low-latency SOTA and TTS models for real-time agent deployment
-
Integrate TTS into cascaded LLM+ASR systems; explore joint optimization and feedback loops
-
Lead research efforts on prosody, control, and expressiveness in speech synthesis
-
Prototype and evaluate new architectures and training pipelines for high-fidelity voice
-
Collaborate with infra and product teams to bring research into production
-
Contribute to internal tooling for data processing, model training, and inference benchmarking
Requirements
-
Proven track record developing state of the art TTS systems or advanced degree in speech synthesis
-
Strong modeling skills and experience training deep neural networks for speech synthesis
-
Deep understanding of audio modeling, phoneme alignment, vocoders, and real-time inference challenges
-
Ability to move from research to working code, this is a hands on role
-
Comfortable working independently and collaboratively and defining your own roadmap in an ambiguous, fast-moving environment
-
Ability to work 4 days a week from our San Francisco office (open to candidates willing to relocate)
Nice to Have
-
Familiarity with multilingual or code-switched TTS
-
Experience with voice cloning, style transfer, or emotion conditioning in speech
-
Contributions to academic publications or open-source projects in speech
As an early-stage company building at the frontier of AI, we work with high intensity and commitment. While schedules can vary by role/team, many weeks will demand extra focus, flexibility and time particularly during major launches and high impact sprints. We're seeking those who are aligned to and able to commit to that expectation which includes 4 days per week in our San Francisco Office.
Compensation Range: $180K - $270K