Sydicom reads this form and drafts every answer from your CV. You review and submit it yourself. Free to start.
A remote Data & ML role at Protege. Experience with ASR, TTS, speaker modeling, self-supervised speech models, diarization, or multimodal audio models.
Keywords this role’s ATS scans for
Sydicom tailors your CV and cover letter to match these.
How Sydicom helps: we read this listing’s requirements and tune your CV and cover letter to the keywords its ATS (Ashby) is scanning for, wherever you are, then help you apply.
Original listing text, shown exactly as published by the company.
Research audio data quality for machine learning
Investigate how audio quality, signal properties, dataset composition, and localized acoustic issues affect downstream model training, evaluation, and deployment.
Develop new metrics, benchmarks, diagnostics, and evaluation frameworks for measuring audio data quality in ways that are predictive of ML model performance.
Speech dataset characterization and metrics
Analyze and summarize Protege’s audio catalog and maintain clear, up-to-date quality scorecards and metrics for key speech datasets.
Develop methods to measure true acoustic properties directly from the waveform, including effective bandwidth, spectral energy distribution, high-frequency roll-off, noise, clipping, reverberation, distortion, and codec artifacts.
Segment-level quality evaluation
Build workflows that evaluate diarized or segmented speech regions, surfacing localized degradation that file-level averages may miss.
Apply multiple complementary quality metrics to detect bandwidth mismatches, resampling artifacts, clipping, reverberation, codec distortion, and other forms of degradation.
Model and data evaluation
Design and run targeted evaluations connecting audio quality issues to downstream model behavior, including ASR performance, speaker embedding stability, learned speech representations, and synthesis quality.
Test which audio quality metrics meaningfully correlate with model outcomes, identify failure modes of existing metrics, and design better alternatives when current approaches are insufficient.
Deterministic filtering and evaluation infrastructure
Translate research findings into reproducible filtering rules, quality gates, and dataset selection strategies that improve dataset consistency across training runs.
Build scalable tools and pipelines for applying audio quality analyses across large datasets, tracking results over time, and making quality signals accessible to researchers, engineers, and data teams.
Cross-functional collaboration
Work closely with ML researchers, data engineers, data operations, and external partners to define, measure, and communicate the value of Protege’s audio data assets.
Near-term: establish a trustworthy audio-quality baseline
Create a trustworthy view of the quality, consistency, signal fidelity, and training-readiness of Protege’s speech and audio datasets, supported by metrics and scorecards the team can operationalize.
Then use targeted evaluations, ablations, and downstream model analysis to connect audio-quality issues to concrete dataset improvements and clearer prioritization over time.
PhD or equivalent Master’s degree + 4+ years industry experience in machine learning, audio signal processing, speech technology, computer science, statistics, engineering, or a related quantitative field.
Proven experience designing and running data evaluations, audio analyses, benchmarks, ablations, or slice-based analyses.
Strong understanding of speech/audio data and signal properties, including sampling rates, codecs, bandwidth, spectrograms, reverberation, clipping, noise, and perceptual quality.
Experience developing or critically evaluating metrics, benchmarks, or measurement frameworks for ML systems, data quality, speech technology, or audio signal analysis.
Ability to connect low-level signal properties to downstream machine learning behavior, including model accuracy, robustness, representation quality, speaker consistency, or synthesis quality.
Comfortable moving between research exploration and production implementation: you can formulate hypotheses, run experiments, analyze results, and turn findings into scalable tools or decision rules.
Excellent written and verbal communicator; able to write concise technical docs and explain empirical results clearly.
High ownership and bias toward action; you independently scope questions, design experiments, and drive them to decisions.
Protege ValuesPass the Loved Ones’ Test
We act with integrity and do the right thing — especially when it’s hard and no one is watching.
Always Find a Way
We are resourceful, resilient builders who solve hard problems and push through obstacles.
Go Fast and Grow Fast
Velocity matters. We move with urgency, learn quickly, and continuously improve as individuals and as a company.
Practice Kindness and Candor
We communicate directly and respectfully, building trust through honest feedback and genuine care for one another.
Deliver Together
We win as one team. Collaboration, accountability, and shared ownership drive our success.
Own the Outcome. Hone the Craft.
We take pride in our work, sweat the details, and continuously raise the bar for excellence.
Protege
Data & ML
14 open roles on Sydicom