Original listing text, shown exactly as published by the company.
About the role
We’re rebuilding Bosta’s data platform end-to-end: from MongoDB at the source, through a governed semantic layer that LLM-native tools (NL-to-SQL agents, AI analysts, embedded copilots) can sit on top of safely and cheaply. You will help own that rebuild.This is a hands-on role. You’ll set patterns, write the foundational code, and ship across the stack — from CDC and ingestion through dbt, the semantic layer, and the interfaces that BI tools and AI agents consume.
Job Responsibilities
- End-to-end pipeline work: MongoDB CDC → ingestion → lakehouse → warehouse → dbt → semantic layer → BI/AI consumers
- Co-ownership of architecture decisions with the Data Engineering Lead
- CDC from production MongoDB without degrading operational DB performance; ingestion patterns that make adding a new source a config change, not a project
- Orchestration that’s observable end-to-end (Airflow, Dagster, or Prefect)
- The dbt project: structure, conventions, tests, contracts, exposures, CI
- The semantic / metrics layer (dbt Semantic Layer, Cube, or equivalent) — one canonical definition per business metric
- LLM-readiness: column-level documentation, PII tagging, query cost guardrails, materialized metric tables, and evals on AI-generated SQL
- Migration of existing logic out of the Tableau and Metabase sprawl into modeled, governed sources
Job Qualifications
- 4+ years across data engineering and/or analytics engineering — you’ve spent meaningful time on both sides
- Comfort spanning the stack: comfortable shipping a Debezium connector one week and a dbt mart the next
- Deep dbt and SQL — you’ve owned a non-trivial project, not just contributed to one
- Production CDC experience (Debezium, Kafka Connect, Airbyte, or hand-rolled) against operational databases — bonus if that database was MongoDB
- A cloud warehouse you know deeply (Redshift, Snowflake, BigQuery, or Databricks)
- Strong Python; comfortable in Linux, infra-as-code, and CI/CD
- Working understanding of how LLM tooling (RAG, NL-to-SQL, embedded agents) consumes a data platform — and what breaks when the platform isn’t ready
- Strong opinions on modeling, lightly held; bias toward observability
Bonus
- MongoDB schema evolution at scale
- Production semantic-layer rollouts (dbt Semantic Layer, Cube, LookML, MetricFlow)
- Lakehouse formats (Iceberg, Delta, Hudi) or streaming experience (Kafka, Flink, Kinesis)
- Data catalog / lineage tooling (DataHub, Atlan, Collibra)…