Remote/Hybrid
3 - 6 years
Data Engineer
About BuildingBlocks
Regardless of whether accomplishing astounding customer work or aiding assemble our organization from the inside, our different gifted organization shares a solitary mission: to fabricate lovely encounters that decidedly affect individuals’ lives and the organizations we serve.
Our way of life is more reason than work, mixing the quest for certified human association with a fastidious way to deal with the things we make. All with an idealistic soul. We’re an organization that is adequately youthful to be fun and defiant – yet intense and driven enough to immensely affect the business.
Our way of life is more reason than work, mixing the quest for certified human association with a fastidious way to deal with the things we make. All with an idealistic soul. We’re an organization that is adequately youthful to be fun and defiant – yet intense and driven enough to immensely affect the business.
Role Overview
You will own the data pipeline powering our LLM training and fine-tuning. This includes ingestion,
cleaning, deduplication, and building high-quality datasets from structured/unstructured sources.
Responsibilities
- Design ETL pipelines for text, PDFs, and structured data.
- Implement data deduplication, filtering (toxicity, PII), and normalization.
- Train and manage tokenizers (SentencePiece/BPE).
- Build datasets for supervised fine-tuning and evaluation.
- Work closely with domain experts to generate instruction/response pairs.
Requirements
- Strong in Python, SQL, and data wrangling frameworks (Pandas, Spark).
- Experience with large text datasets, cleaning, preprocessing.
- Familiarity with NLP-specific preprocessing (chunking, embeddings).
- Knowledge of cloud data storage (S3/GCS/Blob).
- Bonus: Prior experience in AI/ML pipelines.