Data Engineer (LLM Data Pipeline)

Remote/Hybrid

3 - 6 years

Data Engineer

About BuildingBlocks

Regardless of whether accomplishing astounding customer work or aiding assemble our organization from the inside, our different gifted organization shares a solitary mission: to fabricate lovely encounters that decidedly affect individuals’ lives and the organizations we serve.

Our way of life is more reason than work, mixing the quest for certified human association with a fastidious way to deal with the things we make. All with an idealistic soul. We’re an organization that is adequately youthful to be fun and defiant – yet intense and driven enough to immensely affect the business.

Role Overview

You will own the data pipeline powering our LLM training and fine-tuning. This includes ingestion, cleaning, deduplication, and building high-quality datasets from structured/unstructured sources.

Responsibilities

Design ETL pipelines for text, PDFs, and structured data.
Implement data deduplication, filtering (toxicity, PII), and normalization.
Train and manage tokenizers (SentencePiece/BPE).
Build datasets for supervised fine-tuning and evaluation.
Work closely with domain experts to generate instruction/response pairs.

Requirements

Strong in Python, SQL, and data wrangling frameworks (Pandas, Spark).
Experience with large text datasets, cleaning, preprocessing.
Familiarity with NLP-specific preprocessing (chunking, embeddings).
Knowledge of cloud data storage (S3/GCS/Blob).
Bonus: Prior experience in AI/ML pipelines.

Data Engineer (LLM Data Pipeline)

Remote/Hybrid

3 - 6 years

Data Engineer

About BuildingBlocks

Role Overview

Responsibilities

Requirements

Apply Now

Get in touch

For HR Related Queries:

hr@buildingblocks.la

Registered Office:

Let’s talk

Transcend your experience!