Data Engineering

HuggingFace Dataset Pipelines

I work with large-scale datasets daily for training diffusion models and LLMs. Here's how I handle data at scale.

How I Use HuggingFace Datasets

Streaming Large Datasets

Process TB-scale datasets without downloading. Essential for training on limited storage.

Custom Dataset Preparation

Building training datasets from raw images with my captioning pipeline. Upload to Hub for versioning.

Data Filtering & Cleaning

Parallel map/filter operations to clean datasets. Remove duplicates, filter by quality.

Dataset Mixing

Interleave multiple datasets with custom ratios for diverse training data.

Datasets I Work With

For LLM Training

OpenOrcaSlimOrcaUltraChatCapybaraOpenHermes

For Diffusion Training

LAIONJourneyDBCustom HDR datasetsStudio captures

Optimization Techniques I Use

Streaming modeMemory mappingMultiprocessing (num_proc)Parquet formatSSD cachingBatched tokenization

Technology Stack

HuggingFace DatasetsPyArrowParquetTokenizersHuggingFace HubPolars

Expertise by Sumit Chatterjee

Industrial Light & Magic, Sydney

Back to Portfolio