Data Engineering
HuggingFace Dataset Pipelines
I work with large-scale datasets daily for training diffusion models and LLMs. Here's how I handle data at scale.
How I Use HuggingFace Datasets
Streaming Large Datasets
Process TB-scale datasets without downloading. Essential for training on limited storage.
Custom Dataset Preparation
Building training datasets from raw images with my captioning pipeline. Upload to Hub for versioning.
Data Filtering & Cleaning
Parallel map/filter operations to clean datasets. Remove duplicates, filter by quality.
Dataset Mixing
Interleave multiple datasets with custom ratios for diverse training data.
Datasets I Work With
For LLM Training
OpenOrcaSlimOrcaUltraChatCapybaraOpenHermes
For Diffusion Training
LAIONJourneyDBCustom HDR datasetsStudio captures
Optimization Techniques I Use
Streaming modeMemory mappingMultiprocessing (num_proc)Parquet formatSSD cachingBatched tokenization
Technology Stack
HuggingFace DatasetsPyArrowParquetTokenizersHuggingFace HubPolars
Expertise by Sumit Chatterjee
Industrial Light & Magic, Sydney