Black Forest Labs · 2 months ago
Member of Technical Staff - Large Model Data
Black Forest Labs is a team behind groundbreaking models like Stable Diffusion and FLUX, focusing on the importance of high-quality datasets for training generative models. The role involves creating scalable data systems and infrastructure to support frontier research, ensuring data quality, and optimizing workflows for massive datasets.
Computer Software
Responsibilities
Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions
Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines
Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option
Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)
Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs
Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful
Transforms user preference data and alternative sources into formats that models can learn from
Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing
Qualification
Required
Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics
Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code
Familiarity with data annotation and captioning processes for ML training datasets
Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)
Preferred
Have built or contributed to large-scale data acquisition systems and understand the operational challenges
Bring experience with NLP techniques for image/video captioning
Have implemented data deduplication at billion-record scale and understand the tradeoffs
Know your way around big data frameworks like Apache Spark or Hadoop
Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes
Think deeply about ethical considerations in data collection and usage
Company
Black Forest Labs
We’re the leading frontier AI research lab, continuously building the most advanced technology that shapes the visual understanding of the world.
Funding
Current Stage
Early StageCompany data provided by crunchbase