Apply on Employer Site

Black Forest Labs · 2 months ago

Member of Technical Staff - Large Model Data

San Francisco, CA

Full-time

Onsite

Mid, Senior Level

$180K/yr - $300K/yr

Black Forest Labs is a team behind groundbreaking models like Stable Diffusion and FLUX, focusing on the importance of high-quality datasets for training generative models. The role involves creating scalable data systems and infrastructure to support frontier research, ensuring data quality, and optimizing workflows for massive datasets.

Computer Software

Responsibilities

Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions

Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines

Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option

Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)

Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs

Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful

Transforms user preference data and alternative sources into formats that models can learn from

Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing

Qualification

PythonData processing workflowsCloud platformsImage processing librariesData annotation processesMachine learning techniquesNLP techniquesBig data frameworksEthical considerations

Required

Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis

Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing

Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics

Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code

Familiarity with data annotation and captioning processes for ML training datasets

Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)

Preferred

Have built or contributed to large-scale data acquisition systems and understand the operational challenges

Bring experience with NLP techniques for image/video captioning

Have implemented data deduplication at billion-record scale and understand the tradeoffs

Know your way around big data frameworks like Apache Spark or Hadoop

Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes

Think deeply about ethical considerations in data collection and usage

Company

Black Forest Labs

We’re the leading frontier AI research lab, continuously building the most advanced technology that shapes the visual understanding of the world.

Founded in 2024

11-50 employees

https://blackforestlabs.ai/

Funding

Current Stage

Early Stage

Company data provided by crunchbase