Member of Engineering (Pre-training / Synthetic Data) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Poolside · 16 hours ago

Member of Engineering (Pre-training / Synthetic Data)

Poolside is a company focused on building a world where AI drives economically valuable work and scientific progress. They are seeking a Member of Engineering to work on their data team, improving the quality of pretraining datasets and generating synthetic data at scale. The role involves collaboration with various teams to define data needs and ensure high-quality datasets for training large models.

AI InfrastructureArtificial Intelligence (AI)Developer PlatformFoundational AIInformation TechnologySoftware
check
H1B Sponsor Likelynote

Responsibilities

Follow the latest research related to LLMs and synthetic data generation in particular. Be familiar with the most relevant open-source datasets and models
Design and implement complex pipelines that can generate large amounts of data while maintaining high diversity and optimizing the resources available
Closely work with other teams such as Pretraining, Posttraining, Evals and Product to ensure alignment on the quality of the models delivered
Continuously measure and refine the quality of the datasets being generated while validating the final data strategy through quantitative data ablation experiments

Qualification

Machine LearningLarge Language ModelsData Pipeline EngineeringPython ProgrammingSynthetic Data GenerationData Quality OptimizationDistributed Data PipelinesPrompt EngineeringResearch ExperienceCollaboration Skills

Required

Strong machine learning and engineering background
Experience with Large Language Models (LLM), including: Understanding of how LLMs learn, Data ablations and scaling laws, Post-training techniques, Training reasoning and agentic models
Experience with implementing cost-efficient, complex pipelines to generate synthetical datasets at scale optimizing for data quality, correctness, diversity, etc
Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc)
Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc
Excellent programming skills in Python
Strong prompt engineering skills
Experience working with large-scale GPU clusters and distributed data pipelines
Strong obsession with data quality

Preferred

Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have
Can freely discuss the latest papers and descend to fine details
Is reasonably opinionated

Benefits

Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you and dependents
Company-provided equipment
Wellbeing, always-be-learning and home office allowances
Frequent team get togethers
Great diverse & inclusive people-first culture

Company

Poolside

twittertwittertwitter
company-logo
Poolside is an artificial intelligence platform that offers foundation concepts and infrastructure to write software codes.

H1B Sponsorship

Poolside has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Growth Stage
Total Funding
$626M
Key Investors
Bain Capital VenturesRedpoint
2024-10-02Series B· $500M
2023-08-24Series A· $100M
2023-05-14Seed· $26M

Leadership Team

leader-logo
Eiso Kant
Co-CEO & Co-founder
linkedin
leader-logo
Jason Warner
Co-CEO & Co-Founder
linkedin
Company data provided by crunchbase