Apply on Employer Site

Elicit · 1 day ago

Founding Data Engineer

Oakland, CA

Full-time

Hybrid

Senior Level

$185K/yr - $270K/yr

5+ years exp

Elicit is an AI research assistant that aims to enhance reasoning in the world by helping researchers and decision makers. The Founding Data Engineer will be responsible for building a comprehensive corpus of academic documents and optimizing data ingestion processes to support machine learning systems.

Artificial Intelligence (AI)Data Center AutomationDatabaseInformation Technology

Growth Opportunities

Responsibilities

Building and optimizing our academic research paper pipeline

You'll architect and implement robust, scalable systems to handle data ingestion while maintaining high performance and quality

You'll work on efficiently deduplicating hundreds of millions of research papers, and calculating embeddings

Your goal will be to make Elicit the most complete and up-to-date database of scholarly sources

Expanding the datasets Elicit works over

Our users want Elicit to work over court documents, SEC filings, … your job will be to figure out how to ingest and index a rapidly increasing ontology of documents

We also want to support less structured documents, spreadsheets, presentations, all the way up to rich media like audio and video

Larger customers often want for us to integrate private data into Elicit for their organisation to use

We'll look to you to define and build a secure, reliable, fast, and auditable approach to these data connectors

Data for our ML systems

You'll figure out the best way to preprocess all these data mentioned above to make them useful to models

We often need datasets for our model fine-tuning

You'll work with our ML engineers and evaluation experts to find, gather, version, and apply these datasets in training runs

Start building foundational context

Get to know your team, our stack (including Python, Flyte, and Spark), and the product roadmap

Familiarize yourself with our current data pipeline architecture and identify areas for potential improvement

Make your first contribution to Elicit

Complete your first Linear issue related to our data pipeline or academic paper processing

Have a PR merged into our monorepo, demonstrating your understanding of our development workflow

Gain understanding of our CI/CD pipeline, monitoring, and logging tools specific to our data infrastructure

You'll complete your first multi-issue project

Tackle a significant data pipeline optimization or enhancement project

Collaborate with the team to implement improvements in our academic paper processing workflow

You're actively improving the team

Contribute to regular team meetings and hack days, sharing insights from your data engineering expertise

Add documentation or diagrams explaining our data pipeline architecture and best practices

Suggest improvements to our data processing and storage methodologies

You're flying solo

Independently implement significant enhancements to our data pipeline, improving efficiency and scalability

Make impactful decisions regarding our data architecture and processing strategies

You've developed an area of expertise

Become the go-to resource for questions related to our academic paper processing pipeline and data infrastructure

Lead discussions on optimizing our data storage and retrieval processes for academic literature

You actively research and improve the product

Propose and scope improvements to make Elicit more comprehensive and up-to-date in terms of scholarly sources

Identify and implement technical improvements to surpass competitors like Google Scholar in terms of coverage and data quality

Qualification

Data engineeringPythonSparkSQLData pipeline architectureData quality managementColumnar data storageUser-centric designCreative problem-solvingCollaborationDocumentation

Required

5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest, manage, and use data

Strong proficiency in Python (5+ years experience)

You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling

Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark; ideally these are pipelines which directly support user-facing features (rather than internal BI, for example)

Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches

Experience with columnar data storage formats like Parquet

Strong opinions, weakly-held about approaches to data quality management

Creative and user-centric problem-solving

You should be excited to play a key role in shipping new features to users—not just building out a data platform!

Preferred

Experience in developing deduplication processes for large datasets

Hands-on experience with full-text extraction and processing from various document formats (PDF, HTML, XML, etc.)

Familiarity with machine learning concepts and their application in search technologies

Experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)

Experience in science and academia: familiarity with academic publications, and the ability to accurately model the needs of our users

Hands-on experience with industry standard tools like Airflow, DBT, or Hadoop

Hands-on experience with standard paradigms like data lake, data warehouse, or lakehouse

Benefits

Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family

Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays

401K with a 6% employer match

A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter

$1,000 quarterly AI Experimentation & Learning budget, so you can freely experiment with new AI tools to incorporate into your workflow, take courses, purchase educational resources, or attend AI-focused conferences and events

A team administrative assistant who can help you with personal and work tasks