Steampunk, Inc. · 20 hours ago
Senior AI Evaluation Scientist
Steampunk, Inc. is a Change Agent in the Federal contracting industry, focusing on innovative solutions for clients in various sectors. They are seeking a Senior AI Evaluation Scientist to design and lead evaluation programs for AI systems, ensuring accuracy and alignment with mission outcomes.
ConsultingInformation Technology
Responsibilities
Lead the design and implementation of comprehensive evaluation frameworks for generative and predictive AI models, including accuracy, robustness, relevance, trustworthiness, fairness, hallucination rates, and safety
Develop and maintain automated evaluation pipelines that continuously audit model outputs, monitor quality drift, and validate alignment with mission-specific constraints
Create custom benchmark datasets, challenge sets, and adversarial evaluation strategies tailored to client domains and regulatory requirements
Conduct in-depth error analysis, model behavior studies, and sensitivity assessments to inform iterative improvements in prompts, retrieval systems, models, and orchestration frameworks
Partner with AI Product Engineers, LLMOps Engineers, and Data Scientists to drive model improvements through structured experimentation, A/B testing, and scientifically grounded evaluation cycles
Advise teams on measurement methodologies, statistical significance, and best practices for Trustworthy AI evaluation in alignment with NIST AI RMF, MLSecOps, and agency governance requirements
Document evaluation results, risks, and findings for technical and non-technical audiences, including engineering teams, leadership, and government clients
Contribute to the development of standardized tools, reusable templates, and evaluation components to improve repeatability and quality across engagements
Stay informed of advances in LLM assessment, safety science, red-teaming methodologies, and evaluation frameworks emerging from academia and industry
Mentor junior evaluation staff and help grow Steampunk’s AI measurement and evaluation capabilities
Qualification
Required
Ability to hold a position of public trust with the U.S. government
Bachelor's degree and 8 years of experience
5+ years of experience evaluating machine learning, NLP, or generative AI systems, with strong familiarity with LLMs and retrieval-based architectures
Deep understanding of evaluation metrics, statistical testing, dataset construction, experimental design, and model validation methodologies
Hands-on experience with Python and libraries such as PyTorch, Hugging Face, LangChain, scikit-learn, and evaluation tooling (LLM-as-a-judge, rubric-based evaluators, or custom harnesses)
Proficiency in AI evaluation frameworks such as Ragas
Demonstrated experience designing automated evaluation pipelines and integrating them into CI/CD or LLMOps workflows
Strong understanding of AI governance, responsible AI principles, bias detection, fairness metrics, and risk identification
Experience working with structured and unstructured datasets across multiple modalities (text, tabular, documents)
Familiarity with vector databases, RAG architectures, and multi-step LLM workflows
Familiarity with OWASP LLM Top 10 Risks
Excellent analytical, written, and verbal communication skills, with the ability to translate evaluation insights into clear technical recommendations
Proven ability to collaborate with cross-functional engineering and product teams while independently driving evaluation strategy
Experience working in agile or iterative development environments and documenting scientific processes clearly
Company
Steampunk, Inc.
Steampunk is anchored by a startup culture with a customer-centered delivery approach, we put our Federal government clients in the center of everything we design, develop, and deliver to drive high-quality mission impacts and user experiences at speed.
Funding
Current Stage
Growth StageTotal Funding
unknownKey Investors
AcceliCITY powered by Leading Cities
2024-07-31Non Equity Assistance
Recent News
Washington Technology
2025-10-01
2024-05-21
Company data provided by crunchbase