I can help your team create evals
I have created evals for multiple AI products and taught evals since 2025. I have learned two things: first, evals are what turns an ok AI product into a great one, and second, evals take a lot of time. Most of that time is looking at data, creating golden data sets, and iterating on evals.
I can help your team create their first evals, or expand your initial eval set.
What People Say
"In just a few sessions, Peter helped us design a more rigorous and efficient evaluation framework that strengthened our prompt iteration process and improved our overall confidence in model performance."
"When Peter teaches evals, listen up. He's one of the most hands-on teachers and builders I've seen in action."
How Does It Work?
Like any good consultant, I have to say "it depends", but generally it looks like this:
We have a kick-off meeting.
We'll go over things like: do you have evals set up already? Is the product live? What are the scariest things that the AI could get wrong? Who will be the main contact for this? Let's get me access to the data.
We'll jointly identify a few candidate evals to start with.
Good evals are a lot of work, and so it's important to set up evals for things that actually matter.
I go away, spend time with the data, and build a tiny golden dataset and a manual eval.
Then we meet again, and discuss. By now, we have a weekly rhythm set up.
I help you set up automated eval systems.
For the most important evals, you'll want to set up automated systems. I can help you set those up from scratch, or work with your team to set them up.
Most of my time is spent looking at the data, developing data sets and setting up evals. Typically ~20 hours/week.
Most of your time is spent receiving the work, and deciding on what to eval next. Typically ~2 hours/week.
Evals are emerging as the real moat for AI startups.
Writing evals is going to become a core skill for product managers.
If there is one thing we can teach people, it's that writing evals is probably the most important thing.
Evals are surprisingly often all you need.
FAQ
How long is a typical engagement?
It's flexible. Most teams start with about three weeks, a few days a week, which is usually enough to stand up your first meaningful evals and establish a rhythm. From there we can wind down, continue at a lighter cadence, or expand the eval set as your product grows.
How much does it cost?
Get in touch and we'll scope it to your needs. It's almost always cheaper than hiring for the role, and you get someone who has done this across multiple AI products rather than learning on the job.
Can you help with the technical setup?
Yes. I can help you set up automated eval systems, or work alongside your engineering team to get them set up.
We already have Braintrust (or another eval tool) set up.
Great, that makes things faster. The hard part usually isn't the tool, it's deciding what to evaluate and building good golden datasets. I'll help you implement the right evals on top of what you already have.
We don't have any eval tools set up yet.
That's fine, and common. We can start with simple manual evals and a tiny golden dataset, then introduce automated tooling only where it earns its keep. You don't need to buy anything to get started.
Do we need our product to be live already?
No. It helps to have real data, but if you're pre-launch we can work from whatever data you have, including synthetic or early test data, and design evals you'll grow into.
Who on our team needs to be involved?
We'll want one main point of contact, plus access to your data and the people who understand what "good" looks like for your product. Most of your team's time goes into reviewing my work and deciding what actually matters to evaluate.
What kinds of products have you done this for?
A range of AI products since 2025, from RAG and chatbots to document analysis and domain-specific tools. The approach transfers across domains because it starts from your data and your risks.
My CEO has questions. Can you help me sell this internally?
Yes. I'm happy to join a call with your leadership to explain what evals are, why they matter, and what the engagement will and won't deliver. Often it helps to hear it from someone who has done this before, in plain terms, tied to business outcomes rather than hype.
Our legal or compliance team has questions.
Understandable, and a good sign. Evals are one of the best tools for getting ahead of risk: they make it concrete what the AI could get wrong and how often. I can walk your legal or compliance stakeholders through how we identify the scariest failure modes and measure them, and we'll make sure data access is handled appropriately.
Our engineering team has questions.
Good, I love working closely with engineers. I can go as deep as needed on datasets, scoring, automated eval pipelines, and how this fits your existing stack and CI. The goal is to leave your team able to own and extend the evals after I'm gone, not to create a dependency on me.