Essay · 2026

Evaluation as the Forward Deployed Engineer's First Skill

The first skill of a Forward Deployed Engineer is not the choice of model, the design of a retrieval system, or the orchestration of an agent. It is the design of the evaluation surface against which all of those other choices are made.

This is a counterintuitive claim, because eval engineering looks, from the outside, like a chore at the end of the pipeline. Build the system, then evaluate it. The order of operations encoded in that sentence is the source of most of the failure I have observed in early-stage AI deployments, both in my own systems and in systems I have read about. The teams that succeed treat eval design as the first activity, not the last, and the teams that fail treat it as a deliverable to be produced after the system is built.

I want to spend this essay arguing the inverted order, drawing on three engagements I have run.

The first was UIBE in the summer of 2023, where I built a topic-model-based forecasting pipeline for crude oil. The eval question I asked first, before I trained anything, was simple. If my pipeline succeeded, what would I be able to say about the result that I could not currently say about the team's existing keyword baseline? The answer that emerged was that I would be able to defend a Sharpe lift on out-of-sample data with a placebo control. That single sentence, written before any model code, determined the architecture of the entire summer. I trained the model on a window that ended a year before the prediction window. I held out the prediction window from any tuning. And I built a placebo arm where the model trained on permuted headlines, so that I could show the Sharpe collapsed without the actual signal. The placebo arm was not part of the original team request. I built it because the eval question required it, and the team's confidence to deploy capital came from the placebo arm, not from the headline Sharpe.

The second engagement was Spandrel, the deal-sourcing system I built for a transferable tax credit syndicator. The eval question there was, in retrospect, the only honest one available: can the head of origination trust the top of the list enough to start her morning with it? Phrased that way, the eval question forced me to abandon precision and recall as primary metrics, since neither of those numbers maps cleanly to her trust. The metric that emerged was precision at twenty against her actual subsequent behavior, polled from her team. The metric is not as clean as a static F1 number on a labeled set. But it is the metric that determines whether the system survives, which is the only thing the eval was ever meant to do.

The third engagement is OPick, my current production protocol on Base mainnet. The eval design here is where I have been weakest, because I built the system before I built the eval. The classifier was running in production for two months before I had a labeled set against which to measure it. I have been retrofitting evals onto a live system, which means I am navigating the worst case where the system has already shipped and I cannot tell which parts of its behavior are correct and which are artifacts of an unmeasured failure mode. I am doing the work now, but the experience has been instructive about the cost of inverting the order.

What follows is what I now believe is the right discipline, in the order I would defend.

The first question to answer in any forward-deployed engagement is what claim the customer needs to make about the system in order to deploy capital, ship a product, or change a workflow. The claim is the eval target. It is rarely a number you can measure directly. It is usually a sentence like "the top of this list is trustworthy" or "the model is correctly identifying X% of the cases I care about". The work of translating that sentence into a measurable metric is the work of eval design.

The second question is what ground truth, if any, is available. In supervised settings, you have labels. In most forward-deployed settings, you do not have labels, and you need to negotiate with the customer for what minimum labeling effort produces a viable eval. Sometimes the answer is fifty labeled examples reviewed in an afternoon. Sometimes the answer is no labels, and the eval has to be constructed against downstream customer behavior, as in Spandrel. Both are acceptable. What is not acceptable is to skip this conversation and build a system whose performance you cannot defend.

The third question is what failure modes you most need to surface in eval that production usage would surface too late. This is the inverse of the standard eval question, which asks what failure modes are most common. The forward-deployed framing asks which failure modes, if they reach production undetected, would cause the engagement to lose customer trust. In Spandrel, the failure mode that would have killed the engagement was confidently surfacing a deal that the team subsequently flagged as obviously not credit-eligible. The eval had to catch that case even if it was rare, because the customer relationship could not survive one false confidence event. The eval design responded by adding an explicit credit-type validator upstream of the scoring layer, with a stricter false-positive threshold than the main classifier.

The fourth question is who reviews the eval and on what cadence. Eval surfaces that are not reviewed decay. Eval surfaces that are reviewed weekly, by the customer or by you with the customer in copy, become the connective tissue of the engagement. They give you a forum to discuss the system's behavior in concrete terms rather than in vibes, and they let the customer feel the system improving rather than asserting it.

The discipline I am trying to internalize is to write the eval document before I write the system code. The eval document, even a rough one, contains the claim, the metric, the ground truth strategy, the failure modes to catch, and the review cadence. Once those five things are written, the system design problem collapses into something tractable, because every architectural choice can be tested against whether it makes the eval easier or harder. Choices that make the eval easier almost always also make the system easier to operate. Choices that make the eval harder almost always indicate a hidden cost the customer will eventually pay.

Forward Deployed Engineer is a role title that names a discipline. The discipline, distilled, is the willingness to write the document that says how you will know you have succeeded, and then to live inside the discipline of that document. Everything else, the model choice, the framework selection, the orchestration pattern, is downstream.