Ship evidence, not features

There's a failure mode I wanted to avoid while building the marketing OS: spending three months heads-down, emerging with an impressive feature list, and having no idea whether any of it is actually better than what it replaces. So I gave myself a rule that sounds obvious and almost nobody follows: ship evidence, not features.

What that means in practice

Every slice I build has to answer a question, not just exist. The unit of progress isn't "I built the ads-performance skill." It's "here's what the system said about last month's ads, side by side with what actually happened." A feature that can't produce that comparison isn't done — it's just code.

So the repo has a demo-cases/ folder, and the rule is that each capability lands with one: a concrete, dated, side-by-side write-up of the system's read versus the human/agency outcome. Not a screenshot of the feature working — a piece of evidence about whether it's right.

Why evidence beats features for a tool like this

A marketing tool's only real currency is trust. If I'm going to lean on a daily summary to tell me where spend is leaking, I have to believe it more than I believe my own manual scan. A feature doesn't earn that. A track record does.

Stacking features without evidence has a second, sneakier cost: you can't tell which ones to keep. Build ten skills, trust none, and you've made a more complicated version of not knowing. Build three and prove two of them beat the status quo, and you've made something you'd actually bet on. Evidence is how you prune.

A feature is a claim. A demo case is the test of the claim. Ship the tests, and the features that survive them are the only ones worth keeping.

It also paces the build

Demo-driven development forces "ship in slices." If every capability owes a demo, you can't disappear for a quarter — you're producing a verdict every couple of weeks. That cadence is the opposite of the heroic invisible build, and it's saved me from my own worst habit, which is polishing something nobody's pressure-tested.

It dovetails with the phase plan: describe, then recommend, then act. You can't honestly move from "describe" to "recommend" on vibes — you move when the describe layer has a stack of demo cases showing it read the situation better than the manual approach did. The evidence is the gate between phases, not a calendar date.

The uncomfortable part

Shipping evidence means sometimes the evidence is bad. A demo case can show the system's recommendation would have been worse than what actually happened — and then you have to keep that write-up, because the honest pile includes the misses. That stings, but it's the entire point: a tool validated only by its wins is a tool you can't trust on a new decision. The misses are what tell you where the judgment still has to stay human.

It's the same discipline as everything else I keep coming back to — the quality fence, the MCP boundaries: structure that makes the work honest at scale. Features are easy to generate now; anyone can. The scarce, compounding thing is proof that what you generated is actually good.