OPick: Onchain Prediction Market Protocol

Prediction Markets · 2026

OPick is a live opinion market on Base mainnet. It auto-classifies 1.2M+ X posts a day through the Anthropic API and settles attention-index markets onchain, with no manual intervention.

STATUSLive on Base since March 2026

STACKSolidity · Next.js · Anthropic API

SCALE1.2M+ X posts classified daily

RESULT60 days continuous · 6 settlement cycles

OPick is a live onchain opinion market on Base mainnet. Built solo, end to end. It auto-classifies 1.2M+ X posts a day through the Anthropic API and settles weekly attention-index markets onchain with no manual intervention.

Problem

Polymarket and Kalshi trade on resolved events, not on the underlying attention. There was no liquid market for attention itself. The motivation was twofold: to know whether attention could be priced as an asset class, and to open the ability to create prediction markets to more people, in the spirit of what Polymarket made possible on the trading side.

How it works

Each market is a CPMM: reserves of YES and NO tokens encode the implied probability through the constant-product invariant. A nightly classifier pipeline pulls public-figure posts and routes them through Claude Haiku, with a Sonnet fallback for ambiguous cases, aggregating into a per-figure weekly attention index written to a Solidity oracle on Base. Stack: Solidity (Foundry), Python orchestration, Next.js on Vercel, Anthropic API.

CPMM bonding curve. User trades shift pool ratios; pool ratios determine implied probability (NO / (YES + NO)).

Result

Live on Base since March 2026. About 60 days of continuous daily operation, six weekly settlement cycles completed across six public-figure tickers. Total monthly inference cost in the low double-digit dollars.

Live product

OPEN OPICK ↗

Methodology

The classifier output is a tuple of topic and engagement multiplier, so the system needs three evals: the topic classifier, the engagement multiplier, and the end-to-end index.

For the topic classifier, a labeled eval set of 100 hand-labeled posts spanning the six public figures the index currently covers gives approximately 87 percent accuracy and approximately 0.84 F1. The dominant failure mode is posts with media attachments where the text alone is ambiguous. After adding a media-description preprocessing step that captions images and video frames before the classifier runs, accuracy moved to approximately 92 percent.

For the engagement multiplier there is not yet a clean eval harness. Ground truth for "engagement-weighted attention" is unobservable, so the current proxy is correlation with downstream market activity in the corresponding CPMM. That proxy is contaminated by reflexive effects. The next step is a synthetic eval: sample posts from accounts of varying follower counts but identical content type and check the multiplier responds in the expected direction.

The end-to-end eval is the hardest. The honest summary is that I know the topic classifier within calibration, I have a proxy for the multiplier, and I am still building the index-vs-subsequent-media-activity eval.

Stack and languages

Solidity (Foundry) for the contracts on Base. Python orchestration through a single cron-triggered entry point, no queue, no worker pool, no agent framework. Anthropic API as the cognitive substrate: Claude Haiku 4.5 as primary classifier for cost reasons, with a Sonnet fallback for posts the Haiku classifier flagged as ambiguous. Next.js on Vercel for the front.

Problem solved

The first lesson is that for an AI-native protocol, the choice of underlying determines the architecture, not the other way around. Once topic-weighted engagement was the underlying, the AI classifier became the highest-leverage component and everything else became scaffolding around it. The right entry point for systems like this is the model layer, because that determines the eval surface, which determines what trust looks like.

The second is that the boring orchestration choice almost always beats the sophisticated one. The simplest version (a cron, a single Python entry point, no queue) is what is in production. Every operational pain since launch has come from complexity I added later, not complexity I should have added earlier.

The third is that the eval framework belongs before the protocol, not after. Retrofitting evals onto a production system is the wrong direction.

Notable bugs and debugging

The dominant classifier failure mode was posts with media attachments where the text alone was ambiguous. The mitigation, implemented after the first eval pass, was a media-description preprocessing step that captions images and video frames before the classifier runs. After the change, classifier accuracy moved from approximately 87 percent to approximately 92 percent.