pluckv0.1.0
The web, as a typed API

Pluck the data.
Leave the page.

Hand pluck a URL and a schema. Get back a typed object where every field has been traced to the page it came from. No selectors. No scraping glue. No silent any.

llms.txt
recipe.ts
import { createPluck } from 'pluck'
import { z } from 'zod'

const Recipe = z.object({
  title:       z.string(),
  ingredients: z.array(z.string()),
  minutes:     z.number(),
})

const client = createPluck({ router })

const res = await client.pluck(url, Recipe)
//        ^? ExtractResult<Recipe>

if (res.ok) {
  res.data.minutes   // number — verified ✓
  res.source          // 'jsonld' | 'llm'
}

One call. Six stages.

FIG. 01 — THE PIPELINE
01
Fetch
Plain HTTP, or your self-hosted Firecrawl for JS-heavy pages.
02 · FAST PATH
JSON-LD
Reads schema.org & embedded data. Zero LLM, zero cost.
03
Reduce
Strips the chrome. Clean markdown, not raw HTML.
04
Extract
A router fills your schema. pluck never names the model.
05
Verify
Traces every field back to the source. Sets a ratio.
06
Cache
Keyed on content + schema. Unchanged page, no re-work.

When the page already publishes structured data, pluck takes the fast path and skips the model entirely — most recipe, product, and article pages do.

§ 01

Type-safe by contract, not by hope.

The schema is yours, and it's decoupled from the page's markup. A site can re-skin its entire layout — your Recipe type doesn't move, because pluck reads meaning, not CSS selectors.

Every call resolves to a discriminated ExtractResult<T>. Success is typed data; failure is a reason and an optional partial — never an untyped blob to guess at.

ExtractResult<T> · never any
§ 02

Verified, not guessed.

An LLM will happily invent a price. pluck won't ship one. Every extracted field is traced back to a span in the page's own text and scored — the result carries a verifiedRatio.

Fall below the threshold and the call returns { ok: false } with the partial attached, rather than handing you a confident fabrication. Type-safe is not the same as correct — pluck treats them as two separate jobs.

provenance per field
§ 03

Cheap on purpose.

The model call is the expensive part, so pluck avoids it whenever it can. Pages that already publish schema.org JSON-LD take the fast path — no tokens spent, no hallucination surface.

What does hit the model is cached on a content + schema hash. Re-run against an unchanged page and you pay nothing the second time. That economy is the whole reason a shared service beats a hand-rolled script.

json-ld first · hashed cache
§ 04

Swap any part. Keep the pipeline.

Fetcher, Router, and Cache are plain interfaces. Start on plain fetch; graduate to firecrawlFetcher against your own crawl stack. Mock the model with callbackRouter; wire real policy with swooshRouter.

The in-memory cache implements the same Cache interface a Redis or Postgres store would — so the library you run locally is the service you run hosted, untouched.

Fetcher · Router · Cache

pluck owns the crawl, the verify, and the cache. swoosh owns which model, under what policy.

pluck → fetch · json-ld · reduce · verify · cache
swoosh → model selection · budgets · fallback

A clean seam. pluck asks for "a model that does structured output"; swoosh decides which one and what it costs. Neither knows the other's job.
Get started

Three lines
to typed data.

Install it, define a schema, call pluck. The JSON-LD path runs with no model at all.

Full documentation →
terminal
# install
npm install pluck zod

# optional: policy-driven model routing
npm install swoosh-router