Files
GearBox/docs/superpowers/specs/2026-04-18-catalog-population-design.md

6.5 KiB

Catalog Population & Maintenance Design

Date: 2026-04-18
Status: Approved
Domains: Bikepacking, biking, hiking


Goal

Populate the globalItems catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase.


Schema Changes

New: manufacturers table

CREATE TABLE manufacturers (
  id         SERIAL PRIMARY KEY,
  name       TEXT NOT NULL UNIQUE,        -- "Apidura", "Canyon"
  slug       TEXT NOT NULL UNIQUE,        -- "apidura", "canyon"
  website    TEXT NOT NULL,               -- brand homepage
  tier       INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only
  active     BOOLEAN NOT NULL DEFAULT true,
  country    TEXT,                        -- "DE", "US", etc.
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

tier controls how the brand is handled by the ingestion pipeline:

  • 1 — deep agent crawl of full product catalog
  • 2 — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab)
  • 3 — RSS/new-releases only (future)

Modified: globalItems table

  • Remove brand TEXT field
  • Add manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id)
  • Unique constraint changes from (brand, model)(manufacturer_id, model)

All queries that previously selected brand join to manufacturers to get the name. No denormalization.


Ingestion Pipeline

Overview

1. Add manufacturer to DB (name, website, tier, active=true)
2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
3. Claude Haiku agent crawls manufacturer website
4. Agent outputs structured JSON array
5. Script calls POST /api/global-items/bulk
6. Items upserted into catalog (no duplicates)

Script: scripts/crawl-manufacturer.ts

Lives in this repo. Accepts --manufacturer=<slug> flag.

Responsibilities:

  1. Fetch manufacturer record from DB by slug
  2. Build agent prompt with manufacturer context + target schema
  3. Run Claude Haiku agent against the website
  4. Validate and clean agent output
  5. Call POST /api/global-items/bulk with the result
  6. Log success/failure counts

Agent Target Schema

The agent is instructed to extract each product as:

{
  manufacturerId: number,   // resolved from DB before passing to agent
  model: string,            // product model name (no brand prefix)
  category: string,         // see canonical categories below
  weightGrams: number|null,
  priceCents: number|null,  // MSRP in manufacturer's base currency
  priceCurrency: string,    // "EUR", "USD", etc.
  description: string|null,
  sourceUrl: string,        // direct product page URL
  tags: string[],           // from canonical tag list
}

priceCents maps to globalItems.priceCents (base MSRP). priceCurrency is used by the script to also insert a row into marketPrices (market derived from manufacturer country + currency), so regional pricing is populated from day one.

Canonical Categories

Defined in scripts/taxonomy/categories.ts, mirrors globalItems.category values:

  • bags — bikepacking bags, dry bags, stuff sacks
  • shelters — tents, bivys, tarps, hammocks
  • sleep — sleeping bags, quilts, pads, pillows
  • cooking — stoves, cookware, mugs, utensils
  • lighting — headlamps, bike lights, lanterns
  • water — filters, bottles, bladders
  • electronics — power banks, solar panels, GPS, bike computers
  • tools — multi-tools, pumps, repair kits, locks
  • clothing — jackets, base layers, gloves, shoes
  • navigation — GPS devices, maps, compasses
  • bikes — complete bikes
  • components — drivetrain, brakes, wheels, handlebars, saddles

Tag Assignment

The agent assigns tags from the canonical list already seeded in the DB (SEED_TAGS in seed-global-items.ts). The prompt includes the full tag list so the agent can pick appropriate ones per item.


API Changes

POST /api/global-items/bulk already exists and handles upsert on (brand, model). Once the schema migration lands, the upsert key becomes (manufacturer_id, model). The route and service logic change minimally — the schema enforces the constraint.

POST /api/global-items (single upsert) same change.

Both routes require auth (existing behavior).


Running the Pipeline

# Add a manufacturer first (via API or direct DB insert)
# Then crawl:
bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs

# Crawl all active tier-1 manufacturers:
bun run scripts/crawl-all.ts --tier=1

Out of Scope (Deferred)

  • RSS / new-release monitoring — scheduled polling of brand RSS feeds for new product announcements
  • Price updates — periodic refresh of marketPrices from retailer sites
  • Community submissions — user-proposed items with admin approval workflow (Phase 999.6)
  • Separate ingestion repo — pipeline stays in this repo until complexity justifies splitting
  • Aggregator scraping (Tier 2) — Bikepacking.com, Outdoor Gear Lab as discovery sources

Implementation Phases

This design breaks into two sequential phases:

Phase A — Schema & API

  1. Add manufacturers table + migration
  2. Migrate globalItems: replace brand text with manufacturerId FK
  3. Update all queries, services, routes, and MCP tools that reference brand
  4. Seed initial manufacturer list (top bikepacking/biking/hiking brands)

Phase B — Ingestion Script

  1. scripts/crawl-manufacturer.ts — agent runner
  2. scripts/taxonomy/categories.ts — canonical category map
  3. scripts/crawl-all.ts — batch runner by tier
  4. Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)

Agent Execution Model

The crawl script launches a Claude Code headless session (via the Claude Agent SDK) rather than calling the Anthropic API directly. This gives the agent full tool access — WebFetch, browser navigation, file I/O — without needing to re-implement those capabilities. Auth is handled via OAuth rather than a raw API key.

Each manufacturer gets its own agent session. The session receives:

  • The manufacturer record (name, website, tier)
  • The target schema and canonical taxonomy
  • A GearBox API key scoped to write access

The agent browses the manufacturer site, extracts products, and posts to POST /api/global-items/bulk directly from within the session. No intermediate file serialization needed.