6.5 KiB
Catalog Population & Maintenance Design
Date: 2026-04-18
Status: Approved
Domains: Bikepacking, biking, hiking
Goal
Populate the globalItems catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase.
Schema Changes
New: manufacturers table
CREATE TABLE manufacturers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- "Apidura", "Canyon"
slug TEXT NOT NULL UNIQUE, -- "apidura", "canyon"
website TEXT NOT NULL, -- brand homepage
tier INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only
active BOOLEAN NOT NULL DEFAULT true,
country TEXT, -- "DE", "US", etc.
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
tier controls how the brand is handled by the ingestion pipeline:
- 1 — deep agent crawl of full product catalog
- 2 — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab)
- 3 — RSS/new-releases only (future)
Modified: globalItems table
- Remove
brand TEXTfield - Add
manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id) - Unique constraint changes from
(brand, model)→(manufacturer_id, model)
All queries that previously selected brand join to manufacturers to get the name. No denormalization.
Ingestion Pipeline
Overview
1. Add manufacturer to DB (name, website, tier, active=true)
2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
3. Claude Haiku agent crawls manufacturer website
4. Agent outputs structured JSON array
5. Script calls POST /api/global-items/bulk
6. Items upserted into catalog (no duplicates)
Script: scripts/crawl-manufacturer.ts
Lives in this repo. Accepts --manufacturer=<slug> flag.
Responsibilities:
- Fetch manufacturer record from DB by slug
- Build agent prompt with manufacturer context + target schema
- Run Claude Haiku agent against the website
- Validate and clean agent output
- Call
POST /api/global-items/bulkwith the result - Log success/failure counts
Agent Target Schema
The agent is instructed to extract each product as:
{
manufacturerId: number, // resolved from DB before passing to agent
model: string, // product model name (no brand prefix)
category: string, // see canonical categories below
weightGrams: number|null,
priceCents: number|null, // MSRP in manufacturer's base currency
priceCurrency: string, // "EUR", "USD", etc.
description: string|null,
sourceUrl: string, // direct product page URL
tags: string[], // from canonical tag list
}
priceCents maps to globalItems.priceCents (base MSRP). priceCurrency is used by the script to also insert a row into marketPrices (market derived from manufacturer country + currency), so regional pricing is populated from day one.
Canonical Categories
Defined in scripts/taxonomy/categories.ts, mirrors globalItems.category values:
bags— bikepacking bags, dry bags, stuff sacksshelters— tents, bivys, tarps, hammockssleep— sleeping bags, quilts, pads, pillowscooking— stoves, cookware, mugs, utensilslighting— headlamps, bike lights, lanternswater— filters, bottles, bladderselectronics— power banks, solar panels, GPS, bike computerstools— multi-tools, pumps, repair kits, locksclothing— jackets, base layers, gloves, shoesnavigation— GPS devices, maps, compassesbikes— complete bikescomponents— drivetrain, brakes, wheels, handlebars, saddles
Tag Assignment
The agent assigns tags from the canonical list already seeded in the DB (SEED_TAGS in seed-global-items.ts). The prompt includes the full tag list so the agent can pick appropriate ones per item.
API Changes
POST /api/global-items/bulk already exists and handles upsert on (brand, model). Once the schema migration lands, the upsert key becomes (manufacturer_id, model). The route and service logic change minimally — the schema enforces the constraint.
POST /api/global-items (single upsert) same change.
Both routes require auth (existing behavior).
Running the Pipeline
# Add a manufacturer first (via API or direct DB insert)
# Then crawl:
bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs
# Crawl all active tier-1 manufacturers:
bun run scripts/crawl-all.ts --tier=1
Out of Scope (Deferred)
- RSS / new-release monitoring — scheduled polling of brand RSS feeds for new product announcements
- Price updates — periodic refresh of
marketPricesfrom retailer sites - Community submissions — user-proposed items with admin approval workflow (Phase 999.6)
- Separate ingestion repo — pipeline stays in this repo until complexity justifies splitting
- Aggregator scraping (Tier 2) — Bikepacking.com, Outdoor Gear Lab as discovery sources
Implementation Phases
This design breaks into two sequential phases:
Phase A — Schema & API
- Add
manufacturerstable + migration - Migrate
globalItems: replacebrandtext withmanufacturerIdFK - Update all queries, services, routes, and MCP tools that reference
brand - Seed initial manufacturer list (top bikepacking/biking/hiking brands)
Phase B — Ingestion Script
scripts/crawl-manufacturer.ts— agent runnerscripts/taxonomy/categories.ts— canonical category mapscripts/crawl-all.ts— batch runner by tier- Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)
Agent Execution Model
The crawl script launches a Claude Code headless session (via the Claude Agent SDK) rather than calling the Anthropic API directly. This gives the agent full tool access — WebFetch, browser navigation, file I/O — without needing to re-implement those capabilities. Auth is handled via OAuth rather than a raw API key.
Each manufacturer gets its own agent session. The session receives:
- The manufacturer record (name, website, tier)
- The target schema and canonical taxonomy
- A GearBox API key scoped to write access
The agent browses the manufacturer site, extracts products, and posts to POST /api/global-items/bulk directly from within the session. No intermediate file serialization needed.