diff --git a/docs/superpowers/specs/2026-04-18-catalog-population-design.md b/docs/superpowers/specs/2026-04-18-catalog-population-design.md new file mode 100644 index 0000000..4f5ac9a --- /dev/null +++ b/docs/superpowers/specs/2026-04-18-catalog-population-design.md @@ -0,0 +1,164 @@ +# Catalog Population & Maintenance Design + +**Date**: 2026-04-18 +**Status**: Approved +**Domains**: Bikepacking, biking, hiking + +--- + +## Goal + +Populate the `globalItems` catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase. + +--- + +## Schema Changes + +### New: `manufacturers` table + +```sql +CREATE TABLE manufacturers ( + id SERIAL PRIMARY KEY, + name TEXT NOT NULL UNIQUE, -- "Apidura", "Canyon" + slug TEXT NOT NULL UNIQUE, -- "apidura", "canyon" + website TEXT NOT NULL, -- brand homepage + tier INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only + active BOOLEAN NOT NULL DEFAULT true, + country TEXT, -- "DE", "US", etc. + created_at TIMESTAMP NOT NULL DEFAULT NOW() +); +``` + +`tier` controls how the brand is handled by the ingestion pipeline: +- **1** — deep agent crawl of full product catalog +- **2** — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab) +- **3** — RSS/new-releases only (future) + +### Modified: `globalItems` table + +- **Remove** `brand TEXT` field +- **Add** `manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id)` +- **Unique constraint** changes from `(brand, model)` → `(manufacturer_id, model)` + +All queries that previously selected `brand` join to `manufacturers` to get the name. No denormalization. + +--- + +## Ingestion Pipeline + +### Overview + +``` +1. Add manufacturer to DB (name, website, tier, active=true) +2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon +3. Claude Haiku agent crawls manufacturer website +4. Agent outputs structured JSON array +5. Script calls POST /api/global-items/bulk +6. Items upserted into catalog (no duplicates) +``` + +### Script: `scripts/crawl-manufacturer.ts` + +Lives in this repo. Accepts `--manufacturer=` flag. + +Responsibilities: +1. Fetch manufacturer record from DB by slug +2. Build agent prompt with manufacturer context + target schema +3. Run Claude Haiku agent against the website +4. Validate and clean agent output +5. Call `POST /api/global-items/bulk` with the result +6. Log success/failure counts + +### Agent Target Schema + +The agent is instructed to extract each product as: + +```ts +{ + manufacturerId: number, // resolved from DB before passing to agent + model: string, // product model name (no brand prefix) + category: string, // see canonical categories below + weightGrams: number|null, + priceCents: number|null, // MSRP in manufacturer's base currency + priceCurrency: string, // "EUR", "USD", etc. + description: string|null, + sourceUrl: string, // direct product page URL + tags: string[], // from canonical tag list +} +``` + +`priceCents` maps to `globalItems.priceCents` (base MSRP). `priceCurrency` is used by the script to also insert a row into `marketPrices` (market derived from manufacturer country + currency), so regional pricing is populated from day one. + +### Canonical Categories + +Defined in `scripts/taxonomy/categories.ts`, mirrors `globalItems.category` values: + +- `bags` — bikepacking bags, dry bags, stuff sacks +- `shelters` — tents, bivys, tarps, hammocks +- `sleep` — sleeping bags, quilts, pads, pillows +- `cooking` — stoves, cookware, mugs, utensils +- `lighting` — headlamps, bike lights, lanterns +- `water` — filters, bottles, bladders +- `electronics` — power banks, solar panels, GPS, bike computers +- `tools` — multi-tools, pumps, repair kits, locks +- `clothing` — jackets, base layers, gloves, shoes +- `navigation` — GPS devices, maps, compasses +- `bikes` — complete bikes +- `components` — drivetrain, brakes, wheels, handlebars, saddles + +### Tag Assignment + +The agent assigns tags from the canonical list already seeded in the DB (`SEED_TAGS` in `seed-global-items.ts`). The prompt includes the full tag list so the agent can pick appropriate ones per item. + +--- + +## API Changes + +`POST /api/global-items/bulk` already exists and handles upsert on `(brand, model)`. Once the schema migration lands, the upsert key becomes `(manufacturer_id, model)`. The route and service logic change minimally — the schema enforces the constraint. + +`POST /api/global-items` (single upsert) same change. + +Both routes require auth (existing behavior). + +--- + +## Running the Pipeline + +```bash +# Add a manufacturer first (via API or direct DB insert) +# Then crawl: +bun run scripts/crawl-manufacturer.ts --manufacturer=canyon +bun run scripts/crawl-manufacturer.ts --manufacturer=apidura +bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs + +# Crawl all active tier-1 manufacturers: +bun run scripts/crawl-all.ts --tier=1 +``` + +--- + +## Out of Scope (Deferred) + +- **RSS / new-release monitoring** — scheduled polling of brand RSS feeds for new product announcements +- **Price updates** — periodic refresh of `marketPrices` from retailer sites +- **Community submissions** — user-proposed items with admin approval workflow (Phase 999.6) +- **Separate ingestion repo** — pipeline stays in this repo until complexity justifies splitting +- **Aggregator scraping (Tier 2)** — Bikepacking.com, Outdoor Gear Lab as discovery sources + +--- + +## Implementation Phases + +This design breaks into two sequential phases: + +**Phase A — Schema & API** +1. Add `manufacturers` table + migration +2. Migrate `globalItems`: replace `brand` text with `manufacturerId` FK +3. Update all queries, services, routes, and MCP tools that reference `brand` +4. Seed initial manufacturer list (top bikepacking/biking/hiking brands) + +**Phase B — Ingestion Script** +1. `scripts/crawl-manufacturer.ts` — agent runner +2. `scripts/taxonomy/categories.ts` — canonical category map +3. `scripts/crawl-all.ts` — batch runner by tier +4. Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)