docs: catalog population design spec

2026-04-18 14:11:50 +02:00
parent fd874a3ff2
commit 26e20bd0d2
1 changed files with 164 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-18-catalog-population-design.md
+++ b/docs/superpowers/specs/2026-04-18-catalog-population-design.md
@@ -0,0 +1,164 @@
+# Catalog Population & Maintenance Design
+
+**Date**: 2026-04-18  
+**Status**: Approved  
+**Domains**: Bikepacking, biking, hiking
+
+---
+
+## Goal
+
+Populate the `globalItems` catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase.
+
+---
+
+## Schema Changes
+
+### New: `manufacturers` table
+
+```sql
+CREATE TABLE manufacturers (
+  id         SERIAL PRIMARY KEY,
+  name       TEXT NOT NULL UNIQUE,        -- "Apidura", "Canyon"
+  slug       TEXT NOT NULL UNIQUE,        -- "apidura", "canyon"
+  website    TEXT NOT NULL,               -- brand homepage
+  tier       INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only
+  active     BOOLEAN NOT NULL DEFAULT true,
+  country    TEXT,                        -- "DE", "US", etc.
+  created_at TIMESTAMP NOT NULL DEFAULT NOW()
+);
+```
+
+`tier` controls how the brand is handled by the ingestion pipeline:
+- **1** — deep agent crawl of full product catalog
+- **2** — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab)
+- **3** — RSS/new-releases only (future)
+
+### Modified: `globalItems` table
+
+- **Remove** `brand TEXT` field
+- **Add** `manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id)`
+- **Unique constraint** changes from `(brand, model)` → `(manufacturer_id, model)`
+
+All queries that previously selected `brand` join to `manufacturers` to get the name. No denormalization.
+
+---
+
+## Ingestion Pipeline
+
+### Overview
+
+```
+1. Add manufacturer to DB (name, website, tier, active=true)
+2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
+3. Claude Haiku agent crawls manufacturer website
+4. Agent outputs structured JSON array
+5. Script calls POST /api/global-items/bulk
+6. Items upserted into catalog (no duplicates)
+```
+
+### Script: `scripts/crawl-manufacturer.ts`
+
+Lives in this repo. Accepts `--manufacturer=<slug>` flag.
+
+Responsibilities:
+1. Fetch manufacturer record from DB by slug
+2. Build agent prompt with manufacturer context + target schema
+3. Run Claude Haiku agent against the website
+4. Validate and clean agent output
+5. Call `POST /api/global-items/bulk` with the result
+6. Log success/failure counts
+
+### Agent Target Schema
+
+The agent is instructed to extract each product as:
+
+```ts
+{
+  manufacturerId: number,   // resolved from DB before passing to agent
+  model: string,            // product model name (no brand prefix)
+  category: string,         // see canonical categories below
+  weightGrams: number|null,
+  priceCents: number|null,  // MSRP in manufacturer's base currency
+  priceCurrency: string,    // "EUR", "USD", etc.
+  description: string|null,
+  sourceUrl: string,        // direct product page URL
+  tags: string[],           // from canonical tag list
+}
+```
+
+`priceCents` maps to `globalItems.priceCents` (base MSRP). `priceCurrency` is used by the script to also insert a row into `marketPrices` (market derived from manufacturer country + currency), so regional pricing is populated from day one.
+
+### Canonical Categories
+
+Defined in `scripts/taxonomy/categories.ts`, mirrors `globalItems.category` values:
+
+- `bags` — bikepacking bags, dry bags, stuff sacks
+- `shelters` — tents, bivys, tarps, hammocks
+- `sleep` — sleeping bags, quilts, pads, pillows
+- `cooking` — stoves, cookware, mugs, utensils
+- `lighting` — headlamps, bike lights, lanterns
+- `water` — filters, bottles, bladders
+- `electronics` — power banks, solar panels, GPS, bike computers
+- `tools` — multi-tools, pumps, repair kits, locks
+- `clothing` — jackets, base layers, gloves, shoes
+- `navigation` — GPS devices, maps, compasses
+- `bikes` — complete bikes
+- `components` — drivetrain, brakes, wheels, handlebars, saddles
+
+### Tag Assignment
+
+The agent assigns tags from the canonical list already seeded in the DB (`SEED_TAGS` in `seed-global-items.ts`). The prompt includes the full tag list so the agent can pick appropriate ones per item.
+
+---
+
+## API Changes
+
+`POST /api/global-items/bulk` already exists and handles upsert on `(brand, model)`. Once the schema migration lands, the upsert key becomes `(manufacturer_id, model)`. The route and service logic change minimally — the schema enforces the constraint.
+
+`POST /api/global-items` (single upsert) same change.
+
+Both routes require auth (existing behavior).
+
+---
+
+## Running the Pipeline
+
+```bash
+# Add a manufacturer first (via API or direct DB insert)
+# Then crawl:
+bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
+bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
+bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs
+
+# Crawl all active tier-1 manufacturers:
+bun run scripts/crawl-all.ts --tier=1
+```
+
+---
+
+## Out of Scope (Deferred)
+
+- **RSS / new-release monitoring** — scheduled polling of brand RSS feeds for new product announcements
+- **Price updates** — periodic refresh of `marketPrices` from retailer sites
+- **Community submissions** — user-proposed items with admin approval workflow (Phase 999.6)
+- **Separate ingestion repo** — pipeline stays in this repo until complexity justifies splitting
+- **Aggregator scraping (Tier 2)** — Bikepacking.com, Outdoor Gear Lab as discovery sources
+
+---
+
+## Implementation Phases
+
+This design breaks into two sequential phases:
+
+**Phase A — Schema & API**
+1. Add `manufacturers` table + migration
+2. Migrate `globalItems`: replace `brand` text with `manufacturerId` FK
+3. Update all queries, services, routes, and MCP tools that reference `brand`
+4. Seed initial manufacturer list (top bikepacking/biking/hiking brands)
+
+**Phase B — Ingestion Script**
+1. `scripts/crawl-manufacturer.ts` — agent runner
+2. `scripts/taxonomy/categories.ts` — canonical category map
+3. `scripts/crawl-all.ts` — batch runner by tier
+4. Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)