docs: catalog population design spec
This commit is contained in:
164
docs/superpowers/specs/2026-04-18-catalog-population-design.md
Normal file
164
docs/superpowers/specs/2026-04-18-catalog-population-design.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Catalog Population & Maintenance Design
|
||||
|
||||
**Date**: 2026-04-18
|
||||
**Status**: Approved
|
||||
**Domains**: Bikepacking, biking, hiking
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Populate the `globalItems` catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase.
|
||||
|
||||
---
|
||||
|
||||
## Schema Changes
|
||||
|
||||
### New: `manufacturers` table
|
||||
|
||||
```sql
|
||||
CREATE TABLE manufacturers (
|
||||
id SERIAL PRIMARY KEY,
|
||||
name TEXT NOT NULL UNIQUE, -- "Apidura", "Canyon"
|
||||
slug TEXT NOT NULL UNIQUE, -- "apidura", "canyon"
|
||||
website TEXT NOT NULL, -- brand homepage
|
||||
tier INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only
|
||||
active BOOLEAN NOT NULL DEFAULT true,
|
||||
country TEXT, -- "DE", "US", etc.
|
||||
created_at TIMESTAMP NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
`tier` controls how the brand is handled by the ingestion pipeline:
|
||||
- **1** — deep agent crawl of full product catalog
|
||||
- **2** — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab)
|
||||
- **3** — RSS/new-releases only (future)
|
||||
|
||||
### Modified: `globalItems` table
|
||||
|
||||
- **Remove** `brand TEXT` field
|
||||
- **Add** `manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id)`
|
||||
- **Unique constraint** changes from `(brand, model)` → `(manufacturer_id, model)`
|
||||
|
||||
All queries that previously selected `brand` join to `manufacturers` to get the name. No denormalization.
|
||||
|
||||
---
|
||||
|
||||
## Ingestion Pipeline
|
||||
|
||||
### Overview
|
||||
|
||||
```
|
||||
1. Add manufacturer to DB (name, website, tier, active=true)
|
||||
2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
|
||||
3. Claude Haiku agent crawls manufacturer website
|
||||
4. Agent outputs structured JSON array
|
||||
5. Script calls POST /api/global-items/bulk
|
||||
6. Items upserted into catalog (no duplicates)
|
||||
```
|
||||
|
||||
### Script: `scripts/crawl-manufacturer.ts`
|
||||
|
||||
Lives in this repo. Accepts `--manufacturer=<slug>` flag.
|
||||
|
||||
Responsibilities:
|
||||
1. Fetch manufacturer record from DB by slug
|
||||
2. Build agent prompt with manufacturer context + target schema
|
||||
3. Run Claude Haiku agent against the website
|
||||
4. Validate and clean agent output
|
||||
5. Call `POST /api/global-items/bulk` with the result
|
||||
6. Log success/failure counts
|
||||
|
||||
### Agent Target Schema
|
||||
|
||||
The agent is instructed to extract each product as:
|
||||
|
||||
```ts
|
||||
{
|
||||
manufacturerId: number, // resolved from DB before passing to agent
|
||||
model: string, // product model name (no brand prefix)
|
||||
category: string, // see canonical categories below
|
||||
weightGrams: number|null,
|
||||
priceCents: number|null, // MSRP in manufacturer's base currency
|
||||
priceCurrency: string, // "EUR", "USD", etc.
|
||||
description: string|null,
|
||||
sourceUrl: string, // direct product page URL
|
||||
tags: string[], // from canonical tag list
|
||||
}
|
||||
```
|
||||
|
||||
`priceCents` maps to `globalItems.priceCents` (base MSRP). `priceCurrency` is used by the script to also insert a row into `marketPrices` (market derived from manufacturer country + currency), so regional pricing is populated from day one.
|
||||
|
||||
### Canonical Categories
|
||||
|
||||
Defined in `scripts/taxonomy/categories.ts`, mirrors `globalItems.category` values:
|
||||
|
||||
- `bags` — bikepacking bags, dry bags, stuff sacks
|
||||
- `shelters` — tents, bivys, tarps, hammocks
|
||||
- `sleep` — sleeping bags, quilts, pads, pillows
|
||||
- `cooking` — stoves, cookware, mugs, utensils
|
||||
- `lighting` — headlamps, bike lights, lanterns
|
||||
- `water` — filters, bottles, bladders
|
||||
- `electronics` — power banks, solar panels, GPS, bike computers
|
||||
- `tools` — multi-tools, pumps, repair kits, locks
|
||||
- `clothing` — jackets, base layers, gloves, shoes
|
||||
- `navigation` — GPS devices, maps, compasses
|
||||
- `bikes` — complete bikes
|
||||
- `components` — drivetrain, brakes, wheels, handlebars, saddles
|
||||
|
||||
### Tag Assignment
|
||||
|
||||
The agent assigns tags from the canonical list already seeded in the DB (`SEED_TAGS` in `seed-global-items.ts`). The prompt includes the full tag list so the agent can pick appropriate ones per item.
|
||||
|
||||
---
|
||||
|
||||
## API Changes
|
||||
|
||||
`POST /api/global-items/bulk` already exists and handles upsert on `(brand, model)`. Once the schema migration lands, the upsert key becomes `(manufacturer_id, model)`. The route and service logic change minimally — the schema enforces the constraint.
|
||||
|
||||
`POST /api/global-items` (single upsert) same change.
|
||||
|
||||
Both routes require auth (existing behavior).
|
||||
|
||||
---
|
||||
|
||||
## Running the Pipeline
|
||||
|
||||
```bash
|
||||
# Add a manufacturer first (via API or direct DB insert)
|
||||
# Then crawl:
|
||||
bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
|
||||
bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
|
||||
bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs
|
||||
|
||||
# Crawl all active tier-1 manufacturers:
|
||||
bun run scripts/crawl-all.ts --tier=1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (Deferred)
|
||||
|
||||
- **RSS / new-release monitoring** — scheduled polling of brand RSS feeds for new product announcements
|
||||
- **Price updates** — periodic refresh of `marketPrices` from retailer sites
|
||||
- **Community submissions** — user-proposed items with admin approval workflow (Phase 999.6)
|
||||
- **Separate ingestion repo** — pipeline stays in this repo until complexity justifies splitting
|
||||
- **Aggregator scraping (Tier 2)** — Bikepacking.com, Outdoor Gear Lab as discovery sources
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
This design breaks into two sequential phases:
|
||||
|
||||
**Phase A — Schema & API**
|
||||
1. Add `manufacturers` table + migration
|
||||
2. Migrate `globalItems`: replace `brand` text with `manufacturerId` FK
|
||||
3. Update all queries, services, routes, and MCP tools that reference `brand`
|
||||
4. Seed initial manufacturer list (top bikepacking/biking/hiking brands)
|
||||
|
||||
**Phase B — Ingestion Script**
|
||||
1. `scripts/crawl-manufacturer.ts` — agent runner
|
||||
2. `scripts/taxonomy/categories.ts` — canonical category map
|
||||
3. `scripts/crawl-all.ts` — batch runner by tier
|
||||
4. Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)
|
||||
Reference in New Issue
Block a user