makiolaj/GearBox

Fork 0

Files

Jean-Luc Makiola fefef38e9b docs: add agent execution model to catalog population spec

2026-04-18 14:39:59 +02:00

6.5 KiB

Raw Blame History

Catalog Population & Maintenance Design

Date: 2026-04-18
Status: Approved
Domains: Bikepacking, biking, hiking

Goal

Populate the globalItems catalog at scale using AI agents that crawl manufacturer websites, and establish the data model to support ongoing ingestion. Community submissions and automated update scheduling are deferred to a later phase.

Schema Changes

New: `manufacturers` table

CREATE TABLE manufacturers (
  id         SERIAL PRIMARY KEY,
  name       TEXT NOT NULL UNIQUE,        -- "Apidura", "Canyon"
  slug       TEXT NOT NULL UNIQUE,        -- "apidura", "canyon"
  website    TEXT NOT NULL,               -- brand homepage
  tier       INTEGER NOT NULL DEFAULT 1, -- 1 = deep scrape, 2 = aggregator, 3 = RSS only
  active     BOOLEAN NOT NULL DEFAULT true,
  country    TEXT,                        -- "DE", "US", etc.
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

tier controls how the brand is handled by the ingestion pipeline:

1 — deep agent crawl of full product catalog
2 — discovered via gear aggregators (Bikepacking.com, Outdoor Gear Lab)
3 — RSS/new-releases only (future)

Modified: `globalItems` table

Remove brand TEXT field
Add manufacturer_id INTEGER NOT NULL REFERENCES manufacturers(id)
Unique constraint changes from (brand, model) → (manufacturer_id, model)

All queries that previously selected brand join to manufacturers to get the name. No denormalization.

Ingestion Pipeline

Overview

1. Add manufacturer to DB (name, website, tier, active=true)
2. bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
3. Claude Haiku agent crawls manufacturer website
4. Agent outputs structured JSON array
5. Script calls POST /api/global-items/bulk
6. Items upserted into catalog (no duplicates)

Script: `scripts/crawl-manufacturer.ts`

Lives in this repo. Accepts --manufacturer=<slug> flag.

Responsibilities:

Fetch manufacturer record from DB by slug
Build agent prompt with manufacturer context + target schema
Run Claude Haiku agent against the website
Validate and clean agent output
Call POST /api/global-items/bulk with the result
Log success/failure counts

Agent Target Schema

The agent is instructed to extract each product as:

{
  manufacturerId: number,   // resolved from DB before passing to agent
  model: string,            // product model name (no brand prefix)
  category: string,         // see canonical categories below
  weightGrams: number|null,
  priceCents: number|null,  // MSRP in manufacturer's base currency
  priceCurrency: string,    // "EUR", "USD", etc.
  description: string|null,
  sourceUrl: string,        // direct product page URL
  tags: string[],           // from canonical tag list
}

priceCents maps to globalItems.priceCents (base MSRP). priceCurrency is used by the script to also insert a row into marketPrices (market derived from manufacturer country + currency), so regional pricing is populated from day one.

Canonical Categories

Defined in scripts/taxonomy/categories.ts, mirrors globalItems.category values:

bags — bikepacking bags, dry bags, stuff sacks
shelters — tents, bivys, tarps, hammocks
sleep — sleeping bags, quilts, pads, pillows
cooking — stoves, cookware, mugs, utensils
lighting — headlamps, bike lights, lanterns
water — filters, bottles, bladders
electronics — power banks, solar panels, GPS, bike computers
tools — multi-tools, pumps, repair kits, locks
clothing — jackets, base layers, gloves, shoes
navigation — GPS devices, maps, compasses
bikes — complete bikes
components — drivetrain, brakes, wheels, handlebars, saddles

Tag Assignment

The agent assigns tags from the canonical list already seeded in the DB (SEED_TAGS in seed-global-items.ts). The prompt includes the full tag list so the agent can pick appropriate ones per item.

API Changes

POST /api/global-items/bulk already exists and handles upsert on (brand, model). Once the schema migration lands, the upsert key becomes (manufacturer_id, model). The route and service logic change minimally — the schema enforces the constraint.

POST /api/global-items (single upsert) same change.

Both routes require auth (existing behavior).

Running the Pipeline

# Add a manufacturer first (via API or direct DB insert)
# Then crawl:
bun run scripts/crawl-manufacturer.ts --manufacturer=canyon
bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
bun run scripts/crawl-manufacturer.ts --manufacturer=revelate-designs

# Crawl all active tier-1 manufacturers:
bun run scripts/crawl-all.ts --tier=1

Out of Scope (Deferred)

RSS / new-release monitoring — scheduled polling of brand RSS feeds for new product announcements
Price updates — periodic refresh of marketPrices from retailer sites
Community submissions — user-proposed items with admin approval workflow (Phase 999.6)
Separate ingestion repo — pipeline stays in this repo until complexity justifies splitting
Aggregator scraping (Tier 2) — Bikepacking.com, Outdoor Gear Lab as discovery sources

Implementation Phases

This design breaks into two sequential phases:

Phase A — Schema & API

Add manufacturers table + migration
Migrate globalItems: replace brand text with manufacturerId FK
Update all queries, services, routes, and MCP tools that reference brand
Seed initial manufacturer list (top bikepacking/biking/hiking brands)

Phase B — Ingestion Script

scripts/crawl-manufacturer.ts — agent runner
scripts/taxonomy/categories.ts — canonical category map
scripts/crawl-all.ts — batch runner by tier
Test against 2-3 real manufacturers (Canyon, Apidura, Revelate Designs)

Agent Execution Model

The crawl script launches a Claude Code headless session (via the Claude Agent SDK) rather than calling the Anthropic API directly. This gives the agent full tool access — WebFetch, browser navigation, file I/O — without needing to re-implement those capabilities. Auth is handled via OAuth rather than a raw API key.

Each manufacturer gets its own agent session. The session receives:

The manufacturer record (name, website, tier)
The target schema and canonical taxonomy
A GearBox API key scoped to write access

The agent browses the manufacturer site, extracts products, and posts to POST /api/global-items/bulk directly from within the session. No intermediate file serialization needed.

6.5 KiB Raw Blame History