# Catalog Ingestion Script Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Build a Bun script that takes a manufacturer slug, launches a Claude Haiku agent with web tools, crawls the manufacturer's site, and bulk-upserts the extracted products into the GearBox catalog via the existing API. **Architecture:** `scripts/crawl-manufacturer.ts` is the entry point — it fetches the manufacturer record from the GearBox API, builds a structured prompt with the target schema and taxonomy, runs a Claude Haiku agent in an agentic tool-use loop (giving it a `fetch_page` tool backed by Bun's fetch), receives a JSON array of products, and posts them to `POST /api/global-items/bulk`. A `scripts/crawl-all.ts` batch runner iterates all active tier-1 manufacturers. Taxonomy (categories + tags) is defined in code and injected into the agent prompt. **Tech Stack:** Bun, `@anthropic-ai/sdk`, Anthropic Claude (Haiku model for cost), GearBox API (local or deployed). **Prerequisite:** Plan A (catalog-schema-migration) must be complete — the API must accept `manufacturerSlug` in bulk upserts. --- ## File Map | Action | File | |--------|------| | Create | `scripts/taxonomy/categories.ts` — canonical category values | | Create | `scripts/taxonomy/tags.ts` — canonical tag list | | Create | `scripts/crawl-manufacturer.ts` — core agent runner | | Create | `scripts/crawl-all.ts` — batch runner by tier | | Modify | `package.json` — add `db:crawl` and `db:crawl-all` script entries | --- ## Task 1: Taxonomy files **Files:** - Create: `scripts/taxonomy/categories.ts` - Create: `scripts/taxonomy/tags.ts` - [ ] **Step 1: Create `scripts/taxonomy/categories.ts`** ```typescript /** * Canonical category values for globalItems.category. * These are the only valid values the ingestion agent should use. */ export const CATEGORIES = [ "bags", // bikepacking bags, dry bags, stuff sacks "shelters", // tents, bivys, tarps, hammocks "sleep", // sleeping bags, quilts, pads, pillows "cooking", // stoves, cookware, mugs, utensils "lighting", // headlamps, bike lights, lanterns "water", // filters, bottles, bladders "electronics", // power banks, solar panels, GPS, bike computers "tools", // multi-tools, pumps, repair kits, locks "clothing", // jackets, base layers, gloves, shoes "navigation", // GPS devices, maps, compasses "bikes", // complete bikes "components", // drivetrain, brakes, wheels, handlebars, saddles, stems ] as const; export type Category = (typeof CATEGORIES)[number]; ``` - [ ] **Step 2: Create `scripts/taxonomy/tags.ts`** ```typescript /** * Canonical tags for globalItems. * Mirrors the seed tags in src/db/seed-global-items.ts. * The agent should only use tags from this list. */ export const TAGS = [ // Activity "bikepacking", "cycling", "hiking", "backpacking", "camping", "climbing", "mountaineering", "road-cycling", "gravel", "running", "trail-running", // Bag subtypes "handlebar-bag", "framebag", "saddlebag", "top-tube-bag", "stem-bag", "fork-bag", "feed-bag", "dry-bag", "stuff-sack", "bike-bag", // Shelter subtypes "tent", "bivy", "tarp", "hammock", // Sleep subtypes "sleeping-bag", "sleeping-pad", "quilt", "pillow", // Cooking subtypes "stove", "cookware", "mug", "utensils", // Water subtypes "water-filter", "water-bottle", // Lighting subtypes "headlamp", "bike-light", "lantern", // Electronics subtypes "gps", "bike-computer", "power-bank", "solar-panel", // Tools subtypes "multi-tool", "pump", "repair-kit", "lock", // Clothing subtypes "rain-jacket", "base-layer", "gloves", "shoe", ] as const; export type Tag = (typeof TAGS)[number]; ``` - [ ] **Step 3: Commit** ```bash git add scripts/taxonomy/ git commit -m "feat: canonical taxonomy — categories and tags for ingestion" ``` --- ## Task 2: Core crawl script **Files:** - Create: `scripts/crawl-manufacturer.ts` - [ ] **Step 1: Verify `@anthropic-ai/sdk` is available** ```bash bun pm ls | grep anthropic ``` If not listed: ```bash bun add @anthropic-ai/sdk ``` - [ ] **Step 2: Create `scripts/crawl-manufacturer.ts`** ```typescript #!/usr/bin/env bun /** * Crawl a manufacturer's website and upsert their products into the GearBox catalog. * * Usage: * bun run scripts/crawl-manufacturer.ts --manufacturer=apidura * bun run scripts/crawl-manufacturer.ts --manufacturer=canyon --dry-run * * Env vars required: * ANTHROPIC_API_KEY — Anthropic API key * GEARBOX_URL — Base URL of the GearBox instance (default: http://localhost:3000) * GEARBOX_API_KEY — GearBox API key with write access */ import Anthropic from "@anthropic-ai/sdk"; import { CATEGORIES } from "./taxonomy/categories.ts"; import { TAGS } from "./taxonomy/tags.ts"; const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000"; const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? ""; const ANTHROPIC_API_KEY = process.env.ANTHROPIC_API_KEY ?? ""; const MODEL = "claude-haiku-4-5-20251001"; const MAX_TOOL_ROUNDS = 30; // safety limit // ── Parse CLI args ──────────────────────────────────────────────── const args = Object.fromEntries( process.argv .slice(2) .filter((a) => a.startsWith("--")) .map((a) => { const [k, v] = a.slice(2).split("="); return [k, v ?? "true"]; }), ); const manufacturerSlug = args["manufacturer"]; const dryRun = args["dry-run"] === "true"; if (!manufacturerSlug) { console.error("Usage: bun run scripts/crawl-manufacturer.ts --manufacturer="); process.exit(1); } if (!GEARBOX_API_KEY) { console.error("GEARBOX_API_KEY env var is required"); process.exit(1); } if (!ANTHROPIC_API_KEY) { console.error("ANTHROPIC_API_KEY env var is required"); process.exit(1); } // ── Fetch manufacturer from GearBox ────────────────────────────── async function fetchManufacturer(slug: string) { const res = await fetch(`${GEARBOX_URL}/api/manufacturers/${slug}`); if (!res.ok) { throw new Error(`Manufacturer not found: ${slug} (HTTP ${res.status})`); } return res.json() as Promise<{ id: number; name: string; slug: string; website: string; tier: number; country: string | null; }>; } // ── Tool: fetch a web page ──────────────────────────────────────── async function fetchPage(url: string): Promise { try { const res = await fetch(url, { headers: { "User-Agent": "Mozilla/5.0 (compatible; GearBox-Catalog-Bot/1.0)", Accept: "text/html,application/xhtml+xml", }, signal: AbortSignal.timeout(15_000), }); if (!res.ok) return `HTTP ${res.status} for ${url}`; const html = await res.text(); // Strip scripts, styles, and excessive whitespace for token efficiency return html .replace(/]*>[\s\S]*?<\/script>/gi, "") .replace(/]*>[\s\S]*?<\/style>/gi, "") .replace(//g, "") .replace(/\s{3,}/g, " ") .slice(0, 60_000); // cap at 60k chars to stay within context } catch (err) { return `Error fetching ${url}: ${(err as Error).message}`; } } // ── Build system prompt ─────────────────────────────────────────── function buildSystemPrompt(manufacturer: Awaited>) { return `You are a product data extraction agent for GearBox, a gear management app for bikepacking, cycling, and hiking. Your task: crawl ${manufacturer.name}'s website (${manufacturer.website}) and extract their complete product catalog. For each product, extract: - model: string (product name WITHOUT the brand prefix) - category: one of [${CATEGORIES.join(", ")}] - weightGrams: number | null (weight in grams — convert if shown in oz/lbs/kg) - priceCents: number | null (MSRP in cents, base currency) - priceCurrency: string (ISO currency code — "EUR" for DE brands, "USD" for US, "GBP" for GB, etc.) - description: string | null (1-3 sentence product description) - sourceUrl: string (direct product page URL) - tags: string[] (from this list only: [${TAGS.join(", ")}]) Rules: - model must NOT include the brand name (e.g., "Terrapin System" not "Revelate Designs Terrapin System") - Only include outdoor/adventure/cycling products. Skip accessories under €5, clothing if not relevant to the target categories. - If weight is not listed on a product page, use null — do not guess. - Assign 2-5 relevant tags per item. - Extract every product in their catalog, not just featured ones. Navigate to all relevant subcategories. When done, output a JSON array of product objects as your final message. Do not wrap in markdown — raw JSON only. Example output: [ { "model": "Expedition Handlebar Pack", "category": "bags", "weightGrams": 300, "priceCents": 16000, "priceCurrency": "GBP", "description": "14L waterproof handlebar roll bag with internal dry bag and accessory pocket.", "sourceUrl": "https://apidura.com/shop/expedition-handlebar-pack/", "tags": ["bikepacking", "handlebar-bag", "bike-bag"] } ]`; } // ── Agentic tool-use loop ───────────────────────────────────────── type CatalogItem = { model: string; category: string; weightGrams: number | null; priceCents: number | null; priceCurrency: string; description: string | null; sourceUrl: string; tags: string[]; }; async function runCrawlAgent(manufacturer: Awaited>): Promise { const client = new Anthropic({ apiKey: ANTHROPIC_API_KEY }); const tools: Anthropic.Tool[] = [ { name: "fetch_page", description: "Fetch the HTML content of a URL. Use this to explore the manufacturer's website and product pages.", input_schema: { type: "object" as const, properties: { url: { type: "string", description: "The URL to fetch" }, }, required: ["url"], }, }, ]; const messages: Anthropic.MessageParam[] = [ { role: "user", content: `Crawl ${manufacturer.name}'s website at ${manufacturer.website} and extract their complete product catalog. Start with the homepage or sitemap, navigate to all product categories, and return the full product list as JSON.`, }, ]; let rounds = 0; while (rounds < MAX_TOOL_ROUNDS) { rounds++; console.log(` [round ${rounds}] calling model...`); const response = await client.messages.create({ model: MODEL, max_tokens: 8192, system: buildSystemPrompt(manufacturer), tools, messages, }); // Add assistant response to history messages.push({ role: "assistant", content: response.content }); if (response.stop_reason === "end_turn") { // Final message — extract JSON from text content const textBlock = response.content.find((b) => b.type === "text"); if (!textBlock || textBlock.type !== "text") { throw new Error("Agent finished without text output"); } return parseAgentOutput(textBlock.text); } if (response.stop_reason !== "tool_use") { throw new Error(`Unexpected stop reason: ${response.stop_reason}`); } // Process tool calls const toolResults: Anthropic.ToolResultBlockParam[] = []; for (const block of response.content) { if (block.type !== "tool_use") continue; if (block.name === "fetch_page") { const { url } = block.input as { url: string }; console.log(` [tool] fetch_page ${url}`); const content = await fetchPage(url); toolResults.push({ type: "tool_result", tool_use_id: block.id, content, }); } } messages.push({ role: "user", content: toolResults }); } throw new Error(`Agent exceeded ${MAX_TOOL_ROUNDS} tool rounds without finishing`); } function parseAgentOutput(text: string): CatalogItem[] { // Handle agent wrapping output in markdown code blocks const cleaned = text.replace(/^```json\s*/i, "").replace(/\s*```$/i, "").trim(); const parsed = JSON.parse(cleaned); if (!Array.isArray(parsed)) throw new Error("Agent output is not a JSON array"); return parsed; } // ── Upsert to GearBox API ───────────────────────────────────────── async function upsertItems( manufacturerSlug: string, items: CatalogItem[], ): Promise<{ created: number; updated: number }> { const payload = items.map((item) => ({ manufacturerSlug, model: item.model, category: item.category, weightGrams: item.weightGrams ?? undefined, priceCents: item.priceCents ?? undefined, description: item.description ?? undefined, sourceUrl: item.sourceUrl, tags: item.tags, })); // Chunk into batches of 100 (API limit) let totalCreated = 0; let totalUpdated = 0; for (let i = 0; i < payload.length; i += 100) { const batch = payload.slice(i, i + 100); const res = await fetch(`${GEARBOX_URL}/api/global-items/bulk`, { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": GEARBOX_API_KEY, }, body: JSON.stringify({ items: batch }), }); if (!res.ok) { const err = await res.text(); throw new Error(`Bulk upsert failed (HTTP ${res.status}): ${err}`); } const result = await res.json() as { created: number; updated: number }; totalCreated += result.created; totalUpdated += result.updated; console.log(` batch ${Math.floor(i / 100) + 1}: +${result.created} new, ~${result.updated} updated`); } return { created: totalCreated, updated: totalUpdated }; } // ── Main ────────────────────────────────────────────────────────── async function main() { console.log(`\nCrawling manufacturer: ${manufacturerSlug}`); if (dryRun) console.log("DRY RUN — products will not be saved\n"); const manufacturer = await fetchManufacturer(manufacturerSlug); console.log(`Found: ${manufacturer.name} (${manufacturer.website})\n`); console.log("Starting agent crawl..."); const items = await runCrawlAgent(manufacturer); console.log(`\nAgent extracted ${items.length} products`); if (dryRun) { console.log("\nDry run output (first 3 items):"); console.log(JSON.stringify(items.slice(0, 3), null, 2)); return; } console.log("\nUpserting to catalog..."); const { created, updated } = await upsertItems(manufacturerSlug, items); console.log(`\nDone: ${created} created, ${updated} updated`); } main().catch((err) => { console.error(err); process.exit(1); }); ``` - [ ] **Step 3: Add market prices upsert after the bulk upsert** After `upsertItems`, add a call to also upsert `marketPrices` for each item that has a price. This requires knowing the item IDs returned from the bulk upsert and the manufacturer's country/currency. Add this helper after `upsertItems`: ```typescript async function upsertMarketPrices( globalItemIds: number[], items: CatalogItem[], ): Promise { for (let i = 0; i < globalItemIds.length; i++) { const item = items[i]; const globalItemId = globalItemIds[i]; if (!item?.priceCents || !globalItemId) continue; // Derive market from currency const market = item.priceCurrency === "EUR" ? "EU" : item.priceCurrency === "USD" ? "US" : item.priceCurrency === "GBP" ? "GB" : item.priceCurrency; await fetch(`${GEARBOX_URL}/api/global-items/${globalItemId}/market-prices`, { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": GEARBOX_API_KEY, }, body: JSON.stringify({ market, currency: item.priceCurrency, priceCents: item.priceCents, source: "manufacturer-crawl", }), }); } } ``` Call `upsertMarketPrices` in `main()` after the bulk upsert, passing the item IDs from the API response. Note: The bulk upsert response returns `items[]` with IDs. Store those and pass them here. Update the `upsertItems` function return type to also return `itemIds: number[]`. - [ ] **Step 4: Commit** ```bash git add scripts/crawl-manufacturer.ts git commit -m "feat: crawl-manufacturer agent script — Haiku tool-use loop + bulk upsert" ``` --- ## Task 3: Batch runner **Files:** - Create: `scripts/crawl-all.ts` - [ ] **Step 1: Create `scripts/crawl-all.ts`** ```typescript #!/usr/bin/env bun /** * Crawl all active manufacturers of a given tier. * * Usage: * bun run scripts/crawl-all.ts --tier=1 * bun run scripts/crawl-all.ts --tier=1 --dry-run */ const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000"; const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? ""; const args = Object.fromEntries( process.argv .slice(2) .filter((a) => a.startsWith("--")) .map((a) => { const [k, v] = a.slice(2).split("="); return [k, v ?? "true"]; }), ); const tier = args["tier"] ? Number(args["tier"]) : 1; const dryRun = args["dry-run"] === "true"; async function listActiveManufacturers(targetTier: number) { const res = await fetch(`${GEARBOX_URL}/api/manufacturers`); if (!res.ok) throw new Error(`Failed to list manufacturers: HTTP ${res.status}`); const all = await res.json() as Array<{ slug: string; tier: number; active: boolean; name: string }>; return all.filter((m) => m.active && m.tier === targetTier); } async function main() { if (!GEARBOX_API_KEY) { console.error("GEARBOX_API_KEY env var is required"); process.exit(1); } const manufacturers = await listActiveManufacturers(tier); console.log(`Found ${manufacturers.length} active tier-${tier} manufacturers\n`); const results: Array<{ slug: string; status: "ok" | "error"; error?: string }> = []; for (const m of manufacturers) { console.log(`\n${"─".repeat(50)}`); console.log(`Crawling: ${m.name} (${m.slug})`); try { const extraArgs = dryRun ? ["--dry-run"] : []; const proc = Bun.spawn( ["bun", "run", "scripts/crawl-manufacturer.ts", `--manufacturer=${m.slug}`, ...extraArgs], { stdout: "inherit", stderr: "inherit", env: process.env }, ); const exitCode = await proc.exited; if (exitCode !== 0) throw new Error(`Exited with code ${exitCode}`); results.push({ slug: m.slug, status: "ok" }); } catch (err) { console.error(` ERROR: ${(err as Error).message}`); results.push({ slug: m.slug, status: "error", error: (err as Error).message }); } } console.log(`\n${"═".repeat(50)}`); console.log("Summary:"); for (const r of results) { const icon = r.status === "ok" ? "✓" : "✗"; console.log(` ${icon} ${r.slug}${r.error ? ` — ${r.error}` : ""}`); } const failed = results.filter((r) => r.status === "error"); if (failed.length > 0) { console.error(`\n${failed.length} manufacturer(s) failed`); process.exit(1); } } main().catch((err) => { console.error(err); process.exit(1); }); ``` - [ ] **Step 2: Commit** ```bash git add scripts/crawl-all.ts git commit -m "feat: crawl-all batch runner — iterate active manufacturers by tier" ``` --- ## Task 4: Package.json scripts + smoke test **Files:** - Modify: `package.json` - [ ] **Step 1: Add scripts to `package.json`** In the `"scripts"` section, add: ```json "db:crawl": "bun run scripts/crawl-manufacturer.ts", "db:crawl-all": "bun run scripts/crawl-all.ts" ``` - [ ] **Step 2: Commit** ```bash git add package.json git commit -m "chore: add db:crawl and db:crawl-all npm scripts" ``` - [ ] **Step 3: Smoke test with dry run** Ensure GearBox is running (`bun run dev:server` in another terminal) and the manufacturer exists in the DB. ```bash GEARBOX_API_KEY= ANTHROPIC_API_KEY= \ bun run db:crawl --manufacturer=apidura --dry-run ``` Expected output: ``` Crawling manufacturer: apidura Found: Apidura (https://apidura.com) Starting agent crawl... [round 1] calling model... [tool] fetch_page https://apidura.com ... Agent extracted N products Dry run output (first 3 items): [ { "model": "...", "category": "bags", ... } ] ``` - [ ] **Step 4: Live run against one manufacturer** ```bash GEARBOX_API_KEY= ANTHROPIC_API_KEY= \ bun run db:crawl --manufacturer=apidura ``` Expected: products appear in the catalog. Verify by opening the app catalog search or calling `GET /api/global-items?q=apidura`. - [ ] **Step 5: Commit any adjustments** If the agent prompt needed tuning (category mapping issues, extra noise in output), update `buildSystemPrompt` in `crawl-manufacturer.ts` and commit: ```bash git add scripts/crawl-manufacturer.ts git commit -m "fix: tune agent prompt for cleaner category/tag extraction" ```