Files
GearBox/docs/superpowers/plans/2026-04-18-catalog-ingestion-script.md
Jean-Luc Makiola bea386e7db
All checks were successful
CI / ci (push) Successful in 1m21s
CI / e2e (push) Has been skipped
CI / deploy (push) Successful in 1m15s
style(i18n): fix lint — formatting and import ordering across 21 files
Biome auto-fix for formatting (line length, ternary wrapping) and
import organization in files touched by phase 34 i18n work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 14:49:10 +02:00

20 KiB

Catalog Ingestion Script Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a Bun script that takes a manufacturer slug, launches a Claude Haiku agent with web tools, crawls the manufacturer's site, and bulk-upserts the extracted products into the GearBox catalog via the existing API.

Architecture: scripts/crawl-manufacturer.ts is the entry point — it fetches the manufacturer record from the GearBox API, builds a structured prompt with the target schema and taxonomy, runs a Claude Haiku agent in an agentic tool-use loop (giving it a fetch_page tool backed by Bun's fetch), receives a JSON array of products, and posts them to POST /api/global-items/bulk. A scripts/crawl-all.ts batch runner iterates all active tier-1 manufacturers. Taxonomy (categories + tags) is defined in code and injected into the agent prompt.

Tech Stack: Bun, @anthropic-ai/sdk, Anthropic Claude (Haiku model for cost), GearBox API (local or deployed).

Prerequisite: Plan A (catalog-schema-migration) must be complete — the API must accept manufacturerSlug in bulk upserts.


File Map

Action File
Create scripts/taxonomy/categories.ts — canonical category values
Create scripts/taxonomy/tags.ts — canonical tag list
Create scripts/crawl-manufacturer.ts — core agent runner
Create scripts/crawl-all.ts — batch runner by tier
Modify package.json — add db:crawl and db:crawl-all script entries

Task 1: Taxonomy files

Files:

  • Create: scripts/taxonomy/categories.ts

  • Create: scripts/taxonomy/tags.ts

  • Step 1: Create scripts/taxonomy/categories.ts

/**
 * Canonical category values for globalItems.category.
 * These are the only valid values the ingestion agent should use.
 */
export const CATEGORIES = [
	"bags",         // bikepacking bags, dry bags, stuff sacks
	"shelters",     // tents, bivys, tarps, hammocks
	"sleep",        // sleeping bags, quilts, pads, pillows
	"cooking",      // stoves, cookware, mugs, utensils
	"lighting",     // headlamps, bike lights, lanterns
	"water",        // filters, bottles, bladders
	"electronics",  // power banks, solar panels, GPS, bike computers
	"tools",        // multi-tools, pumps, repair kits, locks
	"clothing",     // jackets, base layers, gloves, shoes
	"navigation",   // GPS devices, maps, compasses
	"bikes",        // complete bikes
	"components",   // drivetrain, brakes, wheels, handlebars, saddles, stems
] as const;

export type Category = (typeof CATEGORIES)[number];
  • Step 2: Create scripts/taxonomy/tags.ts
/**
 * Canonical tags for globalItems.
 * Mirrors the seed tags in src/db/seed-global-items.ts.
 * The agent should only use tags from this list.
 */
export const TAGS = [
	// Activity
	"bikepacking", "cycling", "hiking", "backpacking", "camping", "climbing",
	"mountaineering", "road-cycling", "gravel", "running", "trail-running",
	// Bag subtypes
	"handlebar-bag", "framebag", "saddlebag", "top-tube-bag", "stem-bag",
	"fork-bag", "feed-bag", "dry-bag", "stuff-sack", "bike-bag",
	// Shelter subtypes
	"tent", "bivy", "tarp", "hammock",
	// Sleep subtypes
	"sleeping-bag", "sleeping-pad", "quilt", "pillow",
	// Cooking subtypes
	"stove", "cookware", "mug", "utensils",
	// Water subtypes
	"water-filter", "water-bottle",
	// Lighting subtypes
	"headlamp", "bike-light", "lantern",
	// Electronics subtypes
	"gps", "bike-computer", "power-bank", "solar-panel",
	// Tools subtypes
	"multi-tool", "pump", "repair-kit", "lock",
	// Clothing subtypes
	"rain-jacket", "base-layer", "gloves", "shoe",
] as const;

export type Tag = (typeof TAGS)[number];
  • Step 3: Commit
git add scripts/taxonomy/
git commit -m "feat: canonical taxonomy — categories and tags for ingestion"

Task 2: Core crawl script

Files:

  • Create: scripts/crawl-manufacturer.ts

  • Step 1: Verify @anthropic-ai/sdk is available

bun pm ls | grep anthropic

If not listed:

bun add @anthropic-ai/sdk
  • Step 2: Create scripts/crawl-manufacturer.ts
#!/usr/bin/env bun
/**
 * Crawl a manufacturer's website and upsert their products into the GearBox catalog.
 *
 * Usage:
 *   bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
 *   bun run scripts/crawl-manufacturer.ts --manufacturer=canyon --dry-run
 *
 * Env vars required:
 *   ANTHROPIC_API_KEY  — Anthropic API key
 *   GEARBOX_URL        — Base URL of the GearBox instance (default: http://localhost:3000)
 *   GEARBOX_API_KEY    — GearBox API key with write access
 */

import Anthropic from "@anthropic-ai/sdk";
import { CATEGORIES } from "./taxonomy/categories.ts";
import { TAGS } from "./taxonomy/tags.ts";

const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000";
const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? "";
const ANTHROPIC_API_KEY = process.env.ANTHROPIC_API_KEY ?? "";
const MODEL = "claude-haiku-4-5-20251001";
const MAX_TOOL_ROUNDS = 30; // safety limit

// ── Parse CLI args ────────────────────────────────────────────────

const args = Object.fromEntries(
	process.argv
		.slice(2)
		.filter((a) => a.startsWith("--"))
		.map((a) => {
			const [k, v] = a.slice(2).split("=");
			return [k, v ?? "true"];
		}),
);

const manufacturerSlug = args["manufacturer"];
const dryRun = args["dry-run"] === "true";

if (!manufacturerSlug) {
	console.error("Usage: bun run scripts/crawl-manufacturer.ts --manufacturer=<slug>");
	process.exit(1);
}

if (!GEARBOX_API_KEY) {
	console.error("GEARBOX_API_KEY env var is required");
	process.exit(1);
}

if (!ANTHROPIC_API_KEY) {
	console.error("ANTHROPIC_API_KEY env var is required");
	process.exit(1);
}

// ── Fetch manufacturer from GearBox ──────────────────────────────

async function fetchManufacturer(slug: string) {
	const res = await fetch(`${GEARBOX_URL}/api/manufacturers/${slug}`);
	if (!res.ok) {
		throw new Error(`Manufacturer not found: ${slug} (HTTP ${res.status})`);
	}
	return res.json() as Promise<{
		id: number;
		name: string;
		slug: string;
		website: string;
		tier: number;
		country: string | null;
	}>;
}

// ── Tool: fetch a web page ────────────────────────────────────────

async function fetchPage(url: string): Promise<string> {
	try {
		const res = await fetch(url, {
			headers: {
				"User-Agent": "Mozilla/5.0 (compatible; GearBox-Catalog-Bot/1.0)",
				Accept: "text/html,application/xhtml+xml",
			},
			signal: AbortSignal.timeout(15_000),
		});
		if (!res.ok) return `HTTP ${res.status} for ${url}`;
		const html = await res.text();
		// Strip scripts, styles, and excessive whitespace for token efficiency
		return html
			.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, "")
			.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, "")
			.replace(/<!--[\s\S]*?-->/g, "")
			.replace(/\s{3,}/g, "  ")
			.slice(0, 60_000); // cap at 60k chars to stay within context
	} catch (err) {
		return `Error fetching ${url}: ${(err as Error).message}`;
	}
}

// ── Build system prompt ───────────────────────────────────────────

function buildSystemPrompt(manufacturer: Awaited<ReturnType<typeof fetchManufacturer>>) {
	return `You are a product data extraction agent for GearBox, a gear management app for bikepacking, cycling, and hiking.

Your task: crawl ${manufacturer.name}'s website (${manufacturer.website}) and extract their complete product catalog.

For each product, extract:
- model: string (product name WITHOUT the brand prefix)
- category: one of [${CATEGORIES.join(", ")}]
- weightGrams: number | null (weight in grams — convert if shown in oz/lbs/kg)
- priceCents: number | null (MSRP in cents, base currency)
- priceCurrency: string (ISO currency code — "EUR" for DE brands, "USD" for US, "GBP" for GB, etc.)
- description: string | null (1-3 sentence product description)
- sourceUrl: string (direct product page URL)
- tags: string[] (from this list only: [${TAGS.join(", ")}])

Rules:
- model must NOT include the brand name (e.g., "Terrapin System" not "Revelate Designs Terrapin System")
- Only include outdoor/adventure/cycling products. Skip accessories under €5, clothing if not relevant to the target categories.
- If weight is not listed on a product page, use null — do not guess.
- Assign 2-5 relevant tags per item.
- Extract every product in their catalog, not just featured ones. Navigate to all relevant subcategories.

When done, output a JSON array of product objects as your final message. Do not wrap in markdown — raw JSON only.

Example output:
[
  {
    "model": "Expedition Handlebar Pack",
    "category": "bags",
    "weightGrams": 300,
    "priceCents": 16000,
    "priceCurrency": "GBP",
    "description": "14L waterproof handlebar roll bag with internal dry bag and accessory pocket.",
    "sourceUrl": "https://apidura.com/shop/expedition-handlebar-pack/",
    "tags": ["bikepacking", "handlebar-bag", "bike-bag"]
  }
]`;
}

// ── Agentic tool-use loop ─────────────────────────────────────────

type CatalogItem = {
	model: string;
	category: string;
	weightGrams: number | null;
	priceCents: number | null;
	priceCurrency: string;
	description: string | null;
	sourceUrl: string;
	tags: string[];
};

async function runCrawlAgent(manufacturer: Awaited<ReturnType<typeof fetchManufacturer>>): Promise<CatalogItem[]> {
	const client = new Anthropic({ apiKey: ANTHROPIC_API_KEY });

	const tools: Anthropic.Tool[] = [
		{
			name: "fetch_page",
			description: "Fetch the HTML content of a URL. Use this to explore the manufacturer's website and product pages.",
			input_schema: {
				type: "object" as const,
				properties: {
					url: { type: "string", description: "The URL to fetch" },
				},
				required: ["url"],
			},
		},
	];

	const messages: Anthropic.MessageParam[] = [
		{
			role: "user",
			content: `Crawl ${manufacturer.name}'s website at ${manufacturer.website} and extract their complete product catalog. Start with the homepage or sitemap, navigate to all product categories, and return the full product list as JSON.`,
		},
	];

	let rounds = 0;

	while (rounds < MAX_TOOL_ROUNDS) {
		rounds++;
		console.log(`  [round ${rounds}] calling model...`);

		const response = await client.messages.create({
			model: MODEL,
			max_tokens: 8192,
			system: buildSystemPrompt(manufacturer),
			tools,
			messages,
		});

		// Add assistant response to history
		messages.push({ role: "assistant", content: response.content });

		if (response.stop_reason === "end_turn") {
			// Final message — extract JSON from text content
			const textBlock = response.content.find((b) => b.type === "text");
			if (!textBlock || textBlock.type !== "text") {
				throw new Error("Agent finished without text output");
			}
			return parseAgentOutput(textBlock.text);
		}

		if (response.stop_reason !== "tool_use") {
			throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
		}

		// Process tool calls
		const toolResults: Anthropic.ToolResultBlockParam[] = [];
		for (const block of response.content) {
			if (block.type !== "tool_use") continue;
			if (block.name === "fetch_page") {
				const { url } = block.input as { url: string };
				console.log(`  [tool] fetch_page ${url}`);
				const content = await fetchPage(url);
				toolResults.push({
					type: "tool_result",
					tool_use_id: block.id,
					content,
				});
			}
		}

		messages.push({ role: "user", content: toolResults });
	}

	throw new Error(`Agent exceeded ${MAX_TOOL_ROUNDS} tool rounds without finishing`);
}

function parseAgentOutput(text: string): CatalogItem[] {
	// Handle agent wrapping output in markdown code blocks
	const cleaned = text.replace(/^```json\s*/i, "").replace(/\s*```$/i, "").trim();
	const parsed = JSON.parse(cleaned);
	if (!Array.isArray(parsed)) throw new Error("Agent output is not a JSON array");
	return parsed;
}

// ── Upsert to GearBox API ─────────────────────────────────────────

async function upsertItems(
	manufacturerSlug: string,
	items: CatalogItem[],
): Promise<{ created: number; updated: number }> {
	const payload = items.map((item) => ({
		manufacturerSlug,
		model: item.model,
		category: item.category,
		weightGrams: item.weightGrams ?? undefined,
		priceCents: item.priceCents ?? undefined,
		description: item.description ?? undefined,
		sourceUrl: item.sourceUrl,
		tags: item.tags,
	}));

	// Chunk into batches of 100 (API limit)
	let totalCreated = 0;
	let totalUpdated = 0;

	for (let i = 0; i < payload.length; i += 100) {
		const batch = payload.slice(i, i + 100);
		const res = await fetch(`${GEARBOX_URL}/api/global-items/bulk`, {
			method: "POST",
			headers: {
				"Content-Type": "application/json",
				"X-API-Key": GEARBOX_API_KEY,
			},
			body: JSON.stringify({ items: batch }),
		});

		if (!res.ok) {
			const err = await res.text();
			throw new Error(`Bulk upsert failed (HTTP ${res.status}): ${err}`);
		}

		const result = await res.json() as { created: number; updated: number };
		totalCreated += result.created;
		totalUpdated += result.updated;
		console.log(`  batch ${Math.floor(i / 100) + 1}: +${result.created} new, ~${result.updated} updated`);
	}

	return { created: totalCreated, updated: totalUpdated };
}

// ── Main ──────────────────────────────────────────────────────────

async function main() {
	console.log(`\nCrawling manufacturer: ${manufacturerSlug}`);
	if (dryRun) console.log("DRY RUN — products will not be saved\n");

	const manufacturer = await fetchManufacturer(manufacturerSlug);
	console.log(`Found: ${manufacturer.name} (${manufacturer.website})\n`);

	console.log("Starting agent crawl...");
	const items = await runCrawlAgent(manufacturer);
	console.log(`\nAgent extracted ${items.length} products`);

	if (dryRun) {
		console.log("\nDry run output (first 3 items):");
		console.log(JSON.stringify(items.slice(0, 3), null, 2));
		return;
	}

	console.log("\nUpserting to catalog...");
	const { created, updated } = await upsertItems(manufacturerSlug, items);
	console.log(`\nDone: ${created} created, ${updated} updated`);
}

main().catch((err) => {
	console.error(err);
	process.exit(1);
});
  • Step 3: Add market prices upsert after the bulk upsert

After upsertItems, add a call to also upsert marketPrices for each item that has a price. This requires knowing the item IDs returned from the bulk upsert and the manufacturer's country/currency. Add this helper after upsertItems:

async function upsertMarketPrices(
	globalItemIds: number[],
	items: CatalogItem[],
): Promise<void> {
	for (let i = 0; i < globalItemIds.length; i++) {
		const item = items[i];
		const globalItemId = globalItemIds[i];
		if (!item?.priceCents || !globalItemId) continue;

		// Derive market from currency
		const market = item.priceCurrency === "EUR" ? "EU"
			: item.priceCurrency === "USD" ? "US"
			: item.priceCurrency === "GBP" ? "GB"
			: item.priceCurrency;

		await fetch(`${GEARBOX_URL}/api/global-items/${globalItemId}/market-prices`, {
			method: "POST",
			headers: {
				"Content-Type": "application/json",
				"X-API-Key": GEARBOX_API_KEY,
			},
			body: JSON.stringify({
				market,
				currency: item.priceCurrency,
				priceCents: item.priceCents,
				source: "manufacturer-crawl",
			}),
		});
	}
}

Call upsertMarketPrices in main() after the bulk upsert, passing the item IDs from the API response.

Note: The bulk upsert response returns items[] with IDs. Store those and pass them here. Update the upsertItems function return type to also return itemIds: number[].

  • Step 4: Commit
git add scripts/crawl-manufacturer.ts
git commit -m "feat: crawl-manufacturer agent script — Haiku tool-use loop + bulk upsert"

Task 3: Batch runner

Files:

  • Create: scripts/crawl-all.ts

  • Step 1: Create scripts/crawl-all.ts

#!/usr/bin/env bun
/**
 * Crawl all active manufacturers of a given tier.
 *
 * Usage:
 *   bun run scripts/crawl-all.ts --tier=1
 *   bun run scripts/crawl-all.ts --tier=1 --dry-run
 */

const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000";
const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? "";

const args = Object.fromEntries(
	process.argv
		.slice(2)
		.filter((a) => a.startsWith("--"))
		.map((a) => {
			const [k, v] = a.slice(2).split("=");
			return [k, v ?? "true"];
		}),
);

const tier = args["tier"] ? Number(args["tier"]) : 1;
const dryRun = args["dry-run"] === "true";

async function listActiveManufacturers(targetTier: number) {
	const res = await fetch(`${GEARBOX_URL}/api/manufacturers`);
	if (!res.ok) throw new Error(`Failed to list manufacturers: HTTP ${res.status}`);
	const all = await res.json() as Array<{ slug: string; tier: number; active: boolean; name: string }>;
	return all.filter((m) => m.active && m.tier === targetTier);
}

async function main() {
	if (!GEARBOX_API_KEY) {
		console.error("GEARBOX_API_KEY env var is required");
		process.exit(1);
	}

	const manufacturers = await listActiveManufacturers(tier);
	console.log(`Found ${manufacturers.length} active tier-${tier} manufacturers\n`);

	const results: Array<{ slug: string; status: "ok" | "error"; error?: string }> = [];

	for (const m of manufacturers) {
		console.log(`\n${"─".repeat(50)}`);
		console.log(`Crawling: ${m.name} (${m.slug})`);
		try {
			const extraArgs = dryRun ? ["--dry-run"] : [];
			const proc = Bun.spawn(
				["bun", "run", "scripts/crawl-manufacturer.ts", `--manufacturer=${m.slug}`, ...extraArgs],
				{ stdout: "inherit", stderr: "inherit", env: process.env },
			);
			const exitCode = await proc.exited;
			if (exitCode !== 0) throw new Error(`Exited with code ${exitCode}`);
			results.push({ slug: m.slug, status: "ok" });
		} catch (err) {
			console.error(`  ERROR: ${(err as Error).message}`);
			results.push({ slug: m.slug, status: "error", error: (err as Error).message });
		}
	}

	console.log(`\n${"═".repeat(50)}`);
	console.log("Summary:");
	for (const r of results) {
		const icon = r.status === "ok" ? "✓" : "✗";
		console.log(`  ${icon} ${r.slug}${r.error ? ` — ${r.error}` : ""}`);
	}

	const failed = results.filter((r) => r.status === "error");
	if (failed.length > 0) {
		console.error(`\n${failed.length} manufacturer(s) failed`);
		process.exit(1);
	}
}

main().catch((err) => {
	console.error(err);
	process.exit(1);
});
  • Step 2: Commit
git add scripts/crawl-all.ts
git commit -m "feat: crawl-all batch runner — iterate active manufacturers by tier"

Task 4: Package.json scripts + smoke test

Files:

  • Modify: package.json

  • Step 1: Add scripts to package.json

In the "scripts" section, add:

"db:crawl": "bun run scripts/crawl-manufacturer.ts",
"db:crawl-all": "bun run scripts/crawl-all.ts"
  • Step 2: Commit
git add package.json
git commit -m "chore: add db:crawl and db:crawl-all npm scripts"
  • Step 3: Smoke test with dry run

Ensure GearBox is running (bun run dev:server in another terminal) and the manufacturer exists in the DB.

GEARBOX_API_KEY=<your-api-key> ANTHROPIC_API_KEY=<your-key> \
  bun run db:crawl --manufacturer=apidura --dry-run

Expected output:

Crawling manufacturer: apidura
Found: Apidura (https://apidura.com)

Starting agent crawl...
  [round 1] calling model...
  [tool] fetch_page https://apidura.com
  ...
Agent extracted N products

Dry run output (first 3 items):
[
  {
    "model": "...",
    "category": "bags",
    ...
  }
]
  • Step 4: Live run against one manufacturer
GEARBOX_API_KEY=<your-api-key> ANTHROPIC_API_KEY=<your-key> \
  bun run db:crawl --manufacturer=apidura

Expected: products appear in the catalog. Verify by opening the app catalog search or calling GET /api/global-items?q=apidura.

  • Step 5: Commit any adjustments

If the agent prompt needed tuning (category mapping issues, extra noise in output), update buildSystemPrompt in crawl-manufacturer.ts and commit:

git add scripts/crawl-manufacturer.ts
git commit -m "fix: tune agent prompt for cleaner category/tag extraction"