Biome auto-fix for formatting (line length, ternary wrapping) and import organization in files touched by phase 34 i18n work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
648 lines
20 KiB
Markdown
648 lines
20 KiB
Markdown
# Catalog Ingestion Script Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Build a Bun script that takes a manufacturer slug, launches a Claude Haiku agent with web tools, crawls the manufacturer's site, and bulk-upserts the extracted products into the GearBox catalog via the existing API.
|
|
|
|
**Architecture:** `scripts/crawl-manufacturer.ts` is the entry point — it fetches the manufacturer record from the GearBox API, builds a structured prompt with the target schema and taxonomy, runs a Claude Haiku agent in an agentic tool-use loop (giving it a `fetch_page` tool backed by Bun's fetch), receives a JSON array of products, and posts them to `POST /api/global-items/bulk`. A `scripts/crawl-all.ts` batch runner iterates all active tier-1 manufacturers. Taxonomy (categories + tags) is defined in code and injected into the agent prompt.
|
|
|
|
**Tech Stack:** Bun, `@anthropic-ai/sdk`, Anthropic Claude (Haiku model for cost), GearBox API (local or deployed).
|
|
|
|
**Prerequisite:** Plan A (catalog-schema-migration) must be complete — the API must accept `manufacturerSlug` in bulk upserts.
|
|
|
|
---
|
|
|
|
## File Map
|
|
|
|
| Action | File |
|
|
|--------|------|
|
|
| Create | `scripts/taxonomy/categories.ts` — canonical category values |
|
|
| Create | `scripts/taxonomy/tags.ts` — canonical tag list |
|
|
| Create | `scripts/crawl-manufacturer.ts` — core agent runner |
|
|
| Create | `scripts/crawl-all.ts` — batch runner by tier |
|
|
| Modify | `package.json` — add `db:crawl` and `db:crawl-all` script entries |
|
|
|
|
---
|
|
|
|
## Task 1: Taxonomy files
|
|
|
|
**Files:**
|
|
- Create: `scripts/taxonomy/categories.ts`
|
|
- Create: `scripts/taxonomy/tags.ts`
|
|
|
|
- [ ] **Step 1: Create `scripts/taxonomy/categories.ts`**
|
|
|
|
```typescript
|
|
/**
|
|
* Canonical category values for globalItems.category.
|
|
* These are the only valid values the ingestion agent should use.
|
|
*/
|
|
export const CATEGORIES = [
|
|
"bags", // bikepacking bags, dry bags, stuff sacks
|
|
"shelters", // tents, bivys, tarps, hammocks
|
|
"sleep", // sleeping bags, quilts, pads, pillows
|
|
"cooking", // stoves, cookware, mugs, utensils
|
|
"lighting", // headlamps, bike lights, lanterns
|
|
"water", // filters, bottles, bladders
|
|
"electronics", // power banks, solar panels, GPS, bike computers
|
|
"tools", // multi-tools, pumps, repair kits, locks
|
|
"clothing", // jackets, base layers, gloves, shoes
|
|
"navigation", // GPS devices, maps, compasses
|
|
"bikes", // complete bikes
|
|
"components", // drivetrain, brakes, wheels, handlebars, saddles, stems
|
|
] as const;
|
|
|
|
export type Category = (typeof CATEGORIES)[number];
|
|
```
|
|
|
|
- [ ] **Step 2: Create `scripts/taxonomy/tags.ts`**
|
|
|
|
```typescript
|
|
/**
|
|
* Canonical tags for globalItems.
|
|
* Mirrors the seed tags in src/db/seed-global-items.ts.
|
|
* The agent should only use tags from this list.
|
|
*/
|
|
export const TAGS = [
|
|
// Activity
|
|
"bikepacking", "cycling", "hiking", "backpacking", "camping", "climbing",
|
|
"mountaineering", "road-cycling", "gravel", "running", "trail-running",
|
|
// Bag subtypes
|
|
"handlebar-bag", "framebag", "saddlebag", "top-tube-bag", "stem-bag",
|
|
"fork-bag", "feed-bag", "dry-bag", "stuff-sack", "bike-bag",
|
|
// Shelter subtypes
|
|
"tent", "bivy", "tarp", "hammock",
|
|
// Sleep subtypes
|
|
"sleeping-bag", "sleeping-pad", "quilt", "pillow",
|
|
// Cooking subtypes
|
|
"stove", "cookware", "mug", "utensils",
|
|
// Water subtypes
|
|
"water-filter", "water-bottle",
|
|
// Lighting subtypes
|
|
"headlamp", "bike-light", "lantern",
|
|
// Electronics subtypes
|
|
"gps", "bike-computer", "power-bank", "solar-panel",
|
|
// Tools subtypes
|
|
"multi-tool", "pump", "repair-kit", "lock",
|
|
// Clothing subtypes
|
|
"rain-jacket", "base-layer", "gloves", "shoe",
|
|
] as const;
|
|
|
|
export type Tag = (typeof TAGS)[number];
|
|
```
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add scripts/taxonomy/
|
|
git commit -m "feat: canonical taxonomy — categories and tags for ingestion"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 2: Core crawl script
|
|
|
|
**Files:**
|
|
- Create: `scripts/crawl-manufacturer.ts`
|
|
|
|
- [ ] **Step 1: Verify `@anthropic-ai/sdk` is available**
|
|
|
|
```bash
|
|
bun pm ls | grep anthropic
|
|
```
|
|
|
|
If not listed:
|
|
```bash
|
|
bun add @anthropic-ai/sdk
|
|
```
|
|
|
|
- [ ] **Step 2: Create `scripts/crawl-manufacturer.ts`**
|
|
|
|
```typescript
|
|
#!/usr/bin/env bun
|
|
/**
|
|
* Crawl a manufacturer's website and upsert their products into the GearBox catalog.
|
|
*
|
|
* Usage:
|
|
* bun run scripts/crawl-manufacturer.ts --manufacturer=apidura
|
|
* bun run scripts/crawl-manufacturer.ts --manufacturer=canyon --dry-run
|
|
*
|
|
* Env vars required:
|
|
* ANTHROPIC_API_KEY — Anthropic API key
|
|
* GEARBOX_URL — Base URL of the GearBox instance (default: http://localhost:3000)
|
|
* GEARBOX_API_KEY — GearBox API key with write access
|
|
*/
|
|
|
|
import Anthropic from "@anthropic-ai/sdk";
|
|
import { CATEGORIES } from "./taxonomy/categories.ts";
|
|
import { TAGS } from "./taxonomy/tags.ts";
|
|
|
|
const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000";
|
|
const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? "";
|
|
const ANTHROPIC_API_KEY = process.env.ANTHROPIC_API_KEY ?? "";
|
|
const MODEL = "claude-haiku-4-5-20251001";
|
|
const MAX_TOOL_ROUNDS = 30; // safety limit
|
|
|
|
// ── Parse CLI args ────────────────────────────────────────────────
|
|
|
|
const args = Object.fromEntries(
|
|
process.argv
|
|
.slice(2)
|
|
.filter((a) => a.startsWith("--"))
|
|
.map((a) => {
|
|
const [k, v] = a.slice(2).split("=");
|
|
return [k, v ?? "true"];
|
|
}),
|
|
);
|
|
|
|
const manufacturerSlug = args["manufacturer"];
|
|
const dryRun = args["dry-run"] === "true";
|
|
|
|
if (!manufacturerSlug) {
|
|
console.error("Usage: bun run scripts/crawl-manufacturer.ts --manufacturer=<slug>");
|
|
process.exit(1);
|
|
}
|
|
|
|
if (!GEARBOX_API_KEY) {
|
|
console.error("GEARBOX_API_KEY env var is required");
|
|
process.exit(1);
|
|
}
|
|
|
|
if (!ANTHROPIC_API_KEY) {
|
|
console.error("ANTHROPIC_API_KEY env var is required");
|
|
process.exit(1);
|
|
}
|
|
|
|
// ── Fetch manufacturer from GearBox ──────────────────────────────
|
|
|
|
async function fetchManufacturer(slug: string) {
|
|
const res = await fetch(`${GEARBOX_URL}/api/manufacturers/${slug}`);
|
|
if (!res.ok) {
|
|
throw new Error(`Manufacturer not found: ${slug} (HTTP ${res.status})`);
|
|
}
|
|
return res.json() as Promise<{
|
|
id: number;
|
|
name: string;
|
|
slug: string;
|
|
website: string;
|
|
tier: number;
|
|
country: string | null;
|
|
}>;
|
|
}
|
|
|
|
// ── Tool: fetch a web page ────────────────────────────────────────
|
|
|
|
async function fetchPage(url: string): Promise<string> {
|
|
try {
|
|
const res = await fetch(url, {
|
|
headers: {
|
|
"User-Agent": "Mozilla/5.0 (compatible; GearBox-Catalog-Bot/1.0)",
|
|
Accept: "text/html,application/xhtml+xml",
|
|
},
|
|
signal: AbortSignal.timeout(15_000),
|
|
});
|
|
if (!res.ok) return `HTTP ${res.status} for ${url}`;
|
|
const html = await res.text();
|
|
// Strip scripts, styles, and excessive whitespace for token efficiency
|
|
return html
|
|
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, "")
|
|
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, "")
|
|
.replace(/<!--[\s\S]*?-->/g, "")
|
|
.replace(/\s{3,}/g, " ")
|
|
.slice(0, 60_000); // cap at 60k chars to stay within context
|
|
} catch (err) {
|
|
return `Error fetching ${url}: ${(err as Error).message}`;
|
|
}
|
|
}
|
|
|
|
// ── Build system prompt ───────────────────────────────────────────
|
|
|
|
function buildSystemPrompt(manufacturer: Awaited<ReturnType<typeof fetchManufacturer>>) {
|
|
return `You are a product data extraction agent for GearBox, a gear management app for bikepacking, cycling, and hiking.
|
|
|
|
Your task: crawl ${manufacturer.name}'s website (${manufacturer.website}) and extract their complete product catalog.
|
|
|
|
For each product, extract:
|
|
- model: string (product name WITHOUT the brand prefix)
|
|
- category: one of [${CATEGORIES.join(", ")}]
|
|
- weightGrams: number | null (weight in grams — convert if shown in oz/lbs/kg)
|
|
- priceCents: number | null (MSRP in cents, base currency)
|
|
- priceCurrency: string (ISO currency code — "EUR" for DE brands, "USD" for US, "GBP" for GB, etc.)
|
|
- description: string | null (1-3 sentence product description)
|
|
- sourceUrl: string (direct product page URL)
|
|
- tags: string[] (from this list only: [${TAGS.join(", ")}])
|
|
|
|
Rules:
|
|
- model must NOT include the brand name (e.g., "Terrapin System" not "Revelate Designs Terrapin System")
|
|
- Only include outdoor/adventure/cycling products. Skip accessories under €5, clothing if not relevant to the target categories.
|
|
- If weight is not listed on a product page, use null — do not guess.
|
|
- Assign 2-5 relevant tags per item.
|
|
- Extract every product in their catalog, not just featured ones. Navigate to all relevant subcategories.
|
|
|
|
When done, output a JSON array of product objects as your final message. Do not wrap in markdown — raw JSON only.
|
|
|
|
Example output:
|
|
[
|
|
{
|
|
"model": "Expedition Handlebar Pack",
|
|
"category": "bags",
|
|
"weightGrams": 300,
|
|
"priceCents": 16000,
|
|
"priceCurrency": "GBP",
|
|
"description": "14L waterproof handlebar roll bag with internal dry bag and accessory pocket.",
|
|
"sourceUrl": "https://apidura.com/shop/expedition-handlebar-pack/",
|
|
"tags": ["bikepacking", "handlebar-bag", "bike-bag"]
|
|
}
|
|
]`;
|
|
}
|
|
|
|
// ── Agentic tool-use loop ─────────────────────────────────────────
|
|
|
|
type CatalogItem = {
|
|
model: string;
|
|
category: string;
|
|
weightGrams: number | null;
|
|
priceCents: number | null;
|
|
priceCurrency: string;
|
|
description: string | null;
|
|
sourceUrl: string;
|
|
tags: string[];
|
|
};
|
|
|
|
async function runCrawlAgent(manufacturer: Awaited<ReturnType<typeof fetchManufacturer>>): Promise<CatalogItem[]> {
|
|
const client = new Anthropic({ apiKey: ANTHROPIC_API_KEY });
|
|
|
|
const tools: Anthropic.Tool[] = [
|
|
{
|
|
name: "fetch_page",
|
|
description: "Fetch the HTML content of a URL. Use this to explore the manufacturer's website and product pages.",
|
|
input_schema: {
|
|
type: "object" as const,
|
|
properties: {
|
|
url: { type: "string", description: "The URL to fetch" },
|
|
},
|
|
required: ["url"],
|
|
},
|
|
},
|
|
];
|
|
|
|
const messages: Anthropic.MessageParam[] = [
|
|
{
|
|
role: "user",
|
|
content: `Crawl ${manufacturer.name}'s website at ${manufacturer.website} and extract their complete product catalog. Start with the homepage or sitemap, navigate to all product categories, and return the full product list as JSON.`,
|
|
},
|
|
];
|
|
|
|
let rounds = 0;
|
|
|
|
while (rounds < MAX_TOOL_ROUNDS) {
|
|
rounds++;
|
|
console.log(` [round ${rounds}] calling model...`);
|
|
|
|
const response = await client.messages.create({
|
|
model: MODEL,
|
|
max_tokens: 8192,
|
|
system: buildSystemPrompt(manufacturer),
|
|
tools,
|
|
messages,
|
|
});
|
|
|
|
// Add assistant response to history
|
|
messages.push({ role: "assistant", content: response.content });
|
|
|
|
if (response.stop_reason === "end_turn") {
|
|
// Final message — extract JSON from text content
|
|
const textBlock = response.content.find((b) => b.type === "text");
|
|
if (!textBlock || textBlock.type !== "text") {
|
|
throw new Error("Agent finished without text output");
|
|
}
|
|
return parseAgentOutput(textBlock.text);
|
|
}
|
|
|
|
if (response.stop_reason !== "tool_use") {
|
|
throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
|
|
}
|
|
|
|
// Process tool calls
|
|
const toolResults: Anthropic.ToolResultBlockParam[] = [];
|
|
for (const block of response.content) {
|
|
if (block.type !== "tool_use") continue;
|
|
if (block.name === "fetch_page") {
|
|
const { url } = block.input as { url: string };
|
|
console.log(` [tool] fetch_page ${url}`);
|
|
const content = await fetchPage(url);
|
|
toolResults.push({
|
|
type: "tool_result",
|
|
tool_use_id: block.id,
|
|
content,
|
|
});
|
|
}
|
|
}
|
|
|
|
messages.push({ role: "user", content: toolResults });
|
|
}
|
|
|
|
throw new Error(`Agent exceeded ${MAX_TOOL_ROUNDS} tool rounds without finishing`);
|
|
}
|
|
|
|
function parseAgentOutput(text: string): CatalogItem[] {
|
|
// Handle agent wrapping output in markdown code blocks
|
|
const cleaned = text.replace(/^```json\s*/i, "").replace(/\s*```$/i, "").trim();
|
|
const parsed = JSON.parse(cleaned);
|
|
if (!Array.isArray(parsed)) throw new Error("Agent output is not a JSON array");
|
|
return parsed;
|
|
}
|
|
|
|
// ── Upsert to GearBox API ─────────────────────────────────────────
|
|
|
|
async function upsertItems(
|
|
manufacturerSlug: string,
|
|
items: CatalogItem[],
|
|
): Promise<{ created: number; updated: number }> {
|
|
const payload = items.map((item) => ({
|
|
manufacturerSlug,
|
|
model: item.model,
|
|
category: item.category,
|
|
weightGrams: item.weightGrams ?? undefined,
|
|
priceCents: item.priceCents ?? undefined,
|
|
description: item.description ?? undefined,
|
|
sourceUrl: item.sourceUrl,
|
|
tags: item.tags,
|
|
}));
|
|
|
|
// Chunk into batches of 100 (API limit)
|
|
let totalCreated = 0;
|
|
let totalUpdated = 0;
|
|
|
|
for (let i = 0; i < payload.length; i += 100) {
|
|
const batch = payload.slice(i, i + 100);
|
|
const res = await fetch(`${GEARBOX_URL}/api/global-items/bulk`, {
|
|
method: "POST",
|
|
headers: {
|
|
"Content-Type": "application/json",
|
|
"X-API-Key": GEARBOX_API_KEY,
|
|
},
|
|
body: JSON.stringify({ items: batch }),
|
|
});
|
|
|
|
if (!res.ok) {
|
|
const err = await res.text();
|
|
throw new Error(`Bulk upsert failed (HTTP ${res.status}): ${err}`);
|
|
}
|
|
|
|
const result = await res.json() as { created: number; updated: number };
|
|
totalCreated += result.created;
|
|
totalUpdated += result.updated;
|
|
console.log(` batch ${Math.floor(i / 100) + 1}: +${result.created} new, ~${result.updated} updated`);
|
|
}
|
|
|
|
return { created: totalCreated, updated: totalUpdated };
|
|
}
|
|
|
|
// ── Main ──────────────────────────────────────────────────────────
|
|
|
|
async function main() {
|
|
console.log(`\nCrawling manufacturer: ${manufacturerSlug}`);
|
|
if (dryRun) console.log("DRY RUN — products will not be saved\n");
|
|
|
|
const manufacturer = await fetchManufacturer(manufacturerSlug);
|
|
console.log(`Found: ${manufacturer.name} (${manufacturer.website})\n`);
|
|
|
|
console.log("Starting agent crawl...");
|
|
const items = await runCrawlAgent(manufacturer);
|
|
console.log(`\nAgent extracted ${items.length} products`);
|
|
|
|
if (dryRun) {
|
|
console.log("\nDry run output (first 3 items):");
|
|
console.log(JSON.stringify(items.slice(0, 3), null, 2));
|
|
return;
|
|
}
|
|
|
|
console.log("\nUpserting to catalog...");
|
|
const { created, updated } = await upsertItems(manufacturerSlug, items);
|
|
console.log(`\nDone: ${created} created, ${updated} updated`);
|
|
}
|
|
|
|
main().catch((err) => {
|
|
console.error(err);
|
|
process.exit(1);
|
|
});
|
|
```
|
|
|
|
- [ ] **Step 3: Add market prices upsert after the bulk upsert**
|
|
|
|
After `upsertItems`, add a call to also upsert `marketPrices` for each item that has a price. This requires knowing the item IDs returned from the bulk upsert and the manufacturer's country/currency. Add this helper after `upsertItems`:
|
|
|
|
```typescript
|
|
async function upsertMarketPrices(
|
|
globalItemIds: number[],
|
|
items: CatalogItem[],
|
|
): Promise<void> {
|
|
for (let i = 0; i < globalItemIds.length; i++) {
|
|
const item = items[i];
|
|
const globalItemId = globalItemIds[i];
|
|
if (!item?.priceCents || !globalItemId) continue;
|
|
|
|
// Derive market from currency
|
|
const market = item.priceCurrency === "EUR" ? "EU"
|
|
: item.priceCurrency === "USD" ? "US"
|
|
: item.priceCurrency === "GBP" ? "GB"
|
|
: item.priceCurrency;
|
|
|
|
await fetch(`${GEARBOX_URL}/api/global-items/${globalItemId}/market-prices`, {
|
|
method: "POST",
|
|
headers: {
|
|
"Content-Type": "application/json",
|
|
"X-API-Key": GEARBOX_API_KEY,
|
|
},
|
|
body: JSON.stringify({
|
|
market,
|
|
currency: item.priceCurrency,
|
|
priceCents: item.priceCents,
|
|
source: "manufacturer-crawl",
|
|
}),
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
Call `upsertMarketPrices` in `main()` after the bulk upsert, passing the item IDs from the API response.
|
|
|
|
Note: The bulk upsert response returns `items[]` with IDs. Store those and pass them here. Update the `upsertItems` function return type to also return `itemIds: number[]`.
|
|
|
|
- [ ] **Step 4: Commit**
|
|
|
|
```bash
|
|
git add scripts/crawl-manufacturer.ts
|
|
git commit -m "feat: crawl-manufacturer agent script — Haiku tool-use loop + bulk upsert"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 3: Batch runner
|
|
|
|
**Files:**
|
|
- Create: `scripts/crawl-all.ts`
|
|
|
|
- [ ] **Step 1: Create `scripts/crawl-all.ts`**
|
|
|
|
```typescript
|
|
#!/usr/bin/env bun
|
|
/**
|
|
* Crawl all active manufacturers of a given tier.
|
|
*
|
|
* Usage:
|
|
* bun run scripts/crawl-all.ts --tier=1
|
|
* bun run scripts/crawl-all.ts --tier=1 --dry-run
|
|
*/
|
|
|
|
const GEARBOX_URL = process.env.GEARBOX_URL ?? "http://localhost:3000";
|
|
const GEARBOX_API_KEY = process.env.GEARBOX_API_KEY ?? "";
|
|
|
|
const args = Object.fromEntries(
|
|
process.argv
|
|
.slice(2)
|
|
.filter((a) => a.startsWith("--"))
|
|
.map((a) => {
|
|
const [k, v] = a.slice(2).split("=");
|
|
return [k, v ?? "true"];
|
|
}),
|
|
);
|
|
|
|
const tier = args["tier"] ? Number(args["tier"]) : 1;
|
|
const dryRun = args["dry-run"] === "true";
|
|
|
|
async function listActiveManufacturers(targetTier: number) {
|
|
const res = await fetch(`${GEARBOX_URL}/api/manufacturers`);
|
|
if (!res.ok) throw new Error(`Failed to list manufacturers: HTTP ${res.status}`);
|
|
const all = await res.json() as Array<{ slug: string; tier: number; active: boolean; name: string }>;
|
|
return all.filter((m) => m.active && m.tier === targetTier);
|
|
}
|
|
|
|
async function main() {
|
|
if (!GEARBOX_API_KEY) {
|
|
console.error("GEARBOX_API_KEY env var is required");
|
|
process.exit(1);
|
|
}
|
|
|
|
const manufacturers = await listActiveManufacturers(tier);
|
|
console.log(`Found ${manufacturers.length} active tier-${tier} manufacturers\n`);
|
|
|
|
const results: Array<{ slug: string; status: "ok" | "error"; error?: string }> = [];
|
|
|
|
for (const m of manufacturers) {
|
|
console.log(`\n${"─".repeat(50)}`);
|
|
console.log(`Crawling: ${m.name} (${m.slug})`);
|
|
try {
|
|
const extraArgs = dryRun ? ["--dry-run"] : [];
|
|
const proc = Bun.spawn(
|
|
["bun", "run", "scripts/crawl-manufacturer.ts", `--manufacturer=${m.slug}`, ...extraArgs],
|
|
{ stdout: "inherit", stderr: "inherit", env: process.env },
|
|
);
|
|
const exitCode = await proc.exited;
|
|
if (exitCode !== 0) throw new Error(`Exited with code ${exitCode}`);
|
|
results.push({ slug: m.slug, status: "ok" });
|
|
} catch (err) {
|
|
console.error(` ERROR: ${(err as Error).message}`);
|
|
results.push({ slug: m.slug, status: "error", error: (err as Error).message });
|
|
}
|
|
}
|
|
|
|
console.log(`\n${"═".repeat(50)}`);
|
|
console.log("Summary:");
|
|
for (const r of results) {
|
|
const icon = r.status === "ok" ? "✓" : "✗";
|
|
console.log(` ${icon} ${r.slug}${r.error ? ` — ${r.error}` : ""}`);
|
|
}
|
|
|
|
const failed = results.filter((r) => r.status === "error");
|
|
if (failed.length > 0) {
|
|
console.error(`\n${failed.length} manufacturer(s) failed`);
|
|
process.exit(1);
|
|
}
|
|
}
|
|
|
|
main().catch((err) => {
|
|
console.error(err);
|
|
process.exit(1);
|
|
});
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add scripts/crawl-all.ts
|
|
git commit -m "feat: crawl-all batch runner — iterate active manufacturers by tier"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 4: Package.json scripts + smoke test
|
|
|
|
**Files:**
|
|
- Modify: `package.json`
|
|
|
|
- [ ] **Step 1: Add scripts to `package.json`**
|
|
|
|
In the `"scripts"` section, add:
|
|
|
|
```json
|
|
"db:crawl": "bun run scripts/crawl-manufacturer.ts",
|
|
"db:crawl-all": "bun run scripts/crawl-all.ts"
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add package.json
|
|
git commit -m "chore: add db:crawl and db:crawl-all npm scripts"
|
|
```
|
|
|
|
- [ ] **Step 3: Smoke test with dry run**
|
|
|
|
Ensure GearBox is running (`bun run dev:server` in another terminal) and the manufacturer exists in the DB.
|
|
|
|
```bash
|
|
GEARBOX_API_KEY=<your-api-key> ANTHROPIC_API_KEY=<your-key> \
|
|
bun run db:crawl --manufacturer=apidura --dry-run
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Crawling manufacturer: apidura
|
|
Found: Apidura (https://apidura.com)
|
|
|
|
Starting agent crawl...
|
|
[round 1] calling model...
|
|
[tool] fetch_page https://apidura.com
|
|
...
|
|
Agent extracted N products
|
|
|
|
Dry run output (first 3 items):
|
|
[
|
|
{
|
|
"model": "...",
|
|
"category": "bags",
|
|
...
|
|
}
|
|
]
|
|
```
|
|
|
|
- [ ] **Step 4: Live run against one manufacturer**
|
|
|
|
```bash
|
|
GEARBOX_API_KEY=<your-api-key> ANTHROPIC_API_KEY=<your-key> \
|
|
bun run db:crawl --manufacturer=apidura
|
|
```
|
|
|
|
Expected: products appear in the catalog. Verify by opening the app catalog search or calling `GET /api/global-items?q=apidura`.
|
|
|
|
- [ ] **Step 5: Commit any adjustments**
|
|
|
|
If the agent prompt needed tuning (category mapping issues, extra noise in output), update `buildSystemPrompt` in `crawl-manufacturer.ts` and commit:
|
|
|
|
```bash
|
|
git add scripts/crawl-manufacturer.ts
|
|
git commit -m "fix: tune agent prompt for cleaner category/tag extraction"
|
|
```
|