Content Recommendations Content Collector Bulk Load Guide
This guide explains how to seed the Content Recommendations catalogue from an existing CMS export — without waiting for a live CMS webhook or IFX integration to backfill it organically.
The Collector API does not expose a dedicated batch endpoint for content.
“Bulk load” means streaming each item from your export through the single-item
content endpoint, POST /collector/v1/content, with appropriate pacing,
idempotency, and verification.
When to use bulk load vs IFX streaming
| Use bulk load when… | Use IFX streaming when… |
|---|---|
| You are onboarding to Content Recommendations and need to seed your existing back catalogue. | You are wiring the ongoing, live sync from your CMS to Content Recommendations. |
| You are re-importing after a CMS migration, a content model change, or a catalogue cleanup. | You want every publish, update, and unpublish to reach Content Recommendations within seconds, automatically. |
| You have a one-shot export (CSV / JSON / a database dump) and need to push it in. | You are an Arc XP CMS customer and the IFX Recipe is already an option. |
The two paths are complementary, not exclusive. The typical onboarding flow is: bulk-load the back catalogue once, then wire IFX (or your own webhook) for ongoing changes. See the Content Recommendations Onboarding Checklist for the full sequence.
Prerequisites
- Content Recommendations service provisioned.
- A Collector API Delivery API token. Provision it from its own key collection, separate from your Recommendations API token. See Provisioning tokens through Delivery API.
- An export of your published catalogue. Deleted, unpublished, and draft items must not be sent.
Authentication
Pass your Collector API Delivery API token in the X-Api-Key header on every
request:
X-Api-Key: <your-collector-delivery-api-token>Use the Collector API token, not the Recommendations API token. The two are provisioned from separate key collections so a token exposed in one context cannot be used against the other API.
Endpoint summary
| Field | Value |
|---|---|
| Method + path | POST /collector/v1/content |
| Request body | A single JSON UpsertContent payload — one item per request. |
| Success response | 202 Accepted with empty body. |
| Validation error | 422 Unprocessable Entity with HTTPValidationError body. |
| Auth error | 401 Unauthorized if X-Api-Key is missing or unknown. |
| Semantics | action: "publish" is an upsert on (site_id, item_id). Reposting the same payload is safe. |
See the POST /collector/v1/content reference for the full schema, including all optional fields.
Input format and field mapping
Every row in your export maps to one `UpsertContent` payload (action: "publish"). At minimum, each
payload needs:
| Payload field | What it is | Where it comes from in a typical CMS export |
|---|---|---|
action | Always "publish" for bulk load. | Constant. |
item_id | Stable identifier for the content item. | The CMS’s primary identifier for the item (e.g. an ArcID). |
site_id | The website / property the item belongs to. | The site or brand the item is published under. |
type | One of article, video, podcast. | The item’s content type. |
timestamp | Publication date, ISO 8601 with explicit offset. | The item’s published-at timestamp, not the export time. |
title | Display title. | The item’s headline / title field. |
categories, tags, author, is_premium, metadata | Optional but recommended. | Whatever taxonomy and attribution fields your export carries. |
item_id rules to settle before you start the load
A bulk load is the moment you commit to your item_id scheme. Every interaction
you eventually record will be keyed against the IDs you choose here, and there
is no retroactive rebind if you change schemes later. Confirm the following
before sending the first request:
- Match what the live path will use. Whichever
item_idyour live CMS webhook (or IFX integration) will send for a given item after the bulk load is exactly what the bulk load must send for the same item now. A mismatch orphans the bulk-loaded record the first time that item updates through the live path. - Pick one source per content piece and never change it. Treat the
item_idas a primary key, not a label. Common choices are the CMS’s internal record ID (stable across slug edits, re-categorisation, and SEO rewrites) or the canonical URL slug (human-readable, but fragile if editors rewrite it). - Comparison is byte-for-byte.
ABC123,abc123,abc-123, andabc_123are four different items. Decide on a casing and punctuation convention up front and apply it consistently — silently normalising in the bulk load and not in the live webhook (or vice versa) breaks the join. - Length is enforced. The API rejects an
item_idshorter than 1 or longer than 256 characters with422. Aim for ≤ 64 characters so IDs stay readable in logs and dashboards. - No PII. Item IDs surface in operational logs. Do not embed personal data, internal user IDs, or any directly identifying value.
- Same identifier on both sides of the join. The
item_idon every content payload here is the same identifier your behavioral-event producers will send on everyPOST /collector/v1/events/POST /collector/v1/events/batch. A mismatch silently breaks personalisation — events cannot contribute to a catalogue entry the model has not seen.
See the Critical requirements and Content endpoint sections of the Content Recommendations API Developer Guide for the full field-level rules and the complete optional-field list.
Chunking and submitting a backlog
The content endpoint takes one item per request, so “bulk” is a client-side concern: how many items per second you push, and how many you push in parallel.
- Pace on aggregate throughput, not per-request. Whether you run the load
as one process or many parallel workers, Content Recommendations sees the
sum of your request rate, and that aggregate is the only number that
matters. Per-request
202s confirm the payload was queued — they do not confirm the downstream pipeline is keeping up. Start conservative (a few tens of requests per second across all workers is a reasonable opener for an initial backfill), watch for5xxresponses and growing latency, and ramp only when you have evidence of headroom. - Use modest concurrency. A small number of concurrent workers (for
example, 4–16) per export job is usually enough to saturate ingest capacity
without overwhelming downstream processing. Increase only if you observe
headroom; back off on
5xx. - Stable ordering does not matter. Items can be sent in any order. The
upsert keys on
(site_id, item_id), not on send order. - Plan for resumability. Track which
item_ids have been successfully acknowledged so a partially-completed run can resume without re-sending the whole catalogue. Resending is safe (see Idempotency below), but skipping already-sent items shortens recovery time. - Backfills should run before any user-facing recommendations are exposed. See the Pre-launch phase of the Onboarding Checklist.
Idempotency
A publish payload is an upsert keyed on (site_id, item_id). Reposting the
same payload simply overwrites the stored item with the same content — there is
no duplicate, no version bump visible to the caller, and no downstream
double-counting. This means:
- Retrying a request after a transport error is safe.
- Re-running a partial bulk load against the same export is safe.
- Re-running the bulk load after editing your export (e.g. adding a missing
categoriesvalue) is the supported way to correct catalogue metadata.
This guarantee only holds when the item_id for a given content piece does
not change between runs. Reissuing the same content with a different
item_id does not update the existing record — it creates a new catalogue
entry with no interaction history, leaves the original behind as a dead entry,
and orphans any events already recorded against the original ID. If your CMS
workflow can mint new IDs on republish, fix that workflow before re-running
the bulk load.
Deletes are soft. Sending action: "delete" marks the item as removed; it is
not destroyed and is not eligible for recommendations afterwards. Do not send
deletes during a bulk load of published content. Once an item has been soft-
deleted, do not reuse its item_id for unrelated content — the model cannot
distinguish “the original came back” from “a different item appeared at the
same key,” and will fold both histories together.
Verification
The endpoint returns 202 Accepted as soon as the payload is durably committed
to the ingestion pipeline — not when the item is fully visible to the model.
To verify that a load completed:
- Track per-item acknowledgements client-side. A successful
202per item, with a per-item count matching the export’s row count, is the canonical “done” signal for the ingestion step. - Spot-check via the Recommendations API. After the load completes, call
GET /recommend/v1/recommendationswith a testuser_idagainst the samesite_id. A non-emptyrecommendationsarray confirms items have reached the model. Empty arrays after a known-good bulk load usually indicate asite_idmismatch between ingestion and the read call. - Verify a sample of
item_ids round-trip cleanly. Pick a handful of IDs from your export, confirm they show up inrecommendationsresponses for diverse synthetic users, and confirm each one resolves in your CMS. - Watch error rates in your client. A bulk load with even a small
percentage of
422errors usually points to a systematic mapping bug in the export — investigate before declaring the load done.
Error and retry behavior
| Status | Meaning | Caller action |
|---|---|---|
202 | The item was durably accepted for processing. | None. |
401 | X-Api-Key is missing or unknown. | Stop the bulk load. Verify the token and that it was provisioned against the Collector API key collection. |
422 | The payload failed validation (bad enum, naive datetime, missing required field, oversize field). | Do not retry as-is. Fix the offending row in your export and resubmit. |
5xx | Transient server-side failure. | Retry the same request with exponential backoff. Cap attempts so a stuck job does not stall the rest of the load. |
A 422 on one item does not stop the rest of the load; each request is
independent. Surface the failed row to your operator with the response body,
which identifies the offending field via detail[*].loc.
Request format and examples
The content endpoint accepts a single JSON UpsertContent object per request,
sent with Content-Type: application/json. It does not accept CSV, NDJSON,
or a JSON array — those are export formats you may have on disk, but the wire
format is always one JSON object per POST /collector/v1/content call.
Submit one payload with curl
Save the UpsertContent object as article-001.json:
{ "action": "publish", "item_id": "ARTICLE-001", "site_id": "acme-news", "type": "article", "timestamp": "2026-04-08T09:00:00+00:00", "title": "Election results, day three", "categories": ["Politics"], "tags": ["election-2026", "results"], "author": "Jane Doe", "is_premium": false, "metadata": {}}POST it with curl, reading the body from the file via -d @:
curl -sS -X POST \ "https://{org}-config-prod.api.arc-cdn.net/collector/v1/content" \ -H "Content-Type: application/json" \ -H "X-Api-Key: $COLLECTOR_API_TOKEN" \ -d @article-001.jsonA successful submission returns:
HTTP/1.1 202 AcceptedContent-Length: 0For one-off smoke tests, pass the body inline with -d '{...}' instead of
-d @file.
Driving the load from common export formats
Whatever shape your export arrives in, the client-side loop is the same: parse
one record, materialise one UpsertContent JSON object, POST it, record the
202. The only thing that changes per format is the parser at the head of the
pipeline:
- NDJSON (one JSON object per line). Stream the file line by line; each
line is already a request body. No transformation is needed if the producer
writes
UpsertContent-shaped objects directly. - JSON array. Stream-parse with
jq -c '.[]'to emit one object per line, then treat the output as NDJSON. - CSV. Map each row to an
UpsertContentobject. List-valued fields (categories,tags) need a delimiter convention in the CSV — typically;or|— that the client splits into a JSON array before sending. - Database dump. Page through the source table and project each row into
an
UpsertContentobject the same way.
Repeat per record, applying the chunking and concurrency guidance above. Track
the item_ids that returned 202 so you can resume without re-sending the
whole export if the job is interrupted.
When not to use this endpoint
- Ongoing publishes, updates, and deletes. Once the back catalogue is loaded, wire the live CMS webhook (or the IFX Recipe for Arc XP CMS customers) so changes propagate within seconds. Re-running the bulk load is not a substitute for the live sync.
- User-behavior events. Use
POST /collector/v1/eventsorPOST /collector/v1/events/batch. See the Content Recommendations Batch API guide. - Tombstoning items deleted in the source CMS. Bulk load is insert-or-
upsert only. To remove items, send
action: "delete"payloads explicitly.