Skip to content

Content Recommendations Content Collector Bulk Load Guide

This guide explains how to seed the Content Recommendations catalogue from an existing CMS export — without waiting for a live CMS webhook or IFX integration to backfill it organically.

The Collector API does not expose a dedicated batch endpoint for content. “Bulk load” means streaming each item from your export through the single-item content endpoint, POST /collector/v1/content, with appropriate pacing, idempotency, and verification.

When to use bulk load vs IFX streaming

Use bulk load when…Use IFX streaming when…
You are onboarding to Content Recommendations and need to seed your existing back catalogue.You are wiring the ongoing, live sync from your CMS to Content Recommendations.
You are re-importing after a CMS migration, a content model change, or a catalogue cleanup.You want every publish, update, and unpublish to reach Content Recommendations within seconds, automatically.
You have a one-shot export (CSV / JSON / a database dump) and need to push it in.You are an Arc XP CMS customer and the IFX Recipe is already an option.

The two paths are complementary, not exclusive. The typical onboarding flow is: bulk-load the back catalogue once, then wire IFX (or your own webhook) for ongoing changes. See the Content Recommendations Onboarding Checklist for the full sequence.

Prerequisites

  • Content Recommendations service provisioned.
  • A Collector API Delivery API token. Provision it from its own key collection, separate from your Recommendations API token. See Provisioning tokens through Delivery API.
  • An export of your published catalogue. Deleted, unpublished, and draft items must not be sent.

Authentication

Pass your Collector API Delivery API token in the X-Api-Key header on every request:

X-Api-Key: <your-collector-delivery-api-token>

Use the Collector API token, not the Recommendations API token. The two are provisioned from separate key collections so a token exposed in one context cannot be used against the other API.

Endpoint summary

FieldValue
Method + pathPOST /collector/v1/content
Request bodyA single JSON UpsertContent payload — one item per request.
Success response202 Accepted with empty body.
Validation error422 Unprocessable Entity with HTTPValidationError body.
Auth error401 Unauthorized if X-Api-Key is missing or unknown.
Semanticsaction: "publish" is an upsert on (site_id, item_id). Reposting the same payload is safe.

See the POST /collector/v1/content reference for the full schema, including all optional fields.

Input format and field mapping

Every row in your export maps to one `UpsertContent` payload (action: "publish"). At minimum, each payload needs:

Payload fieldWhat it isWhere it comes from in a typical CMS export
actionAlways "publish" for bulk load.Constant.
item_idStable identifier for the content item.The CMS’s primary identifier for the item (e.g. an ArcID).
site_idThe website / property the item belongs to.The site or brand the item is published under.
typeOne of article, video, podcast.The item’s content type.
timestampPublication date, ISO 8601 with explicit offset.The item’s published-at timestamp, not the export time.
titleDisplay title.The item’s headline / title field.
categories, tags, author, is_premium, metadataOptional but recommended.Whatever taxonomy and attribution fields your export carries.

item_id rules to settle before you start the load

A bulk load is the moment you commit to your item_id scheme. Every interaction you eventually record will be keyed against the IDs you choose here, and there is no retroactive rebind if you change schemes later. Confirm the following before sending the first request:

  • Match what the live path will use. Whichever item_id your live CMS webhook (or IFX integration) will send for a given item after the bulk load is exactly what the bulk load must send for the same item now. A mismatch orphans the bulk-loaded record the first time that item updates through the live path.
  • Pick one source per content piece and never change it. Treat the item_id as a primary key, not a label. Common choices are the CMS’s internal record ID (stable across slug edits, re-categorisation, and SEO rewrites) or the canonical URL slug (human-readable, but fragile if editors rewrite it).
  • Comparison is byte-for-byte. ABC123, abc123, abc-123, and abc_123 are four different items. Decide on a casing and punctuation convention up front and apply it consistently — silently normalising in the bulk load and not in the live webhook (or vice versa) breaks the join.
  • Length is enforced. The API rejects an item_id shorter than 1 or longer than 256 characters with 422. Aim for ≤ 64 characters so IDs stay readable in logs and dashboards.
  • No PII. Item IDs surface in operational logs. Do not embed personal data, internal user IDs, or any directly identifying value.
  • Same identifier on both sides of the join. The item_id on every content payload here is the same identifier your behavioral-event producers will send on every POST /collector/v1/events / POST /collector/v1/events/batch. A mismatch silently breaks personalisation — events cannot contribute to a catalogue entry the model has not seen.

See the Critical requirements and Content endpoint sections of the Content Recommendations API Developer Guide for the full field-level rules and the complete optional-field list.

Chunking and submitting a backlog

The content endpoint takes one item per request, so “bulk” is a client-side concern: how many items per second you push, and how many you push in parallel.

  • Pace on aggregate throughput, not per-request. Whether you run the load as one process or many parallel workers, Content Recommendations sees the sum of your request rate, and that aggregate is the only number that matters. Per-request 202s confirm the payload was queued — they do not confirm the downstream pipeline is keeping up. Start conservative (a few tens of requests per second across all workers is a reasonable opener for an initial backfill), watch for 5xx responses and growing latency, and ramp only when you have evidence of headroom.
  • Use modest concurrency. A small number of concurrent workers (for example, 4–16) per export job is usually enough to saturate ingest capacity without overwhelming downstream processing. Increase only if you observe headroom; back off on 5xx.
  • Stable ordering does not matter. Items can be sent in any order. The upsert keys on (site_id, item_id), not on send order.
  • Plan for resumability. Track which item_ids have been successfully acknowledged so a partially-completed run can resume without re-sending the whole catalogue. Resending is safe (see Idempotency below), but skipping already-sent items shortens recovery time.
  • Backfills should run before any user-facing recommendations are exposed. See the Pre-launch phase of the Onboarding Checklist.

Idempotency

A publish payload is an upsert keyed on (site_id, item_id). Reposting the same payload simply overwrites the stored item with the same content — there is no duplicate, no version bump visible to the caller, and no downstream double-counting. This means:

  • Retrying a request after a transport error is safe.
  • Re-running a partial bulk load against the same export is safe.
  • Re-running the bulk load after editing your export (e.g. adding a missing categories value) is the supported way to correct catalogue metadata.

This guarantee only holds when the item_id for a given content piece does not change between runs. Reissuing the same content with a different item_id does not update the existing record — it creates a new catalogue entry with no interaction history, leaves the original behind as a dead entry, and orphans any events already recorded against the original ID. If your CMS workflow can mint new IDs on republish, fix that workflow before re-running the bulk load.

Deletes are soft. Sending action: "delete" marks the item as removed; it is not destroyed and is not eligible for recommendations afterwards. Do not send deletes during a bulk load of published content. Once an item has been soft- deleted, do not reuse its item_id for unrelated content — the model cannot distinguish “the original came back” from “a different item appeared at the same key,” and will fold both histories together.

Verification

The endpoint returns 202 Accepted as soon as the payload is durably committed to the ingestion pipeline — not when the item is fully visible to the model. To verify that a load completed:

  • Track per-item acknowledgements client-side. A successful 202 per item, with a per-item count matching the export’s row count, is the canonical “done” signal for the ingestion step.
  • Spot-check via the Recommendations API. After the load completes, call GET /recommend/v1/recommendations with a test user_id against the same site_id. A non-empty recommendations array confirms items have reached the model. Empty arrays after a known-good bulk load usually indicate a site_id mismatch between ingestion and the read call.
  • Verify a sample of item_ids round-trip cleanly. Pick a handful of IDs from your export, confirm they show up in recommendations responses for diverse synthetic users, and confirm each one resolves in your CMS.
  • Watch error rates in your client. A bulk load with even a small percentage of 422 errors usually points to a systematic mapping bug in the export — investigate before declaring the load done.

Error and retry behavior

StatusMeaningCaller action
202The item was durably accepted for processing.None.
401X-Api-Key is missing or unknown.Stop the bulk load. Verify the token and that it was provisioned against the Collector API key collection.
422The payload failed validation (bad enum, naive datetime, missing required field, oversize field).Do not retry as-is. Fix the offending row in your export and resubmit.
5xxTransient server-side failure.Retry the same request with exponential backoff. Cap attempts so a stuck job does not stall the rest of the load.

A 422 on one item does not stop the rest of the load; each request is independent. Surface the failed row to your operator with the response body, which identifies the offending field via detail[*].loc.

Request format and examples

The content endpoint accepts a single JSON UpsertContent object per request, sent with Content-Type: application/json. It does not accept CSV, NDJSON, or a JSON array — those are export formats you may have on disk, but the wire format is always one JSON object per POST /collector/v1/content call.

Submit one payload with curl

Save the UpsertContent object as article-001.json:

{
"action": "publish",
"item_id": "ARTICLE-001",
"site_id": "acme-news",
"type": "article",
"timestamp": "2026-04-08T09:00:00+00:00",
"title": "Election results, day three",
"categories": ["Politics"],
"tags": ["election-2026", "results"],
"author": "Jane Doe",
"is_premium": false,
"metadata": {}
}

POST it with curl, reading the body from the file via -d @:

Terminal window
curl -sS -X POST \
"https://{org}-config-prod.api.arc-cdn.net/collector/v1/content" \
-H "Content-Type: application/json" \
-H "X-Api-Key: $COLLECTOR_API_TOKEN" \
-d @article-001.json

A successful submission returns:

HTTP/1.1 202 Accepted
Content-Length: 0

For one-off smoke tests, pass the body inline with -d '{...}' instead of -d @file.

Driving the load from common export formats

Whatever shape your export arrives in, the client-side loop is the same: parse one record, materialise one UpsertContent JSON object, POST it, record the 202. The only thing that changes per format is the parser at the head of the pipeline:

  • NDJSON (one JSON object per line). Stream the file line by line; each line is already a request body. No transformation is needed if the producer writes UpsertContent-shaped objects directly.
  • JSON array. Stream-parse with jq -c '.[]' to emit one object per line, then treat the output as NDJSON.
  • CSV. Map each row to an UpsertContent object. List-valued fields (categories, tags) need a delimiter convention in the CSV — typically ; or | — that the client splits into a JSON array before sending.
  • Database dump. Page through the source table and project each row into an UpsertContent object the same way.

Repeat per record, applying the chunking and concurrency guidance above. Track the item_ids that returned 202 so you can resume without re-sending the whole export if the job is interrupted.

When not to use this endpoint

  • Ongoing publishes, updates, and deletes. Once the back catalogue is loaded, wire the live CMS webhook (or the IFX Recipe for Arc XP CMS customers) so changes propagate within seconds. Re-running the bulk load is not a substitute for the live sync.
  • User-behavior events. Use POST /collector/v1/events or POST /collector/v1/events/batch. See the Content Recommendations Batch API guide.
  • Tombstoning items deleted in the source CMS. Bulk load is insert-or- upsert only. To remove items, send action: "delete" payloads explicitly.