Text-to-Speech

The Arc XP Audio API can generate spoken audio from text input using text-to-speech (TTS). This is useful for producing narrated versions of articles, accessibility audio, or any content where you need synthesized speech.

Prerequisites

An Arc XP API token (see the Developer Center).
At least one voice configured in your organization’s TTS settings (see step 1).

1. List Available Voices

Audio API comes with some preset voices configured. If you’d like to configure these voices yourself, browse the voices available to your organization:

curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/settings/voices

The response includes metadata for each voice:

{
  "voices": [
    {
      "id": "voice_abc123",
      "name": "Rachel",
      "use_case": "narration",
      "gender": "female",
      "accent": "american",
      "age": "young",
      "description": "A clear, warm voice ideal for news narration.",
      "preview_url": "https://...",
      "supported_languages": ["en", "es", "fr"]
    }
  ]
}

Note the id field — you’ll need it when generating speech.

2. Preview a Voice

Before creating a full audio clip, you can generate a short preview (~10 seconds) to audition a voice. No audio clip record is created.

curl -X POST https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/tts/preview \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "input_text": "This is a short preview of what this voice sounds like.",
    "voice_id": "voice_abc123"
  }'

The response contains a uri pointing to the generated preview audio file:

{
  "uri": "https://..."
}

3. Generate a Full Audio Clip

Once you’ve chosen a voice, create an audio clip with TTS in a single request. This creates the clip record and starts speech synthesis:

curl -X POST https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/clips/tts \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Article Narration: Breaking News Story",
    "description": "TTS narration of the breaking news article.",
    "tags": ["narration", "news"],
    "input_text": "The full text of the article you want narrated goes here. It can be up to 30,000 characters.",
    "voice_id": "voice_abc123"
  }'

The API returns 202 Accepted with the new clip’s ID and a Location header:

{
  "content_id": "abc123def456"
}

Use that returned content_id as the {audio_id} placeholder in later clip endpoints, including publish.

Monitoring Progress with SSE

TTS generation is asynchronous. To stream real-time progress updates, include the Accept: text/event-stream header:

curl -X POST https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/clips/tts \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Accept: text/event-stream" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Article Narration",
    "input_text": "The full text goes here...",
    "voice_id": "voice_abc123"
  }'

The SSE stream emits progress notifications such as tts_started, tts_generating_speech, encoding_started, and encoding_complete (or tts_failed / encoding_failed).

4. Publish the Clip

Once the clip reaches READY state, publish it to make it available for delivery:

curl -X POST https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/clips/{audio_id}/publish \
  -H "Authorization: Bearer YOUR_API_TOKEN"

Pronunciation Dictionaries

If your content includes names, technical terms, or brand names that the TTS engine mispronounces, you can configure a pronunciation dictionary at the organization level.

Pronunciation rules use IPA (International Phonetic Alphabet) phonemes to define how specific words should be pronounced.

Configure Pronunciation Rules

curl -X PATCH https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/settings/ \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_settings": {
      "pronunciation_dictionary": {
        "rules": [
          {
            "grapheme": ["Arc XP", "ArcXP"],
            "phoneme": "ɑːrk ɛks piː"
          },
          {
            "grapheme": ["GIF"],
            "phoneme": "ɡɪf"
          }
        ]
      }
    }
  }'

Each rule maps one or more grapheme strings (the text as written) to a phoneme (how it should be spoken). The dictionary is applied automatically to all future TTS operations for your organization.

Configure Voices

You can also set which voices are available to your organization through the settings endpoint:

curl -X PATCH https://api.[org].arcpublishing.com/audiocenter/api/editorial/v1/settings/ \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_settings": {
      "voices": [
        { "voice_id": "voice_abc123" },
        { "voice_id": "voice_def456" }
      ]
    }
  }'

TTS Fields Reference

Create TTS Clip (`POST /clips/tts`)

Field	Required	Description
`title`	Yes	Clip title (3–500 characters).
`input_text`	Yes	The text to synthesize (up to 30,000 characters).
`voice_id`	Yes	ID of the voice to use (from `/settings/voices`).
`description`	No	Clip description (up to 4,000 characters).
`tags`	No	Tags for organization and filtering.
`circulation`	No	Site ownership details.

Preview Voice (`POST /tts/preview`)

Field	Required	Description
`input_text`	Yes	Sample text to synthesize (up to 500 characters, ~10 seconds).
`voice_id`	Yes	ID of the voice to audition.