Large Sitemap Indexes in Outbound Feeds
Outbound Feeds (OBF) has two different sitemap index blocks; sitemap index and sitemap index by day. Both provide a way to serve more than 100 articles by using the Sitemap Index format to link to multiple sitemaps. The sitemap index block is best used with smaller sets of results, less than 10,000 records. The sitemap index by day block breaks results into one link per day. This will work fine with small or large sets of results, but requires more configuration and will results in more (but smaller and faster) queries.
Elasticsearch
Content API, the system that provides Arc with its search ability is built on Elasticsearch (ES). We are migrating all clients to ES version 7. Version 7 has a 10,000 record limit for search results. This limit impacts sitemap index preventing it from showing links greater that 10,000. For example if you have one million articles and search content-api for all stories, with the old version of ES it will return a record count of 1,000,000. With ES7, the same query will return a record count of 10,000. Once ES7 hits 10,000 it stops counting and returns saving time and resources. Also it will return an error if you send a request for an offset (?from=10000
) greater than or equal to 10,000 (because it uses zero for the first record, an offset of 10,000 is the 10,001 record).
Sitemap Index
If your OBF setup was created prior to May 2021 you should have a template named sitemap index. If you do not, you can find instructions on setting up a sitemap index Here. It uses the record count from the query, which by default is the last two days of published stories. As long as the record count is less than 10,000 this format will work. For example if you publish 500 articles a day and use the default query of two days, you should average 1,000 records. Well under the 10,000 limit. If you wanted to return 30 days worth of articles, that would average around 15,000. In that case you would have to use sitemap index by day and the sitemap-with-dayX
content sources to support your large sitemap index. If your sitemap-index
returns 7 days or more worth of content, it’s recommended to use sitemap index by day as it’s more performant.
Sitemap Index By Day
If your OBF setup was created after May 2021 you should have a page named sitemap index by day. If you do not, you can find instructions on setting up a sitemap index by day Here. The sitemap index by day generates one link per day. Starting with the current day and going back the number of days set in the Custom Field Number of days to include
which defaults to 2. If you want this to use a specific end date instead, you can enter a date in YYYY-MM-DD
format. The feed does not perform any queries, it just generates links starting with the current date and working backwards generating links to sitemap/YYYY-MM-DD
for the number of days you specified.
Days with large results
Each link to sitemap/YYYY-MM-DD
will attempt to return all articles for that day. It can handle around 1,500 records before it times out. If you have a date that has a large number for records such as a data migration, you can configure those dates using the Dates with large results customField
. Dates setup like this will paginate results into groups of 1000. So if you have a day 2020-10-28 with 2347 article you would use 3.
That will cause the sitemap-index to generate 3 links for that day. Each returning up to 1000 records.
<sitemap> <loc>http://demo-prod.origin.arcpublishing.com/arc/outboundfeeds/sitemap/2020-10-28-1?outputType=xml</loc></sitemap><sitemap> <loc>http://demo-prod.origin.arcpublishing.com/arc/outboundfeeds/sitemap2/2020-10-28-2?outputType=xml</loc></sitemap><sitemap> <loc>http://demo-prod.origin.arcpublishing.com/arc/outboundfeeds/sitemap2/2020-10-28-3?outputType=xml</loc></sitemap>
If you usually publish over 1,000 articles a day, you can use the value all
instead of a date to generate paginated links for every day. You can combine both all
and individual date values as well.
Caching results
If the sitemap-index
returns results from more than a week ago, those results likely aren’t changing very often. To help improve performance of your website and Outbound Feeds we are using longer cache Time To Live (TTL’s) on older sitemaps. To do this requires configuring three different resolvers, each with its own content source. By default content from the last 2 days will use the standard 5 minute TTL. Content from 3 to 7 days will use a 1 hour TTL. Content older than that will use a 1 day TTL (The values start at zero).
Sitemap with DayX Resolvers
The sitemap index by dayX
resolver is configured with the feeds-content-api-by-dayX
content source. These require a single date to be passed as a parameter to limit the query to a size that can be retrieved before the request times out.
If your OBF setup was created after May 2021 you should have three resolvers named:
sitemap
with daysitemap2
with daysitemap3
with day
If not, you can find instructions on setting up all the sitemap with dayX
resolvers Here.
All three of these resolvers are configured almost identically, however the regex is different. There needs to be three so they can have unique cache (TTL) values. Each resolver requires you to enter the fields below.
Date Field - one of the five available ANS date fields. Should be the same as the sort field which defaults to last_updated_date
.
created_date
display_date
first_publish_date
last_updated_date
publish_date
Date Range - usually a date from the url pattern in for format of YYYY-MM-DD
.