Setting Up Sitemap Index By Day Feed with Outbound Feeds
The Sitemap Index is used to overcome Content-API’s 100 record pre call limit, it can generate links to multiple days worth of sitemaps. This feed replaces the Sitemap Index feed. Even though the Sitemap Index feed displays results in 100 record pages, it uses just one query for all of the results. If your Sitemap Index feed has any links where the from parameter is greater than 10,000 (they look like ?from=10000
) those links will return errors in the newer versions of ElasticSearch ES 7.x.
If you have not already created a template, see Updating blocks.json and Outbound Feeds Development to configure the Sitemap feed. If you are currently using the Sitemap Index feed, you should remove the existing Sitemap Index template and resolver after adding the Sitemap by Day page. Pages have a higher priority than templates, so a page with the same uri will be used over a template. But having a page and template both with the same uri is confusing.
The Sitemap Index By Day block returns one sitemap link for each day. The new sitemap links have several different formats. You can also configure it to generate sitemap-news or sitemap-video links.
/arc/outboundfeeds/sitemap/latest/
/arc/outboundfeeds/sitemap/2021-12-31/
/arc/outboundfeeds/sitemap2/2021-12-31/
/arc/outboundfeeds/sitemap3/2021-12-31-4/
You will need to create three new sitemap resolvers using the feeds-source-content-api-by-day-block(23)
content sources. The new content sources will attempt to return all of the content for the specified day, even if it’s greater than 100 records.
Feed Setup
The Sitemap Index By Day feed is a platform level Arc feed block with configurable and customizable parameters. It will generate one link for each day to your sitemap feed. Because Content-API now has a 10,000 record limit, we can no longer consistently use a query to determine how many records are available when dealing with large queries. It needs to know how many days back to go. It does not query Content-API to generate the links, it starts with the current date and works backwards generating links. Because there is no query, it can be setup as a page without a content source.
In summary, to use this new feed type, you’ll need the following:
- Four new blocks:
@wpmedia/feeds-source-content-api-by-day-block
@wpmedia/feeds-source-content-api-by-day2-block
@wpmedia/feeds-source-content-api-by-day3-block
@wpmedia/sitemap-index-by-day-feature-block
- Create a Sitemap Index By Day Feature as a Page
- Create three new resolvers for each of the Page sitemap paths
Sitemap-Index-By-Day Page Configuration
These instructions assume you already have a Sitemap Index page. If you do not already have a sitemap-index page, add one by going to Pagebuilder Editor → Pages and create a new page. Name the new page sitemap-index
, set the requestURI to /arc/outboundfeeds/sitemap-index
. Next edit the page. In the curation tab under Features, click “Add feature” and choose “Sitemap Index By Day”. Follow the steps below to configure the sitemap-index.
If you are currently using a Sitemap Index feed and you want to keep using your existing Sitemap Index feed, you’ll need to create this page using a different URI than /arc/outboundfeeds/sitemap-index
. If you want to replace your existing sitemap-index feed delete both the sitemap-index template and resolver.
Sitemap Index By Day Configuration
1. Number of Days to include
Enter the number of links (days) the sitemap-index should generate. If you want 7 days of sitemaps available, enter 7. If you want to go back to a specific date, you can enter a date instead. It must be in YYYY-MM-DD format. The earliest valid date is 1995-01-01. If the date is the wrong format or invalid the feed will only return 2 links.
default: 2
2. Number of days and sitemap path
This key value pair takes an integer for the maximum number of days and the url path to use for that sitemap resolver. Each sitemapX has a longer TTL (cache value). These longer TTL’s are designed to reduce the chance that requests for older content can impactt rate limits. This path will be used in each link generated.
default:
0: /arc/outboundfeeds/sitemap/
- Day 0+ will use /arc/outboundfeeds/sitemap/
with a 5 minute TTL
2: /arc/outboundfeeds/sitemap2/
- Days from 2+ will use /arc/outboundfeeds/sitemap2/
with a 1 hour TTL
7: /arc/outboundfeeds/sitemap3/
- Days from 7+ will use /arc/outboundfeeds/sitemap3/
with a 1 day TTL
3. Sitemap Index Name
The name used when requesting this feed. It must match the value used in the requestURI of the page. This is used to help parse the request URI for additional path values.
default: /arc/outboundfeeds/sitemap-index/
4. URL Extension
Optional, add a file extension to the end of the generated URL, for example .xml. If you add an extension you will need to change your sitemap resolver to match.
default: blank
5. Additional URL Parameters
URL encoded parameters to add to the sitemap URL. If adding multiple values separate them with &
. This will be appended to each url.
default: blank
6. Dates with large values
Key value pairs of dates in YYYY-MM-DD format
and an integer for the number of pagination links. The feeds-source-content-api-by-day-block
content source will attempt to return all articles for a specific day. It can handle around 1,500 records before the request times out. If you have specific days where you have more than 1,500 records, due to a data migration or some special event, you can enter that date and the number of 1000 record links to generate. If you publish over 1,500 articles everyday, you can use all instead of a date and every day will be split into that many links
default: {}
Example:
2021-05-24
(Date format YYYY-MM-DD)
2
(integer)
This will generate 2 links that will each return up to 1000 articles.
/arc/outboundfeeds/sitemap/2021-05-24-1/?outputType=xml
/arc/outboundfeeds/sitemap/2021-05-24-2/?outputType=xml
Resolver Configuration
Each feed template needs to have a resolver applied. To Create A Resolver, you’ll want to configure these fields.
The Sitemap Index by Day block returns one link for each days sitemaps using a format like sitemap/YYYY-MM-DD
. You will need to create three new sitemap resolvers using the three content-api-by-day
, content-api-by-day2
and content-api-by-day3
content sources. These three resolvers are setup almost identically. Each of the resolvers should be setup using the values in the order provided below. The first one, use the first value. The second one use the second value. The last time use the final value.
You need to setup three different Sitemap By Day resolvers;
sitemap
with daysitemap2
with daysitemap3
with day
1. Resolver Name
Unique, easily recognized name given to the resolver. Create three resolvers, each with its own name.
sitemap
with daysitemap2
with daysitemap3
with day
2. Resolver Priority
Number to indicate priority order (1 to 100). Lower numbers are evaluated first. It should have a lower priority than the standard sitemap resolver.
3. Regex Pattern
These values need to match the sitemap index by days “number of days and sitemap path” customField values. For the three resolvers, use the regex pattern that matches the resolver name. If you added a file extension, be sure to update your regex to match.
^/arc/outboundfeeds/sitemap/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
^/arc/outboundfeeds/sitemap2/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
^/arc/outboundfeeds/sitemap3/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
4. URL Parameters
No parameters are needed
5. Websites
All
6. Content Source
For the three resolvers, use the content source that matches the resolvers name. For sitemap with day use feeds-content-api-by-day. For sitemap with day2 use feeds-content-api-by-day2. For sitemap with day3 use feeds-content-api-by-day3.
feeds-content-api-by-day
feeds-content-api-by-day2
feeds-content-api-by-day3
7. Patterns
These positions come from the regex pattern. There are a number of ways to pass a section (or author, keyword or tag) to the content source from the resolver. All of them use a grouping () in the regex.
The image shows an example configuration, and the setup steps are found below.
Date Field - Enter the ANS date field that you want to query on. This should be the same for all three resolvers. Usually this will be the same date used to sort which by default is last_updated_date
. last_updated_date
is a system generated date, if you migrated 10,000 articles on a single day they will all have the same last_updated_date
. In cases like this you might want to use a different date field like display_date
. This is a required field.
created_date
display_date
first_publish_date
last_updated_date
publish_date
Date Range - Pattern 1 (For content source feeds-content-api-by-day)
^/arc/outboundfeeds/sitemap/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
Date Range - Pattern 1 (For content source feeds-content-api-by-day2)
^/arc/outboundfeeds/sitemap2/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
Date Range - Pattern 1 (For content source feeds-content-api-by-day3)
^/arc/outboundfeeds/sitemap3/(latest(-\d*)?|\d\d\d\d-\d\d-\d\d(-\d*)?)/?$
Include - Terms - If you want to modify the default query used by the content source enter it here. You can find more details on the query format Here. If you modify Include-Terms you need to make the same change to the sitemap.
Exclude - Terms - If you want to modify the default query used by the content source to have NOT terms, enter it here. You can find more details on the query format Here. If you modify Exclude-Terms you need to make the same change to the sitemap.
Sort - Static, last_updated_date:desc
If left blank it will default to publish_date:desc
. Which ever date field you selected to display for the Last Modified Date is the date you should enter here. The format is date_field
+ :
+ sort order
(desc
or asc
). Valid date fields are created_date
, display_date
, first_publish_date
, last_updated_date
, or publish_date
. If you update the sort order you should make the same change to the sitemap.
Source-Exclude - A comma separated list of ANS fields to remove from the default list of ANS fields. See the default list here.
Source-Include - A comma separated list of ANS fields to add to the default list of ANS fields. See the default list here.
Include - Distributor - Name
If you only want content from a single distributor, enter the name here. You may only populate one distributor field.
Exclude - Distributor - Name
If you want to exclude content from a single distributor, enter the name here. You may only populate one distributor field.
Include - Distributor - Category
If you only want content from a single distributor category, enter the name here. You may only populate one distributor field.
Exclude - Distributor - Category
If you want to exclude content from a single distributor category, enter the name here. You may only populate one distributor field.
8. Default Template
Select a sitemap feed to use with this resolver.
9. Default output type
Select xml
.
10. Content Mapped Template
blank
Sitemap Index URL
To preview your Sitemap Index XML, modify the URL below with the your client org and website name. There are two parts, the sitemap-index. NOTE the sitemap-index requires an outputType because it was created as a page.
https://outboundfeeds.CLIENTORG.arcpublishing.com/pf/arc/outboundfeeds/sitemap-index/?outputType=xml&\_website=CLIENTWEBSITE
The second part is the new sitemaps that are used in the sitemap-index links
https://outboundfeeds.CLIENTORG.arcpublishing.com/pf/arc/outboundfeeds/sitemap/YYYY-MM-DD?_website=CLIENTWEBSITE NAME
https://outboundfeeds.CLIENTORG.arcpublishing.com/pf/arc/outboundfeeds/sitemap2/YYYY-MM-DD?_website=CLIENTWEBSITE NAME
https://outboundfeeds.CLIENTORG.arcpublishing.com/pf/arc/outboundfeeds/sitemap3/YYYY-MM-DD?_website=CLIENTWEBSITE NAME
Additional Information
All sitemaps adhere to the Standard Sitemap Protocol and implement Google’s extensions for both Images and Videos.
Steps To Create And Manage Outbound Feeds.
Optional Content Sources For OBF.
Using Jmespath To Map To CustomFields ANS Values.
More details on Resolvers.
Regex Debugger.