How to Read Engine Logs
When working with Engine, it’s important to know the differences in how Engine runs and outputs its logs between different environments. At a high level, Engine runs as part of a suite of other Arc products, mainly its companion PageBuilder Editor. Your Engine instance will most likely be using Arc content products and APIs. The key point here is how the Engine runs in Arc infrastructure versus on your local machine. There will not be big differences between different Arc environments but it’s good to refresh our memory on Arc Environments.
This article talks about what engine logs contain and how to analyze them in general with a focus on tools to analyze logs coming from Arc environments. When working with engine on your local development environment, the tooling will be slightly different since the way engine runs locally is a special environment orchestrated by docker. To analyze engine logs in local development, we suggest using a docker log analysis tool dozzle. We have documented the steps to set up and use this tool in How to Work with Engine Logs in Local Development Environment.
Parts of Engine - Understanding log groups
Let’s hold the thought of CDN at the most outer layer, considering your reader is clicking a link and triggering a page render request from the Engine. There are a few parts of the Engine that run in separate processes to handle your reader’s request and create the final server-side rendered output of your page.
We’ll start from the order each of the processes gets chained in a page render request.
1) Resolver (Lambda) Logs
Resolver is an Arc-hosted and managed process and it is a simple application run in its own standalone lambda function. The resolver’s job is to interpret the URL requested and resolve the following:
-
The page or template it is mapped to. Pages are the simplest to map since each page gets a static URL path when you create a page. The template is resolved by the regular expressions you may have set in your resolvers. Keep in mind resolvers may get quite complicated based on your resolver configurations. There may be multiple resolver configurations that may match the requested URL. When this happens, the priority value between multiple resolvers matched determines which resolver to be used for the requested URL. And based on the resolver configuration, resolver lambda determines which template to render. Eventually, the resolver process is responsible to figure out which page configuration (layout, features, custom fields that you configure in Editor) will be used for the rendering process.
-
Resolver lambda also determines the global content source if the requested URL matches with a resolver configuration and it queries the content source to resolve the content to be passed to the page render. In a most common use case, a story-detail page is configured with a resolver and the content source uses a composer story/content-id.
Currently, resolver logs are not useful because it’s an Arc-hosted process and it won’t populate debugging information in its logs. But it’s important to know the role of the resolver and what it does before the actual Engine rendering happens. Also, you will be seeing resolver process logs in your local environment when looking at the combined Engine log stream in your command line.
You can use the resolver debugger in PageBuilder Editor -> Developer menu to see what output that resolver process returns when a website and a URL are given as inputs. These debugger tools are also available on your local PageBuilder admin UI.
2) Engine (Lambda) Logs
Once the resolver figures out the page/template and the content source to use, it passes these as the parameters to the next process Engine Lambda (star of our show). Engine Lambda is the main process that uses your arc blocks and/or custom code (features, chains, layouts, output types, content sources) and renders the page/template and content that is passed from the previous resolver process. So resolver and engine processes are chained together to translate a URL into a page/template and content then render the page.
Engine Lambda logs will be mostly free-form NodeJS application output as if you were running the NodeJS application on your command line. This means that whatever you console.log manually, or the part of your code that throws exceptions will end up in your Engine logs.
This means you can increase logging in any aspect of your page rendering process to make your code more transparent. It’s always to remember that Engine runs isomorphically meaning it does page rendering on both Server Side Render and the client-side render. To learn more about this, see How Rendering Works In Engine and the difference between Isomorphic, Server-Side, And Client-Side Rendering.
Key information to look in resolver logs
-
Errors - Either manually thrown or thrown from your dependencies, or engine parts.
-
Content source events “arc.fusion.content.* ” - Deep dive on this below.
-
Render error event “arc.fusion.render.error ”: This means your page render failed because of an exception. You can filter to see the exception and stack trace by filtering your logs with this event name.
-
(Opt-in) Content source tracing output when enabled.
-
Render speed, using Lambda “REPORT” events when your code runs in Arc environments.
3) Compiler Logs
Lastly, we want to talk about a process that is when you make deployments or when Engine builds your bundle code locally. This process is not part of your page render. It only runs when we compile your code and deploy it to AWS Lambda. The compiler process is especially important when developing locally because your code gets re-compiled throughout the changes you make in your local environment, and this happens quite frequently.
The compiler process is essentially a multiple-step webpack build with a few custom webpack plugins that organizes and compiles your Engine code. Even though we don’t allow any webpack customizations for clients to customize the build process, the compilation is still highly dependent on your code.
In Arc environments, this process is used by the “Deployer” tool you use to upload and promote a bundle code to your Arc environments. In your local environment, fusion-CLI manages the compiler process, and different than the Deployer, the local version of the compiler continuously run and watches your code changes to trigger re-builds automatically. Even though how each “compile” gets triggered, their output will be the same.
Key information to look in compiler logs
-
Errors - Missing modules, Import errors, Syntax errors, Transpilation errors (babel/typescript)
-
(S)CSS compilation errors
-
Deprecation, upgrade warnings from the modules you use
Diving deep into Content Source events in the Engine logs
In the Engine logs, there are standardized events about each of the content sources used in the page render. These logs contain key information that can help us understand if the content sources are successfully returning the data needed or if they are failing. Even in the successful responses, we may be also getting the result of content source in different ways: cached, stale, stale backoff in the event of our content sources started to fail, but cached content was served. Here is a sample log line Engine records for a content source call:
2022-04-06T19:39:08.510Z 960f7c20-b5c7-4c39-aa17-8d43bcd51e76 INFO {"name":"arc.fusion.content.fetch.error","source":"my-content-list","query":{"movie":"Matrix 6"},"error":404,"value":1,"environment":"myorg","functionName":"fusion-engine-myorg","fusionVersion":"3.2.6","xRayTraceID":"1-624dec5c-59a6dfea294c07ee681548ab"}
This log contains key information we can use to analyze the overall traffic and health of the bundle. Properties like “source” are the name of the content source, the error is the response code returned from the resolve function, and the query is the query parameters called for this content source execution. But the most important one is the “name” property in this JSON object. This is the event type that explains how the content source was handled. Here is the list of cache events and what they mean:
Event Name | What it means |
---|---|
arc.fusion.cache.latency | This means the content source outcome was a successful response. |
arc.fusion.content.fetch.error | As the name suggests, this event means the content source failed to execute and responses a failure response. This usually means there wasn’t a previously cached object with the same cache key, so the Engine wasn’t able to return success to this request. This may result in page render failing as well. |
arc.fusion.cache.stale | This means the actual content source execution failed, but we found previously cached data that is within the backoff window, so we served stale. |
arc.fusion.cache.stale.backoff | Similar to the stale response, this status means the content source cache expired as well as we are beyond the backoff window, we still serve the last valid cached response as stale with incremental backoff retries. |
For more information on stale and backoff, see Content Caching In PageBuilder Engine.
Docker logs when running Engine locally
When you run Engine locally using fusion-CLI, we run the collection of the processes we described above, plus a few others (like PageBuilder Editor, Debugger tools) in separate docker containers. Docker helps us to keep each of these processes more contained and also separately initialized and runs similar to how we run these processes in Arc environments. The orchestration of these docker containers simulates minimal PageBuilder instances on your local environment.
It’s important to differentiate between Engine processes in fusion-CLI output and know how to read these logs since all processes work together. It’s especially easy to mix up these logs in your local environment because their logs are served as a single very active log stream in your command line. When running locally, your command line output will show each container’s logs with a color-coded prefix for each line of the command line output. In Arc environments, each of the services’ log streams is organized in different log groups, so they are easier to separately analyze. Example command-line output:
Keep in mind that you will be seeing more PageBuilder services logs in your local environment. When you want to access the Engine logs, only a few of these services logs will be available to you through Log Forwarding.
Also, not the same types of Engine logs surfaced in local development compared to Arc environments. Some of the events documented here may not be available in the local environment.
AWS log forwarding
By default, log streams in Arc environments are not available. You can request to enable the Log Forwarding service to be set up from your TAM. We set up an AWS Kinesis stream to be forwarded from Arc AWS to your target AWS account. From there you can push your logs to any storage, indexing and, analysis service for further monitoring, and troubleshooting.
Analysis tools
As we mentioned above, Engine logs are essentially very raw application output that is no different than what you see in your command line when you run Engine locally. That’s why there is no special tool needed or we can speak about a strong suggestion.
We often analyze Engine behavior in AWS CloudWatch with simple queries. We have seen our clients utilizing Splunk, Elastic+Kibana stack, or similar tools to analyze and monitor their logs. Most of these tools have local versions that can be set up in minutes in docker and any logs can be streamed or manually imported for analysis.
Almost all of the CloudWatch example queries we provided in this doc can be replicated in most log analysis tools perhaps with even more detailed analysis capabilities.
Keep in mind that the way to analyze your logs in your local development environment will be different than how you analyze in Arc environments because, even though running ElasticSearch+Kibana or Splunk locally is a viable option, it requires a more complex setup to get them properly ingest docker logs and analyze. Instead, we suggest the utilization of lighter command-line tools like Lnav, Up, Angle-Grinder, or similar log trailing, filtering tools or we talked about a generic docker log analysis tool called dozzle in How to Work with Engine Logs in Local Development Environment
AWS CloudWatch example queries
If you decide to set up your AWS Kinesis streams to be stored and available in CloudWatch, you can use CloudWatch Insights and start with these sample queries we often use for our internal debugging.
Find instances of a specific console.log in the log stream:
fields @timestamp, @message | filter @message like “My Custom Exception”
Create a time series chart of the particular error:
To see if the errors are increasing or decreasing over a period of time. This is particularly useful to validate a fix you push.
fields @timestamp, @message | filter @message like “My Custom Exception” | stats count() by bin(1m)
When you run this query, by default, the result panel will display the calculated statistical table view. You can switch to the “Visualization” tab to view this time series in chart format.
You can also adjust the timespan the statistical calculation being done to make your chart more granular or grouped in larger intervals (i.e: same chart in hourly vs 15m vs 1m intervals).
Get the full render logs for a particular request from start to finish:
Lambda logs in CloudWatch are tagged by a property “requestId” that tags all the output made from a specific lambda execution. This means you can get all the logs for a particular server-side request. Keep in mind that your page will first render server-side, then in the client-side there will be additional requests to get content sources from the same features - unless a feature is marked as static (server-side only).
In order to get to a full-page render log, you need to first identify the request you want to debug. We need to catch at least 1 line of log output from that particular render session. This can be with an exception or a logical console.log output. Once you find a specific log line, when you expand the log entry in CloudWatch, it will display the @requestId property. You can now use this property to run another search query to get all logs for that particular request.
fields @timestamp, @message | filter @requestId=”6a47a8a2-acac-492b-85a8-f49ae49262e3″ | sort @timestamp asc
Here are two common scenarios that happen with a way to troubleshoot them using Engine logs.
Case Study 1: Using window in a component that is in server-side render (SSR fails)
Let’s consider you have a custom feature that you’re using browser APIs or a global that is set in window object. Your feature may look something like this:
const SimpleHelloWorld = () => {
const agent = window.navigator.userAgent; // This will fail in SSR
return ( <div className='hello_text'> <h4>Hello from {agent}</h4> </div> );}
Since window object is only available and defined when your feature code runs on the client-side on the browser, this object will be undefined on the server-side render. Therefore this feature’s SSR will fail. Engine isolates each of your features and its’ SSR individually, so if a feature SSR fails, in most cases it won’t break your page altogether. It will cause your feature to not render and leave an empty node in its place.
It’s important to test your feature’s SSR if it successfully renders each possible scenario, we suggest testing your page with JavaScript disabled from the developer tools and checking if the SSR generated HTML contains proper content elements rendered in the initial page load. This will also ensure your content will be available faster to the user and crawlable by the search engine bots.
In a generic way to troubleshoot any SSR failure in the context of a common mistake like this one, logs should be our first stop. This type of exception messages are more visible in Arc environments from lambda outputs. Compared to local development environments may not fully surface the actual exception messages and stack traces from docker containers to fusion-cli output. When available, you can capture these render errors using the arc.fusion.render.error
event with a CloudWatch query like:
fields @timestamp, @message | filter @message like "arc.fusion.render"
A query like this will show the log lines that will show why your SSR render failed.
After finding an instance of the error you are seeing, you can follow similar steps we mentioned in the section above to find the @requestId
then run a query with the requestId again to see the full logs of that page render and see if there are any other logs leads to the error, or replicate this error on the specific content source or piece of code this happens. As a last resort, you can always deploy a new bundle with additional logging to replicate the error and try to understand your code behavior in Arc environments.
Going back to our example, we expect an undefined variable or undefined prop exception showing a stack trace where the exception is thrown, helping us pinpoint where exactly this issue is happening. As mentioned before, depending on the complexity of your features or content sources, sometimes we may need additional information like the input parameters for a specific scenario. In these cases, it’s ok to add more logging from your code base and replicate the issue in order to understand how a specific buggy behavior happens and what causes it.
Case Study 2: Seeing a 5xx on an article page caused by an undefined variable (or a prop) in a content source (content source calls fails)
Now, let’s talk about another common problem that may fail your page render completely, causing 500 response from the Engine. This happens when this content source is used as a global content source only. If this is a content source used by or a few features and not as a global content source, the failure in the content source does not cause full page SSR to fail. It will only fail to return data to its features and depending on how the data is checked and used inside the feature code, in the worst-case scenario, it will fail the rendering of the particular features but it will not affect the page or other features.
Let’s say you have a content source for a custom feature used in your election pages that shows candidate information and this content source is used by a resolver for “candidate detail pages”. The content source logic may assume:
-
Editors will enter valid candidate names that will always return the expected data.
-
The API used in the content source will always return the expected schema with the candidate information requested.
The content source code may look something like this:
export default { fetch: async (customFields) => { const response = await axios({ ... }) // example election candidate database api call return response.data.candidate.party; }, params: { name: 'text' }}
The obvious mistake to avoid here is to not assume the returned data will contain what you need in your code. API calls made in a content source like this will highly depend on the configuration values coming from the Editor UI. What if the candidate’s name is not valid or empty? What if no candidate was found? Do we know if the API we are calling will return a 404 or an empty “candidate” object in these scenarios?
The content source above can easily fail because we’re querying candidate object from the data property of axios response, even if it returns successfully. Similarly deeper references like the “party” of the “candidate” object we have here. The example above is very simple and plain, but this type of common issue may happen in much more complex content source code.
In more generic terms, monitoring your code’s behaviors, especially error events like arc.fusion.content.fetch.error in your logs is important to understand how your code and changes (deployments) affect your site. These events also contain the actual content sources failing which is instrumental in how you can approach your development and deployment monitoring. When a change is shipped to production, this is one of the key ways to confirm your traffic is being handled and cached properly.
You can run and set alarms for anomalies for a CloudWatch query (other tools have similar ways to do this):
fields @timestamp, @message | filter @message like 'arc.fusion.content.fetch.error' | stats count(*) as count by source, error | sort count desc | limit 10
A query like this will give you a recent trend of content source errors if any happening. This includes the “error” prop which is the HTTP status code that comes from “statusCode” prop of the error/exception thrown inside the content source.
The main idea is to monitor this behavior either with automated alerts or manually to validate your deployment integrity when you push changes to live environments. This is also a good place to start when you identify unwanted page render or cache behaviors.
You can always drill down to a specific content source by filtering these cache events with “source” property, and if you like, just the instances that is failed, in a query like:
fields @timestamp, @message | filter @message like 'arc.fusion.content.fetch.error' | filter source = 'my-content-list'
Or see how the calls to this specific content source are handled in general in a query like this:
fields @timestamp, @message | filter source = 'my-content-list' | stats count(*) as count by name | sort count desc | limit 10
You can also use a more visual query like the below to visualize your content sources’ overall success/failure rates:
fields @timestamp, @message | filter source = 'my-content-list' | stats count() as all, sum (name='arc.fusion.content.fetch.error') as errors, sum (name='arc.fusion.cache.latency') as success by bin(5m)
Why does this type of error not happen in local or other environments but happens in production?
Editor configuration changes
Developers should always be aware that the content source or your feature code will behave according to the inputs from the Editor configuration. The configuration parameters and how your content source or feature behaves depend on your implementation and edge cases. Your Editors may enter in the values to your custom fields or resolver configurations. Your product PageBuilder Editor configuration may be much more volatile (updated more frequently) than your sandbox or local environments. Changing configuration is the most dynamic configuration of how a page gets rendered using your code.
Covering all possible scenarios from Arc or third-party APIs
We often see the response differences from the Arc APIs or third-party APIs used that may not be handled properly in content sources.
The best way to ensure the integrity of your content sources or features is to write tests and cover as many possible scenarios for your code as you can.