Troubleshooting agent response codes

Last updated 2024-08-28

IMPORTANT

This guide only applies to Next-Gen WAF customers using the Cloud WAF or Core WAF deployment method.

If something abnormal occurs during request processing, the Next-Gen WAF agent will return an error agent response code (e.g., -2, -1, and 499) that you can use to help resolve the issue.

Troubleshooting -2, -1, and 0 agent response codes

The -2, -1, and 0 agent response codes are error codes applied to requests that weren’t processed correctly. There are a few reasons why this can happen but they tend to fall into two major categories:

The post or response couldn't be matched to the request
The module timed out waiting for a response from the agent

Request and response mismatch

Error agent response codes can occur when a post or response couldn't be matched to any actual requests. This is typically the result of NGINX redirecting before the request is passed to the Next-Gen WAF module.

Specific server response codes

The following server response codes cause NGINX to skip the phases that normally run. Due to their nature, they cause NGINX to finish processing the request without it being passed to the Next-Gen WAF module:

400 (Bad Request)
405 (Not Allowed)
408 (Request Timeout)
413 (Request Entity Too Large)
414 (Request URI Too Large)
494 (Request Headers Too Large)
499 (Client Closed Request)
500 (Internal Server Error)
501 (Not Implemented)

Look for NGINX return directives

Look for custom NGINX configurations or Lua code that could be redirecting the request. This is almost always due to return directives in an NGINX configuration file. There could be return directives used to redirect specific pages to www, https, or a new URL. The return directive stops all processing, causing the request to not be processed by the Next-Gen WAF module. For example:

1
2
3

location /oldurl {
     return 302 https://example.com/newurl/
}

These would need to be updated to force the request to be processed by our agent first. Calling the rewrite_by_lua_block directly allows you to force the Next-Gen WAF module to run first and then perform the return statement for NGINX:

location /oldurl {
     rewrite_by_lua_block {
          sigsci.prerequest()
          return ngx.exit(302 "https://example.com/newurl/")
     }
     #return 302 https://example.com/newurl/
}

Agent restarted

Request and response mismatches can also be due to restarting the agent. If the agent is restarted after the request is processed, but before the response is processed, the agent will not see the response and fail to attribute it to the request, resulting in an error agent response code.

Module timing out

When the module receives a request, it sends it to the agent for processing. The module then waits for a response from the agent (whether or not to block) for a set amount of time (typically 100ms). If the agent doesn’t process the request within that time, the module will time out and default to failing open, allowing the request through. These requests that failed open will have error agent response codes applied to them.

Module timeouts are most commonly due to insufficient resources allocated to the agent. This can be a result of host or agent misconfiguration, such as the agent being limited to too few CPU cores.

This can also be due to a high volume of traffic to the host. If requests are coming in faster than the agent can process them, subsequent requests will be queued for processing. If a queued request reaches the timeout limit, then the module will fail open and allow the request through.

Similarly, certain rules designed specifically for penetration testing can take longer to run than traditional rules. This can result in requests queueing and timing out due to the increased processing time per request.

Look at response time

Requests that are timing out will have a high response time, exceeding the default timeout of 100ms.

Look at agent metrics

From the Agents page in the Next-Gen WAF control panel, you can access metrics for each agent. These metrics can help you diagnose the issue.

Connections dropped

The Connections dropped metric indicates the number of requests that were allowed through (or "dropped").

CPU usage

The CPU metrics can indicate the host is overloaded, preventing it from processing requests quickly enough.

The Host CPU metric indicates the CPU percentage for all cores together (100% is maximum).
The Agent CPU metric indicates the total CPU percentage for the number of cores in use by the agent. For example, if the agent were using 4 cores, then 400% would be the maximum.

CPU allocation and containerization

There are known issues with agents running within containers. It's possible for agents to have insufficient CPU to process requests, due to a low number of CPUs (cores) allocated to the container by the cgroups feature.

We recommend the container running the agent should be given at least 1 CPU. If both NGINX and the agent are running in the same container, then we recommend allocating at least 1.5 CPUs.

Troubleshooting the 499 agent response code and the 504 HTTP status code

If a client is making a request and the Cloud WAF Application Load Balancer (ALB) does not receive the first header byte within 60 seconds of the TCP connection being established, the requesting client will receive a 504, while the Next-Gen WAF agent will respond with a 499. This means the requesting client, if making a long-standing request through a browser, will receive a 504 error in the browser, while the Next-Gen WAF control panel will show a 499 for the request.

The long-standing request will need to be optimized to meet the 60 second threshold. If the request cannot be optimized, reach out to our support team for additional details.

Relevant timeouts in the Cloud WAF architecture

The Cloud WAF agent has 60 seconds to start sending a response to the ALB
The Cloud WAF agent has 10 seconds to negotiate TLS with the upstream
The Cloud WAF agent has 30 seconds to establish an HTTP connection to the upstream

Further help

If you're unable to resolve an agent response code issue, generate an agent diagnostic package by running sigsci-agent-diag, which will output a .tar.gz archive with diagnostic information. Then, reach out to our support team for additional details. When you contact us, be sure to provide the diagnostic .tar.gz archive and include control panel links to the requests and agents affected.

Do not use this form to send sensitive information. If you need assistance, contact support. This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Network services

Security

Compute

Quick start

Building blocks

Integrations

Tutorials

Demos

Use Cases

Code Examples

Starter Kits