About the AI Accelerator page

AI Accelerator is a caching solution for artificial intelligence services from providers like OpenAI. By caching large language model (LLM) API responses and leveraging the cache for semantically similar queries, AI Accelerator can reduce latency and lower your LLM API usage costs.

Before you begin

Be sure you know how to access the web interface controls.

AI Accelerator can be enabled in the Fastly control panel by anyone assigned the role of superuser. Once enabled, all account users will be able to view the metrics.

Supported LLMs

AI Accelerator currently supports OpenAI, Azure OpenAI, Gemini, and LLMs with OpenAI-compatible APIs.

Enabling AI Accelerator

To enable AI Accelerator, follow these steps:

  1. Log in to the Fastly web interface.
  2. Go to Tools > AI Accelerator.
  3. Click Enable AI Accelerator.
  4. On the Enable AI Accelerator page, click Enable Now.

Configuring your application to use AI Accelerator

After AI Accelerator is enabled, you'll need to create a read-only API token and update your application to use the AI Accelerator endpoint. Refer to the code examples below if you need help updating your application's code.

OpenAI and OpenAI-compatible code examples

  1. Python
  2. JavaScript
1
2
3
4
5
6
7
8
9
from openai import OpenAI
client = OpenAI(
# Set the API endpoint
base_url="https://ai.fastly.app/api.openai.com/v1",
# Set default headers
default_headers = {
"Fastly-Key": f"<FASTLY-KEY>",
}
)

For LLMs with OpenAI-compatible APIs, use https://ai.fastly.app/compat/openai/<llm-endpoint> as the base URL.

Azure OpenAI code examples

  1. Python
1
2
3
4
5
6
7
8
9
10
from openai.lib.azure import AzureOpenAI
client = AzureOpenAI(
api_key=azure_key,
api_version="2024-06-01",
azure_deployment="ai-member-4o-chat",
azure_en dpoint=f"https://ai.fastly.app/<AZURE RESOURCE>.openai.azure.com",
default_headers = {
"Fastly-Key": f"<FASTLY-KEY>",
}
)

Gemini code examples

  1. Python
  2. JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
project_region = "<GCP-REGION>"
project_id = "<GCP-PROJECT-ID>"
vertexai.init(
location=project_region,
project=project_id,
api_endpoint=f"ai.fastly.app/{project_region}-aiplatform.googleapis.com",
api_transport='rest',
request_metadata=[("fastly-key", f"<FASTLY-KEY>")]
)
model = GenerativeModel("gemini-pro")
print(model.generate_content("Why is the sky blue?"))

Setting and checking headers

You can use the following request and response headers to control and monitor how AI Accelerator caches LLM responses.

Header nameTypeDescription
x-semantic-thresholdRequest headerControls the similarity threshold for responses from the semantic cache. The default is 0.75. A lower threshold may increase the likelihood of a cached response at the risk of returning a lower quality response.
x-semantic-cache-keyRequest headerUser provided value that is used to segment the response in the cache. Only requests with a matching x-semantic-cache-key above the similarity threshold will be returned as a response. Not required to be set. If not set, the default value of _default_ will be used.
Cache-ControlRequest headerWe currently only respect the max-age cache control directive. If a Cache-Control header is set on a request with a max-age, we will set that as the TTL on the cache entry up to a max TTL of 30 days (in seconds).
x-semantic-cacheResponse headerPreviously x-cache. Possible values are HIT or MISS.

About the AI Accelerator page

The AI Accelerator page provides metrics related to requests, tokens, and origin latency. The page displays the following charts:

  • Total requests: The total number of requests sent to AI Accelerator.
  • Tokens served from cache: The estimated number of tokens served from cache based on responses served from cache. A token is an LLM billing unit, the exact measure of which varies between vendor and version of LLM.
  • Estimated time saved: The estimated amount of time saved in minutes based on responses served from cache.
  • Requests: The total number of AI Accelerator requests aggregated across your account.
  • Tokens: The estimated number of tokens served from cache or origin.
  • Origin Latency Percentiles: The origin latency percentile approximations.

Purging the cache

IMPORTANT

This information is part of a beta release. For additional details, read our product and feature lifecycle descriptions.

You can purge all cache by using the AI Accelerator API endpoint. For example, you can purge all cache using curl in a terminal application.

$ curl -X POST -H "Fastly-Key: YOUR_FASTLY_TOKEN" https://api.fastly.com/ai_accelerator/expire
NOTE

The API token must have purge_all scope.

Disabling AI Accelerator

To disable AI Accelerator, follow these steps:

  1. Update your application code to remove the AI Accelerator integration.
  2. Log in to the Fastly web interface.
  3. Go to Account > Billing > Overview.
  4. Click Options next to AI Accelerator, then click Cancel.
  5. Click Cancel AI Accelerator.
Was this guide helpful?

Do not use this form to send sensitive information. If you need assistance, contact support. This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.