About the AI Accelerator page

Last updated 2025-01-27

AI Accelerator is a caching solution for artificial intelligence services from providers like OpenAI. By caching large language model (LLM) API responses and leveraging the cache for semantically similar queries, AI Accelerator can reduce latency and lower your LLM API usage costs.

Before you begin

Be sure you know how to access the web interface controls.

AI Accelerator can be enabled in the Fastly control panel by anyone assigned the role of superuser. Once enabled, all account users will be able to view the metrics.

Supported LLMs

AI Accelerator currently supports OpenAI, Azure OpenAI, Gemini, and LLMs with OpenAI-compatible APIs.

Enabling AI Accelerator

To enable AI Accelerator, follow these steps:

Log in to the Fastly control panel.
Go to Tools > AI Accelerator.
Click Enable AI Accelerator.
On the Enable AI Accelerator page, click Enable Now.

Configuring your application to use AI Accelerator

After AI Accelerator is enabled, you'll need to create a read-only API token and update your application to use the AI Accelerator endpoint. Refer to the code examples below if you need help updating your application's code.

OpenAI and OpenAI-compatible code examples

Python
JavaScript

from openai import OpenAI
client = OpenAI(
# Set the API endpoint
base_url="https://ai.fastly.app/api.openai.com/v1",
   # Set default headers
   default_headers = {
    "Fastly-Key": f"<FASTLY-KEY>",
   }
)

For LLMs with OpenAI-compatible APIs, use https://ai.fastly.app/compat/openai/<llm-endpoint> as the base URL.

Azure OpenAI code examples

Python

from openai.lib.azure import AzureOpenAI
client = AzureOpenAI(
    api_key=azure_key,
    api_version="2024-06-01",
    azure_deployment="ai-member-4o-chat",
azure_en dpoint=f"https://ai.fastly.app/<AZURE RESOURCE>.openai.azure.com",
default_headers = {
    "Fastly-Key": f"<FASTLY-KEY>",
   }
)

Gemini code examples

Python
JavaScript

project_region = "<GCP-REGION>"
project_id = "<GCP-PROJECT-ID>"
vertexai.init(
location=project_region,
      project=project_id,
  api_endpoint=f"ai.fastly.app/{project_region}-aiplatform.googleapis.com",
      api_transport='rest',
      request_metadata=[("fastly-key", f"<FASTLY-KEY>")]
)

model = GenerativeModel("gemini-pro")
print(model.generate_content("Why is the sky blue?"))

Setting and checking headers

You can use the following request and response headers to control and monitor how AI Accelerator caches LLM responses.

Header name	Type	Description
`x-semantic-threshold`	Request header	Controls the similarity threshold for responses from the semantic cache. The default is `0.75`. A lower threshold may increase the likelihood of a cached response at the risk of returning a lower quality response.
`x-semantic-cache-key`	Request header	User provided value that is used to segment the response in the cache. Only requests with a matching `x-semantic-cache-key` above the similarity threshold will be returned as a response. Not required to be set. If not set, the default value of `_default_` will be used.
`x-settings-overrides`	Request header	Controls whether or not semantic cache is enabled. The default is `{"semantic_cache_enabled": true}`.
`Cache-Control`	Request header	We currently only respect the `max-age` cache control directive. If a `Cache-Control` header is set on a request with a `max-age`, we will set that as the TTL on the cache entry up to a max TTL of 30 days (in seconds).
`x-semantic-cache`	Response header	Previously `x-cache`. Possible values are `HIT` or `MISS`.