About the AI Accelerator page
Last updated 2024-12-10
AI Accelerator is a caching solution for artificial intelligence services from providers like OpenAI. By caching large language model (LLM) API responses and leveraging the cache for semantically similar queries, AI Accelerator can reduce latency and lower your LLM API usage costs.
Before you begin
Be sure you know how to access the web interface controls.
AI Accelerator can be enabled in the Fastly control panel by anyone assigned the role of superuser. Once enabled, all account users will be able to view the metrics.
Supported LLMs
AI Accelerator currently supports OpenAI, Azure OpenAI, Gemini, and LLMs with OpenAI-compatible APIs.
Enabling AI Accelerator
To enable AI Accelerator, follow these steps:
- Log in to the Fastly web interface.
- Go to Tools > AI Accelerator.
- Click Enable AI Accelerator.
- On the Enable AI Accelerator page, click Enable Now.
Configuring your application to use AI Accelerator
After AI Accelerator is enabled, you'll need to create a read-only API token and update your application to use the AI Accelerator endpoint. Refer to the code examples below if you need help updating your application's code.
OpenAI and OpenAI-compatible code examples
- Python
- JavaScript
123456789
from openai import OpenAIclient = OpenAI(# Set the API endpointbase_url="https://ai.fastly.app/api.openai.com/v1", # Set default headers default_headers = { "Fastly-Key": f"<FASTLY-KEY>", })
For LLMs with OpenAI-compatible APIs, use https://ai.fastly.app/compat/openai/<llm-endpoint>
as the base URL.
Azure OpenAI code examples
- Python
12345678910
from openai.lib.azure import AzureOpenAIclient = AzureOpenAI( api_key=azure_key, api_version="2024-06-01", azure_deployment="ai-member-4o-chat",azure_en dpoint=f"https://ai.fastly.app/<AZURE RESOURCE>.openai.azure.com",default_headers = { "Fastly-Key": f"<FASTLY-KEY>", })
Gemini code examples
- Python
- JavaScript
123456789101112
project_region = "<GCP-REGION>"project_id = "<GCP-PROJECT-ID>" vertexai.init(location=project_region, project=project_id, api_endpoint=f"ai.fastly.app/{project_region}-aiplatform.googleapis.com", api_transport='rest', request_metadata=[("fastly-key", f"<FASTLY-KEY>")])
model = GenerativeModel("gemini-pro")print(model.generate_content("Why is the sky blue?"))
Setting and checking headers
You can use the following request and response headers to control and monitor how AI Accelerator caches LLM responses.
Header name | Type | Description |
---|---|---|
x-semantic-threshold | Request header | Controls the similarity threshold for responses from the semantic cache. The default is 0.75 . A lower threshold may increase the likelihood of a cached response at the risk of returning a lower quality response. |
x-semantic-cache-key | Request header | User provided value that is used to segment the response in the cache. Only requests with a matching x-semantic-cache-key above the similarity threshold will be returned as a response. Not required to be set. If not set, the default value of _default_ will be used. |
Cache-Control | Request header | We currently only respect the max-age cache control directive. If a Cache-Control header is set on a request with a max-age , we will set that as the TTL on the cache entry up to a max TTL of 30 days (in seconds). |
x-semantic-cache | Response header | Previously x-cache . Possible values are HIT or MISS . |
About the AI Accelerator page
The AI Accelerator page provides metrics related to requests, tokens, and origin latency. The page displays the following charts:
- Total requests: The total number of requests sent to AI Accelerator.
- Tokens served from cache: The estimated number of tokens served from cache based on responses served from cache. A token is an LLM billing unit, the exact measure of which varies between vendor and version of LLM.
- Estimated time saved: The estimated amount of time saved in minutes based on responses served from cache.
- Requests: The total number of AI Accelerator requests aggregated across your account.
- Tokens: The estimated number of tokens served from cache or origin.
- Origin Latency Percentiles: The origin latency percentile approximations.
Purging the cache
IMPORTANT
This information is part of a beta release. For additional details, read our product and feature lifecycle descriptions.
You can purge all cache by using the AI Accelerator API endpoint. For example, you can purge all cache using curl in a terminal application.
$ curl -X POST -H "Fastly-Key: YOUR_FASTLY_TOKEN" https://api.fastly.com/ai_accelerator/expire
NOTE
The API token must have purge_all
scope.
Disabling AI Accelerator
To disable AI Accelerator, follow these steps:
- Update your application code to remove the AI Accelerator integration.
- Log in to the Fastly web interface.
- Go to Account > Billing > Overview.
- Click Options next to AI Accelerator, then click Cancel.
- Click Cancel AI Accelerator.
Do not use this form to send sensitive information. If you need assistance, contact support. This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.