search close

Extracting Your Data

access_time Updated Mar 27, 2023

Signal Sciences stores requests that contain attacks and anomalies, with some qualifications; see Privacy and Sampling. If you would like to extract this data in bulk for ingestion into your own systems, we offer a request feed API endpoint which makes available a feed of recent data, suitable to be called by (for example) an hourly cron.

This functionality is typically used by SOC teams to automatically import data into SIEMs such as Splunk, ELK, and other commercial systems.

Data extraction vs searching

We have a separate API endpoint for searching request data. Its use case is for finding requests that meet certain criteria, as opposed to bulk data extraction:

Searching Data Extraction
Search using full query syntax Returns all requests, optionally filtered by signals
Limited to 1,000 requests Returns all requests
Window: up to 7 days at a time Window: past 24 hours
Retention: 30 days 24 hours

Time span restrictions

The following restrictions are in effect when using this endpoint:

  • The until parameter has a maximum of five minutes in the past. This is to allow our data pipeline sufficient time to process incoming requests - see below.
  • The from parameter has a minimum value of 24 hours and five minutes in the past.
  • Both the from and until parameters must fall on full minute boundaries.
  • Both the from and until parameters require Unix timestamps with second level detail (e.g., 1445437680).

Delayed data

A five-minute delay is enforced to build in time to collect and aggregate data across all of your running agents, and then ingest, analyze, and augment the data in our systems. Our five-minute delay is a tradeoff between data that is both timely and complete.

Pagination

This endpoint returns data 1,000 requests at a time. If the time span specified contains more than 1,000 requests, a next url will be provided to retrieve the next batch. Each next url is valid for one minute from the time it’s generated.

Sort order

As a result of our data warehousing implementation, the data you get back from this endpoint will be complete for the time span specified, but is not guaranteed to be sorted. Once all data for the given time span has been accumulated, it can be sorted using the timestamp field, if necessary.

Rate limiting

Limits for concurrent connections to this endpoint:

  • Two per site
  • Five per corp

Example usage

A common way to use this endpoint is to set up a cron that runs at 5 minutes past each hour and fetches the previous full hour’s worth of data. In the example below, we calculate the previous full hour’s start and end timestamps and use them to call the API.

Python

import sys, requests, os, calendar, json
from datetime import datetime, timedelta

# Initial setup
api_host = 'https://dashboard.signalsciences.net'
email = os.environ.get('SIGSCI_EMAIL')
password = os.environ.get('SIGSCI_PASSWORD')
corp_name = 'testcorp'
site_name = 'www.example.com'

# Calculate UTC timestamps for the previous full hour
# For example, if now is 9:05 AM UTC, the timestamps will be 8:00 AM and 9:00 AM
until_time = datetime.utcnow().replace(minute=0, second=0, microsecond=0)
from_time = until_time - timedelta(hours=1)
until_time = calendar.timegm(until_time.utctimetuple())
from_time = calendar.timegm(from_time.utctimetuple())

# Authenticate
auth = requests.post(
    api_host + '/api/v0/auth',
    data = {"email": email, "password": password}
)

if auth.status_code == 401:
    print 'Invalid login.'
    sys.exit()
elif auth.status_code != 200:
    print 'Unexpected status: %s response: %s' % (auth.status_code, auth.text)
    sys.exit()

parsed_response = auth.json()
token = parsed_response['token']

# Loop across all the data and output it in one big JSON object
headers = {
    'Content-type': 'application/json',
    'Authorization': 'Bearer %s' % token
}
url = api_host + ('/api/v0/corps/%s/sites/%s/feed/requests?from=%s&until=%s' % (corp_name, site_name, from_time, until_time))
first = True

print '{ "data": ['

while True:
    response_raw = requests.get(url, headers=headers)
    response = json.loads(response_raw.text)

    for request in response['data']:
        data = json.dumps(request)
        if first:
            first = False
        else:
            data = ',\n' + data
        sys.stdout.write(data)

    next_url = response['next']['uri']
    if next_url == '':
        break
    url = api_host + next_url

print '\n] }'