What external data can be used
With Athena it is possible to add and integrate external data of many different formats, including common formats like JSON, CSV, and parquet, into into investigations queries. The basic process for doing this will be fairly similar regardless of format:
- Determine the schema for your data
- Load the data into S3
- Create the table
- Join the external data with internal data
Example External data
This example is a sample log from a ssh proxy that allows users to login to servers with their real credentials.
session_id, login_date, real_user, host, local_user
Load the data into S3
In order for the Athena to access the data it must be uploaded to S3. Many tools have the ability to output logs to S3 but for this example because of the small file size simply uploading it is sufficient. Data can be uploaded to S3 as one or more files and because of this it is generally a good idea to put them in a folder matching the name of the table you will create in the next step.
Determining the Schema
Based on the data above the data types for this file would be:
Athena Schemas are sql schemas and are created with a CREATE EXTERNAL TABLE statement. For this table the create statement would look like:
CREATE EXTERNAL TABLE proxy_logs (
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",", "quoteChar" = "\"" )
Running this in the Athena or Capsule8 console creates a table proxy_logs. Additionally it tells Athena that the files containing its data is located in [BUCKET_NAME]/proxy logs/ where [BUCKET_NAME] is replaced with the bucket from the pervious step.
Join the external data with internal data
With the table created the data is now exposed in Athena. You can see it with a select statement:
SELECT * FROM proxy_logs ORDER BY session_id DESC LIMIT 100
which will get the last 100 log in sessions. This also means that it is possible to join the new table with investigations data. For example if you wanted to get all of the users who logged into a host in the hour prior to an alert:
FROM proxy_logs pl
LEFT JOIN sensors s
ON s.hostname = pl.host
LEFT JOIN alerts a
ON s.sensor_id = a.sensor_id
WHERE a.incident_id = 'adde60c8-4c72-4662-8266-5d968cca20e4' AND date_diff('day', date_parse(login_date, '%Y-%m-%d %H:%i:%s'), from_unixtime(a.unix_nano_timestamp / 1e9)) BETWEEN -1 AND 0