Datashark Docs
Datashark Docs
===================
This document describes how to create your own Spark Plugin Code to tackle a new
Use Case. All Plugin Code resides in the `conf` folder of datashark.
The .conf file needs to define a few necessary flags specific to the plugin and the
.py file needs to implement the `load` function.
port = 9200
index_name = logstash-index
doc_type = docs
pkey = sourceip
score = anomaly_score
title = Name for ES Document
debug = false
```
JSON Document in the Kafka Stream. The *value* has to be a regex pattern that
matches the content of that key.
The function `load` expects a processed *DStream* to be returned from it. Each
RDD in the DStream should be in the following format (this format is necessary for
usability in output plugins.):
`('primary_key', anomaly_score, {"some_metadata": "dictionary here"})`
*primary_key* is a string. It is the tagging metric by which the data was aggregated
for map-reduce and finally scored.
*anomaly_score* is of type float. It is the value used to define the deviation from
normal behavior.
*metadata* is of the type dictionary. This is the extra data that needs to be inserted
into the Elasticsearch document or added to the CSV as extra Columns.
-------------------
Output Plugins
=============
dataShark provides the following 3 output plugins out-of-the-box for storing data:
> 1. Elasticsearch
> 2. Syslog
> 3. CSV
Each of this plugin requires its own basic set of settings, described below.
The Elasticsearch output plugin allows you to easily push JSON documents to your
Elasticsearch Node. This allows users to build visualizations using Kibana over
processed data.
```text
output = elasticsearch
[elasticsearch]
host = 127.0.01
port = 9200
index_name = usecase
doc_type = spark-driver
pkey = source_ip
score = anomaly_score
title = Use Case
debug = false
```
All settings in the config are optional. Their default values are displayed in the
config above.
The Syslog Output plugin outputs JSON documents to the specified Syslog Server IP
and Port. Following is the sample configuration with default settings for the plugin
(all settings are optional):
```
output = syslog
[syslog]
host = 127.0.0.1
port = 514
pkey = source_ip
score = anomaly_score
title = Use Case Title
debug = false
```
The settings are similar to that of elasticsearch.
The CSV Output Plugins writes and appends output from Spark Use Case to a
specified CSV File.Following is the sample configuration with default settings of the
plugin (all settings are optional):
```
output = csv
[csv]
path = UseCase.csv
separator = ,
quote_char = '"'
title = Use Case
debug = false
```