Learning Bosun
Learning Bosun
#bosun
[Link]
Table of Contents
About 1
Remarks 2
Versions 2
Examples 2
Sample Alert 2
Examples 5
Overview 5
The meat of it 6
Examples 11
Template Def 11
Alert Definition 11
Alert Explanation 12
Notification Preview 12
Example Section of [Link] referencing the config for httpunit test cases: 12
Header Template 13
Header Template 13
Template Definition 14
Alert Definition 14
Notification Priview 15
Examples 16
Chapter 5: lscount 17
Parameters 17
Remarks 17
Deprecation 17
Caveats 17
Examples 18
Chapter 6: lsstat 19
Parameters 19
Remarks 19
Deprecation 19
Caveats 19
Examples 20
Remarks 21
Examples 21
Slack Notifications 21
HipChat 21
Syntax 23
Remarks 23
Examples 23
Email Notifications 23
[Link]
Overview 24
PagerDuty Notifications 25
Remarks 27
Examples 27
Remarks 34
Examples 34
Remarks 41
Examples 41
Remarks 43
Examples 43
Examples 46
Squelching a host 46
Remarks 47
Examples 47
Examples 51
HTTPGetJSON 51
Syntax 52
Remarks 52
Examples 52
Credits 58
[Link]
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: bosun
It is an unofficial and free Bosun ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official Bosun.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to info@[Link]
[Link] 1
Chapter 1: Getting started with Bosun
Remarks
Bosun is an open-source, MIT licensed, monitoring and alerting system created by Stack
Overflow. It has an expressive domain specific language for evaluating alerts and creating detailed
notifications. It also lets you test your alerts against historical data for a faster development
experience. More details at [Link]
Bosun uses a config file to store all the system settings, macros, lookups, notifications, templates,
and alert definitions. You specify the config file to use when starting the server, for example
/opt/bosun/bosun -c /opt/bosun/config/[Link]. Changes to the file will not be activated until
bosun is restarted, and it is highly recommended that you store the file in version control.
Versions
0.3.0 2015-06-13
0.4.0 2015-09-18
0.5.0 2016-03-15
Examples
Sample Alert
Bosun alerts are defined in the config file using a custom DSL. They use functions to evaluate time
series data and will generate alerts when the warn or crit expressions are non-zero. Alerts use
templates to include additional information in the notifications, which are usually an email message
and/or HTTP POST request.
template [Link] {
body = `<p>Alert: {{.[Link]}} triggered on {{.[Link]}}
<hr>
<p><strong>Computation</strong>
<table>
{{range .Computations}}
<tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
{{end}}
</table>
<hr>
{{ .Graph .[Link] }}`
[Link] 2
[Link]
}
notification [Link] {
email = alerts@[Link]
}
alert [Link] {
template = [Link]
$q = avg(q("sum:rate:[Link]{host=*,type=idle}", "1m"))
crit = $q < 40
notification = [Link]
}
The alert would send an email with the subject Critical: [Link] cpu idle at 25% on hostname
for any host who's Idle CPU usage has averaged less than 40% over the last 1 minute. This
example is a "host scoped" alert, but Bosun also supports cluster, datacenter, or globally scoped
alerts (see the fundamentals video series for more details).
tsdbHost = localhost:4242
httpListen = :8070
smtpHost = localhost:25
emailFrom = bosun@[Link]
timeAndDate = 202,75,179,136
ledisDir = ../ledis_data
checkFrequency = 5m
notification [Link] {
email = alerts@[Link]
print = true
}
In this case the config file indicates Bosun should connect to a local OpenTSDB instance on port
4242, listen for requests on port 8070 (on all IP addresses bound to the host), use the localhost
SMTP system for email, display additional time zones, use built in Ledis instead of external Redis
for system state, and default alerts to a 5 minute interval.
The config also defines an [Link] that can be assigned to alerts, which would usually
be included at the end of the config file (see sample alert example).
The quick start guide includes information about using Docker to stand up a Bosun instance.
This will create a new instance of Bosun which you can access by opening a browser to
[Link] The docker image includes HBase/OpenTSDB for storing time series data,
the Bosun server, and Scollector for gathering metrics from inside the bosun container. You can
[Link] 3
then point additional scollector instances at the Bosun server and use Grafana to create
dashboards of OpenTSDB or Bosun metrics.
The Stackexchange/Bosun image is designed only for testing. There are no alerts defined in the
config file and the data will be deleted when the docker image is removed, but it is very helpful for
getting a feel for how bosun works. For details on creating a production instance of Bosun see
[Link]
[Link] 4
[Link]
Chapter 2: Alerts: Advanced Scoping
Examples
Understanding the transpose function: t()
Overview
The transpose function is one of Bosun's more powerful functions, but it also takes effort to
understand. It is powerful because it lets us alert at different levels than the tag structure of the
underlying data.
Transpose changes the scope of your alert. This lets you scope things into larger collections. So
for example if you have queries that return a scope of host,cluster and want to alert based on
cluster health and not individual hosts, transpose can be used to do this.
What is scope?
Scope is the list of tag keys that make up your final result. For example:
• If the scope is host, you get per host results in your alerts.
• If your scope is empty (no tag keys) then you could only possibly get one alert.
• If your scope is host,iface you could get alerts for every interface on every host in the result.
So the alerts we get are tied to the tags for the data. The transpose function allows us to alert at
different scopes other than the metric tag structure. So we can query things that result in
host,cluster but alert at a cluster scope.
So it takes a numberSet, a scope a.k.a. group for the result, and returns a seriesSet
The results of many functions in bosun are sets, usually a numberSet or seriesSet. The entire set
in the result shares the same tag keys. And each item in the set is unique to the value of each
corresponding key. If the value of each item in the set is a series
(timestamp:value,timestamp:value) then we have a seriesSet. If the value of each item is just a
[Link] 5
number, then we have a numberSet.
The meat of it
Transpose takes a numberSet and returns a seriesSet with a larger scope (less tag keys). The
resulting seriesSet is a bit strange because the index is not time as is the usual case of a
seriesSet, so timevalue is no longer time and is just an index number. It should therefore be
ignored.
So we end up transposing set items into values of the resulting set, where the resulting set type (a
seriesSet) can hold multiple values:
[Link] 6
[Link]
# Turn each item in the set into a numberSet by reducing it via average
$avgConnByHostCluster = avg($connByHostCluster)
[Link] 7
# Transpose to new scope
$clusterScope = t($avgConnByHostCluster, "cluster")
[Link] 8
[Link]
You can now do neat things with each item that represents the cluster. For example you could do
sum($clusterScope > 5) (Note that $clusterScope is a seriesSet) to get the count of items in the
cluster where each item has a rate above five. You could then alert if the count is greater than a
certain value. For example, you could also use len($clusterScope) to get the number of hosts in
[Link] 9
each cluster, and alert on the count of hosts above the threshold relative to the number of hosts in
the cluster.
[Link] 10
[Link]
Chapter 3: Complete Examples
Examples
SSL Certs Expiring
This data is collected by the http_unit and scollector. It warns when an alert is going to expire
within a certain amount of days, and then goes critical if the cert has passed the expiration date.
This follows the recommended default of warn and crit usage in Bosun (warn: something is going
to fail, crit: something has failed).
Template Def
template [Link] {
subject = {{.[Link]}}: SSL Cert Expiring in {{.Eval .[Link] | printf
"%.2f"}} Days for {{.Group.url_host}}
body = `
{{ template "header" . }}
<table>
<tr>
<td>Url</td>
<td>{{.Group.url_host}}</td>
</tr>
<tr>
<td>IP Address Used for Test</td>
<td>{{.[Link]}}</td>
</tr>
<tr>
<td>Days Remaining</td>
<td>{{.Eval .[Link] | printf "%.2f"}}</td>
</tr>
<tr>
<td>Expiration Date</td>
<td>{{.[Link] (parseDuration (.Eval .[Link] | printf "%vh"))
}}</td>
</tr>
</table>
`
}
Alert Definition
alert [Link] {
template = [Link]
ignoreUnknown = true
$notes = This alert exists to notify of us any SSL certs that will be expiring for hosts
monitored by our http unit test cases defined in the scollector configuration file.
$expireEpoch = last(q("min:[Link]{host=ny-bosun01,url_host=*,ip=*}", "1h", ""))
$hoursLeft = ($expireEpoch - epoch()) / d("1h")
[Link] 11
$daysLeft = $hoursLeft / 24
warn = $daysLeft <= 50
crit = $daysLeft <= 0
warnNotification = default
critNotification = default
}
Alert Explanation
• q(..) (func doc) querties OpenTSDB, one of Bosun's supported backends. In returns a type
called a seriesSet (which is set of time series, each identified by tag).
• last() (func doc) takes the last value of each series in the seriesSet and returns a
numberSet.
• The metric, [Link]. is returning the Unix time stamp of when the cert will expire
• epoch() (func doc) returns the current unix timestamp. So subtracting current unix timestamp
from the expiration epoch gives is the remaining time.
• d() (func doc) returns the number of seconds represented by the duration string, the duration
string uses the same units as OpenTSDB.
Notification Preview
[Link] 12
[Link]
config for httpunit test cases:
[[HTTPUnit]]
TOML = "/opt/httpunit/data/[Link]"
Header Template
In Bosun templates can reference other templates. For emails notifications, you might have a
header template to show things you want in all alerts.
Header Template
template header {
body = `
<style>
td, th {
padding-right: 10px;
}
</style>
<p style="font-weight: bold; text-decoration: underline;">
<a style="padding-right: 10px;" href="{{.Ack}}">Acknowledge</a>
<a style="padding-right: 10px;" href="{{.Rule}}">View Alert in Bosun's Rule Editor</a>
{{if .[Link]}}
<a style="padding-right: 10px;"
href="[Link]
{{.[Link]}} in Opserver</a>
<a
href="[Link]
15m,mode:quick,to:now))&_a=(columns:!(_source),index:%5Blogstash-
%[Link],interval:auto,query:(query_string:(analyze_wildcard:!t,query:'logsource:{{.[Link]}}')
{{.[Link]}} in Kibana</a>
{{end}}
</p>
<table>
<tr>
<td><strong>Key: </strong></td>
<td>{{printf "%s%s" .[Link] .Group }}</td>
</tr>
<tr>
<td><strong>Incident: </strong></td>
<td><a href="{{.Incident}}">#{{.[Link]}}</a></td>
</tr>
</table>
<br/>
{{if .[Link]}}
<p><strong>Notes:</strong> {{html .[Link]}}</p>
{{end}}
{{if .[Link]}}
<p>
{{if not .[Link]}}
<strong>Notes:</strong>
{{end}}
{{ html .[Link] }}</p>
{{end}}
[Link] 13
`
}
Explanations:
• <style>...: Although style blocks are not supported in email, bosun processes style blocks
and then inlines them into the html. So this is shared css for any templates that include this
template.
• The .Ack link takes you to a Bosun view where you can acknowledge the alert. The .Rule link
takes you to Bosun's rule editor setting the template, rule, and time of the alert so you can
modify the alert, or run it at different times.
• {{if .[Link]}}...: .Group is the tagset of the alert. So when the warn or crit expression
has tags like host=*, we know the alert is in reference to a specific host in our environment.
So we then show some links to host specific things.
• The Alert name and key are included to ensure that at least the most basic information is in
any alert
• .[Link] this is included so if in any alert someone defines the $notes variables it will
be show in the alert. The encourages people to write notes explaining the purpose of the
alert and how to interpret it.
• .[Link] is there in case we want to define a macro with notes, and then
have instances of that macro with more notes added to the macro notes.
Template Definition
template [Link] {
subject = {{.[Link]}}: {{.Eval .[Link].by_host}} bad bond(s) on {{.[Link]}}
body = `{{template "header" .}}
<h2>Bond Status</h2>
<table>
<tr><th>Bond</th><th>Slave</th><th>Status</th></tr>
{{range $r := .EvalAll .[Link].slave_status}}
{{if eq $.[Link] .[Link]}}
<tr>
<td>{{$[Link]}}</td>
<td>{{$[Link]}}</td>
<td {{if lt $[Link] 1.0}} style="color: red;" {{end}}>{{$[Link]}}</td>
</tr>
{{end}}
{{end}}
</table>
`
}
Alert Definition
alert [Link] {
[Link] 14
[Link]
template = [Link]
macro = host_based
$notes = This alert triggers when a bond only has a single interface, or the status of a
slave in the bond is not up
$slave_status = max(q("sum:[Link].is_up{bond=*,host=*,slave=*}", "5m", ""))
$slave_status_by_bond = sum(t($slave_status, "host,bond"))
$slave_count = max(q("sum:[Link]{bond=*,host=*}", "5m", ""))
$no_good = $slave_status_by_bond < $slave_count || $slave_count < 2
$by_host = max(t($no_good, "host"))
warn = $by_host
}
Notification Priview
[Link] 15
Chapter 4: Expression Tips and Tricks
Examples
Avoiding Divide by Zero with NumberSet Operations
In order to avoid a divide by zero with a numberSet (what you get after a reduction like avg()) you
can short-circuit the logic:
If the above were just $two / $five then when $five is zero, the result will be +Inf which will cause
an error when used as warn or crit value in an alert expression.
With series operations, things are dropped from the left side if there is no corresponding
timestamp/datapoint in the right side. You can mix this with the dropbool function to avoid divide
by zero:
It is possible after dropbool there will be an empty set which would also error. So series operations
are recommended for visualization and for alerting it is recommended to use reduction functions
earlier in the expression. Alternatively you could wrap the operation in the nv func after reduction:
nv(avg($two / dropbool($five, ($five > 0))), 0)
[Link] 16
[Link]
Chapter 5: lscount
Parameters
Parameter Details
indexRoot The root name of the index to hit, the format is expected to be
[Link]("%s-%s", index_root, [Link]("2006.01.02"))
Creates groups (like tagsets) and can also filter those groups. It is the
keyString
format of "field:regex,field:regex...". The :regex can be ommited.
An Elastic regexp query that can be applied to any field. It is in the same
filterString
format as the keystring argument.
set the time window from now - see the OpenTSDB q() function for more
startDuration
details.
set the time window from now - see the OpenTSDB q() function for more
endDuration
details.
Remarks
Deprecation
The LogStash query functions are deprecated, and only for use with v1.x of ElasticSearch. If
you are running v2 or above of ElasticSearch, then you should refer to the Elastic Query functions.
Caveats
• There is currently no escaping in the keystring, so if you regex needs to have a comma or
double quote you are out of luck.
• The regexs in keystring are applied twice. First as a regexp filter to elastic, and then as a go
regexp to the keys of the result. This is because the value could be an array and you will get
groups that should be filtered. This means regex language is the intersection of the golang
regex spec and the elastic regex spec. Elastic uses lucene style regex. This means regexes
are always anchored (see the documentation).
• If the type of the field value in Elastic (aka the mapping) is a number then the regexes won’t
act as a regex. The only thing you can do is an exact match on the number, ie
“eventlogid:1234”. It is recommended that anything that is a identifier should be stored as a
[Link] 17
string since they are not numbers even if they are made up entirely of numerals.
• Alerts using this information likely want to set ignoreUnknown, since only “groups” that
appear in the time frame are in the results
Examples
Counting total number of documents in last 5 minutes
lscountreturns a time bucketed count of matching documents in the LogStash index, according to
the specified filter.
A trivial use of this would be to check how many documents in total have been received in the 5
minutes, and alert if it is below a certain threshold.
alert [Link] {
$notes = This alerts if there hasn't been any logstash documents in the past 5 minutes
template = [Link]
$count_by_minute = lscount("logstash", "", "", "5m", "5m", "")
$count_graph = lscount("logstash", "", "", "1m", "60m", "")
$q = avg($count_by_minute)
crit = $q < 1
critNotification = default
}
template [Link] {
body = `{{template "header" .}}
{{.Graph .[Link].count_graph }}
{{template "def" .}}
{{template "computation" .}}`
subject = {{.[Link]}}: Logstash docs per second: {{.Eval .[Link].q | printf
"%.2f"}} in the past 5 minutes
}
bucket. You will get one data point in the returned seriesSet with the total number of
documents from the last 5 minutes, in the latest logstash index
• $count_graph = lscount("logstash", "", "", "1m", "60m", "")
This counts the number of documents from the last hour, in 1 minute buckets. There
○
will be a total of 60 data points in the seriesSet returned, which in this instance is used
in a graph.
[Link] 18
[Link]
Chapter 6: lsstat
Parameters
Parameter Details
indexRoot The root name of the index to hit, the format is expected to be
[Link]("%s-%s", index_root, [Link]("2006.01.02"))
Creates groups (like tagsets) and can also filter those groups. It is the
keyString
format of "field:regex,field:regex...". The :regex can be ommited.
An Elastic regexp query that can be applied to any field. It is in the same
filterString
format as the keystring argument.
rStat Can be one of avg, min, max, sum, sum_of_squares, variance, std_deviation
set the time window from now - see the OpenTSDB q() function for more
startDuration
details.
set the time window from now - see the OpenTSDB q() function for more
endDuration
details.
Remarks
Deprecation
The LogStash query functions are deprecated, and only for use with v1.x of ElasticSearch. If
you are running v2 or above of ElasticSearch, then you should refer to the Elastic Query functions.
Caveats
• There is currently no escaping in the keystring, so if you regex needs to have a comma or
double quote you are out of luck.
• The regexs in keystring are applied twice. First as a regexp filter to elastic, and then as a go
regexp to the keys of the result. This is because the value could be an array and you will get
groups that should be filtered. This means regex language is the intersection of the golang
[Link] 19
regex spec and the elastic regex spec. Elastic uses lucene style regex. This means regexes
are always anchored (see the documentation).
• If the type of the field value in Elastic (aka the mapping) is a number then the regexes won’t
act as a regex. The only thing you can do is an exact match on the number, ie
“eventlogid:1234”. It is recommended that anything that is a identifier should be stored as a
string since they are not numbers even if they are made up entirely of numerals.
• Alerts using this information likely want to set ignoreUnknown, since only “groups” that
appear in the time frame are in the results
Examples
The average value of a field over time
lsstat returns various summary stats per bucket for the specified field. The field must be numeric
in elastic.
rStat can be one of avg, min, max, sum, sum_of_squares, variance, std_deviation.
The rest of the fields behave the same as lscount, except that there is no division based on
bucketDuration (since these are summary stats)
The lsstat in this queries the logstash indexes, filters on a field env with the value prod, and gives
the max value of querytime for the last hour, in one minute buckets.
[Link] 20
[Link]
Chapter 7: Notifications: Chat Systems
Remarks
In Bosun notifications are used for both new alert incidents and when an alert is acked/closed/etc.
If you don't want the other events to trigger a notification add runOnActions = false to the
notification definition.
Examples
Slack Notifications
HipChat
Bosun notifications are assigned to alert definitions using warnNotification and critNotification and
indicate where to send the rendered alert template when a new incident occur. The
${[Link]} syntax can be used to load values from an Environmental Variable.
In order to post alerts to HipChat, start by creating an Integration named "Bosun". The Integration
will provide the URL necessary to post messages (including the token) as seen here:
#Example template
[Link] 21
template [Link] {
subject = `{"color":{{if lt (.Eval .[Link]) (.Eval .[Link])
}}"red"{{else}} {{if lt (.Eval .[Link]) (.Eval .[Link])
}}"yellow"{{else}}"green"{{end}}{{end}},"message":"Server: {{.[Link]}}<br/>Metric:
{{.[Link]}}<br/><br/>DL speed: {{.Eval .[Link] | printf "%.2f" }}<br/>DL
Warning threshold: {{.[Link]}}<br/>DL Critical threshold:
{{.[Link]}}<br/><br/>Notes: {{.[Link]}}<br/><br/>RunBook: <a
href={{.[Link]}} >wiki article</a>","notify":false,"message_format":"html"}`
}
#Example notification
notification hipchat {
#Create an Integration in HipChat to generate the POST URL
#Example URL: [Link]
post = ${env.HIPCHAT_ROOM_ABC}
body = {{.}}
contentType = application/json
}
[Link] 22
[Link]
Chapter 8: Notifications: Overview
Syntax
• notification name {
○ email = dev-alerts@[Link], prod-alerts@[Link], ...
○ post = [Link]
○ get = [Link]
○ next = another-notification-definition
○ timeout = 30m
○ runOnActions = false
○ body = {"text": {{.|json}}}
○ contentType = application/json
○ print = true
• }
Remarks
In Bosun notifications are used for both new alert incidents and when an alert is acked/closed/etc.
If you don't want the other events to trigger a notification add runOnActions = false to the
notification definition.
See also:
Examples
SMS Notifications with plivo
There are two ids you will need from your plivo account. Replace authid and authtoken in this
snippet with those values. The src value should also be a valid number assigned to your account.
dst can be any number you want, or multiple seperated by <.
notification sms {
post = [Link]
body = {"text": {{.|json}}, "dst":"15551234567","src":"15559876543"}
contentType = application/json
runOnActions = false
}
Email Notifications
To send email notifications you need to add the following settings to your config file:
[Link] 23
#Using a company SMTP server (note only one can be define)
smtpHost = [Link]
emailFrom = bosun@[Link]
#Chained notifications will escalate if an incident is not acked before the timeout
notification it {
email = it-alerts@[Link]
next = oncall
timeout = 30m
}
#Could set additional escalations here using any notification type (email/get/post)
#or set next = oncall to send another email after the timeout if alert is still not acked
notification oncall {
email = escalated-alerts@[Link]
}
Overview
Bosun notifications are assigned to alert definitions using warnNotification and critNotification and
indicate where to send the rendered alert template when a new incident occur. Notifications can
be sent via email or use HTTP GET/POST requests. There also is a Print notification that just
adds information to the Bosun log file.
If you want to hide a URL, Password, or API Key from being in plain text you can use
${[Link]} to load the value from an Environmental Variable (usually exported from the
Bosun init script). Please note that there are no protections on who can access the variables (they
can easily be displayed in a template) but it does prevent them from being displayed directly on
the Rule Editor page or in the .conf file.
notification logfile {
print = true
}
[Link] 24
[Link]
Alert incidents can be sent to other system using HTTP GET or HTTP POST requests. You can
either send the rendered alert directly (using markdown in the template perhaps) or use body = ...
{{.|json}} ... and contentType to send the alert data over as part of a JSON object. Another
approach is to only send the basic incident information and then have the receiving system pull
additional details from the bosun API.
notification postjson {
post = ${[Link]}
body = {"text": {{.|json}}, apiKey=${[Link]}}
contentType = application/json
}
Swap out AccountSid, AuthToken, ToPhoneNumber and FromPhoneNumber for your credentials/intended
recipients. You need to ensure that if the ToPhoneNumber and FromPhoneNumber have + in them, they
are urlencoded (ie: as %2B)
notification sms {
post = [Link]
01/Accounts/{AccountSid}/[Link]
body = Body={{.}}&To={ToPhoneNumber}&From={FromPhoneNumber}
}
PagerDuty Notifications
#Post to [Link]
notification pagerduty {
post = [Link]
contentType = application/json
runOnActions = false
body = `{
"service_key": "myservicekey",
"incident_key": {{.|json}},
"event_type": "trigger",
"description": {{.|json}},
"client": "Bosun",
"client_url": "[Link]
}`
}
In some cases you may want to change which notification you use based on a tag in the Alert
keys. You can do this using the Lookup feature. Note: Lookup only works if you are using
OpenTSDB and sending data to the Bosun to be indexed. For other backends or non-indexed data
[Link] 25
you have to use lookupSeries instead.
notification default {
email = team@[Link]
}
notification JSmith{
email = JSmith@[Link]
}
#This will use the JSmith lookup for any alerts where the host tag starts with ny-jsmith
lookup host_base_contact {
entry host=ny-jsmith* {
main_contact = JSmith
}
entry host=* {
main_contact = default
}
}
alert blah {
...
warn = q(...)
warnNotification = lookup("host_base_contact", "main_contact")
critNotification = lookup("host_base_contact", "main_contact")
}
macro [Link] {
warnNotification = lookup("host_base_contact", "main_contact")
critNotification = lookup("host_base_contact", "main_contact")
}
[Link] 26
[Link]
Chapter 9: Packages and Initialization Scripts
Remarks
There currently aren't any installation packages provided for Bosun or Scollector, only binaries on
the Bosun release page. It is up to the end user to find the best way to deploy the files and run
them as a service.
Examples
Scollector init.d script
#!/bin/bash
#
# scollector Startup script for scollector.
#
# chkconfig: 2345 90 60
# description: scollector is a replacement for OpenTSDB's TCollector \
# and can be used to send metrics to a Bosun server
RETVAL=0
PIDFILE=/var/run/[Link]
prog=scollector
exec=/opt/scollector/scollector-linux-amd64
scollector_conf=/opt/scollector/[Link]
scollector_logs=/var/log/scollector
scollector_opts="-conf $scollector_conf -log_dir=$scollector_logs"
lockfile=/var/lock/subsys/$prog
# Source config
if [ -f /etc/sysconfig/$prog ] ; then
. /etc/sysconfig/$prog
fi
start() {
[ -x $exec ] || exit 5
umask 077
echo -n $"Starting scollector: "
daemon --check=$exec --pidfile="$PIDFILE" "{ $exec $scollector_opts & } ; echo \$! >|
$PIDFILE"
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch $lockfile
return $RETVAL
}
stop() {
echo -n $"Shutting down scollector: "
[Link] 27
killproc -p "$PIDFILE" $exec
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f $lockfile
return $RETVAL
}
rhstatus() {
status -p "$PIDFILE" -l $prog $exec
}
restart() {
stop
start
}
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
restart
;;
status)
rhstatus
;;
*)
echo $"Usage: $0 {start|stop|restart|status}"
exit 3
esac
exit $?
Here is an init.d script for Bosun that includes setting Environmental Variables that can be used to
hide secrets from the raw config. It uses [Link] to run the program
as a daemon.
#!/bin/sh
#
# /etc/rc.d/init.d/bosun
# bosun
#
# chkconfig: - 98 02
# description: bosun
[Link] 28
[Link]
# Source function library.
. /etc/rc.d/init.d/functions
base_dir="/opt/bosun"
exec="/opt/bosun/bosun"
prog="bosun"
config="${base_dir}/config/[Link]"
lockfile=/var/lock/subsys/$prog
pidfile=/var/run/[Link]
logfile=/var/log/$[Link]
#These "secrets" can be used in the [Link] using syntax like ${[Link]} or ${env.API_KEY}
export CHAT=[Link]
export API_KEY=123456789012345678901234567890
check() {
$exec -t -c $config
if [ $? -ne 0 ]; then
echo "Errors found in configuration file, check it with '$exec -t'."
exit 1
fi
}
start() {
[ -x $exec ] || exit 5
[ -f $config ] || exit 6
check
echo -n $"Starting $prog: "
# if not running, start it up here, usually something like "daemon $exec"
ulimit -n 65536
daemon daemonize -a -c $base_dir -e $logfile -o $logfile -p $pidfile -l $lockfile $exec -
c $config $OPTS
retval=$?
echo
[ $retval -eq 0 ] && touch $lockfile
return $retval
}
stop() {
echo -n $"Stopping $prog: "
# stop it here, often "killproc $prog"
killproc -p $pidfile -d 5m
retval=$?
echo
[ $retval -eq 0 ] && rm -f $lockfile
return $retval
}
restart() {
check
stop
start
}
reload() {
restart
}
[Link] 29
force_reload() {
restart
}
rh_status() {
# run checks to determine if the service is running or use generic status
status $prog
}
rh_status_q() {
rh_status >/dev/null 2>&1
}
case "$1" in
start)
rh_status_q && exit 0
$1
;;
stop)
rh_status_q || exit 0
$1
;;
restart)
$1
;;
reload)
rh_status_q || exit 7
$1
;;
force-reload)
force_reload
;;
status)
rh_status
;;
condrestart|try-restart)
rh_status_q || exit 0
restart
;;
*)
echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-
reload}"
exit 2
esac
[Service]
Type=simple
User=root
ExecStart=/opt/bosun/bosun -c /opt/bosun/config/[Link]
Restart=on-abort
[Link] 30
[Link]
[Install]
WantedBy=[Link]
[Service]
Type=simple
User=root
ExecStart=/opt/scollector/scollector -h [Link]
Restart=on-abort
[Install]
WantedBy=[Link]
TSDBRelay can be used to forward metrics to an OpenTSDB instance, send to Bosun for
indexing, and relay to another opentsdb compatible instance for backup/DR/HA. It also has
options to denormalize metrics with high tag cardinality or create redis/ledis backed external
counters.
[Service]
Type=simple
User=root
ExecStart=/opt/tsdbrelay/tsdbrelay -b localhost:8070 -t localhost:4242 -l [Link]:5252 -r
localhost:4243 #Local tsdb/bosun and influxdb opentsdb endpoint at 4243
#For external counters add: -redis redishostname:6379 -db 0
#For denormalized metrics: -
denormalize=os.cpu__host,[Link].used__host,[Link].bytes__host,[Link].bytes__host,[Link]
Restart=on-abort
[Install]
[Link] 31
WantedBy=[Link]
mkdir /opt/scollector
In the /opt/scollector directory, download the latest binary build from the bosun/scollector site, [
[Link]
wget [Link]
ex:
wget [Link]
ln -s /opt/scollector/scollector-linux-amd64 /usr/local/bin/scollector
mkdir /etc/scollector
ex:
[Link] 32
[Link]
# ClientID = ""
# Secret = ""
# Token = ""
ex:
[Unit]
Description=Scollector Service
After=[Link]
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/scollector -conf=/etc/scollector/[Link]
Restart=on-abort
[Install]
WantedBy=[Link]
Start scollector:
Alternatively, you can view the system message log, you're looking for something like:
[Link] 33
Chapter 10: Scollector: External Collectors
Remarks
Scollector supports tcollector style external collectors that can be used to send metrics to Bosun
via custom scripts or executables. External collectors are a great way to get started collecting
data, but when possible it is recommended for applications to send data directly to Bosun or to
update scollector so that it natively supports additional systems.
The ColDir configuration key specifies the external collector directory, which is usually set to
something like /opt/scollector/collectors/ in Linux or C:\Program Files\scollector\collectors\ in
Windows. It should contain numbered directories just like the ones used in OpenTSDB tcollector.
Each directory represents how often scollector will try to invoke the collectors in that folder
(example: 60 = every 60 seconds). Use a directory named 0 for any executables or scripts that will
run continuously and create output on their own schedule. Any non-numeric named directories will
be ignored, and a lib and etc directory are often used for library and config data shared by all
collectors.
External collectors can use either the simple data output format from tcollector or they can send
JSON data if they want to include metadata.
Examples
Sample collector written in PowerShell
#Send metadata for each metric once on startup (Scollector will resend to Bosun periodically)
Write-Output ($MetricMetadata -f "$[Link]","rate","gauge") #See
[Link]
Write-Output ($MetricMetadata -f "$[Link]","unit","item") #See
[Link]
Write-Output ($MetricMetadata -f "$[Link]","desc","A test metric")
[Link] 34
[Link]
"$[Link]",[datetime]::[Link]($epoch).TotalSeconds,42.123,$tags)
do {
$delay = Get-Random -Minimum 5 -Maximum 25
sleep -Seconds $delay
Write-Output ($MetricData -f
"$[Link]",[datetime]::[Link]($epoch).TotalSeconds,$delay,$tags)
} while ($true)
#If a continuous output script ever exits scollector will restart it. If you just want
periodic data every 60 seconds you
#can use a /60/ folder instead of /0/ and allow the script to exit when finished sending a
batch of metrics.
The following can be saved as [Link]. After you update the EDITME settings and build the
executable it can be used as a continuous external collector.
package main
import (
"fmt"
"log"
"net/url"
"strconv"
"time"
"[Link]/ChimeraCoder/anaconda"
)
func main() {
[Link]("EDITME")
[Link]("EDITME")
api := [Link]("EDITME", "EDITME")
v := [Link]{}
sr, err := [Link]("stackoverflow", nil)
if err != nil {
[Link](err)
}
var since_id int64 = 0
for _, tweet := range sr {
if [Link] > since_id {
since_id = [Link]
}
}
count := 0
for {
now := [Link]().Unix()
[Link]("result_type", "recent")
[Link]("since_id", [Link](since_id, 10))
sr, err := [Link]("stackoverflow", nil)
if err != nil {
[Link](err)
}
for _, tweet := range sr {
if [Link] > since_id {
count += 1
since_id = [Link]
}
[Link] 35
}
[Link]("twitter.tweet_count", now, count, "query=stackoverflow")
[Link]([Link] * 30)
}
}
This is a continuous collector that uses the hadoop fs -du -s /hbase/* command to get details
about the HDFS disk usage. This metric is very useful for tracking space in an OpenTSDB system.
#!/bin/bash
while true; do
while read -r bytes raw_bytes path; do
echo "[Link] $(date +"%s") $bytes path=$path"
#[Link]
change/td-p/27192 KMB 2015-08-24T[Link]Z
echo "[Link] $(date +"%s") $raw_bytes path=$path"
done < <(hadoop fs -du -s /hbase/*)
sleep 30
done
The following Go file can be compiled into a continuous external collector that will query a MSSQL
server database that uses the [Link] schema. It will query multiple
servers/databases for all exceptions since UTC 00:00 to convert the raw entries into a counter. It
also uses the [Link]/metadata package to include metadata for the
[Link] metric.
/*
Exceptional is an scollector external collector for [Link].
*/
package main
import (
"database/sql"
"encoding/json"
"fmt"
"log"
"strings"
"time"
"[Link]/metadata"
"[Link]/opentsdb"
_ "[Link]/denisenkom/go-mssqldb"
)
[Link] 36
[Link]
type Exceptions struct {
GUID string
ApplicationName string
MachineName string
CreationDate [Link]
Type string
IsProtected int
Host string
Url string
HTTPMethod string
IPAddress string
Source string
Message string
Detail string
StatusCode int
SQL string
DeletionDate [Link]
FullJson string
ErrorHash int
DuplicateCount int
}
const (
defaultPassword = "EnterPasswordHere"
defaultPort = "1433"
metric = "[Link]"
descMetric = "The number of exceptions thrown per second by applications and machines.
Data is queried from multiple sources. See status instances for details on exceptions."
)
func main() {
mds := [][Link]{
{
Metric: metric,
Name: "rate",
Value: "counter",
},
{
Metric: metric,
Name: "unit",
Value: [Link],
},
{
Metric: metric,
Name: "desc",
[Link] 37
Value: descMetric,
},
}
for _, m := range mds {
b, err := [Link](m)
if err != nil {
[Link](err)
}
[Link](string(b))
}
instances := [...]ExceptionsDB{
{"NY_AG", "[Link]", defaultPassword, defaultPort, "NY_Status"},
{"CO-SQL", "[Link]", defaultPassword, defaultPort, "CO_Status"},
{"NY-INTSQL", "[Link]", defaultPassword, defaultPort, "INT_Status"},
}
for _, exdb := range instances {
go run(exdb)
}
select {}
}
query := func() {
// Database name is the same as the username
db, err := mssqlConnect([Link], [Link], [Link], [Link],
[Link])
if err != nil {
[Link](err)
}
defer [Link]()
var results []ExceptionsCount
sqlQuery := `
SELECT ApplicationName, MachineName, MAX(Count) as Count FROM
(
--New since UTC rollover
SELECT ApplicationName, MachineName, Sum(DuplicateCount) as Count from Exceptions
WHERE CreationDate > CONVERT (date, GETUTCDATE())
GROUP BY MachineName, ApplicationName
UNION --Zero out any app/machine combos that had exceptions in last 24 hours
SELECT DISTINCT [Link], [Link], 0 as Count from Exceptions ex
WHERE [Link] Between Convert(Date, GETUTCDATE()-1) And Convert(Date, GETUTCDATE())
) as T
GROUP By [Link], [Link]`
rows, err := [Link](sqlQuery)
if err != nil {
[Link](err)
return
}
defer [Link]()
for [Link]() {
var r ExceptionsCount
if err := [Link](&[Link], &[Link], &[Link]); err != nil {
[Link](err)
continue
}
[Link] = [Link]
results = append(results, r)
}
if err := [Link](); err != nil {
[Link] 38
[Link]
[Link](err)
}
if len(results) > 0 {
now := [Link]().Unix()
for _, r := range results {
application, err := [Link]([Link])
if err != nil {
[Link](err)
continue
}
db := [Link]{
Metric: metric,
Timestamp: now,
Value: [Link],
Tags: [Link]{
"application": application,
"machine": [Link]([Link]),
"source": [Link],
},
}
b, err := [Link]()
if err != nil {
[Link](err)
continue
}
[Link](string(b))
}
}
}
for {
wait := [Link](interval)
query()
<-wait
}
}
<#
.DESCRIPTION
Writes the metric out in bosun external collector format which is compatible with
scollector external scripts
.PARAMETER metric
Name of the metric (eg : [Link])
.PARAMETER type
Type of metric (counter, gauge, etc)
.PARAMETER unit
Type of unit (connections, operations, etc)
.PARAMETER desc
Description of the metric
.PARAMETER value
The current value for the metric
#>
function Write-Metric
{
param(
[string]$metric,
[string]$type,
[Link] 39
[string]$unit,
[string]$desc,
$value
)
$obj = @{
metric = $metric
name = "rate"
value = $type
}
$[Link]="unit"
$[Link]=$unit
$[Link]="desc"
$[Link]=$desc
$output = @{
metric = $metric
timestamp= [int]([datetime]::[Link]($epoch).TotalSeconds)
value=$value
tags= @{
host=$env:[Link]()
}
}
[Link] 40
[Link]
Chapter 11: Scollector: Overview
Remarks
Scollector is a monitoring agent that can be used to send metrics to Bosun or any system that
accepts OpenTSDB style metrics. It is modelled after OpenTSDB's tcollector data collection
framework but is written in Go and compiled into a single binary. One of the design goals is to
auto-detect services so that metrics will be sent with minimal or no configuration needed. You also
can create external collectors that generate metrics using a script or executable and use
Scollector to queue and send the metrics to the server.
You are NOT required to use Scollector when using Bosun, as you can also send metrics directly
to the /api/put route, use another monitoring agent, or use a different backend like Graphite,
InfluxDB, or ElasticSearch.
Examples
Setup with sample [Link] file
Scollector binaries for Windows, Mac, and Linux are available from the Bosun release page and
can be saved to /opt/scollector/ or C:\Program Files\scollector\. The Scollector configuration
file uses TOML v0.2.0 to specify various settings and defaults to being named [Link] in
the same folder as the binary. The configuration file is optional and only required if you need to
override a default value or include settings to activate a specific collector.
#Number of data points to include in each batch. Default is 500, should be set higher if you
are sending a lot of metrics.
BatchSize = 5000
You can then either install Scollector as a service or just run it manually via:
[Link] 41
Running Scollector as a service
On Windows you can install Scollector as a service using the -winsvc="install" flag. On Mac and
Linux you must manually create a service or init script. For example here is a basic systemd unit
file:
[Service]
Type=simple
User=root
ExecStart=/opt/scollector/scollector -h [Link]
Restart=on-abort
[Install]
WantedBy=[Link]
[Link] 42
[Link]
Chapter 12: Scollector: Process and Service
Monitoring
Remarks
Scollector can be used to monitor processes and services in Windows and Linux. Some processes
like IIS application pools are monitored automatically, but usually you need to specify which
processes and services you want to monitor.
Examples
Linux process and systemd service monitoring
[[Process]]
Command = "/opt/bosun/bosun"
Name = "bosun"
[[Process]]
Command = "ruby"
Name = "puppet-agent"
Args = "puppet"
[[Process]]
Command = "/haproxy$"
Name = "haproxy-t1"
Args = "/etc/haproxy-t1/[Link]"
[[Process]]
Command = '/usr/bin/redis-server \*:16389'
Name = "redis-bosun-dev"
IncludeCount = true
Scollector can also use the D-Bus API to determine the state of services managed by systemd
and specified in the configuration file.
[[SystemdService]]
Name = "^(puppet|redis-.*|keepalived|haproxy-t.*)$"
WatchProc = false
[[SystemdService]]
Name = "^(scollector|memcached)$"
WatchProc = true
Scollector will monitor any Windows processes or services specified in the configuration file.
[Link] 43
[[Process]]
Name = "^scollector"
[[Process]]
Name = "^chrome"
[[Process]]
Name = "^(MSSQLSERVER|SQLSERVERAGENT)$"
Scollector can also monitor any Windows processes using the .NET framework. If no
ProcessDotNet settings are specified it will default to just monitoring the w3wp worker processes
for IIS. You can specify which applications to monitor in the configuration file.
[[ProcessDotNet]]
Name = "^w3wp"
[[ProcessDotNet]]
Name = "LINQPad"
Matching process will be monitored under the dotnet.* metrics, and if there is more than one
matching process they will be assigned incrementing id tag values starting at 1. Where possible
the w3wp names will be changed to match the iis_pool-names used for process monitoring.
Scollector has built in support for using cAdvisor to generate container.* metrics in Bosun for
each Docker container on a host. To get started you will need to start a new container on each
docker host:
And then from an external source poll for metrics using scollector with the Cadvisor configuration
option. If you are using Kubernetes to manage containers you may also want to use the
TagOverride option to override the docker_id tags (shorten to 12 chars), add a container_name and
pod_name tag, and remove the docker_name and name tag:
[[Cadvisor]]
URL = "[Link]
[[Cadvisor]]
URL = "[Link]
[Link] 44
[Link]
name = ''
You may also want to send the metrics to a test instance of Bosun (maybe using the Bosun
Docker Container) to verify the metrics look correct before sending them to a production Bosun
instance (hard to clean up data after it is sent).
[Link] 45
Chapter 13: Silencing and Squelching Alerts
Examples
Squelching a host
If one does not want to receive any alert for a specific host or service - at least momentarily - one
can squelch it.
alert [Link] {
macro = [Link]
template = mytemplate
$notes = This alert will...
$metric = "avg:[Link]{host=*,name=...
warn = min( a($metric, ...
squelch = host=sqldev01,flavor=amq
squelch = host=test01
}
This alert won't appear in the dashboard for service amq on host sqldev01, and won't appear at all
for any service running on host test01.
[Link] 46
[Link]
Chapter 14: Templates: Graph and GraphAll
Remarks
Bosun Templates can include graphs to provide more information when sending a notification. The
graphs can use variables from the alert and filter base on the tagset for the alert instance or use
the GraphAll function to graph all series. When viewed on the Dashboard or in an email you can
click on the graph to load it in the Expression page.
You can also create a Generic Template with optional Graphs that can be shared across multiple
alerts.
Examples
Graph using Alert Variable
Using .Graph will filter the results to only include those that match the tagset for the alert. For
instance an alert for [Link]{host=ny-web01} would only include series with the host=ny-
web01 tags. If multiple series match then only the first matching result will be used.
template [Link] {
subject = ...
<strong>Graph</strong>
<div>{{.Graph .[Link]}}</div>
`
}
alert [Link] {
template = [Link]
...
$graph = q("avg:300s-avg:[Link].percent_free{host=$host}", "1d", "")
$graph_unit = Percent Free Memory (Including Buffers and Cache)
...
}
template [Link] {
[Link] 47
subject = ...
<strong>GraphAll</strong>
<div>{{.GraphAll .[Link]}}</div>
`
}
alert [Link] {
template = [Link]
...
$graph = q("avg:300s-avg:[Link].percent_free{host=$host}", "1d", "")
$graph_unit = All Systems Percent Free Memory (Including Buffers and Cache)
...
}
Graph queries can be defined inline if you don't want to use an Alert variable.
template [Link] {
subject = ...
`
}
Sometimes you may want to create the query for a graph dynamically in the template itself by
combining one or more variables. For instance a host down alert might want to include the Bosun
known hosts ping metric using the dst_host tag.
template [Link] {
subject = ...
[Link] 48
[Link]
<strong>Graph from multiple variables</strong>
<div>{{printf "q(\"sum:%s{host=%s,anothertag=%s}\", \"8h\", \"\")" "[Link]"
.[Link] "anothervalue" | .Graph}}</div>
`
}
When using GraphAll you may still want to filter the results, in which case you can use an Alert
variable with the Filter, Sort, and Limit functions.
template [Link] {
subject = ...
`
}
alert [Link] {
template = [Link]
...
$graph_all = q("avg:300s-avg:[Link].percent_free{host=ny-*}", "1d", "")
$graph_unit = All Systems with Less than 5 Percent Free Memory
$graph_below_5 = filter($graph_all, min($graph_all) < 5)
If you want to graph two series on one graph, you can use the Merge function. This can also be
combined with the Series function to manipulate the Y axis (like forcing it to start at zero).
template [Link] {
subject = ...
[Link] 49
alert [Link] {
template = [Link]
...
$graph_time = "1d"
$graph_host = q("avg:300s-avg:[Link].percent_free{host=myhost}", $graph_time, "")
$graph_unit = Notice the Y axis always starts at zero now
$graph_series = series("value=zero", epoch()-d($graph_time), 0, epoch(),0)
$graph_merged = merge($graph_host,$graph_series)
...
}
[Link] 50
[Link]
Chapter 15: Templates: HTTPGet and
HTTPGetJSON
Examples
HTTPGetJSON
HTTPGetJSON performs an HTTP request to the specified URL and returns a [Link]
object for use in the alert template. Example:
template example {
{{ $ip := [Link] }}
{{ $whoisURL := printf "[Link] $ip }}
{{ $whoisJQ := $.HTTPGetJSON $whoisURL }}
IP {{$ip}} owner from ARIN is {{ $[Link] "net" "orgRef" "@name" }}
}
In this case the $ip address is hard coded but in a real alert it would usually come from the alert
tags using something like {{ $ip := .Group.client_ip}} where client_ip is a tag key whose value is
an IP address.
The jsonq results are similar to the results generated by the jq JSON processor, so you can test in
a BASH shell using:
[Link] 51
Chapter 16: Templates: Overview
Syntax
• #See [Link] for Go Template Action and Function syntax
• expression = alert status {{.[Link]}} and a variable {{.Eval .[Link].q | printf "%.2f"}}
• expression = `Use backticks to span
• multiple lines with line breaks
• in the Bosun config file`
• template name {
○subject = expression
○body = expression
• }
Remarks
Bosun templates are based on the Go html/template package and can be shared across multiple
alerts, but a single template is used to render all Bosun Notifications for that alert. Alerts reference
which template to use via the template directive and specify which notifications to use via the
warnNotification and critNotification directives (can have multiple warn/crit notifications defined
for each alert).
The template subject will be displayed as headers on the dashboard, as the subject line of email
notifications, and as the default contents of HTTP POST notifications. The template body will be
displayed when an alert instance is expanded and as the body of email notifications.
Examples
Low Memory Alert and Template
Templates can be previewed and edited using the Rule Editor tab in Bosun. Use the Jump to links
to select the alert you want to edit, then you can use the template button next to macro to switch
between the alert an template sections of the configuration. If an alert has multiple instances you
can use host=xxx,name=xxx in the Template Group section to specify for which tagset you want to
[Link] 52
[Link]
see the template rendered.
template [Link] {
subject = {{.[Link]}}: Low Memory: {{.Eval .[Link].q | printf "%.0f"}}% Free
Memory on {{.[Link]}} ({{.Eval .[Link] | bytes }} Free of {{.Eval
.[Link] | bytes }} Total)
body = `
<p><a href="{{.Ack}}">Acknowledge</a> | <a href="{{.Rule}}">View Alert in Bosun's Rule
Editor</a></p>
<p><strong>Alert Key: </strong>{{printf "%s%s" .[Link] .Group }}</p>
<p><strong>Incident: </strong><a href="{{.Incident}}">#{{.[Link]}}</a></p>
<p><strong>Notes: </strong>{{html .[Link]}}</p>
<strong>Graph</strong>
<div>{{.Graph .[Link] .[Link].graph_unit}}</div>
`
}
notification [Link] {
email = alerts@[Link]
}
alert [Link] {
template = [Link]
$notes = Alerts when less than 5% free, or less than 500MB (when total > 2GB). In Linux,
Buffers and Cache are considered "Free Memory".
$default_time = "2m"
$host = wildcard(*)
$graph = q("avg:300s-avg:[Link].percent_free{host=$host}", "1d", "")
$graph_unit = Percent Free Memory (Including Buffers and Cache)
$q = avg(q("avg:[Link].percent_free{host=$host}", $default_time, ""))
$total = last(q("sum:[Link]{host=$host}", $default_time, ""))
$free = last(q("sum:[Link]{host=$host}", $default_time, ""))
#Warn when less than 5% free or total > 2GB and free < 500MB
warn = $q < 5 || ($total > 2147483648 && $free < 524288000)
#Crit when less than 0.5% free
crit = $q <= .5
critNotification = [Link]
}
After you test the alert on the Rule Editor page you can use the Results tab to see computations,
Template to see the rendered alert notification, and Timeline to see all alert incidents (only when
From and To dates are specified).
[Link] 53
Embedded Templates and CSS Styles
You can embed another template body into your template via {{template "mysharedtemplate" .}} to
reuse shared components. Here is an example that creates a header template that can be reused
at the top of all other template bodies. It also uses CSS to stylize the output so that it is easier to
read. Note that any <style>...</style> blocks will be converted to inline CSS on each element so
that email clients like Gmail will render the output correctly.
template header {
body = `
<style>
td, th {
padding-right: 10px;
}
[Link] {
padding-right: 10px;
}
</style>
<p style="font-weight: bold; text-decoration: underline;">
<a class="rightpad" href="{{.Ack}}">Acknowledge</a>
<a class="rightpad" href="{{.Rule}}">View Alert in Bosun's Rule Editor</a>
{{if .[Link]}}
<a class="rightpad"
[Link] 54
[Link]
href="[Link] {{.[Link]}} in
Opserver</a>
<a
href="[Link]
15m,mode:quick,to:now))&_a=(columns:!(_source),index:%5Blogstash-
%[Link],interval:auto,query:(query_string:(analyze_wildcard:!t,query:'logsource:{{.[Link]}}')
{{.[Link]}} in Kibana</a>
{{end}}
</p>
<table>
<tr>
<td><strong>Key: </strong></td>
<td>{{printf "%s%s" .[Link] .Group }}</td>
</tr>
<tr>
<td><strong>Incident: </strong></td>
<td><a href="{{.Incident}}">#{{.[Link]}}</a></td>
</tr>
</table>
<br/>
{{if .[Link]}}
<p><strong>Notes:</strong> {{html .[Link]}}</p>
{{end}}
<p><strong>Tags</strong>
<table>
{{range $k, $v := .Group}}
{{if eq $k "host"}}
<tr><td>{{$k}}</td><td><a href="{{$.HostView $v}}">{{$v}}</a></td></tr>
{{else}}
<tr><td>{{$k}}</td><td>{{$v}}</td></tr>
{{end}}
{{end}}
</table></p>
`
}
After which you can add start your templates with body = `{{template "header" .}} to get the
following output at the top:
It often is faster to use a generic template when first creating a new alert and only specialize the
[Link] 55
template when you need to display more information. The following template will display a subject
with a numerical value, custom formatting, and description string and then a body with up to two
graphs. If no graph variables are specified it will instead list the computations used in the alert.
The generic template also uses the name of the alert to generate the subject (replacing dots with
spaces) and checks for variables to exist before using them to prevent errors.
#See Embedded Templates and CSS Styles example for header template
template header { ... }
template computation {
body = `
<p><strong>Computation</strong>
<table>
{{range .Computations}}
<tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
{{end}}
</table></p>`
}
template generic_template {
subject = {{.[Link]}}: {{replace .[Link] "." " " -1}}: {{if
.[Link]}}{{if .[Link].value_format}}{{.Eval .[Link] | printf
.[Link].value_format}}{{else}}{{.Eval .[Link] | printf
"%.1f"}}{{end}}{{end}}{{if .[Link].value_string}}{{.[Link].value_string}}{{end}}{{if
.[Link]}} on {{.[Link]}}{{end}}
alert [Link] {
template = generic_template
$timethreshold = 60
[Link] 56
[Link]
$timegraph = 24h
$notes = Checks if puppet has not run in at least ${timethreshold} minutes. Doesn't
include hosts which have puppet disabled.
Which will produce a subject like "warning: puppet last run: It has been 62 minutes since last run
on co-lb04" and include a graphs of last_run and disabled for that host. If you want to graph all
results for a query instead of just the matching tagsets you can use $generic_graph_all and
$generic_graph_all2 as the variable names.
[Link] 57
Credits
S.
Chapters Contributors
No
Alerts: Advanced
2 Kyle Brandt, Whisk
Scoping
Notifications: Chat
7 Andy Kruta, Greg Bray
Systems
Notifications:
8 captncraig, Greg Bray
Overview
Packages and
9 Greg Bray, Mark V, Vincent Flesouras
Initialization Scripts
Scollector: External
10 Gary W, Greg Bray
Collectors
Scollector: Process
12 and Service Greg Bray
Monitoring
Silencing and
13 Xavier Nicollet
Squelching Alerts
Templates: Graph
14 Greg Bray
and GraphAll
Templates: HTTPGet
15 Greg Bray
and HTTPGetJSON
[Link] 58
[Link]
When configuration changes are made to systemd unit files for services like Bosun, the 'systemctl daemon-reload' command is crucial as it ensures systemd understands and implements the changes. Without this command, modifications will not be recognized by systemd, and services may continue running with outdated settings, potentially leading to service misconfigurations or failures .
The Scollector service is a monitoring agent designed to send system metrics to Bosun or any system that accepts OpenTSDB style metrics. It is modeled after OpenTSDB's tcollector framework and aims to auto-detect services to send metrics with minimal or no configuration. Additionally, Scollector supports external collectors via scripts or executables, which allows for custom metric generation and transmission to the server. Furthermore, it enables integration with other systems such as Graphite, InfluxDB, or ElasticSearch .
Templates in Bosun's alert notifications standardize the format and content of alerts, providing a consistent framework for displaying alerts across various incidents. They include necessary elements such as alert details, graphs, and instructional links. The structured approach of using alert-specific variables ensures that notifications convey precise context about incidents, enhancing clarity. Additionally, templates allow embedding components and CSS for customized styling, facilitating an organized and comprehensible representation of alert data .
Setting up Scollector on CentOS 7 involves creating a directory for Scollector, downloading the latest binary build, and creating a symbolic link in /usr/local/bin for easy execution. The configuration directory is established at /etc/scollector, where the Scollector configuration file (scollector.toml) is placed. This file specifies server details and collectors' configurations. The service file is created in the systemd directory, enabling Scollector to run as a service to ensure it starts automatically and manages restarts on failures, providing consistent metric collection and reporting .
The TagOverride option allows for customization of metric tags by modifying existing tags and adding new ones, particularly valuable in Kubernetes environments. By overriding docker_id tags to a shortened version and adding container_name and pod_name tags, it ensures metrics are labeled for better comprehension and analysis. This custom tagging facilitates clearer identification of metrics related to specific containers or pods, aiding in more tailored monitoring and analysis .
Avoiding divide by zero errors in Bosun alert expressions is crucial as these errors can result in undefined operations, causing calculation failures and potentially erroneous alerts. Bosun suggests using short-circuit logic to handle this: by evaluating conditional expressions that prevent division when the denominator could be zero. This ensures reliability in alert calculations and the stability of the monitoring system by preventing +Inf outcomes which are unusable in critical or warning thresholds .
Alert silencing and squelching in Bosun manage alert noise, allowing users to mute alerts temporarily or based on certain conditions, such as specific hosts or services. This functionality improves workflow by reducing the number of non-critical alerts, allowing teams to focus on higher-priority issues. It aids in mitigating alert fatigue by ensuring that only essential alerts reach the operational dashboards, optimizing response efficiency and focus .
Macros in Bosun alerts are vital for simplifying complex alert configurations by allowing the reuse of predefined expressions and common settings across different alerts. They provide a mechanism to efficiently manage variable and constant information within alerts, minimizing redundancy and errors. By encapsulating frequently used logic or configurations, macros contribute to alert customization, making it easier to adjust alert parameters globally by changing macro definitions rather than individual alert settings .
External collectors expand Scollector's monitoring capabilities by allowing custom metric collection via scripts or executables in environments not natively supported by the default collectors. These collectors are set up in specific directories which dictate their execution frequency. They utilize simple data output or JSON formats to send metrics, ensuring flexibility in data handling. Implementing them involves specifying the ColDir configuration to point to their directory and ensuring that scripts are formatted to output the desired metrics accurately .
Bosun's alerting system uses templates that offer structured notification outputs during an incident. It provides key information such as the alert name, incident ID, and additional notes defined by users to explain the purpose of the alert and its interpretation. The system encourages using macros and variables in alerts to customize the notification content and incorporate relevant numbers or graphical data that help in understanding the incident better. Furthermore, Bosun allows for including operational links to the rule editor or acknowledgment interfaces, enhancing effective incident response .