0% found this document useful (0 votes)
15 views

ELK 2 3 - Logstash Filtering - Structured Data

The document discusses using Logstash filters to manipulate log data before output. It describes taking a CSV log file and indexing it, then demonstrating how the data can be improved by using filters to split fields and normalize data types to make the data more useful for analysis and investigation.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ELK 2 3 - Logstash Filtering - Structured Data

The document discusses using Logstash filters to manipulate log data before output. It describes taking a CSV log file and indexing it, then demonstrating how the data can be improved by using filters to split fields and normalize data types to make the data more useful for analysis and investigation.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

In this lecture we're going to start talking about Logstash filtering which is

probably the most important thing we're going to talk about in regards to
Logstash and the most complex.

There's a lot to be done here. And it's all very valuable to know if you're going to
be using this to perform investigations you want your data and present it in a
useful way and filtering is how we're really going to make that happen.

Now if you remember from our discussion earlier, Logstash really has three main
functions it provides. And that's similar to the way that the configuration files go
as well, and we've seen that. We have inputs filters and outputs.

We spent a little while talking about inputs getting data in, but that's really not
the entire battle. There's some things we have to do before you send it to an
output.

Well, we don't have to do them. We can send things straight from an input to an
output, but chances are they're not going to be formatted how we want them to
be.

For instance, when you just take a bunch of data like the data we were just
looking at like Flow Data or windows event data and you don't manipulate it in
any way you don't define data types things of that nature it all actually gets
indexed generally as a string.

And when things are indexed as a string, as we've seen earlier, when we looked at
ES, we can't always manipulate them in the way that we want. We can't do math
on them, we can't build the types of charts we want if they're IP addresses, we
can't search for them based upon citer notation.

That's a problem we need to get things into the right data types. And there's a
whole lot of other actions we need to do as well. And filters are where all that
magic happens.

And quite honestly when you're building your config files, it's probably going to
where you're going to spend most of your time. So we're going to spend a little bit
of time on that here as well, talking about all the various filter plugins you can
use.

Now you might be asking yourself what are the specific things that the filtering
plugins can do. And really they can do a lot of things. And we're going to show
you a lot of those here. But just to give you an overview, you can use filters to add
fields to the data you're indexing, maybe provide some additional context, you
can use it to remove fields (if field doesn't have value to you as an investigator,
maybe it's more for a sysadmin or something like that, if you don't need it you
might as well remove it and save the disk space and processing required to index
it).

You can change field names. This is incredibly important when you want to index
multiple data sources and be able to search across them using the same field
name. For instance, a lot of different things, log IP addresses, but they all call
them different things. So if you normalize that and change the field names to the
same thing you can search those a lot easier.

Another thing you can do is change field data types. This is important as we saw in
an earlier example. We want to take advantage of the particular type of data or
indexing such as an integer an IP address or so on. We want to change those data
types from the typical string that they might be indexed as.

Of course filtering can filter out events. This is really common really expected for
things like windows event logs where you have a lot of things in there that don't
really provide a lot of security value. So we want to filter those out not index
them they're not useful to us, so we can do that here.

Of course you can normalize fields. That includes all sorts of things, including
lowercasing the fields, changing certain values to be consistent with other values,
all sorts of options here in terms of normalization, so you get consistent search
results when you're doing your investigations.

And finally and maybe most importantly, we can bring structure to unstructured
data. And a lot of the data will get handled especially in the IR realm will be
unstructured. Just log data that you actually have to parse out and write parsers
for yourself. And we can do that and we can apply those using the Grok filter
which we've dedicated an entire lecture to in this section

As we get into our example into the demo portion of this, I want to paint a picture
for you of a common scenario. It's this is going to be especially the case if you're
in consulting or an IR and you walk into a customer site and you need logs, and
first of all you'll be fortunate if they have logs for you at all.

But hopefully they do, and when they hand you those logs, they're often going to
be in a format that is a little raw. They're going to be just line based text logs.
They 're not going to be formatted in any particular way, and one of your goals is
going to be to get those in a centralized location, normalize them, format them
the way you want, massage that data around, and get it to where you can use it,
and search it in a meaningful way.

And that's exactly example I want to work through with you here. So for the rest
of this lecture we're going to take an input. And we're going to work on massaging
that with input filters from Logstash.

And so what I’ve done here is in the data directory at C logs data I’ve created this
app underscore access.CSV file. So it's a comma separated value file, common
output that you're probably going to get handed from time to time, and it's a very
simple log file. This is akin to something you might see in a custom web
application or web access log.

So in this case we have a Unix timestamp. We have a username, we have an IP


address, we have a company name, we have a result, whether the authentication
was a success or a failure, and then we have a customer ID number.

So this isn't an insanely complex log, but it's a log that we need to do more with
we don't necessarily want to parse it just as it is here. So let's now take a look at
the configuration file that I’ve already got set up. The configuration file in
‘accesslog.com’ for what I’ve named it, is pretty straightforward, very simple. It's
the exact same thing we did earlier with our other example of reading a file based
log with one slight change.
You see here we're using the same output to ES, we're using the file input plug-in
referencing the path that the CSV file is located in, we're telling it to start at
the ,beginning but I do have one more additional option in here. And this is the
SINCEDB path option.

And just really quickly, one of the things Logstash has the capability to do when
it's reading a log file, is to remember where it left off and it does this by creating
SINCEDB files. That way if you restart Logstash, it can keep track of where it was
without reading the entire file in again. So you don't have duplicate logs.

What I’ve done here is specified the path of the SINCEDB files and sent that to
‘dev/null’, which is basically telling Logstash: “hey, just trash those, because what
we're going to be doing here is iterating on our configuration file a lot and
constantly re-indexing this file over and over again and I didn't have to manually
re delete those every time”.

So the SINCEDB path option pointing that ‘dev/null’, something you're going to
want to do when you're playing around with this on your own in a lab
environment constantly re-indexing things. Not something you'll do in ??, but it is
something we've done here and you'll continue to see that throughout this course
as we re-index things.

Now with that said, let's actually take a look at the data that is already indexed
into ES. I’ve already run this config. I’ve already indexed that data file in, so let's
take a look at what that looks like.

So our data is in Kibana here, it's in ES we're accessing it via Kibana, and it's pretty
simple. I want you to notice we have metadata fields and then we have the data
we actually indexed.

So we have this path field, we have this timestamp field, version host, and so on,
type in index. These are metadata fields. These are things that were added via
Logstash, via ES.
The actual data itself is all contained within one field. It's all within this message
field. And see it's all a single line. And that poses some problems, because it
makes it hard to search.

Certainly we can do some free text search, but the way ES utilizes its analyzers,
that's a little non-trivial sometimes, and we also don't really have the ability to do
any aggregations.

You know, over here with some of these other fields, so with path or host we can
click on that and get aggregations. Of course there's only one host here so we
only get one.

If these were broken up into separate fields, we could aggregate on all these. We
could perform simple aggregations on the user or on the IP address or on the
company or any of these things. We could also search them a whole lot more
easier. And we can even do certain types of operations.

So for instance, on the customer ID, we could do mathematical operations if it


was configured as an integer.

So all told having everything in a single field is not the way we want to go here.
We want to break these things up into their individual fields and maybe even go a
little bit further and do a couple other useful things to them so this data is much
easier to search.

Remember, everything we're doing with Logstash is to make the data more
amenable to performing investigations and solving those investigations. So that's
what we're going to work on now is configuring input or configuring filtering
plugins, so we can make this data easier to work with.

The thing you need to know here is that Logstash uses plugins for filters, just like
it does for input or output, which means we have a dedicated filter plugin page in
the help documentation.

Now the thing we're going to use here first is the CSV plugin, because really the
first thing we have to do is break apart that big long line into individual fields. And
once we do that, we can continue to manipulate the data further.
But really the first step is taking this structured data that we have via CSV and
relaying that structure to ES, so that when Logstash indexes it, we have that
structure represented there.

Now the CSV plugin is fairly simple. There are actually no required fields we're
going to use this columns field and we see that the columns field is a type array.
So we're simply going to define a list of column names in the order they appear in
the CSV as if it were a header line.

Pretty straightforward, this is a pretty easy step and since the data is structured,
this is very easy. Later on we'll look at unstructured data which is a lot more
complex but structured data with a CSV file, fairly easy.

Now the thing we have to do before we get there of course, is we actually need to
clear all this data out of Kibana or out of ES. Because we're going to re-index it if
we make a change to our configuration file, it's not going to be reflective because
that data has already been indexed.

So I’m going to go here to Dev Tools and just simply hit play, and that's going to
tell me that it's deleted all of our Logstash instances. So we now can go in and edit
our config file, restart Logstash, and we can have those changes going to effect.

So let's go ahead and do that. Now what I’ve done for you in this terminal window
is down at the bottom here, SSH into the same box, where our data is. I just want
it on the screen so you can see it here and see what we're working with as we're
manipulating it.

Now with that said, let's go ahead and edit our conf file which is accesslog.com.
It's pretty straight forward right now, but we're going to add a filter statement
using the CSV plugin.

So we're going to create the object for that in the configuration file, and then
we're going to specify the columns option, which is the only one we're using, and
create an array. Now the array is simple again. We're just specifying the header
fields in the order they occur separated by commas.
If we can get that from looking at the structure here below, so I’m going to name
these fields date time, username, source IP, company, result, and customer ID.
That will get us all six fields that are listed here as they are separated by commas.

Of course you can name these anything you want that's what we're going to
name them for the purpose of this exercise, and those names will be important as
we continue to manipulate these things going on.

So we'll go ahead and exit out of that, and really all that we have to do now, since
we've already deleted the old index, is to paste in. I pasted in the command to
restart Logstash, and we'll go ahead and do that. And that'll take just a second but
once that's done, we should be able to go back over to Kibana. And see the data
properly re-indexed.

So given that just a second here so let's go back in here, and go back and discover.
And it looks like we actually have data now and sure enough it looks like things
are broken back into their appropriate field names. Of course they're in
alphabetical order, here they're not in the order that or well they're not
alphabetical. But certainly not in the order that we separated those out into.

So here we see result, we see date time, we see source IP, we see all the fields we
specified, here's username and so on.

So we have all those fields available to us, which means we can now do things
that we want to do before.

So I can click on source IP and do aggregations based upon that. It looks like
they're all unique in this file. I can click on result we can do aggregations on that it
looks like we have a few more fails than we do passes. So that you see some of
the power we have here and of course that means we can also search for these
things based upon individual fields.

Now looking, here you may see something that's a little odd and notice we still
have our message field. That's the field that contains the entire log line
altogether. And you might ask yourself, well, do we really need this if we have
these things broken out into their own fields and in most cases you probably
don't. There are some reasons you might want to keep those. For various reasons,
unnormalized versus normalized data and so on.

But for most cases you're going to want to remove that field. So that's what we'll
do next. Let's go ahead and actually edit our filter configuration to remove this
message field.

To do that, I’m going to use a different plugin. Let's go back here to our plugins
page, and we are actually going to use the mutate plugin which is a fun name but
it performs mutations on field. So I guess it's appropriate.

Now we're going to be using this field this plugin a lot it's probably one of the
more common ones we'll use. It has a lot of different options. Just looking at
these. Here we can see we can do conversions, copies, joins, lower cases, merge,
renames, replaces, etc. we can do a lot of those things in here.

So the option that I want is here under common options and we're actually going
to use it to use this remove field option. So I’m going to click on that so we can
see what that looks like. It's pretty straightforward here.

We're going to specify the plugin name. I think this is an error into the
documentation this should say mutate right here. But nonetheless we're going to
use plugin name, which is mutate and then remove field and specify the field.

Ours is going to be a little simpler than this because we're specifying a very
distinct field name. So let's go ahead and do that, of course, first I need to go in
here and delete the index. We're going to be re-indexing that data, and then we
can go in and actually make this change.

So let's do that ‘sudo nano conf.d/accesslog.conf’. We're going to go in here


underneath where we've broken things out with the CSV plug-in and specify our
mutate plug-in, and then the option we're going to use here is remove field.

We'll set that up, and then we'll specify the field we want to remove, which is
message. The message field is the one we're going to remove.
So I’ve gone ahead and saved that, that looks good. Let's go ahead and restart
Logstash and we'll see if the change we made is going to take effect here.

Looks like we've re-indexed the data and sure enough I don't see a message field
anymore. We still have all of our data. So I see customer ID and company and
username. But I don't see the message filled anymore. So it looks like that was
successful. So we're no longer indexing that field. We don't need it anymore, so
that's great.

So what’s next? We've already done a lot and this is certainly much better than
often than we were. But we can keep going. There's a lot of other things we can
do here to make this data more reasonable to work with.

One thing I notice immediately is notice here that we have two timestamps we
have date time which is the Unix formatted timestamp that's been indexed from
the data we're using. But there's also this timestamp and this timestamp right
here is the time that the data was indexed into ES.

And there are reasons you might want to have both of these timestamps, but for
our purposes we really don't. We'd rather use the one that came with the log file,
because that's one we're going to be using to perform our searches and sync
things up with other network and host based events that have occurred.

Timestamps are certainly a very important aspect of everything we're doing in


terms of digital forensics or instant response.

So what I actually want to do here is take this date time value and have it
populate this timestamp field, so we only have one timestamp. So that's what
we're going to do and to do that, we're going to go back here, and we're going to
use ‘date’ plugin.

And it's pretty straightforward. You don't see any required fields here, but what
we're going to do is use the date field, just to do exactly what it says. The date
filter is used for parsing dates from fields and then using that date or timestamp is
the Logstash time for the event.
So there's a couple examples here. Let's see yeah here we're going to use an
example like this where we're going to pass in the name of the field that we want
to use and then pass in the format of the timestamp. Of course our timestamp is
formatted like this: it's UNIX_MX which will parse integer values expressing Unix
time and milliseconds since epoch like this example which is exactly what our
timestamp looks like.

So that's what we're going to work with here. So let's go ahead and get that
configured, I’m going to go ahead and purge this index, and let's switch this back
over here to our terminal window.

I’m going to open up the file here and to do this we're going to use the ‘date’
input plugin, and we're going to use the option ‘match’, and for match we're
going to open a new array here, and we're going to specify the name of the field
which is ‘datetime’. Of course that comes from what we have up here. That's
what we named the column when we broke it out with the CSV plugin, and then
we're going to specify the format of it which is UNIX_MX, and close that array out
and that should get us what we need.

So let's go ahead and get log status restarted here and we'll see if that worked.
And there we go. It looks like we are good here. Of course we still have both fields
we didn't tell it to drop the original date time field.

But notice now that the actual timestamp matches the date time field you can't
really tell, unless you can do Unix epoch time conversions in your head and if so,
then you probably don't need this course.

But nonetheless I can trust tell you these are different because if you look at this
timestamp it's July 18th 2013 and. So I’m making this course now in 2017.

So clearly this is an older log file which makes sense.

So now our timestamp field is based upon our date time field and if we wanted to
we could go so far as to drop this date time field which may be worthwhile to do
since we don't really need two fields showing the same time. So as a matter of
fact let's just go ahead and do that real quick I’m going to delete the index.
Let's go in here to our configuration and let's just add the option to remove
another field, and that is date time. Go back here, and sure enough that field is
now gone.

So we're only left with the single timestamp field and that's the only time step we
have to deal with someone on indexing useless data that we really just don't need
anymore.

So what can we do next, well, something I like to do and I think is valuable here, is
to add some relative context to the information we've indexed. And one way to
do that is with GeoIP information, and you might ask: well, how do we do that,
and as you might guess there's a plugin for that.

So let's go ahead and find the GeoIP plugin and it's pretty straightforward. It tells
us the GeoIP filter adds information about the geographical location of an IP
address based upon data from the Maximum Geolite2 databases.

So we're going to do. Since we're feeding it in external IP addresses, we can


actually apply this filter and it will geo map those for us and provide that context
within the data we call this often decorating the data with additional context.

So that's what we're going to do here. It looks like our main requirement here is
the source field. Let's go down here, and see what that wants from us. It says
source is a required string in this field containing the IP address or hostname to
map via GeoIP.

Okay that's pretty good I think let's also use the target option here, and this will
specify the field into which Logstash should store the GeoIP data.

So that's all we're doing is source and target. That should be pretty
straightforward.

So let's go here and let's once again delete our index and let's switch back here to
our configuration file. Now we're going to go down here under mutate under date
all these other things we've done, and we'll specify we're using the GeoIP filter.
From here again we have two configuration parameters. One is source and the
other is target. That's what we're doing. Now source is the field that this is being
generated from. Notice up here the field we used was source IP to hold our IP
address. That's what we're going to use here. ‘Srcip’ and then we just simply have
to have to supply a target or the field name.

We don't have to supply this, but I like to, especially since a lot of the logs you'll
deal with have source and destination IPs. So in this case I’m going to put it under
‘geoip_srcip’. If we had a destination we would have another one with that field.
But we'll start with this and that should get us what we want here with this GeoIP
decoration.

So go ahead and get things restarted and we'll see if we've been successful. All
right it looks like we were successful. We know this because we actually have
quite a few fields added and of course as you know there are quite a few things
that go into GeoIP location. And we've got a lot of those here. So for instance we
have latitude here, longitude is here as well, we have source country name,
country code, looks like we have a region name here, which in the case of the US
is a state, we have a zip code, and so on.

So we have all these different fields that we can now use to enrich the data we
have, and of course, there are a lot of fields here you can control which ones you
do and don't want. We've got them all here. And notice they're broken out here
on the left of the screen, so we can group by those things.

So I can go down here and sort by region code. It looks like we have a lot of traffic
from California, New Jersey; we can sort these by various things. Here's country
name, most of this is US based traffic, we've got some China, India, Canada and
Taiwan here.

So we've enriched our data we've decorated with useful context, and that's pretty
darn neat. That's useful to us and that's one way you can enrich data using the
plugins here that are available for Logstash.

Now at first thought you might say: well, let's go ahead and just add it to this
existing mutate section. We have we're already doing some things with the
mutate plug-in here, but we can't quite do that because we particularly we're
only replacing certain values of a certain field.

If I put it here to replace the field result with pass or with success, it's going to
replace every value we don't want to place every value we only want to replace
those that say pass with the word success. We actually need to do a conditional
here and I haven't talked about those yet, but that's one of the beauties of this
input plugin syntax is you can actually do conditionals. And it's pretty slick.

So let's do that here I’m going to show you an example of it. What we want to do
here is if the result field has a value of pass we want to replace the word pass
with success. So we're going to write this out, just as I said it, so if the result field
equals pass we want to take an action. So open up a new object here, so if a
result field equals pass, we want to take an action and the action we want to take
is we actually want to use that mutate plugin.

So we'll open up an object for that as well, and I know my tabs aren't lining up
perfectly. But bear with me on this one. And the option we want to take is replace
and that's an array and we want to replace the result field with success. We're
replacing pass with success and that's pretty much it.

So again we could have placed it up there in the other mutate field. But if we
would have done that, it would replace every value with success. We don't want
to do that. So again this is a conditional. If the result field says pass apply the
mutate filter plugin and apply that to replace result with that field value with
success.

So let's go ahead and apply that, and let's see if that worked for us. All right and
let's look here and what do you know look there's one right at the top. Result
equals success... Once again the result used to say pass and that was what was
native in the log. But we wrote that anywhere we saw pass we replaced it with
success and now we can correlate this better with other logs that actually use the
terminology success, or fail. This is actually a common thing you'll probably be
doing across a lot of fields, because as we know vendors name the same thing
many different things not just across different vendors but even across different
products with the same vendor.

So normalizing your field names, so you can search them better, is certainly a
common task you'll be taking.

I think that's a pretty good stopping point for now. We've dramatically changed
the data that's coming in. Remember we didn't touch our source file at all. That
file's saying the same and that's how you want it to be. You want that file to
remain what it is, and do all the hard work within Logstash that wa. That's just the
easiest way to get things normalized and take all the actions we wanted to take.

So we've gone from everything in one field where we couldn't do a lot to it, to this
really amazing, in data that's enriched with additional context with the GeoIP
information.

It's broken into different fields, so we can form aggregations, we've normalized
some things, a lot of great stuff applied here, and of course along the way, you
learned how to use filter plugins within Logstash. There are a whole lot more of
those, and we're going to continue to go through more of those throughout the
course. But for now I think we'll go ahead and leave this file be and move on to
our next lesson.

In this lesson we learned all about filter plugins, and how they can be used to
manipulate and enhance input data and went through a real world example to
demonstrate those concepts. Along the way, you learned about several of the
most important plugins that exist, including the CSV plugin which allows us to
take structured CSV data, and break it into separate fields; the date plugin which
allows us to take the ingested date and replace the default ES date with the
relevant timestamp, very important timestamps are in forensic response.

We learned about the mutate plugin which is a lot more involved than we even
showed here we're going to continue to use that one that one's really important
does a lot of really cool things.
We of course enriched our data with the GeoIP plug-in which is really nice. So we
could have that gip information right in there with our events that decoration
concept we talked about.

And of course finally we talked about conditionals. And I showed you how to use a
conditional along with the mutate plugin to selectively apply things and changes
we wanted to make based upon field names and values

So input is important, but the ability to filter that based on certain parameters is
really where you're going to spend most of your time. Of course in this example
we had structured data to work with and that makes things a little bit easier.

In the next lesson we're going to focus on unstructured data and using the
illustrious Grok plug-in to separate that out. So that's going to think take things up
a notch1.

But hopefully you've gotten your feet wet you're starting to play around with
these filter plugins and definitely take some time to play around with these on
your own as well.

1
take it up a notch: to try harder / make more effort / to make something more exciting, intense, or
interesting

You might also like