Databricks Cloud How To Log Analysis Example
Databricks Cloud How To Log Analysis Example
Example
How-To Guide
Getting Started
First we need to locate the log file. In this example, we are using a
synthetically generated log which is stored in the /dbguide/sample_log
file. The command below (typed in the notebook) assigns the log file
pathname to the DBFS_SAMPLE_LOGS_FOLDER variable, which will be
used throughout the rest of this analysis.
At the end of the above code snippet, notice that we count the number of
tuples in access_logs (which returns 100,000 as a result).
Second, we use the reduce() operator to compute the sum of all content sizes
and then divide it into the total number of tuples to obtain the average:
Figure 5: Computing the average content size with the reduce() operator
The result is 249 bytes. Similarly we can easily compute the min and max, as
well as other statistics of the content size distribution.
An important point to note is that both commands above run in parallel.
Each RDD is partitioned across a set of workers, and each operation invoked
on an RDD is shipped and executed in parallel at each worker on the
corresponding RDD partition. For example the lambda function passed as the
argument of reduce() will be executed in parallel at workers on each partition
of the content_sizes RDD. This will result in computing the partial sums
for each partition. Next, these partial sums are aggregated at the driver to
obtain the total sum. The ability to cache RDDs and process them in parallel
are two of the main features of Spark that allows us to perform large scale,
sophisticated analysis.
Figure 8: Converting RDD to DataFrame for easy data manipulation and visualization
Now we can plot the count of response codes by simply invoking display()
on our data frame.
If you want to change the chart type, you can do so interactively by just
clicking on the down arrow below the chart, and select another chart type.
To illustrate this capability, below we show the same data using a pie-chart.
Additional Resoures
If youd like to analyze your Apache access logs with Databricks, you can
evaluate Databricks with a trial account now. You can also find the source
code on Github.
Other Databricks how-tos can be found at:
The Easiest Way to Run Spark Jobs