0% found this document useful (0 votes)

331 views9 pages

Databricks Cloud How To Log Analysis Example

This document provides an example of analyzing Apache access logs using Databricks notebooks. It shows how to parse log files, load them into a resilient distributed dataset (RDD), and analyze the data by computing statistics like average content size and response code frequencies. Visualizations of the results are also created by converting RDDs to DataFrames and using the display function.

Uploaded by

SRK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

331 views9 pages

Databricks Cloud How To Log Analysis Example

Uploaded by

SRK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Log Analysis

Example
How-To Guide

Databricks: Log Analysis Example

Analyzing Apache Access Logs

with Databricks
Databricks provides a powerful platform to process, analyze, and visualize small
and big data in one place. In this example, we will illustrate how to analyze Apache
HTTP web server access logs using Notebooks. Notebooks allow users to write
and run arbitrary Spark code and interactively visualize the results. Currently,
notebooks support three languages: Scala, Python, and SQL. In this example, we
will be using Python for illustration.
The analysis presented in this example is available in Databricks as part of
the Databricks Guide. Find this notebook in your Databricks workspace at
databricks_guide/Sample Applications/Log Analysis/Log Analysis in Python
it will also show you how to create a data frame of access logs with Python using
the new Spark SQL 1.3 API. Additionally, there are also Scala & SQL notebooks in
the same folder with similar analysis available.

Getting Started
First we need to locate the log file. In this example, we are using a
synthetically generated log which is stored in the /dbguide/sample_log
file. The command below (typed in the notebook) assigns the log file
pathname to the DBFS_SAMPLE_LOGS_FOLDER variable, which will be
used throughout the rest of this analysis.

Figure 1: Location of the synthetically generated logs in your instance of Databricks

Databricks: Log Analysis Example

Parsing the Log File

Each line in the log file corresponds to an Apache web server access
request. To parse the log file, we define parse_apache_log_line(), a
function that takes a log line as an argument and returns the main fields of
the log line. The return type of this function is a PySpark SQL Row object
which models the web log access request. For this we use the re module
which implements regular expression operations. The APACHE_ACCESS_
LOG_PATTERN variable contains the regular expression used to match an
access log line. In particular, APACHE_ACCESS_LOG_PATTERN matches
client IP address (ipAddress) and identity (clientIdentd), user name as
defined by HTTP authentication (userId), time when the server has finished
processing the request (dateTime), the HTTP command issued by the client,
e.g., GET (method), protocol, e.g., HTTP/1.0 (protocol), response code
(responseCode), and the size of the response in bytes (contentSize).

Figure 2: Example function to parse the log file in a Databricks notebook

Databricks: Log Analysis Example

Loading the Log File

Now we are ready to load the logs into a Resilient Distributed Dataset (RDD).
RDDs represent a collection of items distributed across many compute nodes
that can be manipulated in parallel and is the primary data abstraction in
Spark. Once the data is stored in an RDD, we can easily analyze and process it
in parallel. To do so, we launch a Spark job that reads and parses each line in
the log file using the parse_apache_log_line() function defined earlier, and
then creates the access_logs RDD. Each tuple in access_logs contains the
fields of a corresponding line (request) in the log file, DBFS_SAMPLE_LOGS_
FOLDER. Note that once we create the access_logs RDD, we cache it into
memory, by invoking the cache() method. This will dramatically speed up
subsequent operations we will perform on access_logs.

Figure 3: Example code to load the log file in Databricks notebook

At the end of the above code snippet, notice that we count the number of
tuples in access_logs (which returns 100,000 as a result).

Databricks: Log Analysis Example

Loading the Log File

Now we are ready to analyze the logs stored in the access_logs RDD. Below
we give two simple examples:

1. Computing the average content size

2. Computing and plotting the frequency of each response code

1. Average Content Size

We compute the average content size in two steps. First, we create another
RDD, content_sizes, that contains only the contentSize field from
access_logs, and cache this RDD:

Figure 4: Create the content size RDD in Databricks notebook

Databricks: Log Analysis Example

Second, we use the reduce() operator to compute the sum of all content sizes
and then divide it into the total number of tuples to obtain the average:

Figure 5: Computing the average content size with the reduce() operator

The result is 249 bytes. Similarly we can easily compute the min and max, as
well as other statistics of the content size distribution.
An important point to note is that both commands above run in parallel.
Each RDD is partitioned across a set of workers, and each operation invoked
on an RDD is shipped and executed in parallel at each worker on the
corresponding RDD partition. For example the lambda function passed as the
argument of reduce() will be executed in parallel at workers on each partition
of the content_sizes RDD. This will result in computing the partial sums
for each partition. Next, these partial sums are aggregated at the driver to
obtain the total sum. The ability to cache RDDs and process them in parallel
are two of the main features of Spark that allows us to perform large scale,
sophisticated analysis.

Databricks: Log Analysis Example

2. Computing and Plotting the Frequency

of Each Response Code
We compute these counts using a map-reduce pattern. In particular, the
code snippet returns an RDD (response_code_to_count_pair_rdd) of
tuples, where each tuple associates a response code with its count.

Figure 6: Counting the response codes using a map-reduce pattern

Next, we take the first 100 tuples from response_code_to_count_pair_

rdd to filter out possible bad data, and store the result in another RDD,
response_code_to_count_array.

Figure 7: Filter out possible bad data with take()

Databricks: Log Analysis Example

To plot data we convert the response_code_to_count_array RDD into

a DataFrame. A DataFrame is conceptually equivalent to a table, and it is
very similar to the DataFrame abstraction in the popular Pythons pandas
package. The resulting DataFrame (response_code_to_count_data_
frame) has two columns response code and count.

Figure 8: Converting RDD to DataFrame for easy data manipulation and visualization

Now we can plot the count of response codes by simply invoking display()
on our data frame.

Figure 9: Visualizing response codes with display()

Databricks: Log Analysis Example

If you want to change the chart type, you can do so interactively by just
clicking on the down arrow below the chart, and select another chart type.
To illustrate this capability, below we show the same data using a pie-chart.

Figure 10: Changing the visualization of response codes to a pie chart

Additional Resoures
If youd like to analyze your Apache access logs with Databricks, you can
evaluate Databricks with a trial account now. You can also find the source
code on Github.
Other Databricks how-tos can be found at:
The Easiest Way to Run Spark Jobs

Evaluate Databricks with

a trial account now:
databricks.com/registration
9

azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Pyspark
No ratings yet
Pyspark
31 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Intro To Apache Spark: Paco Nathan, Download Slides
No ratings yet
Intro To Apache Spark: Paco Nathan, Download Slides
86 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Data Engineer (Azure) Curriculum
No ratings yet
Data Engineer (Azure) Curriculum
3 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Python for Data Engineering Guide
No ratings yet
Python for Data Engineering Guide
4 pages
DBT - Commands
No ratings yet
DBT - Commands
2 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Making Big Data Simple With Databricks
No ratings yet
Making Big Data Simple With Databricks
25 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Apache Pig
100% (2)
Apache Pig
80 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Datasheet - Z1K 335-11z: Ordering Details
No ratings yet
Datasheet - Z1K 335-11z: Ordering Details
6 pages
Linear Algebra, 4th Edition (2009) Lipschutz-Lipson
100% (2)
Linear Algebra, 4th Edition (2009) Lipschutz-Lipson
432 pages
D Key
No ratings yet
D Key
6 pages
9852 2009 01 Driving Boltec MC, LC
No ratings yet
9852 2009 01 Driving Boltec MC, LC
2 pages
Linkedin: Ryan-Kroger-4A3675B
No ratings yet
Linkedin: Ryan-Kroger-4A3675B
1 page
GPS Navigation Project Group-4 Redone
No ratings yet
GPS Navigation Project Group-4 Redone
16 pages
Modulating Control Valves PN16 MXG461.. MXF461. - en
No ratings yet
Modulating Control Valves PN16 MXG461.. MXF461. - en
16 pages
Automatic Vehicle Speed Controller Using Ultrasonic & RFID Sensors
No ratings yet
Automatic Vehicle Speed Controller Using Ultrasonic & RFID Sensors
43 pages
2408.03286v2Biomedical SAM 2: Segment Anything in Biomedical Images and Videos
No ratings yet
2408.03286v2Biomedical SAM 2: Segment Anything in Biomedical Images and Videos
13 pages
5 Manual de Piezas Terminadora de Asfalto.....
No ratings yet
5 Manual de Piezas Terminadora de Asfalto.....
12 pages
Research Proposal 2
No ratings yet
Research Proposal 2
3 pages
Hum-23a Gor Test 06102022
No ratings yet
Hum-23a Gor Test 06102022
17 pages
Steering&Suspension Assignments
No ratings yet
Steering&Suspension Assignments
4 pages
Dissertation Using T-Test
100% (2)
Dissertation Using T-Test
4 pages
Perraju Vodugu: Experience Summary
No ratings yet
Perraju Vodugu: Experience Summary
8 pages
Core 3 Version 2 Configure Computer Systems and Networks
No ratings yet
Core 3 Version 2 Configure Computer Systems and Networks
83 pages
Applied Multivariate Statistical Analysis 6th Edition Johnson Solutions Manualpdf download
100% (5)
Applied Multivariate Statistical Analysis 6th Edition Johnson Solutions Manualpdf download
50 pages
BPD Micro Project (CE-4-I)
No ratings yet
BPD Micro Project (CE-4-I)
16 pages
Valvoline Maxlife Multi Vehicle ATF
No ratings yet
Valvoline Maxlife Multi Vehicle ATF
2 pages
12th Lab Practical
No ratings yet
12th Lab Practical
20 pages
Electrical Fixture & Conduit Layout Plan For Ground To 5th Floor 12
No ratings yet
Electrical Fixture & Conduit Layout Plan For Ground To 5th Floor 12
1 page
Secure Login System in Assembly Language
No ratings yet
Secure Login System in Assembly Language
13 pages
CSP Crypto Puzzle
No ratings yet
CSP Crypto Puzzle
3 pages
Teaching After Covid - Jacobson Term Paper
No ratings yet
Teaching After Covid - Jacobson Term Paper
11 pages
Injection of EMF in The Rotor Circuit
No ratings yet
Injection of EMF in The Rotor Circuit
3 pages
How MBO Helped Fix A Troubled Project
No ratings yet
How MBO Helped Fix A Troubled Project
5 pages
Service Manual 5200 - 5380
No ratings yet
Service Manual 5200 - 5380
60 pages
The 85 MM Speedometer Requires A Dash Panel Hole Diameter of 86mm. It Can Accommodate A Maximum Dash Panel Thickness of 7mm
No ratings yet
The 85 MM Speedometer Requires A Dash Panel Hole Diameter of 86mm. It Can Accommodate A Maximum Dash Panel Thickness of 7mm
3 pages
Xpressbees ISM
No ratings yet
Xpressbees ISM
20 pages
HPE 3PAR StoreServ 8000 Storage - Identifying Components
No ratings yet
HPE 3PAR StoreServ 8000 Storage - Identifying Components
3 pages

Databricks Cloud How To Log Analysis Example

Uploaded by

Databricks Cloud How To Log Analysis Example

Uploaded by

Log Analysis

Databricks: Log Analysis Example

Analyzing Apache Access Logs

Figure 1: Location of the synthetically generated logs in your instance of Databricks

Databricks: Log Analysis Example

Parsing the Log File

Figure 2: Example function to parse the log file in a Databricks notebook

Databricks: Log Analysis Example

Loading the Log File

Figure 3: Example code to load the log file in Databricks notebook

Databricks: Log Analysis Example

Loading the Log File

1. Computing the average content size

1. Average Content Size

Figure 4: Create the content size RDD in Databricks notebook

Databricks: Log Analysis Example

Databricks: Log Analysis Example

2. Computing and Plotting the Frequency

Figure 6: Counting the response codes using a map-reduce pattern

Next, we take the first 100 tuples from response_code_to_count_pair_

Figure 7: Filter out possible bad data with take()

Databricks: Log Analysis Example

To plot data we convert the response_code_to_count_array RDD into

Figure 9: Visualizing response codes with display()

Databricks: Log Analysis Example

Figure 10: Changing the visualization of response codes to a pie chart

Evaluate Databricks with

You might also like