0% found this document useful (0 votes)
450 views109 pages

ABD00 Notebooks Combined - Databricks

This document contains code snippets and output from various Python, Linux, and Databricks filesystem commands. It demonstrates Python basics like data types, anonymous functions, lists, and maps. It also shows Linux commands like pwd, ls, ps, env, free, df, and Databricks filesystem commands like ls to view contents of directories. The document serves as a reference for common Python, Linux, and Databricks filesystem commands.

Uploaded by

Bruno Teles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
450 views109 pages

ABD00 Notebooks Combined - Databricks

This document contains code snippets and output from various Python, Linux, and Databricks filesystem commands. It demonstrates Python basics like data types, anonymous functions, lists, and maps. It also shows Linux commands like pwd, ls, ps, env, free, df, and Databricks filesystem commands like ls to view contents of directories. The document serves as a reference for common Python, Linux, and Databricks filesystem commands.

Uploaded by

Bruno Teles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

ABD00 Notebooks Combined


(https://databricks.com)

ABD02 Python refresh


# Python environment
import sys
print(sys.version)

3.9.5 (default, Nov 23 2021, 15:27:38)


[GCC 9.3.0]

Python
int=100; dec=222.222

print(type(int), type(dec))

<class 'int'> <class 'float'>

print("Int = ", int, "\nDecimal= ", dec)

Int = 100
Decimal= 222.222

print("Int = %d Decimal = %.2f" % (int, dec))

Int = 100 Decimal = 222.22

print('singe Q')
print("singe Q")
print("lot's of")

singe Q
singe Q
lot's of

dict = {'Lisboa':'1900', 'Porto':'4000'}


print(dict)

{'Lisboa': '1900', 'Porto': '4000'}

#list(dict)
list(dict.items())

Out[8]: [('Lisboa', '1900'), ('Porto', '4000')]

print(dict['Lisboa'])
print(dict['Porto'])
# Much more on lists but not relevant for our Spark examples

1900
4000

anonymous functions

x = lambda a : a + 2
print(x(3))

f = lambda x, y : x + y
f(1,1)

Out[11]: 2

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 1/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

times2 = lambda var : var*2

print(times2)
type(times2)

<function <lambda> at 0x7fe784341dc0>


Out[13]: function

print(times2(2))

seq = [1,2,3,4,5]
print(seq)

[1, 2, 3, 4, 5]

list(map(times2,seq))

Out[16]: [2, 4, 6, 8, 10]

print(*map(times2,seq))

2 4 6 8 10

list(map(lambda var: var*2 , seq))

Out[18]: [2, 4, 6, 8, 10]

list(filter(lambda iten : iten%2 == 0 , seq))

Out[19]: [2, 4]

print(*map(lambda var: var*2 , seq))


list(map(lambda var: var*2 , seq))

2 4 6 8 10
Out[20]: [2, 4, 6, 8, 10]

ABD02 Databricks Community Edition


# Comment your notebooks with markdowns
# Type '%md #' for having line text in the notebook like '%md # - ABD Databricks Community Edition'
# hit 'crtl enter' or 'shift enter' to run the code in the cell

print("Hi")

Hi

Try some commands

These are Linux commands

%sh pwd #parent working directory?

/databricks/driver

%sh ls -l

total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 2/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

drwxr-xr-x 5 root root 4096 Jan 5 11:05 metastore_db


-r-xr-xr-x 1 root root 1306848 Jan 1 1970 preload_class.lst

%sh ps #ps command is used to list the currently running processes and their PIDs along with some other information
depends on different options

PID TTY TIME CMD


1 ? 00:00:00 systemd
44 ? 00:00:00 systemd-journal
63 ? 00:00:00 networkd-dispat
70 ? 00:00:00 systemd-logind
86 ? 00:00:00 ntpd
88 ? 00:00:00 sshd
91 ? 00:00:00 monit
92 ? 00:00:00 unattended-upgr
154 ? 00:00:00 start_wsfs.sh
163 ? 00:00:00 wsfs
282 ? 00:00:57 java
357 ? 00:00:00 apache2
387 ? 00:00:00 cron
429 ? 00:00:00 bash
475 ? 00:02:13 java
638 ? 00:00:00 ttyd
879 ? 00:00:00 bash
885 ? 00:00:00 R
902 ? 00:00:00 R
1048 ? 00:00:05 python

%sh env

SHELL=/bin/bash
PIP_NO_INPUT=1
SUDO_GID=0
PYTHONHASHSEED=0
DISABLE_LOCAL_FILESYSTEM=false
JAVA_HOME=/usr/lib/jvm/zulu8-ca-amd64/jre/
MLR_PYTHONPATH=/etc/mlr_python_path
MLFLOW_PYTHON_EXECUTABLE=/databricks/spark/scripts/mlflow_python.sh
JAVA_OPTS= -Djava.io.tmpdir=/local_disk0/tmp -XX:-OmitStackTraceInFastThrow -Djava.security.properties=/databricks/sp
ark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetai
ls -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_6
4-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=co
m.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.su
n.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.
xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.s
un.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.interna
l.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementat
ionSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.
cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabric
ks.serviceName=driver-1 -Xms7254m -Xmx7254m -Dspark.ui.port=40001 -Dspark.executor.extraJavaOptions="-Djava.io.tmpdir
=/local_disk0/tmp -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:PerMethodRecompilationCutoff=-1 -XX:Pe

%sh free

total used free shared buff/cache available


Mem: 10596352 4441264 4273584 2728 1881504 6155088
Swap: 10485756 0 10485756

%sh df #Use the df command for displaying available space on a file system: View used space. See free disk space. Show
what filesystems are mounted.

Filesystem
1K-blocks Used Available Use% Mounted on
/var/lib/lxc/base-images/release__12.0.x-snapshot-cpu-ml-scala2.12__databricks-universe__head__93a7752__dde6fe5__jenk
ins__b0f3aff__format-2 153707984 17438296 128388984 12% /
none
492 0 492 0% /dev
/dev/xvdb
153707984 17438296 128388984 12% /mnt/readonly
/dev/mapper/vg-lv
455461216 10514868 421736776 3% /local_disk0
tmpfs
7808672 0 7808672 0% /sys/fs/cgroup
tmpfs
7808672 0 7808672 0% /dev/shm
tmpfs

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 3/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1561736 112 1561624 1% /run


tmpfs
5120 0 5120 0% /run/lock
workspace
10485760 0 10485760 0% /Workspace

%sh python --version

Python 3.9.5

These are Cluster filesystem commands

#Lists the contents of a directory

%fs ls /FileStore/tables

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.

Some standard fs linux commands for your knowledge (cp - copy, mv - move, rm - remove)

#%fs cp /FileStore/tables/dataset1.csv /FileStore/tables/dataset2.csv


#dbutils.fs.cp('/FileStore/tables/dataset1.csv', '/FileStore/tables/dataset2.csv')

#dbutils.fs.rm('/FileStore/tables/dataset2.csv')
#%fs rm /FileStore/tables/dataset2.csv

#%fs mv /FileStore/tables/file.csv /FileStore/tables/new_file.csv


#dbutils.fs.mv('/FileStore/tables/file.csv', '/FileStore/tables/new_file.csv')

%fs ls

Table
   
  path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 0
2 dbfs:/cp/ cp/ 0 0
3 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
4 dbfs:/databricks-results/ databricks-results/ 0 0
5 dbfs:/delta/ delta/ 0 0
6 dbfs:/local_disk0/ local_disk0/ 0 0
7 dbfs:/tmp/ tmp/ 0 0
Showing all 8 rows.

%fs ls /databricks-datasets/

Table
  
  path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbfs:/databricks datasets/Rdatasets/ Rdatasets/ 0 0

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 4/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.

%fs ls /databricks-datasets/definitive-guide/data/

Table
  
  path name size modifica
1 dbfs:/databricks-datasets/definitive-guide/data/activity-data/ activity-data/ 0 0
2 dbfs:/databricks-datasets/definitive-guide/data/bike-data/ bike-data/ 0 0
3 dbfs:/databricks-datasets/definitive-guide/data/binary-classification/ binary-classification/ 0 0
4 dbfs:/databricks-datasets/definitive-guide/data/clustering/ clustering/ 0 0
5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/ flight-data/ 0 0
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data-hive/ flight-data-hive/ 0 0
7 dbfs:/databricks-datasets/definitive-guide/data/multiclass-classification/ multiclass-classification/ 0 0
Showing all 14 rows.

%fs ls dbfs:/databricks-datasets/samples/

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/samples/adam/ adam/ 0 0
2 dbfs:/databricks-datasets/samples/data/ data/ 0 0
3 dbfs:/databricks-datasets/samples/docs/ docs/ 0 0
4 dbfs:/databricks-datasets/samples/lending_club/ lending_club/ 0 0
5 dbfs:/databricks-datasets/samples/newsgroups/ newsgroups/ 0 0
6 dbfs:/databricks-datasets/samples/people/ people/ 0 0
7 dbfs:/databricks-datasets/samples/population-vs-price/ population-vs-price/ 0 0
Showing all 7 rows.

%fs ls dbfs:/databricks-datasets/samples/people

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/samples/people/people.json people.json 77 1534435526000

Showing 1 row.

%fs ls dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv

Table
  
  path name size modificationT
1 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2010-summary.csv 2010-summary.csv 7121 152219204900
2 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2011-summary.csv 2011-summary.csv 7069 152219204900
3 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2012-summary.csv 2012-summary.csv 6857 152219204900
4 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2013-summary.csv 2013-summary.csv 7020 152219205000
5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2014-summary.csv 2014-summary.csv 6729 152219205000
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2015-summary.csv 2015-summary.csv 7080 152219205000

Showing all 6 rows.

These are Databricks and notebook commands

dbutils.help()

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 5/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

This module provides various utilities for users to interact with the rest of Databricks.

credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks

dbutils.fs.help()

dbutils.fs provides utilities for working with FileSystems. Most methods in this package can take either a DBFS path (e.g., "/foo" or "dbfs:/foo"), or another
FileSystem URI. For more info about a method, use dbutils.fs.help("methodName"). In notebooks, you can also use the %fs shorthand to access DBFS.
The %fs shorthand maps straightforwardly onto dbutils calls. For example, "%fs head --maxBytes=10000 /file/path" translates into "dbutils.fs.head("/file/path",
maxBytes = 10000)".

fsutils
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

mount
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]):
boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information

dbutils.fs.help('cp')

/**
* Copies a file or directory, possibly across FileSystems..
*
* Example: cp("/mnt/my-folder/a", "s3n://bucket/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean

#dbutils.fs.cp("dbfs:/FileStore/old_file.txt", "file:/tmp/new/new_file.txt")

dbutils.widgets.help()

dbutils.widgets provides utilities for working with notebook widgets. You can create different types of widgets and get their bound value. For more info about a
method, use dbutils.widgets.help("methodName").

combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value
and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and
choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value
and choices

# Same output as command above


display(dbutils.widgets)

dbutils.widgets provides utilities for working with notebook widgets. You can create different types of widgets and get their bound value. For more info about a
method, use dbutils.widgets.help("methodName").

combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value
and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and
choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value
and choices

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 6/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1/2.0

Out[41]: 0.5

These are Spark commands

spark

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

spark.version

Out[43]: '3.3.1'

#spark.sparkContext.appName
spark.conf.get("spark.app.name")

Out[44]: 'Databricks Shell'

spark.sparkContext.getConf().getAll()

Out[45]: [('spark.databricks.preemption.enabled', 'true'),


('spark.sql.hive.metastore.jars', '/databricks/databricks-hive/*'),
('spark.driver.tempDirectory', '/local_disk0/tmp'),
('spark.databricks.clusterUsageTags.driverInstanceId', 'i-0ab347f7afa43c463'),
('spark.sql.warehouse.dir', 'dbfs:/user/hive/warehouse'),
('spark.databricks.managedCatalog.clientClassName',
'com.databricks.managedcatalog.ManagedCatalogClientImpl'),
('spark.databricks.credential.scope.fs.gs.auth.access.tokenProviderClassName',
'com.databricks.backend.daemon.driver.credentials.CredentialScopeGCPTokenProvider'),
('spark.hadoop.fs.fcfs-s3.impl.disable.cache', 'true'),
('spark.hadoop.fs.s3a.retry.limit', '20'),
('spark.sql.streaming.checkpointFileManagerClass',
'com.databricks.spark.sql.streaming.DatabricksCheckpointFileManager'),
('spark.databricks.service.dbutils.repl.backend',
'com.databricks.dbconnect.ReplDBUtils'),
('spark.hadoop.databricks.s3.verifyBucketExists.enabled', 'false'),
('spark.streaming.driver.writeAheadLog.allowBatching', 'true'),
('spark.databricks.clusterSource', 'UI'),
('spark.hadoop.hive.server2.transport.mode', 'http'),
('spark.executor.memory', '8278m'),
('spark.databricks.clusterUsageTags.clusterOwnerOrgId', '1665022229529116'),

#Read a text file


tf=spark.read.text("/databricks-datasets/definitive-guide/README.md")
tf.show()

+--------------------+
| value|
+--------------------+
| # Datasets|
| |
|This folder conta...|
| |
| |
|The datasets are ...|
| |
| ## Flight Data|
| |
|This data comes f...|
| |
| ## Retail Data|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 7/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

| |
|Daqing Chen, Sai ...|
| |
|The data was down...|
| |

print(tf)

DataFrame[value: string]

type(tf)

Out[48]: pyspark.sql.dataframe.DataFrame

tf.display()
#display(tf)

Table

  value
1 # Datasets
2

3 This folder contains all of the datasets used in The Definitive Guide.
4

6 The datasets are as follow.


7

Showing all 26 rows.

tf.dtypes

Out[50]: [('value', 'string')]

tf.schema

Out[51]: StructType([StructField('value', StringType(), True)])

tf.printSchema()

root
|-- value: string (nullable = true)

diamonds=spark.read.csv('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv', header='True')
diamonds.show()

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55| 326|3.95|3.98|2.43|
| 2| 0.21| Premium| E| SI1| 59.8| 61| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65| 327|4.05|4.07|2.31|
| 4| 0.29| Premium| I| VS2| 62.4| 58| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58| 335|4.34|4.35|2.75|
| 6| 0.24|Very Good| J| VVS2| 62.8| 57| 336|3.94|3.96|2.48|
| 7| 0.24|Very Good| I| VVS1| 62.3| 57| 336|3.95|3.98|2.47|
| 8| 0.26|Very Good| H| SI1| 61.9| 55| 337|4.07|4.11|2.53|
| 9| 0.22| Fair| E| VS2| 65.1| 61| 337|3.87|3.78|2.49|
| 10| 0.23|Very Good| H| VS1| 59.4| 61| 338| 4|4.05|2.39|
| 11| 0.3| Good| J| SI1| 64| 55| 339|4.25|4.28|2.73|
| 12| 0.23| Ideal| J| VS1| 62.8| 56| 340|3.93| 3.9|2.46|
| 13| 0.22| Premium| F| SI1| 60.4| 61| 342|3.88|3.84|2.33|
| 14| 0.31| Ideal| J| SI2| 62.2| 54| 344|4.35|4.37|2.71|
| 15| 0.2| Premium| E| SI2| 60.2| 62| 345|3.79|3.75|2.27|
| 16| 0.32| Premium| E| I1| 60.9| 58| 345|4.38|4.42|2.68|
| 17| 0.3| Ideal| I| SI2| 62| 54| 348|4.31|4.34|2.68|
| 18| 0.3| Good| J| SI1| 63.4| 54| 351|4.23|4.29| 2.7|

diamonds.printSchema()

root
|-- _c0: string (nullable = true)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 8/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

|-- carat: string (nullable = true)


|-- cut: string (nullable = true)
|-- color: string (nullable = true)
|-- clarity: string (nullable = true)
|-- depth: string (nullable = true)
|-- table: string (nullable = true)
|-- price: string (nullable = true)
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- z: string (nullable = true)

spark.read.csv('/databricks-datasets/learning-spark-v2/flights/departuredelays.csv', inferSchema='true',
header='True').show()

+-------+-----+--------+------+-----------+
| date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245| 6| 602| ABE| ATL|
|1020600| -8| 369| ABE| DTW|
|1021245| -2| 602| ABE| ATL|
|1020605| -4| 602| ABE| ATL|
|1031245| -4| 602| ABE| ATL|
|1030605| 0| 602| ABE| ATL|
|1041243| 10| 602| ABE| ATL|
|1040605| 28| 602| ABE| ATL|
|1051245| 88| 602| ABE| ATL|
|1050605| 9| 602| ABE| ATL|
|1061215| -6| 602| ABE| ATL|
|1061725| 69| 602| ABE| ATL|
|1061230| 0| 369| ABE| DTW|
|1060625| -3| 602| ABE| ATL|
|1070600| 0| 369| ABE| DTW|
|1071725| 0| 602| ABE| ATL|
|1071230| 0| 369| ABE| DTW|
|1070625| 0| 602| ABE| ATL|

spark.read.csv('/databricks-datasets/learning-spark-v2/flights/departuredelays.csv', inferSchema='true',
header='True').take(3)

Out[56]: [Row(date=1011245, delay=6, distance=602, origin='ABE', destination='ATL'),


Row(date=1020600, delay=-8, distance=369, origin='ABE', destination='DTW'),
Row(date=1021245, delay=-2, distance=602, origin='ABE', destination='ATL')]

ABD03 Distributed Processing


Check your driver node. Remember Spark doesn't process data here

%sh ls -l #Long listing. Possibly the most used option for ls.

total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs
drwxr-xr-x 5 root root 4096 Jan 5 11:05 metastore_db
-r-xr-xr-x 1 root root 1306848 Jan 1 1970 preload_class.lst

Check the files under /databricks-datasets

%fs ls /databricks-datasets/

Table
  
  path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbf /d t b i k d t t /Rd t t / Rd t t / 0 0

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 9/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.

dbutils commands
dbutils.help()

This module provides various utilities for users to interact with the rest of Databricks.

credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks

dbutils.fs.help('cp')

/**
* Copies a file or directory, possibly across FileSystems..
*
* Example: cp("/mnt/my-folder/a", "s3n://bucket/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean

Working with the file system

Check the loaded files in your cluster (under dbfs:/FileStore/tables as allowed by Batabricks CE)

# Check the files you have in "dbfs:/FileStore/tables" (via command our databricks interface)
# If you don't have this folder already created automaticaly do the purplecow.txt import that is described bellow (2
options)
# 1) go to "Data > DBFS (on top) > Load" and import one file (raw). Ex: purplecow.txt (find it in Moodle on ABD
Class Tech Resources)
# 2) go to Databrick main page and use the import otion to import one file. Ex: purplecow.txt

#dbutils.fs.ls("dbfs:/FileStore/tables")
#dbutils.fs.mkdirs("/FileStore/tables")
#dbutils.fs.rm("/FileStore/tables/purplecow.txt")

%fs ls dbfs:/FileStore/tables

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 10/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Import the purplecow.txt file available in moodle to dbfs:/FileStore/tables (using the databricks interface)
# Then use %fs ls (or databricks interface to see the saved file)

%fs ls dbfs:/FileStore/tables/purplecow.txt

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/purplecow.txt purplecow.txt 109 1663266910000

Showing 1 row.

#Make a copy of the file if needed

#dbutils.fs.cp("/FileStore/tables/purplecow.txt", "/FileStore/tables/purplecow1.txt")

#Check if there is a purplecow.txt file in your Spark driver node

%sh ls -l

total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs
drwxr-xr-x 5 root root 4096 Jan 5 11:05 metastore_db
-r-xr-xr-x 1 root root 1306848 Jan 1 1970 preload_class.lst

# Copy the purplecow.txt file from your databricks filesystem to your driver node

dbutils.fs.cp("dbfs:/FileStore/tables/purplecow.txt", "file:/purplecow.txt", )

Out[67]: True

%sh ls -l /

total 88
-r-xr-xr-x 1 root root 271 Jan 1 1970 BUILD
drwxrwxrwx 2 root root 4096 Jan 5 11:00 Workspace
lrwxrwxrwx 1 root root 7 Oct 19 16:47 bin -> usr/bin
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot
drwxr-xr-x 1 root root 4096 Jan 5 11:04 databricks
drwxr-xr-x 2 root root 4096 Jan 5 11:00 dbfs
drwxr-xr-x 7 root root 540 Jan 5 11:00 dev
drwxr-xr-x 1 root root 4096 Jan 5 11:00 etc
drwxr-xr-x 1 root root 4096 Nov 23 01:50 home
lrwxrwxrwx 1 root root 7 Oct 19 16:47 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Oct 19 16:47 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Oct 19 16:47 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Oct 19 16:47 libx32 -> usr/libx32
drwxr-xr-x 7 ubuntu ubuntu 4096 Jan 5 11:00 local_disk0
drwxr-xr-x 2 root root 4096 Oct 19 16:47 media
drwxr-xr-x 1 root root 4096 Jan 5 11:00 mnt
drwxr-xr-x 4 root root 4096 Nov 23 01:50 opt
dr-xr-xr-x 225 root root 0 Jan 5 10:59 proc
-rw-r--r-- 1 root root 109 Jan 5 11:14 purplecow.txt
drwxr-xr-x 1 root root 4096 Jan 5 11:05 root

# Delete the purplecowfile in your driver (just for house cleaning purposes)
dbutils.fs.rm("file:/purplecow.txt")

Out[69]: True

# Now, create a DataFrame to save to your databricks filesystem

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 11/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

df = spark.createDataFrame([["Porto", 1900], ["Lisboa", 4000]], ["Name", "Zip-Code"])


display(df)

Table
 
  Name Zip-Code
1 Porto 1900
2 Lisboa 4000

Showing all 2 rows.

# Write the DataFrame in parquet format and Delta format


# Chech the saved files in the file system

df.write.format("parquet").save("dbfs:/FileStore/tables/df.parquet")

#df.write.format("delta").save("dbfs:/FileStore/tables/df.delta")
df.write.save("dbfs:/FileStore/tables/df.delta")

%fs ls dbfs:/FileStore/tables/

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/df.delta/ df.delta/ 0 0
6 dbfs:/FileStore/tables/df.parquet/ df.parquet/ 0 0
7 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
Showing all 15 rows.

%fs ls dbfs:/FileStore/tables/df.delta

Table

  path name
1 dbfs:/FileStore/tables/df.delta/_delta_log/ _delta_log/
2 dbfs:/FileStore/tables/df.delta/part-00003-6fc4d7b8-dbb9-482a-a5d3-9180f7e7bc80-c000.snappy.parquet part-00003-6fc4d7b8-dbb9-4
3 dbfs:/FileStore/tables/df.delta/part-00007-f919575c-282a-4a0b-a942-935b519d720f-c000.snappy.parquet part-00007-f919575c-282a-4

Showing all 3 rows.

%fs ls dbfs:/FileStore/tables/df.delta/_delta_log/

Table
  
  path name size modification
1 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-0 .s3-optimization-0 0 167291727600
2 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-1 .s3-optimization-1 0 167291727600
3 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-2 .s3-optimization-2 0 167291727600
4 dbfs:/FileStore/tables/df.delta/_delta_log/00000000000000000000.crc 00000000000000000000.crc 2962 167291729100
5 dbfs:/FileStore/tables/df.delta/_delta_log/00000000000000000000.json 00000000000000000000.json 1969 167291727700

Showing all 5 rows.

%fs ls dbfs:/FileStore/tables/df.parquet

Table

  path name

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 12/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 dbfs:/FileStore/tables/df.parquet/_SUCCESS _SUCCESS

2 dbfs:/FileStore/tables/df.parquet/_committed_7290864805188671922 _committed_729086
3 dbfs:/FileStore/tables/df.parquet/_started_7290864805188671922 _started_729086480
dbfs:/FileStore/tables/df.parquet/part-00000-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-32-1- part-00000-tid-7290
4
c000.snappy.parquet
dbfs:/FileStore/tables/df.parquet/part-00003-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-35-1- part-00003-tid-7290
5
c000.snappy.parquet
dbfs:/FileStore/tables/df.parquet/part-00007-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-39-1- part-00007-tid-7290
6
c000.snappy.parquet

Showing all 6 rows.

%fs rm -r dbfs:/FileStore/tables/df.parquet

res15: Boolean = true

%fs rm -r dbfs:/FileStore/tables/df.delta

res16: Boolean = true

ABD04 Spark Basics


# Check the Python environment
import sys
print(sys.version)

3.9.5 (default, Nov 23 2021, 15:27:38)


[GCC 9.3.0]

%sh python --version

Python 3.9.5

Spark is a distributed processing engine able to execute parallel processing in a cluster. A Spark cluster is made of one Driver
Node and many Executors (JVMs - Java Virtual Machines) Nodes

Check sc and spark object

sc

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

sc.appName

Out[81]: 'Databricks Shell'

spark

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 13/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

spark.version

Out[83]: '3.3.1'

#the spark sessions inherits the spark context


spark.sparkContext

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

spark.sparkContext.appName

Out[85]: 'Databricks Shell'

Check spark configuration parameters

Through spark.conf, you manipulate Spark's runtime configuration parameters

#spark.sparkContext.appName
spark.conf.get("spark.app.name")

Out[86]: 'Databricks Shell'

spark.conf.set("spark.app.name","ABD - Spark Basics")

spark.conf.get("spark.app.name")

Out[88]: 'ABD - Spark Basics'

spark.conf.get("spark.sql.warehouse.dir")

Out[89]: 'dbfs:/user/hive/warehouse'

# spark.default.parallelism is only applicable to RDDS


#spark.conf.get("spark.default.parallelism")

# spark.sql.shuffle.partitions is only applicable to DataFrames


#spark.conf.get("spark.sql.shuffle.partitions")

spark.sparkContext.master

Out[92]: 'local[8]'

spark.sparkContext.getConf().getAll()

Out[93]: [('spark.databricks.preemption.enabled', 'true'),


('spark.sql.hive.metastore.jars', '/databricks/databricks-hive/*'),
('spark.driver.tempDirectory', '/local_disk0/tmp'),
('spark.databricks.clusterUsageTags.driverInstanceId', 'i-0ab347f7afa43c463'),

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 14/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

('spark.sql.warehouse.dir', 'dbfs:/user/hive/warehouse'),
('spark.databricks.managedCatalog.clientClassName',
'com.databricks.managedcatalog.ManagedCatalogClientImpl'),
('spark.databricks.credential.scope.fs.gs.auth.access.tokenProviderClassName',
'com.databricks.backend.daemon.driver.credentials.CredentialScopeGCPTokenProvider'),
('spark.hadoop.fs.fcfs-s3.impl.disable.cache', 'true'),
('spark.hadoop.fs.s3a.retry.limit', '20'),
('spark.sql.streaming.checkpointFileManagerClass',
'com.databricks.spark.sql.streaming.DatabricksCheckpointFileManager'),
('spark.databricks.service.dbutils.repl.backend',
'com.databricks.dbconnect.ReplDBUtils'),
('spark.hadoop.databricks.s3.verifyBucketExists.enabled', 'false'),
('spark.streaming.driver.writeAheadLog.allowBatching', 'true'),
('spark.databricks.clusterSource', 'UI'),
('spark.hadoop.hive.server2.transport.mode', 'http'),
('spark.executor.memory', '8278m'),

#You can create your own variables


spark.conf.set("spark.abd_class.name","ABD04 Spark Basics")

#You can create your own variables


spark.conf.get("spark.abd_class.name")

Out[95]: 'ABD04 Spark Basics'

use help() to have more info of your object

help(spark)

Help on SparkSession in module pyspark.sql.session object:

class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin)
| SparkSession(sparkContext: pyspark.context.SparkContext, jsparkSession: Optional[py4j.java_gateway.JavaObject] =
None, options: Dict[str, Any] = {})
|
| The entry point to programming Spark with the Dataset and DataFrame API.
|
| A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
| tables, execute SQL over tables, cache tables, and read parquet files.
| To create a :class:`SparkSession`, use the following builder pattern:
|
| .. autoattribute:: builder
| :annotation:
|
| Examples
| --------
| >>> spark = SparkSession.builder \
| ... .master("local") \
| ... .appName("Word Count") \
| ... .config("spark.some.config.option", "some-value") \

Check the file system

Find datasets with example data to work with

%fs ls /databricks-datasets/

Table
  
  path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 15/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

%fs ls /FileStore/tables

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.

%fs head dbfs:/databricks-datasets/sms_spam_collection/README.md

MS Spam Collection v. 1


S
===============================

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam re
search. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legit
imate (ham) or spam.

## Composition

This corpus has been collected from free or free for research sources at the Internet:

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in w
hich cell phone users make public claims about SMS spam messages, most of them without reporting the very spam messag
e received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and
it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000
legitimate messages collected for research at the Department of Computer Science at the National University of Singap
ore. The messages largely originate from Singaporeans and mostly from students attending the University. These messag
es were collected from volunteers who were made aware that their contributions were going to be made publicly availab
le. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at http://etheses.bham.ac.uk/253/

%fs ls dbfs:/databricks-datasets/sms_spam_collection/

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/sms_spam_collection/README.md README.md 4344 1448498620000
2 dbfs:/databricks-datasets/sms_spam_collection/data-001/ data-001/ 0 0

Showing all 2 rows.

Create your first RDD (unstructured data)

#Create your RDD


rdd=sc.parallelize(["Lisboa", "Porto", "Faro"])
rdd.collect()

Out[97]: ['Lisboa', 'Porto', 'Faro']

dataW = ["Lisboa", "Porto", "Faro", "Coimbra"]


rddW=sc.parallelize(dataW)
rddW.collect()

Out[98]: ['Lisboa', 'Porto', 'Faro', 'Coimbra']

#To do this you have to upload first the purple cow file to Databricks
myrdd = sc.textFile("dbfs:/FileStore/tables/purplecow.txt")
myrdd.collect()

Out[99]: ['I never saw a purple cow.',


'I never hope to see one.',

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 16/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

'But I can tell you, anyhow,',


"I'd rather see than be one!"]

RDDs support two types of operations: transformations and actions.

Transformations, like map() or filter() create a new RDD from an existing one, resulting into another immutable RDD. All
transformations are lazy. That is, they are not executed until an action is invoked or performed.

Actions, like show() or count(), return a value (results) to the user. Other actions like saveAsTextFile() write the RDD to distributed
storage (HDFS, DBFS or S3).

Transformations contribute to a query plan, but nothing is executed until an action is called.

# Doing the same thing for reading a text file but in one line with "."/methods notation
sc.textFile("dbfs:/FileStore/tables/purplecow.txt").collect()

Out[100]: ['I never saw a purple cow.',


'I never hope to see one.',
'But I can tell you, anyhow,',
"I'd rather see than be one!"]

sc.textFile("dbfs:/FileStore/tables/purplecow.txt").take(2) #same as .head()

Out[101]: ['I never saw a purple cow.', 'I never hope to see one.']

mydata = sc.textFile("/databricks-datasets/samples/docs/README.md")

# Check the nunber of elements


mydata.count()

Out[103]: 65

mydata.take(5)

Out[104]: ['Welcome to the Spark documentation!',


'',
'This readme will walk you through navigating and building the Spark documentation, which is included',
'here with the Spark source code. You can also find documentation specific to release versions of',
'Spark at http://spark.apache.org/documentation.html.']

# Tip to debug in Spark


help(mydata.toDebugString)

Help on method toDebugString in module pyspark.rdd:

toDebugString() -> Optional[bytes] method of pyspark.rdd.RDD instance


A description of this RDD and its recursive dependencies for debugging.

mydata.toDebugString()

Out[106]: b'(2) /databricks-datasets/samples/docs/README.md MapPartitionsRDD[83] at textFile at NativeMethodAccessorI


mpl.java:0 []\n | /databricks-datasets/samples/docs/README.md HadoopRDD[82] at textFile at NativeMethodAccessorImpl.
java:0 []'

# Check the number of partitions


mydata.getNumPartitions()

Out[107]: 2

Operations with the RDD

Filtering lines

Filterdata = mydata.filter(lambda line: "to" in line)


Filterdata.take(10)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 17/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Out[108]: ['Welcome to the Spark documentation!',


'here with the Spark source code. You can also find documentation specific to release versions of',
'Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the',
'documentation yourself. Why build it yourself? So that you have the docs that corresponds to',
'The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala,',
' $ Rscript -e \'install.packages(c("knitr", "devtools"), repos="http://cran.stat.ucla.edu/")\'',
'We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as',
'the github wiki, as the definitive documentation) to enable the documentation to evolve along with',
'the source code and be captured by revision control (currently git). This way the code automatically',
'In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can']

Change the lines of the file to Upcase using RDDs / the low-level API

Upperdata = mydata.map(lambda line: line.upper())


Upperdata.take(10)

Out[109]: ['WELCOME TO THE SPARK DOCUMENTATION!',


'',
'THIS README WILL WALK YOU THROUGH NAVIGATING AND BUILDING THE SPARK DOCUMENTATION, WHICH IS INCLUDED',
'HERE WITH THE SPARK SOURCE CODE. YOU CAN ALSO FIND DOCUMENTATION SPECIFIC TO RELEASE VERSIONS OF',
'SPARK AT HTTP://SPARK.APACHE.ORG/DOCUMENTATION.HTML.',
'',
'READ ON TO LEARN MORE ABOUT VIEWING DOCUMENTATION IN PLAIN TEXT (I.E., MARKDOWN) OR BUILDING THE',
'DOCUMENTATION YOURSELF. WHY BUILD IT YOURSELF? SO THAT YOU HAVE THE DOCS THAT CORRESPONDS TO',
'WHICHEVER VERSION OF SPARK YOU CURRENTLY HAVE CHECKED OUT OF REVISION CONTROL.',
'']

Create your first DataFrame (structured data)

More on DataFrames in the classe on Spark SQL

df = spark.createDataFrame([("James", 20), ("Anna", 31), ("Michael", 30), ("Charles", 35), ("Brooke", 25)],
["name", "age"])
df.show()

+-------+---+
| name|age|
+-------+---+
| James| 20|
| Anna| 31|
|Michael| 30|
|Charles| 35|
| Brooke| 25|
+-------+---+

diamonds=spark.read.csv('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv', header='True')
diamonds.display()

Table
       
  _c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.

# Creating a Dataframe (and not an RDD) to read a file


mydataframe = spark.read.text("dbfs:/FileStore/tables/purplecow.txt")
mydataframe.show()

+--------------------+
| value|
+--------------------+

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 18/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

|I never saw a pur...|


|I never hope to s...|
|But I can tell yo...|
|I'd rather see th...|
+--------------------+

mydataframe.printSchema()

root
|-- value: string (nullable = true)

#Check the lineage of the Dataframe


mydataframe.explain()

== Physical Plan ==
FileScan text [value#1317] Batched: false, DataFilters: [], Format: Text, Location: InMemoryFileIndex(1 paths)[dbfs:/
FileStore/tables/purplecow.txt], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>

# Tip to debug in Spark


help(mydataframe.rdd.toDebugString)

Help on method toDebugString in module pyspark.rdd:

toDebugString() -> Optional[bytes] method of pyspark.rdd.RDD instance


A description of this RDD and its recursive dependencies for debugging.

mydataframe.rdd.toDebugString()

Out[116]: b'(1) MapPartitionsRDD[111] at javaToPython at NativeMethodAccessorImpl.java:0 []\n | MapPartitionsRDD[11


0] at javaToPython at NativeMethodAccessorImpl.java:0 []\n | SQLExecutionRDD[109] at javaToPython at NativeMethodAcc
essorImpl.java:0 []\n | MapPartitionsRDD[108] at javaToPython at NativeMethodAccessorImpl.java:0 []\n | FileScanRDD
[107] at javaToPython at NativeMethodAccessorImpl.java:0 []'

from pyspark.sql.functions import *


Upperdataframe = mydataframe.select(upper('value'))
Upperdataframe.show()

+--------------------+
| upper(value)|
+--------------------+
|I NEVER SAW A PUR...|
|I NEVER HOPE TO S...|
|BUT I CAN TELL YO...|
|I'D RATHER SEE TH...|
+--------------------+

ABD05 Spark Programming with RDDs Prog1


# hit tab twice to see the available methods from sc
sc

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

spark.sparkContext

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 19/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

Creating an RDD
rdd1=sc.parallelize(["Lisboa", "Porto", "Faro"])
rdd1.collect()

Out[120]: ['Lisboa', 'Porto', 'Faro']

# Check the type of object created with sc.parallelize


type(rdd1)

Out[121]: pyspark.rdd.RDD

Working with RDDs

Numeric RDDs

Mylist = [50,59.2,59,57.2,53.5,53.2,55.4,51.8,53.6,55.4,54.7]

Myrdd = sc.parallelize(Mylist)

Myrdd.first()

Out[124]: 50

Myrdd.collect()

Out[125]: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6, 55.4, 54.7]

Myrdd.take(4)

Out[126]: [50, 59.2, 59, 57.2]

Myrdd.top(4)

Out[127]: [59.2, 59, 57.2, 55.4]

Myrdd.sum()

Out[128]: 603.0

Myrdd.mean()

Out[129]: 54.81818181818182

Myrdd.variance()

Out[130]: 7.383305785123963

Myrdd.stdev()

Out[131]: 2.717223911480974

Convert ºF to ºC

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 20/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# using map() to do operations with numbers


MyrddC = Myrdd.map(lambda p: (p-32)*5/9)
MyrddC.collect()

Out[132]: [10.0,
15.11111111111111,
15.0,
14.000000000000002,
11.944444444444445,
11.777777777777779,
13.0,
10.999999999999998,
12.0,
13.0,
12.611111111111112]

MyrddC.filter(lambda n: n > 12).collect()

Out[133]: [15.11111111111111, 15.0, 14.000000000000002, 13.0, 13.0, 12.611111111111112]

MyrddC.filter(lambda n: n % 2 == 0).collect()
#MyrddC.filter(lambda n: n % 2 != 0).collect()

Out[134]: [10.0, 12.0]

MyrddC.takeOrdered(num=5, key=lambda n: -n)

Out[135]: [15.11111111111111, 15.0, 14.000000000000002, 13.0, 13.0]

Text RDDs

rdd=sc.parallelize(["Do you know that", "a horse has one stomach", "but a cow has four"])
rdd.collect()

Out[136]: ['Do you know that', 'a horse has one stomach', 'but a cow has four']

rddupper = rdd.map(lambda line: line.upper())


rddupper.collect()

Out[137]: ['DO YOU KNOW THAT', 'A HORSE HAS ONE STOMACH', 'BUT A COW HAS FOUR']

rddfilter = rdd.filter(lambda line: "has" in line)


rddfilter.collect()

Out[138]: ['a horse has one stomach', 'but a cow has four']

rddfilter = rdd.filter(lambda line: line.startswith ('a'))


rddfilter.collect()

Out[139]: ['a horse has one stomach']

Demonstrating flatMap

map() vs flatMap()

rddOut1 = rdd.map(lambda line: line.split())


rddOut1.take(15)

Out[140]: [['Do', 'you', 'know', 'that'],


['a', 'horse', 'has', 'one', 'stomach'],
['but', 'a', 'cow', 'has', 'four']]

rddOut2 = rdd.flatMap(lambda line: line.split())


rddOut2.take(15)

Out[141]: ['Do',
'you',
'know',

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 21/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

'that',
'a',
'horse',
'has',
'one',
'stomach',
'but',
'a',
'cow',
'has',
'four']

Using flatMap() to creat a list of words

# Creating a list of unique words


rddOut3 = rdd.flatMap(lambda line: line.split()).distinct()
rddOut3.take(15)

Out[142]: ['but',
'you',
'that',
'a',
'has',
'Do',
'four',
'stomach',
'cow',
'horse',
'know',
'one']

Key Value Pair RDDs


rddText1=sc.parallelize(["Do you know that", "a horse has one stomach", "but a cow has four"])

# Creating a key value pair RDD with map()


pairs = rddText1.map(lambda x: (x.split(" ")[0], x))
pairs.collect()

Out[144]: [('Do', 'Do you know that'),


('a', 'a horse has one stomach'),
('but', 'but a cow has four')]

# Creating a key value pair RDD with keyBy


rddText1.keyBy(lambda l: len(l)).collect()

Out[145]: [(16, 'Do you know that'),


(23, 'a horse has one stomach'),
(18, 'but a cow has four')]

Shared Variables

Broadcast Variables

list_parm = sc.broadcast(["par1", "par2", "par3"])

data = list_parm.value
print("Stored data ", data)

Stored data ['par1', 'par2', 'par3']

par2 = list_parm.value[2]
print("Parameter 2 is: ", par2)

Parameter 2 is: par3

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 22/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Acumulators
accum = sc.accumulator(0)

myrdd = sc.parallelize([20,30,40,50])

myrdd.foreach(lambda n: accum.add(n))

final = accum.value

print("Accumulated value is:", final)

Accumulated value is: 140

ABD05 Spark Algorithms with RDDs Prog2


spark

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

%fs ls "/FileStore/tables"

Table
   
  path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.

Word count
rdd = sc.textFile("/FileStore/tables/purplecow.txt")

rdd.collect()

Out[156]: ['I never saw a purple cow.',


'I never hope to see one.',
'But I can tell you, anyhow,',
"I'd rather see than be one!"]

rdd.flatMap(lambda line: line.split()).collect()

Out[157]: ['I',
'never',
'saw',
'a',
'purple',

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 23/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

'cow.',
'I',
'never',
'hope',
'to',
'see',
'one.',
'But',
'I',
'can',
'tell',
'you,',
'anyhow,',
"I'd",
'rather',

rdd.flatMap(lambda line: line.split()) \


.map(lambda word: (word, 1)).collect()

Out[158]: [('I', 1),


('never', 1),
('saw', 1),
('a', 1),
('purple', 1),
('cow.', 1),
('I', 1),
('never', 1),
('hope', 1),
('to', 1),
('see', 1),
('one.', 1),
('But', 1),
('I', 1),
('can', 1),
('tell', 1),
('you,', 1),
('anyhow,', 1),
("I'd", 1),
('rather', 1),
('see', 1),

rdd.flatMap(lambda line: line.split()) \


.map(lambda word: (word, 1)) \
.reduceByKey(lambda a,b: a+b).collect()

Out[159]: [('never', 2),


('cow.', 1),
('But', 1),
('tell', 1),
('anyhow,', 1),
('rather', 1),
('than', 1),
('one!', 1),
('I', 3),
('saw', 1),
('a', 1),
('purple', 1),
('hope', 1),
('to', 1),
('see', 2),
('one.', 1),
('can', 1),
('you,', 1),
("I'd", 1),
('be', 1)]

# The word count algorithm


count = rdd.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a,b: a+b)
count.collect()

Out[160]: [('never', 2),


('cow.', 1),
('But', 1),
('tell', 1),

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 24/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

('anyhow,', 1),
('rather', 1),
('than', 1),
('one!', 1),
('I', 3),
('saw', 1),
('a', 1),
('purple', 1),
('hope', 1),
('to', 1),
('see', 2),
('one.', 1),
('can', 1),
('you,', 1),
("I'd", 1),
('be', 1)]

# using the python add operator in Word count


from operator import add
file = sc.textFile("/databricks-datasets/samples/docs//README.md")
wcount = file.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
wcount.collect()
#wcount.saveAsTextFile("/work/wcount-out")

Out[161]: [('Welcome', 1),


('Spark', 9),
('documentation!', 1),
('', 71),
('readme', 1),
('walk', 1),
('navigating', 1),
('is', 3),
('source', 3),
('code.', 1),
('documentation', 7),
('specific', 1),
('versions', 1),
('of', 10),
('at', 1),
('http://spark.apache.org/documentation.html.', 1),
('more', 1),
('viewing', 1),
('in', 4),
('yourself?', 1),
('have', 3),

#dbutils.fs.ls("/work/rddOut.txt")
#dbutils.fs.rm("/work/wcount", recurse=True)

Pi estimation
import random
n_samples = 10000

# Just testing the output of the function "inside"


# Do it more than once! and check how many are < 1
x, y = random.random(), random.random()
print( x*x + y*y )

0.7530264908969033

def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1

count = sc.parallelize(range(0, n_samples)) \


.filter(inside).count()

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 25/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

print("Pi is roughly %f" % (4.0 * count / n_samples))


# run the function again with a bigger number of samples
# and check the new Pi value

Pi is roughly 3.154400

PageRank
Remenber to load the linkFile from moodle to databricks environment

file = "dbfs:/FileStore/tables/linkFile.txt"
iterations = 10

w = sc.textFile(file)
w.collect()

Out[168]: ['page1 page3',


'page2 page1',
'page4 page1',
'page3 page1',
'page4 page2',
'page3 page4']

def computeContribs(neighbors, rank):


for neighbor in neighbors:
yield(neighbor, rank/len(neighbors))

links = sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda pages: (pages[0], pages[1])) \
.groupByKey() \
.persist()

links.collect()

Out[170]: [('page1', <pyspark.resultiterable.ResultIterable at 0x7fe77838fac0>),


('page2', <pyspark.resultiterable.ResultIterable at 0x7fe77838f880>),
('page4', <pyspark.resultiterable.ResultIterable at 0x7fe77838fe50>),
('page3', <pyspark.resultiterable.ResultIterable at 0x7fe77838f250>)]

ranks = links.map(lambda link: (link[0], 1.0))


ranks.collect()

Out[171]: [('page1', 1.0), ('page2', 1.0), ('page4', 1.0), ('page3', 1.0)]

for x in range(iterations):
contribs = links.join(ranks) \
.flatMap(lambda neighborRanks: computeContribs(neighborRanks[1][0], \
neighborRanks[1][1]))
ranks = contribs.reduceByKey(lambda v1,v2: v1+v2) \
.map(lambda pageContribs: (pageContribs[0], \
pageContribs[1] * 0.85 + 0.15))

for rank in ranks.sortByKey().collect() : print(rank)

('page1', 1.4313779845858583)
('page2', 0.4633039012638519)
('page3', 1.3758228705372553)
('page4', 0.7294952436130331)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 26/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# The full code now


file = "dbfs:/FileStore/tables/linkFile.txt"
iterations = 10

def computeContribs(neighbors, rank):


for neighbor in neighbors:
yield(neighbor, rank/len(neighbors))

links = sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda pages: (pages[0], pages[1])) \
.groupByKey() \
.persist()

ranks = links.map(lambda link: (link[0], 1.0))

for x in range(iterations):
contribs = links.join(ranks) \
.flatMap(lambda neighborRanks: computeContribs(neighborRanks[1][0], \
neighborRanks[1][1]))
ranks = contribs.reduceByKey(lambda v1,v2: v1+v2) \
.map(lambda pageContribs: (pageContribs[0], \
pageContribs[1] * 0.85 + 0.15))

for rank in ranks.sortByKey().collect() : print(rank)

('page1', 1.4313779845858583)
('page2', 0.4633039012638519)
('page3', 1.3758228705372553)
('page4', 0.7294952436130331)

ABD06 Data Frames and Spark SQL

Cheching the SQL environment

spark

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

# hit tab after the dot to see the available methods of the spark session object
spark

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

Through the SparkSession.catalog field instance you can access all the Catalog metadata information about your tables, database,
UDFs etc.

Note: spark.catalog. returns a Dataset and can be display as table

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 27/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Check the spark.catalog object


spark.catalog

Out[462]: <pyspark.sql.catalog.Catalog at 0x7fe7783cf0d0>

# List the tables in your Catalog


spark.catalog.listDatabases()

Out[463]: [Database(name='abd2022_v2', catalog='spark_catalog', description='', locationUri='dbfs:/user/hive/warehous


e/abd2022_v2.db'),
Database(name='default', catalog='spark_catalog', description='Default Hive database', locationUri='dbfs:/user/hive/
warehouse')]

# More readable output


display(spark.catalog.listDatabases())

Table
   
  name catalog description locationUri
1 abd2022_v2 spark_catalog dbfs:/user/hive/warehouse/abd2022_v2.db
2 default spark_catalog Default Hive database dbfs:/user/hive/warehouse

Showing all 2 rows.

# With a Spark SQL statment programmatically


spark.sql('show databases').display()

Table

  databaseName
1 abd2022_v2
2 default

Showing all 2 rows.

%sql
show databases;

Table

  databaseName
1 abd2022_v2
2 default

Showing all 2 rows.

# Check the current Database


spark.catalog.currentDatabase()

Out[467]: 'default'

# Create a new Database


spark.sql('create database abd2022')

Out[470]: DataFrame[]

spark.sql('show databases').display()

Table

  databaseName
1 abd2022
2 abd2022_v2
3 default

Showing all 3 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 28/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

spark.sql("USE database abd2022")

Out[472]: DataFrame[]

spark.catalog.currentDatabase()

Out[473]: 'abd2022'

spark.sql("USE database default")

Out[474]: DataFrame[]

# Check if you have a Table defined


spark.catalog.listTables()

Out[475]: [Table(name='counts', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=Tru


e),
Table(name='display_query_1', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=Tru
e),
Table(name='managers', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
Table(name='movies', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
Table(name='static_counts', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
Table(name='teams', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
Table(name='TempTable', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

# Let's create a Table (just to see it in our Database) using an SQL Statment
# Remender: a Table is not the same thing as a Dataframe (more on this later)
# A Dataframe is data in memory
# A Table is data registered on the Spark Catalog

%sql
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

OK

spark.catalog.listTables()

Out[193]: [Table(name='diamonds', catalog='spark_catalog', namespace=['default'], description=None, tableType='EXTERN


AL', isTemporary=False)]

#spark.sql("DROP TABLE IF EXISTS diamonds")

%sql
DROP TABLE IF EXISTS diamonds;

OK

# Check if you have Tables defined again


spark.catalog.listTables()

Out[196]: []

Dataframes and Tables

# Let's create a DataFrame with a Spark statment


dep_delays=spark.read.csv('/databricks-datasets/learning-spark-v2/flights/departuredelays.csv', inferSchema='true',
header='True')

Check the type of object you created

type(dep_delays)

Out[198]: pyspark.sql.dataframe.DataFrame

dep_delays.schema

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 29/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Out[199]: StructType([StructField('date', IntegerType(), True), StructField('delay', IntegerType(), True), StructFiel


d('distance', IntegerType(), True), StructField('origin', StringType(), True), StructField('destination', StringType
(), True)])

dep_delays.summary().display()

Table
     
  summary date delay distance origin destination
1 count 1391578 1391578 1391578 1391578 1391578
2 mean 2180446.584000322 12.079802928761449 690.5508264718184 null null
3 stddev 838031.1536741006 38.8077337498565 513.6628153663316 null null
4 min 1010005 -112 21 ABE ABE
5 25% 1240630 -4 316 null null
6 50% 2161410 0 548 null null
7 75% 3101505 12 893 null null
Showing all 8 rows.

dbutils.data.summarize(dep_delays)

profiles generated in approximate mode

  ort by
S

Feature order Reverse order Feature search (regex enabled)

Features: int(3) string(2)

Numeric Features (3)   hart to show


C

Standard
count missing mean std dev zeros min median max custom log expand

date
1.39M 0% 2.18M 838k 0% 1.01M 2.16M 3.31M data type: int
50K

1M 2M

delay 1M
1.39M 0% 12.08 38.81 9.42% -112 0 1,642 data type: int
200K

0 400 8

distance
1.39M 0% 690.55 513.66 0% 21 548 4,330 data type: int
100K

500 2K

Categorical Features (2)   hart to show


C

Standard
count missing unique top freq top avg len custom log expand

origin SHOW RAW DATA

1.39M 0% 264 ATL 91.5k 3 data type: string


0.2

10 30 50 70 90
destination SHOW RAW DATA

1.39M 0% 313 ATL 90.4k 3 data type: string


0.2

10 30 50 70 90

spark.catalog.listTables()
# Note that there is no dep_delays on the Spark catalog

Out[202]: []

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 30/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Let's save our DataFrame in Spark Catalog as a Table and check it

%sql
DROP TABLE IF EXISTS Dep_delays;

OK

dep_delays.write.saveAsTable("Dep_delays")

spark.catalog.listTables()

Out[205]: [Table(name='dep_delays', catalog='spark_catalog', namespace=['default'], description=None, tableType='MANA


GED', isTemporary=False)]

%sql
DROP TABLE IF EXISTS Dep_delays;

OK

Working with Dataframes

DataFrames (like RDDs) support two types of operations: transformations and actions.

Transformations, like select() or filter() create a new DataFrame from an existing one, resulting into another immutable
DataFrame. All transformations are lazy. That is, they are not executed until an action is invoked or performed.

Actions, like show() or count(), return a value with results to the user. Other actions like save() write the DataFrame to distributed
storage (like HDFS, DBFS or S3).

Transformations contribute to a query plan, but nothing is executed until an action is called.

Creating a DataFrame

Lets create a DataFrame programmaticaly

df = spark.range(5)
display(df)

Table

  id
1 0
2 1
3 2
4 3
5 4

Showing all 5 rows.

# A DataFrame is a collection of Row objects


df.collect()

Out[442]: [Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4)]

df.show()

+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 31/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

| 4|
+---+

df.display()

Table

  id
1 0
2 1
3 2
4 3
5 4

Showing all 5 rows.

df.take(3)

Out[445]: [Row(id=0), Row(id=1), Row(id=2)]

#display(df.describe())
df.describe()

Out[446]: DataFrame[summary: string, id: string]

# Change the name of the column with .toDF("col1_name","col2_name", ...)


df.toDF("num").show()

+---+
|num|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+

df.show()

+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+

# Change the name of the column with withColumnRenamed("col1_old","col2_new")


df.withColumnRenamed("id", "num").show()

+---+
|num|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+

# Doing an operation with a column using selectExpr


# Take note that this command also allows for column name change
df2 = df.selectExpr("(id * 2) as value")
display(df2)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 32/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Table

  value
1 0
2 2
3 4
4 6
5 8

Showing all 5 rows.

df2.schema

Out[451]: StructType([StructField('value', LongType(), False)])

df2.printSchema

Out[452]: <bound method DataFrame.printSchema of DataFrame[value: bigint]>

df2.explain()

== Physical Plan ==
*(1) Project [(id#30861L * 2) AS value#30949L]
+- *(1) Range (0, 5, step=1, splits=8)

# spark.createDataFrame(rdd, schema)

df3 = spark.createDataFrame([["Porto", 1900], ["Lisboa", 4000]], ["Name", "Zip-Code"])


df3.show()
df3.printSchema()

+------+--------+
| Name|Zip-Code|
+------+--------+
| Porto| 1900|
|Lisboa| 4000|
+------+--------+

root
|-- Name: string (nullable = true)
|-- Zip-Code: long (nullable = true)

from pyspark.sql import Row


list = [('Mark',25),('Tom',22),('Mary',20),('Sofia',26)]
rdd = sc.parallelize(list)

people = rdd.map(lambda x: Row(name=x[0], age=x[1]))

df_people = spark.createDataFrame(people)
# You can also use rdd.toDF()
# df_people = people.toDF()

display(df_people)

Table
 
  name age
1 Mark 25
2 Tom 22
3 Mary 20
4 Sofia 26

Showing all 4 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 33/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

df_people.printSchema()

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)

from pyspark.sql.types import StringType, IntegerType, StructType, StructField


data = [['Porto'], ['Lisboa'], ['Faro']]
schema = StructType([StructField('City', StringType(), True)])

df_city0 = spark.createDataFrame(data, schema)


df_city0.show()

+------+
| City|
+------+
| Porto|
|Lisboa|
| Faro|
+------+

# Use .toDF to change column names again


df_city1 = spark.createDataFrame(["Porto", "Lisboa", "Faro"], "string")
df_city1.show()

df_city2 = spark.createDataFrame(["Porto", "Lisboa", "Faro"], "string").toDF("city")


df_city2.show()

+------+
| value|
+------+
| Porto|
|Lisboa|
| Faro|
+------+

+------+
| city|
+------+
| Porto|
|Lisboa|
| Faro|
+------+

Creating a DataFrame from a JSON file

# Deleting json file (just in case it exists to avoid creation error)


dbutils.fs.rm("/tmp/test.json")

Out[234]: True

# Create the JSON File


dbutils.fs.put("/tmp/test.json", """
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3"}}
""")
JsonDF = spark.read.json("/tmp/test.json")

Wrote 214 bytes.

JsonDF.show()

+---------+--------+---+-------+
| array| dict|int| string|
+---------+--------+---+-------+
|[1, 2, 3]|{value1}| 1|string1|
|[2, 4, 6]|{value2}| 2|string2|
|[3, 6, 9]|{value3}| 3|string3|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 34/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

+---------+--------+---+-------+

%fs ls /databricks-datasets/samples/people

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/samples/people/people.json people.json 77 1534435526000

Showing 1 row.

dfpeople = spark.read.json("/databricks-datasets/samples/people/people.json")

display(dfpeople)

Table
 
  age name
1 40 Jane
2 30 Andy
3 50 Justin

Showing all 3 rows.

Creating a DataFrame from a CSV file

# Reading the Diamonds CSV to a DataFrame


diamonds0 = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true",
inferSchema="true")

diamonds0.display()

Table
       
  _c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.

# hit tab after the . to see the available methods


diamonds0.columns

Out[243]: ['_c0',
'carat',
'cut',
'color',
'clarity',
'depth',
'table',
'price',
'x',
'y',
'z']

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 35/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Reading with spark.read.format


dataPath = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
diamonds1 = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("inferSchema", "true")\
.load(dataPath)

# inferSchema means we will automatically figure out column types # at a cost of reading the data more than once
display(diamonds1)

Table
       
  _c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.

Creating a DataFrame from a Text file

# Reading a text file to a DataFrame


textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

textFile.take(5)

Out[246]: [Row(value='Welcome to the Spark documentation!'),


Row(value=''),
Row(value='This readme will walk you through navigating and building the Spark documentation, which is included'),
Row(value='here with the Spark source code. You can also find documentation specific to release versions of'),
Row(value='Spark at http://spark.apache.org/documentation.html.')]

linesWithSpark = textFile.where(textFile.value.contains("Spark"))
display(linesWithSpark)

Table

  value
1 Welcome to the Spark documentation!
2 This readme will walk you through navigating and building the Spark documentation, which is included
3 here with the Spark source code. You can also find documentation specific to release versions of
4 Spark at http://spark.apache.org/documentation.html.
5 whichever version of Spark you currently have checked out of revision control.
6 The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala,
7 We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as
Showing all 12 rows.

linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
display(linesWithSpark)

Table

  value
1 Welcome to the Spark documentation!
2 This readme will walk you through navigating and building the Spark documentation, which is included
3 here with the Spark source code. You can also find documentation specific to release versions of
4 Spark at http://spark.apache.org/documentation.html.
5 whichever version of Spark you currently have checked out of revision control.
6 The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala,
7 We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 36/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Showing all 12 rows.

Saving the DF (in delta format)

Remember that this is (also) the new format by default

diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true",


inferSchema="true")

# Deleting the file just to avoid error on writing


dbutils.fs.rm("/delta/diamonds", True)
# delete diamonds2 if exists (use after this run)
# dbutils.fs.rm("/delta/diamonds2", True)

Out[250]: True

%fs ls /delta

OK

diamonds.write.format("delta").save("/delta/diamonds")

%fs ls /delta

Table
   
  path name size modificationTime
1 dbfs:/delta/diamonds/ diamonds/ 0 0

Showing 1 row.

diamonds2delta = spark.read.load("/delta/diamonds")
display(diamonds2delta)

Table
       
  _c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.

Exploring DataFrames data

diamonds.select("carat","cut","color").show(5)

+-----+-------+-----+
|carat| cut|color|
+-----+-------+-----+
| 0.23| Ideal| E|
| 0.21|Premium| E|
| 0.23| Good| E|
| 0.29|Premium| I|
| 0.31| Good| J|
+-----+-------+-----+
only showing top 5 rows

diamonds.select("color","clarity","table","price").where("color = 'E'").show(5)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 37/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

+-----+-------+-----+-----+
|color|clarity|table|price|
+-----+-------+-----+-----+
| E| SI2| 55.0| 326|
| E| SI1| 61.0| 326|
| E| VS1| 65.0| 327|
| E| VS2| 61.0| 337|
| E| SI2| 62.0| 345|
+-----+-------+-----+-----+
only showing top 5 rows

diamonds.sort('table').show(25)
#diamonds.sort('table').select('table').show()
#diamonds.sort(diamonds.table.asc(),diamonds.price.desc()).show()
#diamonds.sort(diamonds['table'],diamonds['price']).show()
#diamonds.orderBy('table').show()

+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| _c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|11369| 1.04| Ideal| I| VS1| 62.9| 43.0| 4997|6.45|6.41|4.04|
|35634| 0.29|Very Good| E| VS1| 62.8| 44.0| 474| 4.2|4.24|2.65|
| 5980| 1.0| Fair| I| VS1| 64.0| 49.0| 3951|6.43|6.39| 4.1|
|22702| 0.3| Fair| E| SI1| 64.5| 49.0| 630|4.28|4.25|2.75|
|25180| 2.0| Fair| H| SI1| 61.2| 50.0|13764|8.17|8.08|4.97|
| 7419| 1.02| Fair| F| SI1| 61.8| 50.0| 4227|6.59|6.51|4.05|
| 3239| 0.94| Fair| H| SI2| 66.0| 50.1| 3353|6.13|6.17|4.06|
| 1516| 0.91| Fair| F| SI2| 65.3| 51.0| 2996|6.05|5.98|3.93|
| 3980| 1.0| Premium| H| SI1| 62.2| 51.0| 3511|6.47| 6.4| 4.0|
| 4151| 0.91| Premium| F| SI2| 61.0| 51.0| 3546|6.24|6.21| 3.8|
| 8854| 1.0| Fair| E| VS2| 66.4| 51.0| 4480|6.31|6.22|4.16|
|26388| 2.01| Good| H| SI2| 64.0| 51.0|15888|8.08|8.01|5.15|
|33587| 0.37| Premium| F| VS1| 62.7| 51.0| 833|4.65|4.57|2.89|
|45799| 0.51| Fair| E| VS2| 65.5| 51.0| 1709|5.06|5.01| 3.3|
|46041| 0.57| Good| H| VS1| 63.7| 51.0| 1728|5.36|5.29|3.39|
|47631| 0.67| Good| I| VVS2| 58.9| 51.0| 1882|5.74|5.78| 3.4|
|24816| 2.0|Very Good| J| VS1| 61.0| 51.6|13203|8.14|8.18|4.97|
|10541| 1.0| Ideal| E| SI2| 62.2| 52.0| 4808|6.42|6.47|4.01|

# Chose this syntaxe with [] for better future compatibility and less errors
# (like column names that are also attributes on the DataFrame class)
diamonds.filter(diamonds['color'] == 'E').show(5)

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2|Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows

diamonds.select(diamonds["clarity"],diamonds["price"] * 1.10).show(5)

+-------+------------------+
|clarity| (price * 1.1)|
+-------+------------------+
| SI2| 358.6|
| SI1| 358.6|
| VS1|359.70000000000005|
| VS2|367.40000000000003|
| SI2|368.50000000000006|
+-------+------------------+
only showing top 5 rows

# What is the average price of diamonds by color?


from pyspark.sql.functions import avg
display(diamonds.select("color","price").groupBy("color").agg(avg("price")))

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 38/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Table
 
  color avg(price)
1 F 3724.886396981765
2 E 3076.7524752475247
3 D 3169.9540959409596
4 J 5323.81801994302
5 G 3999.135671271697
6 I 5091.874953891553
7 H 4486.669195568401
Showing all 7 rows.

Runing SQL queries programaticaly

#spark.catalog.dropTempView("temptable")
#spark.catalog.dropGlobalTempView("temptable")
# Thera are some limitations to drop Tables with spark.catalog, use %sql instead

%sql
DROP VIEW IF EXISTS temptable

OK

# Use createTempView(“name”)
# write.saveAsTable is also an option (see bellow)

diamonds.createTempView("temptable")

spark.catalog.listTables()

Out[263]: [Table(name='temptable', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=T


rue)]

spark.sql(" select * from temptable where color = 'E' ").limit(10).show()

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2| Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
| 16| 0.32| Premium| E| I1| 60.9| 58.0| 345|4.38|4.42|2.68|
| 22| 0.23|Very Good| E| VS2| 63.8| 55.0| 352|3.85|3.92|2.48|
| 33| 0.23|Very Good| E| VS1| 60.7| 59.0| 402|3.97|4.01|2.42|
| 34| 0.23|Very Good| E| VS1| 59.5| 58.0| 402|4.01|4.06| 2.4|
| 37| 0.23| Good| E| VS1| 64.1| 59.0| 402|3.83|3.85|2.46|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+

spark.sql(" select * from temptable where color = 'E' ").show(5)

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2|Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 39/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

%sql
DROP VIEW IF EXISTS temptable;
DROP TABLE IF EXISTS diamonds_;

OK

# Example with saveAsTable


diamonds.limit(5).write.saveAsTable("diamonds_")

spark.catalog.listTables()

Out[268]: [Table(name='diamonds_', catalog='spark_catalog', namespace=['default'], description=None, tableType='MANAG


ED', isTemporary=False)]

spark.sql(" select * from diamonds_ ").show()

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+

# Creating a DataFrame from a Table in Hive


df_f_table = spark.table("diamonds_")
df_f_table.show(3)

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 3 rows

diamonds.explain()

== Physical Plan ==
FileScan csv [_c0#13653,carat#13654,cut#13655,color#13656,clarity#13657,depth#13658,table#13659,price#13660,x#13661,y
#13662,z#13663] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/databricks-d
atasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c
0:int,carat:double,cut:string,color:string,clarity:string,depth:double,table:double,pric...

%sql
DROP VIEW IF EXISTS temptable;
DROP TABLE IF EXISTS diamonds_;

OK

Working with UDFs

#diamonds.select( "table" ).show(2)


diamonds.registerTempTable("TempTable")

/databricks/spark/python/pyspark/sql/dataframe.py:234: FutureWarning: Deprecated in 2.0, use createOrReplaceTempView


instead.
warnings.warn("Deprecated in 2.0, use createOrReplaceTempView instead.", FutureWarning)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 40/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Spark UDFs
#from pyspark.sql.functions import udf

def price_plus(num):
return num*2

def price_tag(num):
if num < 330:
tag = 'Good'
else:
tag = 'Bad'
return tag

price_plusUDF = udf(price_plus)
price_tagUDF = udf(price_tag)

# Using UDFs with DataFrames


diamonds.select( "table", price_plusUDF("table").alias("Price_Plus"), price_tagUDF("table").alias("Price_Note")
).show(2)

+-----+----------+----------+
|table|Price_Plus|Price_Note|
+-----+----------+----------+
| 55.0| 110.0| Good|
| 61.0| 122.0| Good|
+-----+----------+----------+
only showing top 2 rows

# Using UDFs in %SQL


spark.udf.register("pricePlus", price_plus)
spark.udf.register("priceTag", price_tag)

Out[277]: <function __main__.price_tag(num)>

%sql select table, pricePlus(table) as TablePlus, priceTag(price) as PriceNote from TempTable limit 3

Table
  
  table TablePlus PriceNote
1 55 110.0 Good
2 61 122.0 Good
3 65 130.0 Good

Showing all 3 rows.

# Checcking the fields in our table first


#diamonds.select('table','price').show(5)
diamonds.select(diamonds['table'],diamonds['price']).show(3)

+-----+-----+
|table|price|
+-----+-----+
| 55.0| 326|
| 61.0| 326|
| 65.0| 327|
+-----+-----+
only showing top 3 rows

# Use withColumn() to add a new column or derive a new column based on the existing on
display( diamonds.withColumn("table", price_plusUDF("table")))

Table
       
  _c0 carat cut color clarity depth table price x

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 41/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 1 0.23 Ideal E SI2 61.5 110.0 326 3.95


5 5 0.31 Good J SI2 63.3 116.0 335 4.34
2 2 0.21 Premium E SI1 59.8 122.0 326 3.89
6 6 0.24 Very Good J VVS2 62.8 114.0 336 3.94
3 3 0.23 Good E VS1 56.9 130.0 327 4.05
7 7 0.24 Very Good I VVS1 62.3 114.0 336 3.95
Truncated
8 8 results, showing0.26
first 1,000 rows. Very Good H SI1 61.9 110.0 337 4.07
9 9 0.22 Fair E VS2 65.1 122.0 337 3.87

Convert
10 DataFrame to0.23
10 RDD (Row RDD)
Very Good H VS1 59.4 122.0 338 4
11 11 0.3 Good J SI1 64 110.0 339 4.25
12 12
df_people1 = 0.23 Ideal J VS1 62.8
spark.read.json("/databricks-datasets/samples/people/people.json") 112.0 340 3.93
rdd_1
13 =13df_people1.rdd
0.22 Premium F SI1 60.4 122.0 342 3.88
rdd_1.collect()
14 14 0.31 Ideal J SI2 62.2 108.0 344 4.35
15 [Row(age=40,
Out[281]:
15 0.2 name='Jane'),
Premium E SI2 60.2 124.0 345 3.79
Row(age=30,
16 16 name='Andy'),
0.32 Premium E I1 60.9 116.0 345 4.38
Row(age=50, name='Justin')]
17 17 0.3 Ideal I SI2 62 108.0 348 4.31

ABD07 Data Lake | ELT example

Data Lake | Data Layers


1. Bonze: read raw data from Parquet files
2. Silver: Perform ELT to clean and conform the data
3. Gold: agregate the data fit specific exploration use cases

Data is from Lending Club. It includes funded loans from 2012 to 2017. Each loan includes demographic information, current loan
status (Current, Late, Fully Paid, etc.) and latest payment info.

Notes to consider:
The management of the Data Lake layer is done here with folders in the file system but could also be done with databases in
Spark SQL
Parquet format will be used for Bronze layer and the Delta format for the Silver and Gold layer

# Removing old file (in case of previous runs)


dbutils.fs.rm("/FileStore/tables/DL", True)
#dbutils.fs.rm("/FileStore/tables/DL/Bronze/", True)
#dbutils.fs.rm("/FileStore/tables/DL/Silver/", True)
#dbutils.fs.rm("/FileStore/tables/DL/Gold/", True)

Out[282]: False

# Configure the Path for the 3 data layer


DL_Bronze_Path = "/FileStore/tables/DL/Bronze/"
DL_Silver_Path = "/FileStore/tables/DL/Silver/"
DL_Gold_Path = "/FileStore/tables/DL/Gold/"

# Importing libraries
from pyspark.sql.functions import *
import time
import datetime

Read the Source data and process it for the Silver

This is where the "DATA" rules should apply

# Read Parquet files with Spark


# Lets assume the file has the daily Loans
rawL = spark.read.parquet("/databricks-datasets/samples/lending_club/parquet/")

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 42/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

rawL.count()

Out[286]: 1481560

# Check the structure of the data


rawL.printSchema()

root
|-- id: string (nullable = true)
|-- member_id: string (nullable = true)
|-- loan_amnt: float (nullable = true)
|-- funded_amnt: integer (nullable = true)
|-- funded_amnt_inv: double (nullable = true)
|-- term: string (nullable = true)
|-- int_rate: string (nullable = true)
|-- installment: double (nullable = true)
|-- grade: string (nullable = true)
|-- sub_grade: string (nullable = true)
|-- emp_title: string (nullable = true)
|-- emp_length: string (nullable = true)
|-- home_ownership: string (nullable = true)
|-- annual_inc: float (nullable = true)
|-- verification_status: string (nullable = true)
|-- loan_status: string (nullable = true)
|-- pymnt_plan: string (nullable = true)
|-- url: string (nullable = true)
|-- desc: string (nullable = true)
|-- purpose: string (nullable = true)

# Check the content of the data


rawL.display()

Table
       
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
1 null null 35000 35000 35000 36 months 17.27% 1252.56
2 null null 8000 8000 8000 36 months 18.25% 290.23
3 null null 5000 5000 5000 36 months 6.97% 154.32
4 null null 10000 10000 10000 36 months 9.75% 321.5
5 null null 24000 24000 24000 36 months 9.75% 771.6
6 null null 9600 9600 9600 36 months 9.75% 308.64
7 null null 13000 13000 13000 60 months 8.39% 266.03
Truncated results, showing first 1,000 rows.

# Some null records are not relevant

rawL.filter("loan_amnt is not null").count()


#rawL.filter("loan_amnt is null").count()
#rawL.filter(col("loan_amnt").isNull() == True).count()
#rawL.filter(isnull("loan_amnt") == True).count()

Out[289]: 1481542

rawL = rawL.filter("loan_amnt is not null").limit(100)

rawL.count()

Out[291]: 100

rawL.display()

Table
       
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 43/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1
6 null
null null
null 35000
9600 35000
9600 35000
9600 36 months
36 months 17.27%
9.75% 1252.56
308.64 B
2
7 null
null null
null 8000
13000 8000
13000 8000
13000 36 months
60 months 18.25%
8.39% 290.23
266.03 B
Showing
8 all 100 rows.
null null 9000 9000 9000 36 months 9.16% 286.87 B
9 null null 18000 18000 18000 36 months 12.99% 606.41 C

Correct
10 data formats null
null 16000 16000 16000 36 months 5.32% 481.84 A
11 null null 8400 8400 8400 36 months 9.75% 270.06 B
12 null
# Transforming nullcolumns into9000
string 9000
numeric columns 9000 36 months 5.32% 271.04 A
rawL
13 = null
rawL.withColumn('int_rate',
null regexp_replace('int_rate',
20000 20000 '%',
20000 '').cast('float')) \
36 months 11.47% 659.24 B
.withColumn('revol_util', regexp_replace('revol_util', '%', '').cast('float')) \
14 null null 30000 30000 30000 36 months 12.99% 1010.68 C
.withColumn('issue_year', substring(rawL.issue_d, 5, 4).cast('double') ) \
15 null null .withColumn('earliest_year',
32000 32000 substring(rawL.earliest_cr_line,
32000 36 months
5, 4).cast('double')) 1043.86
10.75% B
16 null null 16300 16300 16300 36 months 9.75% 524.05 B
# Converting emp_length into numeric column
17 null null 10500 10500 10500 36 months 11.99% 348.71 C
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "([ ]*+[a-zA-Z].*)|(n/a)", "") ))
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "< 1", "0") ))
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "10\\+", "10") ).cast('float'))

rawL.display()
#rawL.select('int_rate','revol_util','issue_year','earliest_year').display()

Table
       
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
1 null null 35000 35000 35000 36 months 17.27 1252.56
2 null null 8000 8000 8000 36 months 18.25 290.23
3 null null 5000 5000 5000 36 months 6.97 154.32
4 null null 10000 10000 10000 36 months 9.75 321.5
5 null null 24000 24000 24000 36 months 9.75 771.6
6 null null 9600 9600 9600 36 months 9.75 308.64
7 null null 13000 13000 13000 60 months 8.39 266.03
Showing all 100 rows.

Save the cleaned Raw data as a Bronze file in Delta Lake


# Write the data in the Bronze path
# This data should be able to be reprocessed with minimal efford

date_time = datetime.datetime.now()
date_tag = date_time.strftime("%Y-%b-%d")
Bronze_Path = DL_Bronze_Path + date_tag

rawL.write.format('parquet').mode('overwrite').save(Bronze_Path + '/Loans')

Read the Bronze data and process it for the Silver layer

This is where the "BUSINESS" rules should apply

loans = spark.read.parquet(Bronze_Path + "/Loans")

# You may want to creat a log with statistics from your ELT process for control purposes
loans.count()

Out[298]: 100

display(loans)

Table
       
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 44/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 null null 35000 35000 35000 36 months 17.27 1252.56


4
2 null null 10000
8000 10000
8000 10000
8000 36 months 9.75
18.25 321.5
290.23 B
5 null null 24000 24000 24000 36 months 9.75 771.6 B
6 null null 9600 9600 9600 36 months 9.75 308.64 B
7 null null 13000 13000 13000 60 months 8.39 266.03 B
Showing
8 all 100 rows.
null null 9000 9000 9000 36 months 9.16 286.87 B
9 null null 18000 18000 18000 36 months 12.99 606.41 C
Here
10 wenull
should applynull
the business rules to "conform"
16000 our data to the16000
16000 Silver layer 36 months 5.32 481.84 A
11 null null 8400 8400 8400 36 months 9.75 270.06 B

# Start
12 null
by selectingnull 9000 we need for
only the columns 9000the Bronze layer
9000 36 months 5.32 271.04 A
loans
13 =null
loans.select("loan_status",
null "int_rate",20000
20000 "revol_util", "issue_d",
20000 "earliest_cr_line",
36 months "emp_length",
11.47 659.24 B
"verification_status", \
14 null null 30000 30000 30000 36 months 12.99 1010.68 C
"total_pymnt", "loan_amnt", "grade", "annual_inc", "dti", "addr_state", "term",
15 null
"home_ownership", null
"purpose", \ 32000 32000 32000 36 months 10.75 1043.86 B
16 null null "issue_year",
16300 "earliest_year",
16300 "application_type",
16300 36"delinq_2yrs",
months 9.75 "total_acc")
524.05 B
17 null null 10500 10500 10500 36 months 11.99 348.71 C
# Creating 'bad_loan' label, which includes charged off, defaulted, and late repayments on loans
loans = loans.filter(loans.loan_status.isin(["Default", "Charged Off", "Fully Paid"])) \
.withColumn("bad_loan", (~(loans.loan_status == "Fully Paid")).cast("string"))

# Bucketing verification_status values together


loans = loans.withColumn('verification_status', trim(regexp_replace(loans.verification_status, 'Source Verified',
'Verified')))

# Calculating the 'credit_length_in_years' column


loans = loans.withColumn('credit_length_in_years', (loans.issue_year - loans.earliest_year))

# Calculating the 'net' column, the total amount of money earned or lost per loan
loans = loans.withColumn('net', round(loans.total_pymnt - loans.loan_amnt, 2))

display(loans)

Table
      
  loan_status int_rate revol_util issue_d earliest_cr_line emp_length verification_status total_pymn
1 Fully Paid 5.32 27.9 Mar-2016 Nov-2000 8 Not Verified 16098.34
2 Fully Paid 9.75 19.4 Mar-2016 Aug-2010 1 Verified 8663.31
3 Fully Paid 5.32 23.9 Mar-2016 Dec-1989 10 Not Verified 9361.74112
4 Fully Paid 12.99 75.4 Mar-2016 Apr-2007 1 Verified 11088.6700
5 Charged Off 21.18 87.5 Mar-2016 Jun-2004 0 Verified 4693.26
6 Fully Paid 7.39 59.1 Mar-2016 May-1995 2 Verified 7280.34770
7 Charged Off 15.31 7.7 Mar-2016 Jan-1998 null Verified 10162.81
Showing all 23 rows.

Save our cleaned and conformed data as a Silver file and table in the
Delta Lake
# Write the data in the Bronze path
# This data should represent a clean history of business facts with the maximum data detail (level)
# Bronze layer cound story info by year (don't think it's necessary in this case)

#date_time = datetime.datetime.now()
#date_tag = date_time.strftime("%Y")
#Silver_Path = DL_Silver_Path + date_tag

# Write the data on disk (use append as previuos records will exist in the BZ table)
file_path_SV_loans = DL_Silver_Path + '/Loans_SV'
loans.write.format('delta').mode('append').save(file_path_SV_loans)

# Register an SQL table in the database to facilitate SQL user Queries


spark.sql(f"CREATE TABLE Loans_SV USING delta LOCATION '{file_path_SV_loans}'")

Out[304]: DataFrame[]

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 45/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Check the Table in the Catalog


spark.catalog.listTables()

Out[305]: [Table(name='loans_sv', catalog='spark_catalog', namespace=['default'], description=None, tableType='EXTERN


AL', isTemporary=False),
Table(name='TempTable', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

Read the Silver data and prepare it with aggregations for a Business
Unit in the Gold layer
# Read the data (option by reading the data from the Table)
loans_SV = spark.table('Loans_SV')

display(loans_SV)

Table
      
  loan_status int_rate revol_util issue_d earliest_cr_line emp_length verification_status total_pymn
1 Fully Paid 5.32 27.9 Mar-2016 Nov-2000 8 Not Verified 16098.34
2 Fully Paid 9.75 19.4 Mar-2016 Aug-2010 1 Verified 8663.31
3 Fully Paid 5.32 23.9 Mar-2016 Dec-1989 10 Not Verified 9361.74112
4 Fully Paid 12.99 75.4 Mar-2016 Apr-2007 1 Verified 11088.6700
5 Charged Off 21.18 87.5 Mar-2016 Jun-2004 0 Verified 4693.26
6 Fully Paid 7.39 59.1 Mar-2016 May-1995 2 Verified 7280.34770
7 Charged Off 15.31 7.7 Mar-2016 Jan-1998 null Verified 10162.81
Showing all 23 rows.

Create a Gold Table Gold Tables are often created to provide clean, reliable data for a specific business unit or use case.

In our case, we'll create a Gold table that includes only 2 columns - addr_state and count - to provide an aggregated view of
our data. For our purposes, this table will allow us to show what Delta Lake can do, but in practice a table like this could be used
to feed a downstream reporting or BI tool that needs data formatted in a very specific way. Silver tables often feed multiple
downstream Gold tables.

# Aggregate the data from the Gold layer


loans_by_state = loans_SV.groupBy("addr_state").count()

loans_by_state.count()

Out[309]: 15

# Write out the data in Delta format


file_path_GD_loans_state = DL_Gold_Path + 'loans_by_state'
loans_by_state.write.format('delta').save(file_path_GD_loans_state)

# Register an SQL table in the database to facilitate SQL user Queries


spark.sql(f"CREATE TABLE loans_by_state USING delta LOCATION '{file_path_GD_loans_state}'")

Out[311]: DataFrame[]

%sql
-- show tables;

Explore the Gold layer with SQL as the traditional users


%sql show tables;

Table
  
  database tableName isTemporary

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 46/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 default loans_by_state false


2 default loans_sv false
3 temptable true

Showing all 3 rows.

%sql
SELECT *
FROM loans_by_state

Table
 
  addr_state count
1 SC 1
2 MN 1
3 VA 1
4 MI 1
5 WI 1
6 MD 2
7 MO 1
Showing all 15 rows.

%sql
SELECT addr_state, sum(`count`) AS loans
FROM loans_by_state
GROUP BY addr_state

Table
 
  addr_state loans
1 SC 1
2 MN 1
3 VA 1
4 MI 1
5 WI 1
6 MD 2
7 MO 1
Showing all 15 rows.

Drop the Tables you don't need just for house cleaning purposes
%sql
drop table loans_by_state;
drop table loans_sv;

OK

dbutils.fs.rm("/FileStore/tables/DL", True)

Out[316]: True

ABD07 Spark SQL Examples


spark

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 47/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

# Check if you have tables registered


spark.sql("show tables").show()
#spark.catalog.listTables()
#spark.catalog.listTables('default')
#spark.catalog.listTables('global_temp')

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| |temptable| true|
+--------+---------+-----------+

%sql show tables;

Table
  
  database tableName isTemporary
1 temptable true

Showing 1 row.

Using Group By and the aggregation function

from pyspark.sql.functions import *

#import pyspark.sql.functions as f
#if you use the sintaxe above you have to prefix the functions with "f".
#Ex: f.explode, f.split, f.avf

#avg_dim_metric = df.groupBy("col_dimension").agg(avg("col_metric"))
#avg_dim_metric.show()

from pyspark.sql.functions import *


dataGB = [("UK","Brooke", "F", 20), ("UK", "Denny", "M", 31), ("UK", "Jules", "M", 30), ("UK", "Tom", "M", 35),
("UK", "Mary","F", 25)]
dataGB = dataGB + [("PT","Pedro", "M", 28), ("PT", "Rui","M", 40), ("PT", "Carlos","M", 34), ("PT", "Maria","F", 45),
("PT", "Sandra","F", 28)]
#list(data)

# Create the DataFrame


dfGB = spark.createDataFrame(dataGB, ["country", "name", "gender", "age"])
dfGB.show()

+-------+------+------+---+
|country| name|gender|age|
+-------+------+------+---+
| UK|Brooke| F| 20|
| UK| Denny| M| 31|
| UK| Jules| M| 30|
| UK| Tom| M| 35|
| UK| Mary| F| 25|
| PT| Pedro| M| 28|
| PT| Rui| M| 40|
| PT|Carlos| M| 34|
| PT| Maria| F| 45|
| PT|Sandra| F| 28|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 48/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

+-------+------+------+---+

# Group the same names together, aggregate their ages, and compute an average
#dfGB.groupBy("country").agg(avg("age").alias("avg_age")).show()
dfGB.groupBy("country").agg(avg("age"), sum("age"), max("age"), min("age")).show()

+-------+--------+--------+--------+--------+
|country|avg(age)|sum(age)|max(age)|min(age)|
+-------+--------+--------+--------+--------+
| UK| 28.2| 141| 35| 20|
| PT| 35.0| 175| 45| 28|
+-------+--------+--------+--------+--------+

dfGB.groupBy("country").agg(avg("age"), sum("age"), max("age"), min("age")).show()

+-------+--------+--------+--------+--------+
|country|avg(age)|sum(age)|max(age)|min(age)|
+-------+--------+--------+--------+--------+
| UK| 28.2| 141| 35| 20|
| PT| 35.0| 175| 45| 28|
+-------+--------+--------+--------+--------+

dfGB.groupBy("country","gender").agg(avg("age")).show()

+-------+------+--------+
|country|gender|avg(age)|
+-------+------+--------+
| UK| F| 22.5|
| UK| M| 32.0|
| PT| M| 34.0|
| PT| F| 36.5|
+-------+------+--------+

dfGB.groupBy("country").pivot("gender").agg(avg("age")).show()

+-------+----+----+
|country| F| M|
+-------+----+----+
| PT|36.5|34.0|
| UK|22.5|32.0|
+-------+----+----+

Using Explode

dataExp = [('Client1', 'p1,p2,p3'), ('Client2', 'p1,p3,p5'), ('Client3', 'p3,p4')]


dfExp = spark.createDataFrame(dataExp, ['Client', 'List_Products'])

# Use withColumn to derive/create a new column 'Product' based on the column 'List_Products'
dfExp1 = dfExp.withColumn('Product', split('List_Products', ','))
dfExp1.show()

+-------+-------------+------------+
| Client|List_Products| Product|
+-------+-------------+------------+
|Client1| p1,p2,p3|[p1, p2, p3]|
|Client2| p1,p3,p5|[p1, p3, p5]|
|Client3| p3,p4| [p3, p4]|
+-------+-------------+------------+

dfExp2 = dfExp1.select('Client', explode('Product').alias('Product'))


dfExp2.display()

Table
 
  Client Product
1 Client1 p1

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 49/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

2 Client1 p2
3 Client1 p3
4 Client2 p1
5 Client2 p3
6 Client2 p5
7 Client3 p3
Showing all 8 rows.

dataExp = [('Client1', 'p1,p2,p3'), ('Client2', 'p1,p2,p3'), ('Client3', 'p3,p4')]


dfExp = spark.createDataFrame(dataExp, ['Client', 'List_Products'])
dfExp1 = dfExp.withColumn('Product', split('List_Products', ','))
dfExp2 = dfExp1.select('Client', explode('Product').alias('Product'))
dfExp2.display()

Table
 
  Client Product
1 Client1 p1
2 Client1 p2
3 Client1 p3
4 Client2 p1
5 Client2 p2
6 Client2 p3
7 Client3 p3
Showing all 8 rows.

# The same as above using a different sintaxe


arrayData = [('Client1',["p1","p2","p3"]),('Client2',["p1","p2","p3"]),('Client3',["p3", "p4"])]
df3 = spark.createDataFrame(data=arrayData, schema = ['Client','acquisitions'])
df4 = df3.select(df3['Client'], explode(df3['acquisitions']))
df4.show()

+-------+---+
| Client|col|
+-------+---+
|Client1| p1|
|Client1| p2|
|Client1| p3|
|Client2| p1|
|Client2| p2|
|Client2| p3|
|Client3| p3|
|Client3| p4|
+-------+---+

Join

EmpData = [(1,"Paul",10), (2,"Mary",20), (3,"Tom",10), (4,"Sandy",30)]


EmpDF = spark.createDataFrame(EmpData, ["emp_id","emp_name","dept_id"])
DeptData = [("Finance",10), ("Marketing",20), ("Sales",30),("HR",40)]
DeptDF=spark.createDataFrame(DeptData,["dept_name","dept_id"])

#df1.join(DF2, joinExprs, joinType = inner/cross/outer/full/left/right... )

EmpDF.join(DeptDF, "dept_id").show()
#EmpDF.join(DeptDF, on = "dept_id").show()
#EmpDF.join(DeptDF,EmpDF["dept_id"] == DeptDF["dept_id"]).show()

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
| 10| 1| Paul| Finance|
| 10| 3| Tom| Finance|
| 20| 2| Mary|Marketing|
| 30| 4| Sandy| Sales|
+-------+------+--------+---------+

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 50/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Joins don't support more than 2 Dataframes. Use a chan join query.
# df1.join(df2,col).join(df3,col)
AdressData=[(1,"1523 Main St","SFO","CA"), (2,"3453 Orange St","SFO","NY"), (3,"34 Warner St","Jersey","NJ"), (4,"221
Cavalier St","Newark","DE"),(5,"789 Walnut St","Sandiago","CA")]
AdressData = spark.createDataFrame(AdressData, ["emp_id","addline1","city","state"])

Union

EmpDataPlus = [(5,"Victor",10), (6,"Sam",20), (7,"Paty",10), (8,"Carol",30), (8,"Carol",30)]


EmpDFplus = spark.createDataFrame(EmpDataPlus, ["emp_id","emp_name","dept_id"])

#EmpDF.union(EmpDFplus).show()
# Check the number of columns, different columns will give an error
# Union ignores dupplicate data. Use distinct or dropDuplicates (allows selection of columns)
#EmpDF.union(EmpDFplus).distinct().show()
EmpDF.union(EmpDFplus).dropDuplicates(['emp_id']).show()

+------+--------+-------+
|emp_id|emp_name|dept_id|
+------+--------+-------+
| 1| Paul| 10|
| 2| Mary| 20|
| 3| Tom| 10|
| 4| Sandy| 30|
| 5| Victor| 10|
| 6| Sam| 20|
| 7| Paty| 10|
| 8| Carol| 30|
+------+--------+-------+

Conditional formating

moviesDF = spark.read.csv("dbfs:/FileStore/tables/movielensABD.csv", header="false", sep=";", inferSchema="true") \


.toDF("id", "name", "genres", "na", "rating", "views")
display(moviesDF)

Table
 
  id name genres
1 1 Toy Story (1995) Adventure|Animation|Chil
2 2 Jumanji (1995) Adventure|Children|Fantas
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
7 7 Sabrina (1995) Comedy|Romance
Truncated results, showing first 1,000 rows.

moviesDF.withColumn('comedy',when(col('genres').contains('Comedy'),lit('Yes')).otherwise(lit('No'))).show()
# Function lit() adds a column

+---+--------------------+--------------------+---+------+----------+------+
| id| name| genres| na|rating| views|comedy|
+---+--------------------+--------------------+---+------+----------+------+
| 1| Toy Story (1995)|Adventure|Animati...| 7| 3.0| 851866703| Yes|
| 2| Jumanji (1995)|Adventure|Childre...| 15| 2.0|1134521380| No|
| 3|Grumpier Old Men ...| Comedy|Romance| 5| 4.0|1163374957| Yes|
| 4|Waiting to Exhale...|Comedy|Drama|Romance| 19| 3.0| 855192868| Yes|
| 5|Father of the Bri...| Comedy| 15| 4.5|1093070098| Yes|
| 6| Heat (1995)|Action|Crime|Thri...| 15| 4.0|1040205753| No|
| 7| Sabrina (1995)| Comedy|Romance| 18| 3.0| 856006982| Yes|
| 8| Tom and Huck (1995)| Adventure|Children| 30| 4.0| 968786809| No|
| 9| Sudden Death (1995)| Action| 18| 3.0| 856007219| No|
| 10| GoldenEye (1995)|Action|Adventure|...| 2| 4.0| 835355493| No|
| 11|American Presiden...|Comedy|Drama|Romance| 15| 2.5|1093028381| Yes|
| 12|Dracula: Dead and...| Comedy|Horror| 67| 3.0| 854711916| Yes|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 51/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

| 13| Balto (1995)|Adventure|Animati...|182| 3.0| 845745917| No|


| 14| Nixon (1995)| Drama| 15| 2.5|1166586286| No|
| 15|Cutthroat Island ...|Action|Adventure|...| 73| 2.5|1255593501| No|
| 16| Casino (1995)| Crime|Drama| 15| 3.5|1093070150| No|
| 17|Sense and Sensibi...| Drama|Romance| 2| 5.0| 835355681| No|
| 18| Four Rooms (1995)| Comedy| 18| 3.0| 856007359| Yes|

moviesDF.withColumn('sensitive', when(col('genres').contains('Crime'), lit('Yes'))\


.when(col('genres').contains('Drama'), lit('Yes'))\
.when(col('genres').contains('Horror'), lit('Yes'))\
.otherwise(lit('No'))).show()

+---+--------------------+--------------------+---+------+----------+---------+
| id| name| genres| na|rating| views|sensitive|
+---+--------------------+--------------------+---+------+----------+---------+
| 1| Toy Story (1995)|Adventure|Animati...| 7| 3.0| 851866703| No|
| 2| Jumanji (1995)|Adventure|Childre...| 15| 2.0|1134521380| No|
| 3|Grumpier Old Men ...| Comedy|Romance| 5| 4.0|1163374957| No|
| 4|Waiting to Exhale...|Comedy|Drama|Romance| 19| 3.0| 855192868| Yes|
| 5|Father of the Bri...| Comedy| 15| 4.5|1093070098| No|
| 6| Heat (1995)|Action|Crime|Thri...| 15| 4.0|1040205753| Yes|
| 7| Sabrina (1995)| Comedy|Romance| 18| 3.0| 856006982| No|
| 8| Tom and Huck (1995)| Adventure|Children| 30| 4.0| 968786809| No|
| 9| Sudden Death (1995)| Action| 18| 3.0| 856007219| No|
| 10| GoldenEye (1995)|Action|Adventure|...| 2| 4.0| 835355493| No|
| 11|American Presiden...|Comedy|Drama|Romance| 15| 2.5|1093028381| Yes|
| 12|Dracula: Dead and...| Comedy|Horror| 67| 3.0| 854711916| Yes|
| 13| Balto (1995)|Adventure|Animati...|182| 3.0| 845745917| No|
| 14| Nixon (1995)| Drama| 15| 2.5|1166586286| Yes|
| 15|Cutthroat Island ...|Action|Adventure|...| 73| 2.5|1255593501| No|
| 16| Casino (1995)| Crime|Drama| 15| 3.5|1093070150| Yes|
| 17|Sense and Sensibi...| Drama|Romance| 2| 5.0| 835355681| Yes|
| 18| Four Rooms (1995)| Comedy| 18| 3.0| 856007359| No|

ABD07 Spark SQL Exercises

Exercise 1:
Check your Spark Session and if you have tables registered

spark

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

# Check if you have tables registered


spark.sql("show tables").show()
#spark.catalog.listTables()

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| |temptable| true|
+--------+---------+-----------+

Exercise 2:
Load the movielens (or movielensABD) file to your environment

Create a Dataframe based on the movilens file

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 52/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

moviesDF = spark.read.csv("dbfs:/FileStore/tables/movielens.txt", header="false", sep=";", inferSchema="true") \


.toDF("id", "name", "genres", "na", "rating", "views")
display(moviesDF)

Table
 
  id name genres
1 1 Toy Story (1995) Adventure|Animation|Chil
2 2 Jumanji (1995) Adventure|Children|Fantas
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
7 7 Sabrina (1995) Comedy|Romance
Truncated results, showing first 1,000 rows.

Exercise 3:
Show the data about 5 movies that have a rating superior to 2.5

# Where and Filter work the same way


moviesDF.select("name", "genres", "rating", "views").where("rating > 2.5").limit(5).show()
moviesDF.select("name", "genres", "rating", "views").where(moviesDF['rating'] > 2.5).show(5)

moviesDF.createOrReplaceTempView("movies")
#saveAstable
spark.sql("select name, genres, rating, views from movies where rating > 2.5 limit 5").show()

+--------------------+--------------------+------+----------+
| name| genres|rating| views|
+--------------------+--------------------+------+----------+
| Toy Story (1995)|Adventure|Animati...| 3.0| 851866703|
|Grumpier Old Men ...| Comedy|Romance| 4.0|1163374957|
|Waiting to Exhale...|Comedy|Drama|Romance| 3.0| 855192868|
|Father of the Bri...| Comedy| 4.5|1093070098|
| Heat (1995)|Action|Crime|Thri...| 4.0|1040205753|
+--------------------+--------------------+------+----------+

+--------------------+--------------------+------+----------+
| name| genres|rating| views|
+--------------------+--------------------+------+----------+
| Toy Story (1995)|Adventure|Animati...| 3.0| 851866703|
|Grumpier Old Men ...| Comedy|Romance| 4.0|1163374957|
|Waiting to Exhale...|Comedy|Drama|Romance| 3.0| 855192868|
|Father of the Bri...| Comedy| 4.5|1093070098|
| Heat (1995)|Action|Crime|Thri...| 4.0|1040205753|
+--------------------+--------------------+------+----------+
only showing top 5 rows

Exercise 4:
What different genres are there?

genresDF = moviesDF.select("genres").distinct()
display(genresDF)

#genresDF = spark.sql("select distinct(genres) from movies")


#display(genresDF)

Table

  genres

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 53/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 Comedy|Horror|Thriller
5 Comedy|Drama|Horror|Thriller
2 Adventure|Sci-Fi|Thriller
6 Action|Animation|Comedy|Sci-Fi
3 Action|Adventure|Drama|Fantasy
7 Animation|Children|Drama|Musical|Romance
Showing
8 all 901 rows.
Action|Adventure|Drama
9 Adventure|Animation

Exercise
10
5:
Adventure|Sci-Fi
11 Documentary|Musical|IMAX
How
12 many different genres are there?
Adventure|Children|Fantasy|Sci-Fi|Thriller
13 Documentary|Sci-Fi
14 Musical|Romance|War
genresDF.count()
15 Action|Adventure|Fantasy|Romance
#genresDF.createOrReplaceTempView("genres")
16 Adventure|Children|Drama|Fantasy|IMAX
#spark.sql("select count(*) as count from genres").show()
17 Crime|Drama|Fantasy|Horror|Thriller
Out[347]: 901

Exercise 6:
What is the average rating of the movies?

#import pyspark.sql.functions

from pyspark.sql.functions import avg

moviesDF.select(avg("rating")).show()

#spark.sql("select avg(rating) from movies").show()

+-----------------+
| avg(rating)|
+-----------------+
|3.223527465254798|
+-----------------+

What is the highest rating and the highest rating for each genres?

moviesDF.groupby().max("rating").show(5)

+-----------+
|max(rating)|
+-----------+
| 5.0|
+-----------+

moviesDF.groupby("genres").max("rating").alias("Max_Rating").show(5)

+--------------------+-----------+
| genres|max(rating)|
+--------------------+-----------+
|Comedy|Horror|Thr...| 5.0|
|Adventure|Sci-Fi|...| 2.0|
|Action|Adventure|...| 2.5|
| Action|Drama|Horror| 4.5|
|Comedy|Drama|Horr...| 2.5|
+--------------------+-----------+
only showing top 5 rows

Exercise 7:
What are the movies that have a rating superior to 3 and a pair id number?

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 54/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

moviesDF.select("id","name").filter("(rating > 3) and (id%2 == 0)").show()


#spark.sql("select id, name from movies where rating > 3 and id%2 == 0").show()

+---+--------------------+
| id| name|
+---+--------------------+
| 6| Heat (1995)|
| 8| Tom and Huck (1995)|
| 10| GoldenEye (1995)|
| 16| Casino (1995)|
| 24| Powder (1995)|
| 28| Persuasion (1995)|
| 30|Shanghai Triad (Y...|
| 32|Twelve Monkeys (a...|
| 34| Babe (1995)|
| 36|Dead Man Walking ...|
| 40|Cry, the Beloved ...|
| 50|Usual Suspects, T...|
| 54|Big Green, The (1...|
| 68|French Twist (Gaz...|
| 72|Kicking and Screa...|
| 80|White Balloon, Th...|
| 82|Antonia's Line (A...|
| 84|Last Summer in th...|

Exercise 8:
Show the average rating per genre

genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").avg("rating").toDF("genres","avg_rating")
#genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").agg({"rating": "avg"}).toDF("genres","avg_rating")
#genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").agg(avg("rating")).toDF("genres","avg_rating")
genresAvgDF.display()

#spark.sql("select genres, avg(rating) as avg_rating from movies group by genres").show()

Table
 
  genres avg_rating
1 Comedy|Horror|Thriller 2.9615384615384617
2 Adventure|Sci-Fi|Thriller 1.1666666666666667
3 Action|Adventure|Drama|Fantasy 1.3
4 Action|Drama|Horror 4.5
5 Comedy|Drama|Horror|Thriller 2.5
6 Action|Animation|Comedy|Sci-Fi 3.5
7 Animation|Children|Drama|Musical|Romance 4
Showing all 901 rows.

Exercise 9:
What genres have the highest average rating?

genresAvgDF.orderBy("avg_rating",ascending=False).show()

#genresAvgDF.createOrReplaceTempView("genres_avg_rating")
#spark.sql("select * from genres_avg_rating order by avg_rating desc").show()

+--------------------+----------+
| genres|avg_rating|
+--------------------+----------+
| Adventure|Thriller| 5.0|
|Crime|Fantasy|Horror| 5.0|
|Adventure|Comedy|...| 5.0|
|Adventure|Animati...| 5.0|
|Animation|Comedy|...| 5.0|
|Comedy|Drama|Fant...| 5.0|
|Action|Comedy|Fan...| 5.0|
|Animation|Comedy|...| 5.0|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 55/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

| Drama|Mystery|War| 5.0|
|Children|Comedy|M...| 5.0|
|Action|Adventure|...| 5.0|
|Action|Comedy|Dra...| 5.0|
|Action|Fantasy|Ho...| 5.0|
|Children|Drama|Sc...| 5.0|
| Children|Drama|War| 5.0|
|Adventure|Documen...| 5.0|
| Mystery| 5.0|
| i | d | | |

Exercise 10:
Wow many genres "comedy" movies have a rating of 3.0?

moviesDF.select("name","genres","rating") \
.where("genres like '%Comedy%' and rating == 3") \
.select("name") \
.display()

#spark.sql("select name from movies where genres like '%Comedy%' and rating == 3").show()

Table

  name
1 Toy Story (1995)
2 Waiting to Exhale (1995)
3 Sabrina (1995)
4 Dracula: Dead and Loving It (1995)
5 Four Rooms (1995)
6 Get Shorty (1995)
7 Mighty Aphrodite (1995)
Showing all 677 rows.

Exercise 11:
11.1 - Load the managers and teams csv.

11.2 - Perform a query that will return a dataframe containing the team name, its managerID, their number of wins (W)
and number of losses (L)

teamsDF = spark.read.csv("dbfs:/FileStore/tables/Teams.csv", header="true", sep=",", inferSchema="true")


managersDF = spark.read.csv("dbfs:/FileStore/tables/Managers.csv", header="true", sep=",", inferSchema="true")
teamsDF.createOrReplaceTempView("teams")
managersDF.createOrReplaceTempView("managers")
#display(teamsDF)
#display(managersDF)

teamsDF.select("yearID","teamID","name","W","L") \
.join(managersDF.select("teamID","yearID","managerID"), ["teamID", "yearID"]) \
.select("yearID","name", "managerID", "W", "L") \
.toDF("Year", "Name", "Manager", "Wins", "Losses") \
.show()

#spark.sql("select t.yearID as Year, \


# t.name as Name, \
# m.managerID as Manager, \
# t.W as Wins, \
# t.L as Losses \
# from teams t, \
# managers m \
# where m.teamID == t.teamID and \
# m.yearID == t.yearID").show()

+----+--------------------+----------+----+------+
|Year| Name| Manager|Wins|Losses|
+----+--------------------+----------+----+------+
|1871|Boston Red Stockings|wrighha01m| 20| 10|

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 56/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

|1871|Chicago White Sto...| woodji01m| 19| 9|


|1871|Cleveland Forest ...|paborch01m| 10| 19|
|1871|Fort Wayne Kekiongas|deaneha01m| 7| 12|
|1871|Fort Wayne Kekiongas|lennobi01m| 7| 12|
|1871| New York Mutuals|fergubo01m| 16| 17|
|1871|Philadelphia Athl...|mcbridi01m| 21| 7|
|1871|Rockford Forest C...|hastisc01m| 4| 21|
|1871| Troy Haymakers|cravebi01m| 13| 15|
|1871| Troy Haymakers| pikeli01m| 13| 15|
|1871| Washington Olympics|youngni99m| 15| 15|
|1872| Baltimore Canaries|millsev01m| 35| 19|
|1872| Baltimore Canaries|cravebi01m| 35| 19|
|1872| Brooklyn Eckfords| woodji01m| 3| 26|
|1872| Brooklyn Eckfords|clintji01m| 3| 26|
|1872| Brooklyn Atlantics|fergubo01m| 9| 28|
|1872|Boston Red Stockings|wrighha01m| 39| 8|

To push our mind a bit - additional exercise

Exercise ACE:

Which is the single type genre (i.e., only one genre, no |) with the highest average rating?

from pyspark.sql.functions import split


from pyspark.sql.functions import explode
moviesDF.select(explode(split(moviesDF["genres"],"[|]")).alias("genre"), "rating") \
.groupBy("genre") \
.avg("rating") \
.orderBy("avg(rating)" ,ascending=False) \
.show()

#moviesDF.select(explode(split(moviesDF["genres"],"[|]")).alias("genre"), "rating") \
# .groupBy("genre") \
# .agg({"rating": "avg"}) \
# .orderBy("avg(rating)" ,ascending=False) \
# .show()

+------------------+------------------+
| genre| avg(rating)|
+------------------+------------------+
|(no genres listed)| 3.735294117647059|
| Documentary| 3.641683778234086|
| Film-Noir| 3.549586776859504|
| War| 3.441256830601093|
| Western|3.4107142857142856|
| Drama| 3.383895563770795|
| Musical|3.3553299492385786|
| Animation| 3.335570469798658|
| Romance|3.2868267358857883|
| Mystery|3.2309124767225326|
| Crime| 3.208791208791209|
| Children|3.1348797250859106|
| Comedy|3.1303296038705777|
| Adventure| 3.121863799283154|
| Thriller| 3.053290623179965|
| Sci-Fi|3.0214917825537295|
| Fantasy|3.0191424196018377|
| Action|2.9504212572909916|

ABD09 - Spark Streaming (ABD)

Clean the (possible) old files and folders from previous processes

%ls -la /temp-files/

ls: cannot access '/temp-files/': No such file or directory

Clean the tags files

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 57/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

%sh rm -r /temp-files/*

rm: cannot remove '/temp-files/*': No such file or directory

%sh rmdir /temp-files

rmdir: failed to remove '/temp-files': No such file or directory

# Clean the Checkpoint folder


dbutils.fs.rm("/cp", True)

Out[362]: True

# Clean the Stream Output Folder (the parquet file)


dbutils.fs.rm("/Stream", True)

Out[363]: False

Starting the process

%sh rm -r /temp-files/*

rm: cannot remove '/temp-files/*': No such file or directory

%sh mkdir /temp-files

%sh ls -la /temp-files

total 8
drwxr-xr-x 2 root root 4096 Jan 5 14:15 .
drwxr-xr-x 1 root root 4096 Jan 5 14:15 ..

# Remenber that with this python process bellow ou will be working on the local (driver node) file system, not on DBFS
%pwd

Out[367]: '/databricks/driver'

# Create one file in our temp-files


from pyspark.sql.functions import *
import time
import datetime
file_path="/temp-files/"

# Create tags with a 'Yes' flag


date_time = datetime.datetime.now()
date_tag = date_time.strftime("%Y-%b-%d_%H-%M-%S")
file_name = file_path + "file_" + date_tag

f= open(file_name,"w+")
line = date_tag + ';' + 'ABD2022' + ';' + 'Yes'
f.write(line)
f.close()

# Create tags with a 'No' flag


date_time = datetime.datetime.now()
date_tag = date_time.strftime("%Y-%b-%d_%H-%M-%S")
file_name = file_path + "file_" + date_tag

f= open(file_name,"w+")
line = date_tag + ';' + 'ABD2022' + ';' + 'No'
f.write(line)
f.close()

%sh ls -la /temp-files

total 16
drwxr-xr-x 2 root root 4096 Jan 5 14:15 .
drwxr-xr-x 1 root root 4096 Jan 5 14:15 ..

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 58/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

-rw-r--r-- 1 root root 32 Jan 5 14:15 file_2023-Jan-05_14-15-32


-rw-r--r-- 1 root root 31 Jan 5 14:15 file_2023-Jan-05_14-15-39

#%sh rm -r /temp-files/*

# Checking the content of a tag file


# spark.read.text("file:/temp-files/....").collect()
spark.read.text("file:/temp-files/file_2023-Jan-05_14-15-32").display()

Table

  value
1 2023-Jan-05_14-15-32;ABD2022;Yes

Showing 1 row.

# Checking the code (if needed) to transform one text line of the tag file into a 3 columns DF ('TimeStamp', 'Device',
'Flag')
#w = spark.read.text("file:/temp-files/file_2022-Nov-05_15-10-43")
#w = w.withColumn('TimeStamp', split('value', ';').getItem(0)) \
# .withColumn('Device', split('value', ';').getItem(1)) \
# .withColumn('Flag', split('value', ';').getItem(2)) \
# .withColumn('Tags_count', lit('1')) \
# .drop('value')
#w.display()

help(display)

Help on method display in module dbruntime.display:

display(input=None, *args, **kwargs) method of dbruntime.display.Display instance


Display plots or data.

Display plot:
- display() # no-op
- display(matplotlib.figure.Figure)

Display dataset:
- display(spark.DataFrame)
- display(list) # if list can be converted to DataFrame, e.g., list of named tuples
- display(pandas.DataFrame)
- display(koalas.DataFrame)
- display(pyspark.pandas.DataFrame)

Display any other value that has a _repr_html_() method

For Spark 2.0 and 2.1:


- display(DataFrame, streamName='optional', trigger=optional pyspark.sql.streaming.Trigger,
checkpointLocation='optional')

Load with standard structured streaming

# Define the reader for the tag's files


strmDF = (
spark
.readStream
.text("file:/temp-files/")
)

# Include transformations here (if needed)


strmDF = strmDF.withColumn('TimeStamp', split('value', ';').getItem(0)) \
.withColumn('Device', split('value', ';').getItem(1)) \
.withColumn('Flag', split('value', ';').getItem(2)) \
.drop('value')
strmDF = strmDF.where("Flag == 'Yes'")

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 59/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Write the tags (strmDF) in a parquet format


ABD_query = ( strmDF
.writeStream
.format("parquet")
.queryName("Devices_Yes")
.option("checkpointLocation","/cp/fileStream/" )
.start("/Stream/Stream_Out/")
)

 Devices_Yes (id: 8957059f-c44f-4724-b2bf-140c1decd6e8) Last updated: 1 day ago

# Display strmDF to see the content in real-time


display(strmDF)

 display_query_1 (id: bcce8459-25ce-4b0b-8f7a-14a16f2fa6be) Last updated: 1 day ago

Table
  
  TimeStamp Device Flag
1 2023-Jan-05_14-15-32 ABD2022 Yes

Showing 1 row.

# Show the final output of the tags file in a static DF


# df = spark.read.parquet("/Stream/Stream_Out/"))
# df.show()
display(spark.read.parquet("/Stream/Stream_Out/"))

Table
  
  TimeStamp Device Flag
1 2023-Jan-05_14-15-32 ABD2022 Yes

Showing 1 row.

dbutils.fs.ls("/Stream/Stream_Out/")

Out[382]: [FileInfo(path='dbfs:/Stream/Stream_Out/_spark_metadata/', name='_spark_metadata/', size=0, modificationTim


e=0),
FileInfo(path='dbfs:/Stream/Stream_Out/part-00000-3886032d-9ee2-4e3b-bcfb-4bfb86c28fa3-c000.snappy.parquet', name='p
art-00000-3886032d-9ee2-4e3b-bcfb-4bfb86c28fa3-c000.snappy.parquet', size=1204, modificationTime=1672928212000)]

Remember to stop all the streaming queries

ABD_query.status

Out[383]: {'message': 'Waiting for next trigger',


'isDataAvailable': False,
'isTriggerActive': False}

ABD_query.stop()

ABD_query.status

Out[385]: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

# Clean the Stream Output Folder (the parquet file)


dbutils.fs.rm("/Stream", True)

Out[386]: True

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 60/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

ABD09 - Spark Streaming (DB)


Databricks Spark Structured Streaming
In Structured Streaming the data stream is processed as continuous updates to a table.

1. create the input data/streaming table;

2. define the computation on the input table to a results table (as if it were a static table);

3. write the results table to an output sink.

Triggers
Developers define triggers to control how frequently the input table is updated.

Each time a trigger fires, Spark checks for new data (new rows for the input table), and updates the result.

From the docs for DataStreamWriter.trigger(Trigger):

The default value is ProcessingTime(0) and it will run the query as fast as possible.

And the process repeats in perpetuity.

The trigger specifies when the system should process the next set of data.

Trigger Type Example Notes

Unspecified DEFAULT- The query will be executed as soon as the


system has completed processing the previous query

Fixed interval .trigger(Trigger.ProcessingTime("6 hours")) The query will be executed in micro-batches and kicked
micro- off at the user-specified intervals
batches

One-time .trigger(Trigger.Once()) The query will execute only one micro-batch to process
micro-batch all the available data and then stop on its own

Continuous .trigger(Trigger.Continuous("1 second")) The query will be executed in a low-latency, continuous


w/fixed
processing mode
checkpoint
interval (http://spark.apache.org/docs/latest/structured-
streaming-programming-guide.html#continuous-
processing). EXPERIMENTAL in 2.3.2

In the example below, you will be using a fixed interval of 3 seconds:

.trigger(Trigger.ProcessingTime("3 seconds"))

Checkpointing
A checkpoint stores the current state of your streaming job to a reliable storage system such as Azure Blob Storage or HDFS. It
does not store the state of your streaming job to the local file system of any node in your cluster.

Together with write ahead logs, a terminated stream can be restarted and it will continue from where it left off.

To enable this feature, you only need to specify the location of a checkpoint directory:

.option("checkpointLocation", checkpointPath)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 61/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Output Modes
| Mode | Example | Notes | | ------------- | ----------- | | Complete | .outputMode("complete") | The entire updated Result Table
is written to the sink. The individual sink implementation decides how to handle writing the entire table. | | Append |
.outputMode("append") | Only the new rows appended to the Result Table since the last trigger are written to the sink. | |
Update | .outputMode("update") | Only the rows in the Result Table that were updated since the last trigger will be outputted
to the sink. Since Spark 2.1.1 |

In the example below, we are writing to a Parquet directory which only supports the append mode:

dsw.outputMode("append")

Output Sinks
DataStreamWriter.format accepts the following values, among others:

Output
Sink Example Notes

File dsw.format("parquet") , Dumps the Result Table to a file. Supports Parquet, json, csv, etc.
dsw.format("csv") ...

Kafka dsw.format("kafka") Writes the output to one or more topics in Kafka

Console dsw.format("console") Prints data to the console (useful for debugging)

Memory dsw.format("memory") Updates an in-memory table, which can be queried through


Spark SQL or the DataFrame API

foreach dsw.foreach(writer: ForeachWriter) This is your "escape hatch", allowing you to write your own type
of sink.

Delta dsw.format("delta") A proprietary sink

In the example below, we will be appending files to a Parquet directory and specifying its location with this call:

.format("parquet").start(outputPathDir)

Things to consider on the code examples:


0. We are giving the query a name via the call to .queryName
1. Spark begins running jobs once we call .start
2. The call to .start returns a StreamingQuery object

Databricks Structured Streaming using Python


DataFrames API

Databricks Structured Streaming using Python DataFrames API


Apache Spark 2.0 adds the first version of a new higher-level stream processing API, Structured Streaming. In this notebook we
are going to take a quick look at how to use DataFrame API to build Structured Streaming applications. We want to compute
real-time metrics like running counts and windowed counts on a stream of timestamped actions (e.g. Open, Close, etc).

To run this notebook, import it and attach it to a Spark 2.x cluster.

spark

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 62/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

SparkSession - hive

SparkContext

Spark UI

Version
v3.3.1
Master
local[8]
AppName
Databricks Shell

Sample Data
We have some sample action data as files in /databricks-datasets/structured-streaming/events/ which we are going to
use to build this appication. Let's take a look at the contents of this directory.

%fs ls /databricks-datasets/structured-streaming/events/

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/structured-streaming/events/file-0.json file-0.json 72530 1469673865000
2 dbfs:/databricks-datasets/structured-streaming/events/file-1.json file-1.json 72961 1469673866000
3 dbfs:/databricks-datasets/structured-streaming/events/file-10.json file-10.json 73025 1469673878000
4 dbfs:/databricks-datasets/structured-streaming/events/file-11.json file-11.json 72999 1469673879000
5 dbfs:/databricks-datasets/structured-streaming/events/file-12.json file-12.json 72987 1469673880000
6 dbfs:/databricks-datasets/structured-streaming/events/file-13.json file-13.json 73006 1469673881000
7 dbfs:/databricks-datasets/structured-streaming/events/file-14.json file-14.json 73003 1469673882000
Showing all 50 rows.

There are about 50 JSON files in the directory. Let's see what each JSON file contains.

%fs head /databricks-datasets/structured-streaming/events/file-0.json

[Truncated to first 65536 bytes]


{"time":1469501107,"action":"Open"}
{"time":1469501147,"action":"Open"}
{"time":1469501202,"action":"Open"}
{"time":1469501219,"action":"Open"}
{"time":1469501225,"action":"Open"}
{"time":1469501234,"action":"Open"}
{"time":1469501245,"action":"Open"}
{"time":1469501246,"action":"Open"}
{"time":1469501248,"action":"Open"}
{"time":1469501256,"action":"Open"}
{"time":1469501264,"action":"Open"}
{"time":1469501266,"action":"Open"}
{"time":1469501267,"action":"Open"}
{"time":1469501269,"action":"Open"}
{"time":1469501271,"action":"Open"}
{"time":1469501282,"action":"Open"}
{"time":1469501285,"action":"Open"}
{"time":1469501291,"action":"Open"}
{"time":1469501297,"action":"Open"}
{"time":1469501303,"action":"Open"}

Each line in the file contains JSON record with two fields - time and action . Let's try to analyze these files interactively.

Batch/Interactive Processing
The usual first step in attempting to process the data is to interactively query the data. Let's define a static DataFrame on the files,
and give it a table name.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 63/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

from pyspark.sql.types import *

inputPath = "/databricks-datasets/structured-streaming/events/"

# Since we know the data format already, let's define the schema to speed up processing (no need for Spark to infer
schema)
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])

# Static DataFrame representing data in the JSON files


staticInputDF = (
spark
.read
.schema(jsonSchema)
.json(inputPath)
)

display(staticInputDF)

Table
 
  time action
1 2016-07-28T04:19:28.000+0000 Close
2 2016-07-28T04:19:28.000+0000 Close
3 2016-07-28T04:19:29.000+0000 Open
4 2016-07-28T04:19:31.000+0000 Close
5 2016-07-28T04:19:31.000+0000 Open
6 2016-07-28T04:19:31.000+0000 Open
7 2016-07-28T04:19:32.000+0000 Close
Truncated results, showing first 1,000 rows.

Now we can compute the number of "open" and "close" actions with one hour windows. To do this, we will group by the
action column and 1 hour windows over the time column.

from pyspark.sql.functions import * # for window() function

staticCountsDF = (
staticInputDF
.groupBy(
staticInputDF.action,
window(staticInputDF.time, "1 hour"))
.count()
)

staticCountsDF.cache()

# Register the DataFrame as table 'static_counts'


staticCountsDF.createOrReplaceTempView("static_counts")
display(staticCountsDF)

Table
  
  action window count
1 Close  {"start": "2016-07-26T13:00:00.000+0000", "end": "2016-07-26T14:00:00.000+0000"} 1028
2 Open  {"start": "2016-07-26T18:00:00.000+0000", "end": "2016-07-26T19:00:00.000+0000"} 1004
3 Close  {"start": "2016-07-27T02:00:00.000+0000", "end": "2016-07-27T03:00:00.000+0000"} 971
4 Open  {"start": "2016-07-27T04:00:00.000+0000", "end": "2016-07-27T05:00:00.000+0000"} 995
5 Open  {"start": "2016-07-27T05:00:00.000+0000", "end": "2016-07-27T06:00:00.000+0000"} 986
6 Open  {"start": "2016-07-26T05:00:00.000+0000", "end": "2016-07-26T06:00:00.000+0000"} 1000
7 Open  {"start": "2016-07-26T11:00:00.000+0000", "end": "2016-07-26T12:00:00.000+0000"} 991
Showing all 104 rows.

Now we can directly use SQL to query the table. For example, here are the total counts across all the hours.

%sql select action, sum(count) as total_count from static_counts group by action

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 64/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Visualization

50k
total_count

40k

30k

20k

10k

0.00
Close Open
action

Showing all 2 rows.

How about a timeline of windowed counts?

%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from static_counts order by time, action

Visualization

1.0k

800
count

600

400

200

0.00
Jul-26 03:00, Close Jul-26 17:00, Open Jul-27 08:00, Close Jul-27 22:00, Open
time, action

Showing all 104 rows.

Note the two ends of the graph. The close actions are generated such that they are after the corresponding open actions, so
there are more "opens" in the beginning and more "closes" in the end.

Stream Processing
Now that we have analyzed the data interactively, let's convert this to a streaming query that continuously updates as data
comes. Since we just have a static set of files, we are going to emulate a stream from them by reading one file at a time, in the
chronological order they were created. The query we have to write is pretty much the same as the interactive query above.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 65/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

from pyspark.sql.functions import *

# Similar to definition of staticInputDF above, just using `readStream` instead of `read`


streamingInputDF = (
spark
.readStream
.schema(jsonSchema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.json(inputPath)
)

# Same query as staticInputDF


streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.action,
window(streamingInputDF.time, "1 hour"))
.count()
)

# Is this DF actually a streaming DF?


streamingCountsDF.isStreaming

Out[392]: True

As you can see, streamingCountsDF is a streaming Dataframe ( streamingCountsDF.isStreaming was true ). You can start
streaming computation, by defining the sink and starting it. In our case, we want to interactively query the counts (same queries
as above), so we will set the complete set of 1 hour counts to be in a in-memory table (note that this for testing purpose only in
Spark 2.0).

spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small

query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)

 counts (id: 72bf3883-e774-45a7-a76f-c45b473c3c24) Last updated: 1 day ago

query is a handle to the streaming query that is running in the background. This query is continuously picking up files and
updating the windowed counts.

Note the status of query in the above cell. The progress bar shows that the query is active. Furthermore, if you expand the
> counts above, you will find the number of files they have already processed.

Let's wait a bit for a few files to be processed and then interactively query the in-memory counts table.

from time import sleep


sleep(5) # wait a bit for computation to start

%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

Visualization

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 66/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1.0k

800
count

600

400

200

0.00
Jul-26 03:00, Close Jul-26 08:00, Open Jul-26 14:00, Close Jul-26 19:00, Open
time, action

Showing all 40 rows.

We see the timeline of windowed counts (similar to the static one earlier) building up. If we keep running this interactive query
repeatedly, we will see the latest updated counts which the streaming query is updating in the background.

sleep(5) # wait a bit more for more data to be computed

%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

Visualization

1.0k

800
count

600

400

200

0.00
Jul-26 03:00, Close Jul-26 09:00, Open Jul-26 16:00, Close Jul-26 22:00, Open
time, action

Showing all 46 rows.

sleep(5) # wait a bit more for more data to be computed

%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action

Visualization

1.0k

800
count

600

400

200

0.00
Jul-26 03:00, Close Jul-26 10:00, Open Jul-26 18:00, Close Jul-27 01:00, Open
time, action

Showing all 52 rows.

Also, let's see the total number of "opens" and "closes".

%sql select action, sum(count) as total_count from counts group by action order by action

Visualization

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 67/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

total_count 25k

20k

15k

10k

5.0k

0.00
Close Open
action

Showing all 2 rows.

If you keep running the above query repeatedly, you will always find that the number of "opens" is more than the number of
"closes", as expected in a data stream where a "close" always appear after corresponding "open". This shows that Structured
Streaming ensures prefix integrity. Read the blog posts linked below if you want to know more.

Note that there are only a few files, so consuming all of them there will be no updates to the counts. Rerun the query if you want
to interact with the streaming query again.

Finally, you can stop the query running in the background, either by clicking on the 'Cancel' link in the cell of the query, or by
executing query.stop() . Either way, when the query is stopped, the status of the corresponding cell above will automatically
update to TERMINATED .

query.stop()

ABD10 Graph Analysis

Databricks GraphFrames User Guide (Python)


This notebook demonstrates examples from the GraphFrames User Guide
(https://graphframes.github.io/graphframes/docs/_site/user-guide.html). The GraphFrames package is available from Spark
Packages (http://spark-packages.org/package/graphframes/graphframes).

NOTE: remember that for running this Notebook you must use a Databricks Runtime Version with Spark ML

from functools import reduce


from pyspark.sql.functions import col, lit, when
from graphframes import *

Creating GraphFrames
Users can create GraphFrames from vertex and edge DataFrames.
Vertex DataFrame: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex
in the graph.
Edge DataFrame: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst"
(destination vertex ID of edge).

Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.

Create the vertices first:

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 68/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])

And then some edges:

edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])

Let's create a graph from these vertices and these edges:

g = GraphFrame(vertices, edges)
display(g)

/databricks/spark/python/pyspark/sql/dataframe.py:150: UserWarning: DataFrame.sql_ctx is an internal property, and wi


ll be removed in future releases. Use DataFrame.sparkSession instead.
warnings.warn(
GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

# This example graph also comes with the GraphFrames package


from graphframes.examples import Graphs
same_g = Graphs(sqlContext).friends()
print(same_g)

GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

Basic graph and DataFrame queries


GraphFrames provide several simple graph queries, such as node degree.

Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on
the vertex and edge DataFrames. Those DataFrames are made available as vertices and edges fields in the GraphFrame.

display(g.vertices)

Table
  
  id name age
1 a Alice 34
2 b Bob 36
3 c Charlie 30
4 d David 29
5 e Esther 32
6 f Fanny 36
7 g Gabby 60
Showing all 7 rows.

display(g.edges)

Table
  
src dst relationship

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 69/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

  src dst relationship


1 a b friend
2 b c follow
3 c b follow
4 f c follow
5 e f follow
6 e d friend
7 d a friend
Showing all 8 rows.

The incoming degree of the vertices:

display(g.inDegrees)

/databricks/spark/python/pyspark/sql/dataframe.py:129: UserWarning: DataFrame constructor is internal. Do not directl


y use it.
warnings.warn("DataFrame constructor is internal. Do not directly use it.")

Table
 
  id inDegree
1 b 2
2 c 2
3 f 1
4 d 1
5 a 1
6 e 1

Showing all 6 rows.

The outgoing degree of the vertices:

display(g.outDegrees)

Table
 
  id outDegree
1 a 2
2 b 1
3 c 1
4 f 1
5 e 2
6 d 1

Showing all 6 rows.

The degree of the vertices:

# in and out degrees


display(g.degrees)

Table
 
  id degree
1 a 3
2 b 3
3 c 3
4 f 2
5 e 3
6 d 2

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 70/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Showing all 6 rows.

You can run queries directly on the vertices DataFrame. For example, we can find the age of the youngest person in the graph:

youngest = g.vertices.groupBy().min("age")
#oldest = g.vertices.groupBy().max("age")
display(youngest)
#display(oldest)

Table

  min(age)
1 29

Showing 1 row.

Likewise, you can run queries on the edges DataFrame. For example, let's count the number of 'follow' relationships in the graph:

relationships = g.edges.groupBy("relationship").count()
display(relationships)

Table
 
  relationship count
1 friend 4
2 follow 4

Showing all 2 rows.

numFollows = g.edges.filter("relationship = 'follow'").count()


print("The number of follow edges is", numFollows)

The number of follow edges is 4

Follows = g.edges.filter("relationship = 'follow'")


Follows.show()

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| b| c| follow|
| c| b| follow|
| f| c| follow|
| e| f| follow|
+---+---+------------+

Friend = g.edges.filter("relationship = 'friend'")


Friend.show()

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| a| b| friend|
| e| d| friend|
| d| a| friend|
| a| e| friend|
+---+---+------------+

Motif finding
Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices
with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.

Check out the GraphFrame User Guide (http://graphframes.github.io/user-guide.html#motif-finding) for more details on the API.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 71/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Do a test with filter and then motif find just for having the links of type tipo "friend"
# Do also a test with the filter condition after the find()
# col(edge)["relationship"]
#motifs = g.filter("relationship = 'friend'").find("(x)-[e1]->(y); (y)-[e2]->(x)")
#motifs = g.find("(x)-[e1]->(y); (y)-[e2]->(x)").filter(e1['relationship'] == "friend")
#display(motifsw)

# Search for pairs of vertices with edges in both directions between them.
motifs = g.find("(x)-[e1]->(y); (y)-[e2]->(x)")
display(motifs)

Table
  
  x e1 y e2
 {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "relationship":  {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
 {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "relationship":  {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "
2

Showing all 2 rows.

filtered = motifs.filter("x.age > 30 or y.age > 30")


display(filtered)

Table
  
  x e1 y e2
 {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "relationship":  {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
 {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "relationship":  {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "
2

Showing all 2 rows.

# Test multiple conditions


motifs1 = g.find("(a)-[e1]->(b); (b)-[e2]->(c); (c)-[e3]->(a) ")
display(motifs1)

Table
  
  a e1 b e2
 {"id": "a", "name": "Alice", "age":  {"src": "a", "dst": "e", "relationship":  {"id": "e", "name": "Esther", "age":  {"src": "e", "dst": "d", "rel
1
34} "friend"} 32} "friend"}
 {"id": "d", "name": "David", "age":  {"src": "d", "dst": "a", "relationship":  {"id": "a", "name": "Alice", "age":  {"src": "a", "dst": "e", "rel
2
29} "friend"} 34} "friend"}
 {"id": "e", "name": "Esther", "age":  {"src": "e", "dst": "d", "relationship":  {"id": "d", "name": "David", "age":  {"src": "d", "dst": "a", "rel
3
32} "friend"} 29} "friend"}

Showing all 3 rows.

# Do more conditions
# A vertex connected to another vertex (doesn't matter the edge)
motifs2 = g.find("(a)-[]->(b)")
display(motifs2)

Table
 
  a b
1  {"id": "a", "name": "Alice", "age": 34}  {"id": "b", "name": "Bob", "age": 36}
 {"id": "b", "name": "Bob", "age": 36}  {"id": "c", "name": "Charlie", "age":
2
30}
 {"id": "c", "name": "Charlie", "age":  {"id": "b", "name": "Bob", "age": 36}
3
30}
 {"id": "f", "name": "Fanny", "age":  {"id": "c", "name": "Charlie", "age":
4
36} 30}
 {"id": "e" "name": "Esther" "age":  {"id": "f" "name": "Fanny" "age":
Showing all 8 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 72/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Do more conditions
# Do we have a vertex connected to himself without intermediate vertices
motifs3 = g.find("(a)-[]->(a)")
display(motifs3)

Query returned no results

# Do more conditions
# A vertex connected to himself with an intermediate vertex
motifs4 = g.find("(a)-[]->(b);(b)-[]->(a)")
display(motifs4)

Table
 
  a b
 {"id": "b", "name": "Bob", "age": 36}  {"id": "c", "name": "Charlie", "age":
1
30}
 {"id": "c", "name": "Charlie", "age":  {"id": "b", "name": "Bob", "age": 36}
2

Showing all 2 rows.

Since the result is a DataFrame, more complex queries can be built on top of the motif. Let us find all the reciprocal relationships
in which one person is older than 30:

# Search for pairs of vertices with edges in both directions between them.
motifs5 = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
display(motifs5)

Table
  
  a e b e2
 {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "relationship":  {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
 {"id": "c", "name": "Charlie", "age":  {"src": "c", "dst": "b", "relationship":  {"id": "b", "name": "Bob", "age": 36}  {"src": "b", "dst": "c", "
2

Showing all 2 rows.

Subgraphs
GraphFrames provides APIs for building subgraphs by filtering on edges and vertices. These filters can be composed together, for
example the following subgraph only includes people who are more than 30 years old and have friends who are more than 30
years old.

g2 = g.filterEdges("relationship = 'friend'").filterVertices("age > 30").dropIsolatedVertices()

display(g2.vertices)

Table
  
  id name age
1 a Alice 34
2 b Bob 36
3 e Esther 32

Showing all 3 rows.

display(g2.edges)

Table
  
  src dst relationship
1 a b friend
2 a e friend

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 73/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Showing all 2 rows.

Standard graph algorithms


GraphFrames comes with a number of standard graph algorithms built in:
Breadth-first search (BFS)
Connected components
Strongly connected components
Label Propagation Algorithm (LPA)
PageRank (regular and personalized)
Shortest paths
Triangle count

Breadth-first search (BFS)


BFS finds the shortest path(s) from one vertex to another vertex

# Search from "Alice" to "Charlie".


paths1 = g.bfs(" name = 'Alice' ", "name = 'Charlie' ")
display(paths1)

Table
  
  from e0 v1 e1
 {"id": "a", "name": "Alice", "age":  {"src": "a", "dst": "b", "relationship":  {"id": "b", "name": "Bob", "age":  {"src": "b", "dst": "c", "relatio
1
34} "friend"} 36} "follow"}

Showing 1 row.

# Search from "Esther" for users of age < 32.


paths2 = g.bfs("name = 'Esther'", "age < 32")
display(paths2)

Table
  
  from e0 to
 {"id": "e", "name": "Esther", "age":  {"src": "e", "dst": "d", "relationship":  {"id": "d", "name": "David", "age":
1
32} "friend"} 29}

Showing 1 row.

The search may also be limited by edge filters and maximum path lengths.

filteredPaths = g.bfs(
fromExpr = "name = 'Esther'",
toExpr = "age < 32",
edgeFilter = "relationship != 'friend'",
maxPathLength = 3)
display(filteredPaths)

Table
  
  from e0 v1 e1
 {"id": "e", "name": "Esther", "age":  {"src": "e", "dst": "f", "relationship":  {"id": "f", "name": "Fanny", "age":  {"src": "f", "dst": "c", "rela
1
32} "follow"} 36} "follow"}

Showing 1 row.

Connected components
Compute the connected component membership of each vertex and return a DataFrame with each vertex assigned a component
ID. The GraphFrames connected components implementation can take advantage of checkpointing to improve performance.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 74/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Be prepared, this example may take +10m to run (don't do it during the class)
# Check the results with the graphs elements to see that G is the only element that is connected only to A
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
result = g.connectedComponents()
display(result)

Table
   
  id name age component
1 a Alice 34 0
2 c Charlie 30 0
3 d David 29 0
4 e Esther 32 0
5 f Fanny 36 0
6 b Bob 36 0
7 g Gabby 60 8589934593
Showing all 7 rows.

Strongly connected components


Compute the strongly connected component (SCC) of each vertex and return a DataFrame with each vertex assigned to the SCC
containing that vertex.

result = g.stronglyConnectedComponents(maxIter=5)
display(result.select("id", "component"))

Table
 
  id component
1 a 0
2 b 1
3 c 1
4 d 0
5 e 0
6 f 4
7 g 8589934593
Showing all 7 rows.

Label Propagation
Run static Label Propagation Algorithm for detecting communities in networks.

Each node in the network is initially assigned to its own community. At every superstep, nodes send their community affiliation to
all neighbors and update their state to the most frequent community affiliation of incoming messages.

LPA is a standard community detection algorithm for graphs. It is very inexpensive computationally, although (1) convergence is
not guaranteed and (2) one can end up with trivial solutions (all nodes are identified into a single community).

result = g.labelPropagation(maxIter=5)
display(result)

Table
   
  id name age label
1 a Alice 34 2
2 b Bob 36 1
3 c Charlie 30 8589934592
4 d David 29 2
5 e Esther 32 2
6 f Fanny 36 2
7 g Gabby 60 8589934593

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 75/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Showing all 7 rows.

PageRank
Identify important vertices in a graph based on connections.

results = g.pageRank(resetProbability=0.15, tol=0.01)


display(results.vertices)

Table
   
  id name age pagerank
1 a Alice 34 0.44910633706538744
2 b Bob 36 2.655507832863289
3 c Charlie 30 2.6878300011606218
4 d David 29 0.3283606792049851
5 e Esther 32 0.37085233187676075
6 f Fanny 36 0.3283606792049851
7 g Gabby 60 0.1799821386239711
Showing all 7 rows.

display(results.edges)

Table
   
  src dst relationship weight
1 a e friend 0.5
2 a b friend 0.5
3 b c follow 1
4 c b follow 1
5 d a friend 1
6 e d friend 0.5
7 e f follow 0.5
Showing all 8 rows.

# Run PageRank for a fixed number of iterations


g.pageRank(resetProbability=0.15, maxIter=5)

Out[436]: GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])

# Run PageRank personalized for vertex "a"


g.pageRank(resetProbability=0.15, maxIter=10, sourceId="a")

Out[437]: GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])

Shortest paths
Computes shortest paths to the given set of landmark vertices, where landmarks are specified by vertex ID.

results = g.shortestPaths(landmarks=["a", "d"])


#results = g.shortestPaths(["a", "d"])
display(results)

Table
   
  id name age distances
1 a Alice 34  {"d": 2, "a": 0}

2 b Bob 36 {}
3 c Charlie 30 {}
4 d David 29  {"d": 0, "a": 1}

5 e Esther 32  {"d": 1, "a": 2}

6 f Fanny 36 {}
7 g Gabby 60 {}

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 76/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

7 g Gabby 60 {}
Showing all 7 rows.

Triangle count
Computes the number of triangles passing through each vertex.

results = g.triangleCount()
display(results)

Table
   
  count id name age
1 1 a Alice 34
2 0 b Bob 36
3 0 c Charlie 30
4 1 d David 29
5 1 e Esther 32
6 0 f Fanny 36
7 0 g Gabby 60
Showing all 7 rows.

ABD #11 Spark ML Examples with Libsvm

ML Examples

Don't forget to load: "sample_linear_regression_data.txt", "sample_libsvm_data.txt", "sample_kmeans_data.txt" file

Linear Regression (supervised learning)


Using libsvm

%fs head FileStore/tables/sample_linear_regression_data.txt

[Truncated to first 65536 bytes]


-9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.331097
90358914726 6:0.8067445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.56580355758
00715
0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430
441812433 6:0.18869982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.1748547635725912
2
-4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.93716352
23468384 6:-0.2841601610457427 7:0.6355938616712786 8:-0.1646249064941625 9:0.9480713629917628 10:0.42681251564645817
-19.782762789614537 1:-0.0388509668871313 2:-0.4166870051763918 3:0.8997202693189332 4:0.6409836467726933 5:0.2732890
95712564 6:-0.26175701211620517 7:-0.2794902492677298 8:-0.1306778297187794 9:-0.08536581111046115 10:-0.054623158248
28923
-7.966593841555266 1:-0.06195495876886281 2:0.6546448480299902 3:-0.6979368909424835 4:0.6677324708883314 5:-0.079387
25467767771 6:-0.43885601665437957 7:-0.608071585153688 8:-0.6414531182501653 9:0.7313735926547045 10:-0.026818676347
611925
-7.896274316726144 1:-0.15805658673794265 2:0.26573958270655806 3:0.3997172901343442 4:-0.3693430998846541 5:0.143240
61105995334 6:-0.25797542063247825 7:0.7436291919296774 8:0.6114618853239959 9:0.2324273700703574 10:-0.2512812878219
9144
-8.464803554195287 1:0.39449745853945895 2:0.817229160415142 3:-0.6077058562362969 4:0.6182496334554788 5:0.255866550
8269453 6:-0.07320145794330979 7:-0.38884168866510227 8:0.07981886851873865 9:0.27022202891277614 10:-0.7474843534024
693

from pyspark.ml.regression import LinearRegression

# Load training data


trainingLiR = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")

display(trainingLiR)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 77/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Table

  label features
-9.490009878824548  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4551273600657362, 0.366446943519
1 -0.38256108933468047, -0.4458430198517267, 0.33109790358914726, 0.8067445293443565, -0.2624341731773887,
-0.44850386111659524, -0.07269284838169332, 0.5658035575800715]}
0.2577820163584905  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8386555657374337, -0.12701805115
2 0.499812362510895, -0.22686625128130267, -0.6452430441812433, 0.18869982177936828, -0.5804648622673358,
0.651931743775642, -0.6555641246242951, 0.17485476357259122]}
-4.438869807456516  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.5025608135349202, 0.142080696829
3 0.16004976900412138, 0.505019897181302, -0.9371635223468384, -0.2841601610457427, 0.6355938616712786,
-0.1646249064941625, 0.9480713629917628, 0.42681251564645817]}
-19.782762789614537  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.0388509668871313,
4 -0.4166870051763918, 0.8997202693189332, 0.6409836467726933, 0.273289095712564, -0.26175701211620517,
-0.2794902492677298, -0.1306778297187794, -0.08536581111046115, -0.05462315824828923]}
-7.966593841555266  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.06195495876886281,
5 0.6546448480299902, -0.6979368909424835, 0.6677324708883314, -0.07938725467767771, -0.43885601665437957,
-0.608071585153688, -0.6414531182501653, 0.7313735926547045, -0.026818676347611925]}
-7.896274316726144  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.15805658673794265,
6 0.26573958270655806, 0.3997172901343442, -0.3693430998846541, 0.14324061105995334, -0.25797542063247825,
0.7436291919296774, 0.6114618853239959, 0.2324273700703574, -0.25128128782199144]}
-8.464803554195287  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.39449745853945895, 0.81722916041

Showing all 501 rows.

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model


lrModel = lr.fit(trainingLiR)

# Print the coefficients and intercept for linear regression


print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

Coefficients: [0.0,0.3229251667740594,-0.3438548034562219,1.915601702345841,0.05288058680386255,0.765962720459771,0.
0,-0.15105392669186676,-0.21587930360904645,0.2202536918881343]
Intercept: 0.15989368442397356

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("Root Mean Squared Error - RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2: %f" % trainingSummary.r2)

numIterations: 6
objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.49363616643404634, 0.4936351537897608, 0.49363512141778
71, 0.49363512062528014, 0.4936351206216114]
+--------------------+
| residuals|
+--------------------+
| -9.889232683103197|
| 0.5533794340053553|
| -5.204019455758822|
| -20.566686715507508|
| -9.4497405180564|
| -6.909112502719487|
| -10.00431602969873|
| 2.0623978070504845|
| 3.1117508432954772|
| -15.89360822941938|
| -5.036284254673026|
| 6.4832158769943335|
| 12.429497299109002|
| -20.32003219007654|
| -2.0049838218725|

Linear Regression
with test (and train) data and pipelines

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 78/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

from pyspark.ml.regression import LinearRegression


from pyspark.ml import Pipeline

# Load data
data = spark.read.format("libsvm")\
.load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")

# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8,0.2])

# Create the Model


lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Chain Linear Regression in a Pipeline


pipeline = Pipeline(stages=[lr])

# Train Model
model = pipeline.fit(trainingData)

# Make Predictions
predictions = model.transform(testData)

# Show Predictions
display(predictions)

Table

  label features
-26.805483428483072  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4572552704218824, -0.576096954000
1 -0.20809839485012915, 0.9140086345619809, -0.5922981637492224, -0.8969369345510854, 0.3741080343476908,
-0.01854004246308416, 0.07834089512221243, 0.3838413057880994]}
-23.487440120936512  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.5195354431261132, 0.808035794841
2 0.8498613208566037, 0.044766977500795946, -0.9031972948753286, 0.284006053218262, 0.9640004956647206,
-0.04090127960289358, 0.44190479952918427, -0.7359820144913463]}
-19.66731861537172  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.9353590082406811, 0.8768609458072
3 0.9618210554140587, 0.12103715737151921, -0.7691766106953688, -0.4220229608873225, -0.18117247651928658,
-0.14333978019692784, -0.31512358142857066, 0.4022153556528465]}
-19.402336030214553  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.462288625222409, -0.9029755259427
4 0.7442695642729447, 0.3802724233363486, 0.4068685903786069, -0.5054707879424198, -0.8686166000900748,
-0.014710838968344575, -0.1362606460134499, 0.8444452252816472]}
-17.494200356883344  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.4218585945316018,
5 0.15566399304488754, -0.164665303422032, -0.8579743106885072, 0.5651453461779163, -0.6582935645654426,
-0.40838717556437576, -0.19258926475033356, 0.9864284520934183, 0.7156150246487265]}
-17.32672073267595  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.31374599099683476,
6 -0.36270498808879115, 0.7456203273799138, 0.046239858938568856, -0.030136501929084014, -0.06596637210739509,
-0.46829487815816484, -0.2054839116368734, -0.7006480295111763, -0.6886047709544985]}
-17.026492264209548  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8367805314799452, 0.155919044362

Showing all 92 rows.

print("Coefficients: " + str(model.stages[0].coefficients))


print("Intercept : " + str(model.stages[0].intercept))

Coefficients: [0.0,0.5632690395178652,-0.26439486198649054,1.510281213933446,0.0,0.6917959515358358,0.0,0.0,0.0,0.468
5019027558994]
Intercept : 0.3701057577245132

from pyspark.ml.evaluation import RegressionEvaluator

eval = RegressionEvaluator(labelCol="label", predictionCol="prediction")

print('RMSE:', eval.evaluate(predictions, {eval.metricName: "rmse"}))


print('R 2:', eval.evaluate(predictions, {eval.metricName: "r2"}))

RMSE: 10.335940526345189
R 2: 0.019420906856787323

Classification (supervised learning)


Using libsvm

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 79/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

%fs head FileStore/tables/sample_libsvm_data.txt

[Truncated to first 65536 bytes]


0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252
186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 21
8:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 2
66:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 29
6:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:
7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252
387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85
456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:
71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:8
5 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252
598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:25
2 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37
1 159:124 160:253 161:255 162:63 186:96 187:244 188:251 189:253 190:62 214:127 215:251 216:251 217:253 218:62 241:68
242:236 243:251 244:211 245:31 246:8 268:60 269:228 270:251 271:251 272:94 296:155 297:253 298:253 299:189 323:20 32
4:253 325:251 326:235 327:66 350:32 351:205 352:253 353:251 354:126 378:104 379:251 380:253 381:184 382:15 405:80 40
6:240 407:251 408:193 409:23 432:32 433:253 434:253 435:253 436:159 460:151 461:251 462:251 463:251 464:39 487:48 48
8:221 489:251 490:251 491:172 515:234 516:251 517:251 518:196 519:12 543:253 544:251 545:251 546:89 570:159 571:255 5
72:253 573:253 574:31 597:48 598:228 599:253 600:247 601:140 602:8 625:64 626:251 627:253 628:220 653:64 654:251 655:
253 656:220 681:24 682:193 683:253 684:220
1 125:145 126:255 127:211 128:31 152:32 153:237 154:253 155:252 156:71 180:11 181:175 182:253 183:252 184:71 209:144

from pyspark.ml.classification import LogisticRegression


from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data


dataLoR = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_libsvm_data.txt")

# Check the data, we have a lavel with 0s and 1s


display(dataLoR)

Table
 
  label features
0  {"vectorType": "sparse", "length": 692, "indices": [127, 128, 129, 130, 131, 154, 155, 156, 157, 158, 159, 181, 182, 183, 184, 185,
186, 187, 188, 189, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 262,
263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 289, 290, 291, 292, 293, 294, 295, 296, 297, 300, 301, 302, 316, 317, 318, 319,
320, 321, 328, 329, 330, 343, 344, 345, 346, 347, 348, 349, 356, 357, 358, 371, 372, 373, 374, 384, 385, 386, 399, 400, 401, 412, 413,
414, 426, 427, 428, 429, 440, 441, 442, 454, 455, 456, 457, 466, 467, 468, 469, 470, 482, 483, 484, 493, 494, 495, 496, 497, 510, 511,
512, 520, 521, 522, 523, 538, 539, 540, 547, 548, 549, 550, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 594, 595,
596, 597, 598, 599, 600, 601, 602, 603, 604, 622, 623, 624, 625, 626, 627, 628, 629, 630, 651, 652, 653, 654, 655, 656, 657], "values":
1
[51, 159, 253, 159, 50, 48, 238, 252, 252, 252, 237, 54, 227, 253, 252, 239, 233, 252, 57, 6, 10, 60, 224, 252, 253, 252, 202, 84, 252,
253, 122, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 48, 238, 252,
252, 179, 12, 75, 121, 21, 253, 243, 50, 38, 165, 253, 233, 208, 84, 253, 252, 165, 7, 178, 252, 240, 71, 19, 28, 253, 252, 195, 57, 252,
252, 63, 253, 252, 195, 198, 253, 190, 255, 253, 196, 76, 246, 252, 112, 253, 252, 148, 85, 252, 230, 25, 7, 135, 253, 186, 12, 85, 252,
223, 7, 131, 252, 225, 71, 85, 252, 145, 48, 165, 252, 173, 86, 253, 225, 114, 238, 253, 162, 85, 252, 249, 146, 48, 29, 85, 178, 225,
253, 223, 167, 56, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 28, 199, 252, 252, 253, 252, 252, 233, 145, 25, 128, 252, 253,
252, 141, 37]}
1  {"vectorType": "sparse", "length": 692, "indices": [158, 159, 160, 161, 185, 186, 187, 188, 189, 213, 214, 215, 216, 217, 240, 241,
242, 243, 244, 245, 267, 268, 269, 270, 271, 295, 296, 297, 298, 322, 323, 324, 325, 326, 349, 350, 351, 352, 353, 377, 378, 379, 380,
381, 404, 405, 406, 407, 408, 431, 432, 433, 434, 435, 459, 460, 461, 462, 463, 486, 487, 488, 489, 490, 514, 515, 516, 517, 518, 542,
543, 544, 545, 569, 570, 571, 572, 573, 596, 597, 598, 599, 600, 601, 624, 625, 626, 627, 652, 653, 654, 655, 680, 681, 682, 683],
2
"values": [124, 253, 255, 63, 96, 244, 251, 253, 62, 127, 251, 251, 253, 62, 68, 236, 251, 211, 31, 8, 60, 228, 251, 251, 94, 155, 253,
253, 189, 20, 253, 251, 235, 66, 32, 205, 253, 251, 126, 104, 251, 253, 184, 15, 80, 240, 251, 193, 23, 32, 253, 253, 253, 159, 151,
251, 251, 251, 39, 48, 221, 251, 251, 172, 234, 251, 251, 196, 12, 253, 251, 251, 89, 159, 255, 253, 253, 31, 48, 228, 253, 247, 140, 8,

Showing all 100 rows.

# Split the data into training and test sets (30% held out for testing)
(trainingDataLoR, testDataLoR) = dataLoR.randomSplit([0.7,0.3])

# Create the Model


loReg = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Train the Model


loregClassifier = loReg.fit(trainingDataLoR)

# Print the coefficients and intercept for linear regression


print("Coefficients: %s" % str(loregClassifier.coefficients))
print("Intercept: %s" % str(loregClassifier.intercept))

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 80/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Coefficients: (692,[351,378,379,405,406,407,433,434,435,461,462,489],[0.0006062065933687492,0.0006353388402241491,0.0
010036254420385836,0.0005701600678249564,0.0010321061366040955,0.0012146854845528385,0.0006613971762257488,0.00102153
81836988545,0.0006819741249471635,0.0005969809559873252,0.0006768449741966736,0.0005984540021993428])
Intercept: -1.104081936861524

# Make predictions based on the test data


predictionsLoR = loregClassifier.transform(testDataLoR)

#predictionsLoR.select('*').display()
predictionsLoR.select("probability","prediction","label").display()
# Remember that the field probability will give you a probability value for each class

Table
  
  probability prediction label
1  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
2  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
3  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
4  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
5  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
6  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
7  {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
Showing all 39 rows.

# Evaluate the Model


evaluatorLoR = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluatorLoR.evaluate(predictionsLoR)

#print("Precision is: " + str(evaluatorLoR.evaluate(predictionsLoR)))

Out[490]: 0.9746298984034834

K-Means (unsupervised learning)

%fs head FileStore/tables/sample_kmeans_data.txt

0 1:0.0 2:0.0 3:0.0


1 1:0.1 2:0.1 3:0.1
2 1:0.2 2:0.2 3:0.2
3 1:9.0 2:9.0 3:9.0
4 1:9.1 2:9.1 3:9.1
5 1:9.2 2:9.2 3:9.2

from pyspark.ml.clustering import KMeans


from pyspark.ml.evaluation import ClusteringEvaluator

# Load data
# The label column in this case is only for complementary information (the model will not use it for
training/estimation)
# It results from the svmlib reading, it's only a sequence number of the rows
dataset = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_kmeans_data.txt")
dataset.display()

Table
 
  label features
1 0  {"vectorType": "sparse", "length": 3, "indices": [], "values": []}

1  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1,


2
0.1]}
2  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2,
3
0.2]}
4 3  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]}

4  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1,


5
9 1]}
Showing all 6 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 81/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Trains a k-means model with 2 seeds/clusters


kmeans = KMeans().setK(2)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

display(predictions)

Table
  
  label features prediction
1 0  {"vectorType": "sparse", "length": 3, "indices": [], "values": []} 1
1  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1, 1
2
0.1]}
2  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2, 1
3
0.2]}
4 3  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]} 0
4  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1, 0
5
9 1]}
Showing all 6 rows.

# Evaluate clustering by computing Silhouette score


evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.


centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)

Silhouette with squared euclidean distance = 0.9997530305375207


Cluster Centers:
[9.1 9.1 9.1]
[0.1 0.1 0.1]

ABD #12 ML Data Preparation

Loading the data


people = spark.createDataFrame([(1, 'Peter', 1.79, 90, 28,'M', 'Tiler'),
(2, 'Fritz', 1.78, None, 45,'M', None),
(2, 'Fritz', 1.78, None, 45,'M', None),
(3, 'Florence', 1.75, None, None, None, None),
(4, 'Nicola', 1.6, 60, 33,'F', 'Dancer'),
(5, 'Gregory', 1.8, 88, 54,'M', 'Teacher'),
(6, 'Steven', 1.82, None, None, 'M', None),
(7, 'Dagmar', 1.7, 64, 42,'F', 'Nurse'),
(8, 'Thomaz', 2.3, 10, 100,'M', 'Driver')],
['id', 'name', 'height', 'weight', 'age', 'gender', 'job'])
people.display()

Table
      
  id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 2 Fritz 1.78 null 45 M null
4 3 Florence 1.75 null null null null
5 4 Nicola 1.6 60 33 F Dancer
6 5 Gregory 1.8 88 54 M Teacher
7 6 Steven 1.82 null null M null
Showing all 9 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 82/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Descriptive statistics
# Use describe or summary for statistics (summary will give you more info)
#people.describe().display()
#people.select("age").summary().display()
people.summary().display()
#dbutils.data.summarize(people)

Table
     
  summary id name height weight age gend
1 count 9 9 9 5 7 8
2 mean 4.222222222222222 null 1.8133333333333335 62.4 49.57142857142857 null
3 stddev 2.438123139721299 null 0.194357917255768 32.292413969847466 23.8107618725731 null
4 min 1 Dagmar 1.6 10 28 F
5 25% 2 null 1.75 60 33 null
6 50% 4 null 1.78 64 45 null
7 75% 6 null 1.8 88 54 null
Showing all 8 rows.

from pyspark.sql.functions import *


#from pyspark.sql.functions import mean, min, max

people.select([mean('age'), min('age'), max('age')]).display()

Table
  
  avg(age) min(age) max(age)
1 49.57142857142857 28 100

Showing 1 row.

# Check the Skewness and Kurtosis of the columns


# Skewness: measures how the distribution is balanced to the right or left
# Kurtosis: measures how the distribution is balanced up or down

# from pyspark.sql.functions import *


#cols = ['height', 'weight','age']
cols = people.columns
for i in range(len(cols)):
print("Skewness and Kurtosis for variable "+cols[i]+":")
print(people.select(skewness(people[i]), kurtosis(people[i])).display())

Skewness and Kurtosis for variable id:

Table
 
  skewness(id) kurtosis(id)
1 0.22135555123008185 -1.2793038693335663

Showing 1 row.

None
Skewness and Kurtosis for variable name:

Table
 
  skewness(name) kurtosis(name)
1 null null

Showing 1 row.

None
Skewness and Kurtosis for variable height:

Table
 
  skewness(height) kurtosis(height)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 83/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 1.8736806669884842 2.7503936497452406

Showing 1 row.

None
Skewness and Kurtosis for variable weight:

Table
 
  skewness(weight) kurtosis(weight)
1 -0.8805430091452401 -0.5432331648482904

Showing 1 row.

None
Skewness and Kurtosis for variable age:

Table
 
  skewness(age) kurtosis(age)
1 1.5084230988771352 1.0914433556729701

Showing 1 row.

None
Skewness and Kurtosis for variable gender:

Table
 
  skewness(gender) kurtosis(gender)
1 null null

Showing 1 row.

None
Skewness and Kurtosis for variable job:

Table
 
  skewness(job) kurtosis(job)
1 null null

Showing 1 row.

None

Data Preparation Examples

Derive columns, drop columns, filter columns (e.g NULLs)

display(people)

Table
      
  id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 2 Fritz 1.78 null 45 M null
4 3 Florence 1.75 null null null null
5 4 Nicola 1.6 60 33 F Dancer
6 5 Gregory 1.8 88 54 M Teacher
7 6 Steven 1.82 null null M null
Showing all 9 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 84/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Eliminate duplicates with distinct() or dropDuplicates()


# df = people.distinct()
# dropDuplicates alows columns selection. Ex: dropDuplicates(["name", "age"])

df = people.dropDuplicates()
df.display()

Table
      
  id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 3 Florence 1.75 null null null null
4 4 Nicola 1.6 60 33 F Dancer
5 5 Gregory 1.8 88 54 M Teacher
6 6 Steven 1.82 null null M null
7 7 Dagmar 1.7 64 42 F Nurse
Showing all 8 rows.

# Checking if there are null values in the dataset


#from pyspark.sql.functions import *

#display(df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]))


df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).display()

Table
      
  id name height weight age gender job
1 0 0 0 3 2 1 3

Showing 1 row.

# removing null values


# dropna(thresh=2, subset=('fild1','Field2','Field3'))
# Drops rows with null values
# thresh = minimum number of non-null values (valid data elements) required to keep the row
# how = 'any' or 'all' values (as null in the column)
df.dropna(how='any', subset="gender").display()

Table
      
  id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 4 Nicola 1.6 60 33 F Dancer
4 5 Gregory 1.8 88 54 M Teacher
5 6 Steven 1.82 null null M null
6 7 Dagmar 1.7 64 42 F Nurse
7 8 Thomaz 2.3 10 100 M Driver
Showing all 7 rows.

df = df.dropna(how="any")
df.display()

Table
      
  id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 4 Nicola 1.6 60 33 F Dancer
3 5 Gregory 1.8 88 54 M Teacher
4 7 Dagmar 1.7 64 42 F Nurse
5 8 Thomaz 2.3 10 100 M Driver

Showing all 5 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 85/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Add or derive a new feature


# Use selectExpr, withcolumn(), lit() etc.

#df.withColumn("bmi", col("weight")/col("height")**2 ).display()


df = df.selectExpr("id", "name", "weight", "height", "age", "gender", "job",
"(weight/power(height,2)) as bmi")
df.display()

Table
       
  id name weight height age gender job bmi
1 1 Peter 90 1.79 28 M Tiler 28.089010954714272
2 4 Nicola 60 1.6 33 F Dancer 23.437499999999996
3 5 Gregory 88 1.8 54 M Teacher 27.160493827160494
4 7 Dagmar 64 1.7 42 F Nurse 22.145328719723185
5 8 Thomaz 10 2.3 100 M Driver 1.8903591682419663

Showing all 5 rows.

#from pyspark.sql.functions import *


from pyspark.sql.types import DoubleType

def bmi(weight, height):


return (weight/(height**2))

bmi_udf = udf(bmi, DoubleType())

df.withColumn("bmi", bmi_udf(df["weight"],df["height"])).display()

Table
       
  id name weight height age gender job bmi
1 1 Peter 90 1.79 28 M Tiler 28.089010954714272
2 4 Nicola 60 1.6 33 F Dancer 23.437499999999996
3 5 Gregory 88 1.8 54 M Teacher 27.160493827160494
4 7 Dagmar 64 1.7 42 F Nurse 22.145328719723185
5 8 Thomaz 10 2.3 100 M Driver 1.8903591682419663

Showing all 5 rows.

# Last step to have a final table only with labels and features
df = df.drop("name")
df.display()

Table
      
  id weight height age gender job bmi
1 1 90 1.79 28 M Tiler 28.089010954714272
2 4 60 1.6 33 F Dancer 23.437499999999996
3 5 88 1.8 54 M Teacher 27.160493827160494
4 7 64 1.7 42 F Nurse 22.145328719723185
5 8 10 2.3 100 M Driver 1.8903591682419663

Showing all 5 rows.

Identifying Outliers

Common rule: An outlier value is more than 1.5 * IQR

There are no outliers if all the values are roughly within the Q1−1.5IQR and Q3+1.5IQR range

IQR is the interquartile range. It's defined as a difference between the upper (Q3) and lower (Q1) quartiles

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 86/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# We'll use the .approxQuantile(...) method (that will give you a list with the Q1 and Q3 values)
# The 1st parameter is the name of the column
# The 2nd parameter can be either a number between 0 or 1 (where 0.5 means to calculated median) or a list (as in this
case)
# The 3rd parameter specifies the acceptable level of an error for each metric (0 means an exact value - it can be
very expensive)

cols = ['weight', 'height', 'age']


bounds = {}
for col in cols:
quantiles = df.approxQuantile(col, [0.25, 0.75], 0.05)
IQR = quantiles[1] - quantiles[0]
bounds[col] = [quantiles[0] - 1.5 * IQR, quantiles[1] + 1.5 * IQR]

# bounds dic will have the lower and upper bounds for each feature
print(bounds)

{'weight': [18.0, 130.0], 'height': [1.5499999999999998, 1.9500000000000002], 'age': [1.5, 85.5]}

# Let's identify the outliers now

outliers = df.select(['id','weight','height','age'] +
[( (df[c] < bounds[c][0]) | (df[c] > bounds[c][1]) ).alias(c + '_o') for c in cols ])

outliers.display()

Table
      
  id weight height age weight_o height_o age_o
1 1 90 1.79 28 false false false
2 4 60 1.6 33 false false false
3 5 88 1.8 54 false false false
4 7 64 1.7 42 false false false
5 8 10 2.3 100 true true true

Showing all 5 rows.

# Use a join now to eliminate the outliers from the original dataset
df_No_Outl = df.join(outliers['id','weight_o','height_o','age_o'], on='id')
#df_No_Outl.filter('age_o').show()
df_No_Outl.filter(df_No_Outl.age_o == False).select('id','height','weight','age').display()

Table
   
  id height weight age
1 1 1.79 90 28
2 4 1.6 60 33
3 5 1.8 88 54
4 7 1.7 64 42

Showing all 4 rows.

Vectorize, StringIndexer, OneHotEncoding

Vectorize

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["age","height","weight"], outputCol="features")

assembler.transform(df).display()
assembler.transform(df).printSchema()

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 87/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Table
      
  id weight height age gender job bmi features
1 90 1.79 28 M Tiler 28.089010954714272  {"vectorType": "de
1
90]}
2 4 60 1.6 33 F Dancer 23.437499999999996  {"vectorType": "de

3 5 88 1.8 54 M Teacher 27.160493827160494  {"vectorType": "de

4 7 64 1.7 42 F Nurse 22.145328719723185  {"vectorType": "de

8 10 2.3 100 M Driver 1.8903591682419663  {"vectorType": "de


5

Showing all 5 rows.

root
|-- id: long (nullable = true)
|-- weight: long (nullable = true)
|-- height: double (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- job: string (nullable = true)
|-- bmi: double (nullable = true)
|-- features: vector (nullable = true)

StringIndexer
Used to transform a categorical string feature into a numerical featura

# drop columns in case they already exist (from previous run)


#df = df.drop("genderIndex","jobIndex")

from pyspark.ml.feature import StringIndexer

genderIndexer = StringIndexer(inputCol="gender", outputCol="genderIndex")


df = genderIndexer.fit(df).transform(df)

occupationIndexer = StringIndexer(inputCol="job", outputCol="jobIndex")


df = occupationIndexer.fit(df).transform(df)

df.display()

Table
       
  id weight height age gender job bmi genderIndex
1 1 90 1.79 28 M Tiler 28.089010954714272 0
2 4 60 1.6 33 F Dancer 23.437499999999996 1
3 5 88 1.8 54 M Teacher 27.160493827160494 0
4 7 64 1.7 42 F Nurse 22.145328719723185 1
5 8 10 2.3 100 M Driver 1.8903591682419663 0

Showing all 5 rows.

One Hot Encoding

For every distinct category, create a feature

Better than just StringIndex to solve non ordered categories

# In this exemple setDropLast=False. Meaning the last category is also encoded


from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["jobIndex","genderIndex"],
outputCols=["jobOHEVector","genderOHEVector"]).setDropLast(False)

encoded = encoder.fit(df).transform(df)
encoded.display()

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 88/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Table
       
  id weight height age gender job bmi genderIndex
1 90 1.79 28 M Tiler 28.089010954714272 0
1

4 60 1.6 33 F Dancer 23.437499999999996 1


2

5 88 1.8 54 M Teacher 27.160493827160494 0


3

7 64 17 42 F Nurse 22 145328719723185 1
Showing all 5 rows.

# In this exemple setDropLast=True. Meaning the last category is ignored (not encoded)
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["jobIndex","genderIndex"],
outputCols=["jobOHEVector","genderOHEVector"]).setDropLast(True)

encoded = encoder.fit(df).transform(df)
encoded.show()

+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+
| id|weight|height|age|gender| job| bmi|genderIndex|jobIndex| jobOHEVector|genderOHEVector|
+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+
| 1| 90| 1.79| 28| M| Tiler|28.089010954714272| 0.0| 4.0| (4,[],[])| (1,[0],[1.0])|
| 4| 60| 1.6| 33| F| Dancer|23.437499999999996| 1.0| 0.0|(4,[0],[1.0])| (1,[],[])|
| 5| 88| 1.8| 54| M|Teacher|27.160493827160494| 0.0| 3.0|(4,[3],[1.0])| (1,[0],[1.0])|
| 7| 64| 1.7| 42| F| Nurse|22.145328719723185| 1.0| 2.0|(4,[2],[1.0])| (1,[],[])|
| 8| 10| 2.3|100| M| Driver|1.8903591682419663| 0.0| 1.0|(4,[1],[1.0])| (1,[0],[1.0])|
+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+

ABD #12 Spark ML Examples

ML Examples

Don't forget to load: "sample_linear_regression_data.txt", "sample_libsvm_data.txt", "sample_kmeans_data.txt" file from the


Moodle technical resources folder
You may want to change the input datasets from Moodle with the ones already existing on /databricks-datasets

%fs ls /databricks-datasets/samples/data/mllib

Table
 
  path name size
1 dbfs:/databricks-datasets/samples/data/mllib/.DS_Store .DS_Store 614
2 dbfs:/databricks-datasets/samples/data/mllib/als/ als/ 0
3 dbfs:/databricks-datasets/samples/data/mllib/gmm_data.txt gmm_data.txt 639
4 dbfs:/databricks-datasets/samples/data/mllib/kmeans_data.txt kmeans_data.txt 72
5 dbfs:/databricks-datasets/samples/data/mllib/lr-data/ lr-data/ 0
6 dbfs:/databricks-datasets/samples/data/mllib/lr_data.txt lr_data.txt 197
7 dbfs:/databricks-datasets/samples/data/mllib/pagerank data.txt pagerank data.txt 24
Showing all 20 rows.

%fs ls /databricks-datasets/definitive-guide/data/

Table
  
  path name size modifica
1 dbfs:/databricks-datasets/definitive-guide/data/activity-data/ activity-data/ 0 0
2 dbfs:/databricks-datasets/definitive-guide/data/bike-data/ bike-data/ 0 0
3 dbfs:/databricks-datasets/definitive-guide/data/binary-classification/ binary-classification/ 0 0
4 dbfs:/databricks-datasets/definitive-guide/data/clustering/ clustering/ 0 0

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 89/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/ flight-data/ 0 0
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data-hive/ flight-data-hive/ 0 0
7 dbfs:/databricks-datasets/definitive-guide/data/multiclass-classification/ multiclass-classification/ 0 0
Showing all 14 rows.

Linear Regression (supervised learning)


Using libsvm

%fs head FileStore/tables/sample_linear_regression_data.txt

[Truncated to first 65536 bytes]


-9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.331097
90358914726 6:0.8067445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.56580355758
00715
0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430
441812433 6:0.18869982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.1748547635725912
2
-4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.93716352
23468384 6:-0.2841601610457427 7:0.6355938616712786 8:-0.1646249064941625 9:0.9480713629917628 10:0.42681251564645817
-19.782762789614537 1:-0.0388509668871313 2:-0.4166870051763918 3:0.8997202693189332 4:0.6409836467726933 5:0.2732890
95712564 6:-0.26175701211620517 7:-0.2794902492677298 8:-0.1306778297187794 9:-0.08536581111046115 10:-0.054623158248
28923
-7.966593841555266 1:-0.06195495876886281 2:0.6546448480299902 3:-0.6979368909424835 4:0.6677324708883314 5:-0.079387
25467767771 6:-0.43885601665437957 7:-0.608071585153688 8:-0.6414531182501653 9:0.7313735926547045 10:-0.026818676347
611925
-7.896274316726144 1:-0.15805658673794265 2:0.26573958270655806 3:0.3997172901343442 4:-0.3693430998846541 5:0.143240
61105995334 6:-0.25797542063247825 7:0.7436291919296774 8:0.6114618853239959 9:0.2324273700703574 10:-0.2512812878219
9144
-8.464803554195287 1:0.39449745853945895 2:0.817229160415142 3:-0.6077058562362969 4:0.6182496334554788 5:0.255866550
8269453 6:-0.07320145794330979 7:-0.38884168866510227 8:0.07981886851873865 9:0.27022202891277614 10:-0.7474843534024
693

from pyspark.ml.regression import LinearRegression

# Load training data


trainingLiR = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")

display(trainingLiR)

Table

  label features
-9.490009878824548  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4551273600657362, 0.366446943519
1 -0.38256108933468047, -0.4458430198517267, 0.33109790358914726, 0.8067445293443565, -0.2624341731773887,
-0.44850386111659524, -0.07269284838169332, 0.5658035575800715]}
0.2577820163584905  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8386555657374337, -0.12701805115
2 0.499812362510895, -0.22686625128130267, -0.6452430441812433, 0.18869982177936828, -0.5804648622673358,
0.651931743775642, -0.6555641246242951, 0.17485476357259122]}
-4.438869807456516  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.5025608135349202, 0.142080696829
3 0.16004976900412138, 0.505019897181302, -0.9371635223468384, -0.2841601610457427, 0.6355938616712786,
-0.1646249064941625, 0.9480713629917628, 0.42681251564645817]}
-19.782762789614537  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.0388509668871313,
4 -0.4166870051763918, 0.8997202693189332, 0.6409836467726933, 0.273289095712564, -0.26175701211620517,
-0.2794902492677298, -0.1306778297187794, -0.08536581111046115, -0.05462315824828923]}
-7.966593841555266  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.06195495876886281,
5 0.6546448480299902, -0.6979368909424835, 0.6677324708883314, -0.07938725467767771, -0.43885601665437957,
-0.608071585153688, -0.6414531182501653, 0.7313735926547045, -0.026818676347611925]}
-7.896274316726144  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.15805658673794265,
6 0.26573958270655806, 0.3997172901343442, -0.3693430998846541, 0.14324061105995334, -0.25797542063247825,
0.7436291919296774, 0.6114618853239959, 0.2324273700703574, -0.25128128782199144]}
-8.464803554195287  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.39449745853945895, 0.81722916041

Showing all 501 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 90/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model


lrModel = lr.fit(trainingLiR)

# Print the coefficients and intercept for linear regression


print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

Coefficients: [0.0,0.3229251667740594,-0.3438548034562219,1.915601702345841,0.05288058680386255,0.765962720459771,0.
0,-0.15105392669186676,-0.21587930360904645,0.2202536918881343]
Intercept: 0.15989368442397356

# Summarize the model over the training set and print out some metrics
# RMSE: square root of the variance of the residuals (Lower values of RMSE indicate better fit)
# R2: the proportional improvement in prediction from the model (0 -> the model does not improve prediction over the
mean model. 1 -> indicates perfect prediction. )
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("Root Mean Squared Error - RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2: %f" % trainingSummary.r2)

numIterations: 6
objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.49363616643404634, 0.4936351537897608, 0.49363512141778
71, 0.49363512062528014, 0.4936351206216114]
+--------------------+
| residuals|
+--------------------+
| -9.889232683103197|
| 0.5533794340053553|
| -5.204019455758822|
| -20.566686715507508|
| -9.4497405180564|
| -6.909112502719487|
| -10.00431602969873|
| 2.0623978070504845|
| 3.1117508432954772|
| -15.89360822941938|
| -5.036284254673026|
| 6.4832158769943335|
| 12.429497299109002|
| -20.32003219007654|
| -2.0049838218725|

Linear Regression
with test (and train) data and pipelines

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 91/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

from pyspark.ml.regression import LinearRegression


from pyspark.ml import Pipeline

# Load data
data = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")

# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8,0.2])

# Create the Model


lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Chain Linear Regression in a Pipeline


pipeline = Pipeline(stages=[lr])

# Train Model
model = pipeline.fit(trainingData)

# Make Predictions
predictions = model.transform(testData)

# Show Predictions
display(predictions)

Table

  label features
-26.805483428483072  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4572552704218824, -0.576096954000
1 -0.20809839485012915, 0.9140086345619809, -0.5922981637492224, -0.8969369345510854, 0.3741080343476908,
-0.01854004246308416, 0.07834089512221243, 0.3838413057880994]}
-23.51088409032297  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.4683538422180036, 0.14695401859
2 0.9113612952591796, -0.9838482669789823, 0.4506466371133697, 0.6456121712599778, 0.8264783725578371,
0.562664168655115, -0.8299281852090683, 0.40690300256653256]}
-19.402336030214553  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.462288625222409, -0.9029755259427
3 0.7442695642729447, 0.3802724233363486, 0.4068685903786069, -0.5054707879424198, -0.8686166000900748,
-0.014710838968344575, -0.1362606460134499, 0.8444452252816472]}
-18.27521356600463  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.489685764918109, 0.683231434274
4 0.9115808714640257, -0.0004680515344936964, 0.03760860984717218, 0.4344127744883004, -0.30019645809377127,
-0.48339658188341783, -0.5488933834939806, -0.4735052851773165]}
-17.428674570939506  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8562209225926345, 0.707720210065
5 0.7449487615498371, 0.4648122665228682, 0.20867633509077188, 0.08516406450475422, 0.22426604902631664,
-0.5503074163123833, -0.40653248591627533, -0.34680731694527833]}
-16.692207021311106  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.9117919458569854, 0.628599902089
6 -0.29426892743208954, -0.7936280881977256, 0.8429787263741186, 0.7932494418330283, 0.31956207523432667,
0.9890773145202636, -0.7936494627564858, 0.9917688731048739]}
-16.26143027545273  {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.9309578475799722, 0.75917958809
Showing all 101 rows.

print("Coefficients: " + str(model.stages[0].coefficients))


print("Intercept : " + str(model.stages[0].intercept))

Coefficients: [0.0,1.9141348939000609,-0.05113069489802313,1.983468491510318,0.553327181606741,0.1749868889333724,0.
0,-0.4667102520693325,-0.9095076912374733,0.26743281796509416]
Intercept : 0.16479345157668718

from pyspark.ml.evaluation import RegressionEvaluator

eval = RegressionEvaluator(labelCol="label", predictionCol="prediction")

print('RMSE:', eval.evaluate(predictions, {eval.metricName: "rmse"}))


print('R2:', eval.evaluate(predictions, {eval.metricName: "r2"}))

RMSE: 10.735968494895788
R2: -0.08201975916265836

Classification (supervised learning)


Using libsvm

%fs head FileStore/tables/sample_libsvm_data.txt

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 92/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

[Truncated to first 65536 bytes]


0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252
186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 21
8:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 2
66:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 29
6:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:
7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252
387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85
456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:
71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:8
5 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252
598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:25
2 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37
1 159:124 160:253 161:255 162:63 186:96 187:244 188:251 189:253 190:62 214:127 215:251 216:251 217:253 218:62 241:68
242:236 243:251 244:211 245:31 246:8 268:60 269:228 270:251 271:251 272:94 296:155 297:253 298:253 299:189 323:20 32
4:253 325:251 326:235 327:66 350:32 351:205 352:253 353:251 354:126 378:104 379:251 380:253 381:184 382:15 405:80 40
6:240 407:251 408:193 409:23 432:32 433:253 434:253 435:253 436:159 460:151 461:251 462:251 463:251 464:39 487:48 48
8:221 489:251 490:251 491:172 515:234 516:251 517:251 518:196 519:12 543:253 544:251 545:251 546:89 570:159 571:255 5
72:253 573:253 574:31 597:48 598:228 599:253 600:247 601:140 602:8 625:64 626:251 627:253 628:220 653:64 654:251 655:
253 656:220 681:24 682:193 683:253 684:220
1 125:145 126:255 127:211 128:31 152:32 153:237 154:253 155:252 156:71 180:11 181:175 182:253 183:252 184:71 209:144

from pyspark.ml.classification import LogisticRegression


from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data


dataLoR = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_libsvm_data.txt")

# Check the data, we have a lavel with 0s and 1s


display(dataLoR)

Table
 
  label features
0  {"vectorType": "sparse", "length": 692, "indices": [127, 128, 129, 130, 131, 154, 155, 156, 157, 158, 159, 181, 182, 183, 184, 185,
186, 187, 188, 189, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 262,
263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 289, 290, 291, 292, 293, 294, 295, 296, 297, 300, 301, 302, 316, 317, 318, 319,
320, 321, 328, 329, 330, 343, 344, 345, 346, 347, 348, 349, 356, 357, 358, 371, 372, 373, 374, 384, 385, 386, 399, 400, 401, 412, 413,
414, 426, 427, 428, 429, 440, 441, 442, 454, 455, 456, 457, 466, 467, 468, 469, 470, 482, 483, 484, 493, 494, 495, 496, 497, 510, 511,
512, 520, 521, 522, 523, 538, 539, 540, 547, 548, 549, 550, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 594, 595,
596, 597, 598, 599, 600, 601, 602, 603, 604, 622, 623, 624, 625, 626, 627, 628, 629, 630, 651, 652, 653, 654, 655, 656, 657], "values":
1
[51, 159, 253, 159, 50, 48, 238, 252, 252, 252, 237, 54, 227, 253, 252, 239, 233, 252, 57, 6, 10, 60, 224, 252, 253, 252, 202, 84, 252,
253, 122, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 48, 238, 252,
252, 179, 12, 75, 121, 21, 253, 243, 50, 38, 165, 253, 233, 208, 84, 253, 252, 165, 7, 178, 252, 240, 71, 19, 28, 253, 252, 195, 57, 252,
252, 63, 253, 252, 195, 198, 253, 190, 255, 253, 196, 76, 246, 252, 112, 253, 252, 148, 85, 252, 230, 25, 7, 135, 253, 186, 12, 85, 252,
223, 7, 131, 252, 225, 71, 85, 252, 145, 48, 165, 252, 173, 86, 253, 225, 114, 238, 253, 162, 85, 252, 249, 146, 48, 29, 85, 178, 225,
253, 223, 167, 56, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 28, 199, 252, 252, 253, 252, 252, 233, 145, 25, 128, 252, 253,
252, 141, 37]}
1  {"vectorType": "sparse", "length": 692, "indices": [158, 159, 160, 161, 185, 186, 187, 188, 189, 213, 214, 215, 216, 217, 240, 241,
242, 243, 244, 245, 267, 268, 269, 270, 271, 295, 296, 297, 298, 322, 323, 324, 325, 326, 349, 350, 351, 352, 353, 377, 378, 379, 380,
381, 404, 405, 406, 407, 408, 431, 432, 433, 434, 435, 459, 460, 461, 462, 463, 486, 487, 488, 489, 490, 514, 515, 516, 517, 518, 542,
543, 544, 545, 569, 570, 571, 572, 573, 596, 597, 598, 599, 600, 601, 624, 625, 626, 627, 652, 653, 654, 655, 680, 681, 682, 683],
2
"values": [124, 253, 255, 63, 96, 244, 251, 253, 62, 127, 251, 251, 253, 62, 68, 236, 251, 211, 31, 8, 60, 228, 251, 251, 94, 155, 253,
253, 189, 20, 253, 251, 235, 66, 32, 205, 253, 251, 126, 104, 251, 253, 184, 15, 80, 240, 251, 193, 23, 32, 253, 253, 253, 159, 151,
251, 251, 251, 39, 48, 221, 251, 251, 172, 234, 251, 251, 196, 12, 253, 251, 251, 89, 159, 255, 253, 253, 31, 48, 228, 253, 247, 140, 8,
Showing all 100 rows.

# Split the data into training and test sets (30% held out for testing)
(trainingDataLoR, testDataLoR) = dataLoR.randomSplit([0.7,0.3])

# Create the Model


loReg = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Train the Model


loregClassifier = loReg.fit(trainingDataLoR)

# Print the coefficients and intercept for linear regression


print("Coefficients: %s" % str(loregClassifier.coefficients))
print("Intercept: %s" % str(loregClassifier.intercept))

Coefficients: (692,[271,272,300,328,350,351,356,378,379,405,406,407,433,434,435,461,462,489,490,511,512,517,539,540],
[-0.00019903767637136554,-0.0001941577234365856,-0.00034527933210997097,-4.680078802426797e-05,0.0002033194519132294
2,0.00022292147220978535,-5.213010992387977e-05,0.0005717340811962772,0.00022634251292289687,0.0003611998147730647,0.
0007059600020105616,0.0004143587482711157,0.000604649706484048,0.0006808244146618602,1.6865828617895953e-05,0.0004753

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 93/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

5748773760707,0.0006162903776126785,0.0004422202688935682,0.00035742728163686986,-5.462326765468662e-05,-0.0002070301
3094870845,0.0004217378448089352,-4.418484614812527e-05,-0.0002559427594559066])
Intercept: -0.2602368710870625

# Make predictions based on the test data


predictionsLoR = loregClassifier.transform(testDataLoR)

#predictionsLoR.select('*').display()
predictionsLoR.select("probability","prediction","label").display()
# Remember that the field probability will give you a probability value for each class

Table
  
  probability prediction label
1  {"vectorType": "dense", "length": 2, "values": [0.6348341003103886, 0.36516589968961144]} 0 0
2  {"vectorType": "dense", "length": 2, "values": [0.6127092158471116, 0.3872907841528884]} 0 0
3  {"vectorType": "dense", "length": 2, "values": [0.6482675547104254, 0.3517324452895746]} 0 0
4  {"vectorType": "dense", "length": 2, "values": [0.6440298352541282, 0.35597016474587184]} 0 0
5  {"vectorType": "dense", "length": 2, "values": [0.6015645505916722, 0.39843544940832776]} 0 0
6  {"vectorType": "dense", "length": 2, "values": [0.5893064637024384, 0.41069353629756156]} 0 0
7  {"vectorType": "dense", "length": 2, "values": [0.6395634012764051, 0.3604365987235949]} 0 0
Showing all 33 rows.

# Evaluate the Model


evaluatorLoR = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluatorLoR.evaluate(predictionsLoR)

#print("Precision is: " + str(evaluatorLoR.evaluate(predictionsLoR)))

Out[574]: 1.0

K-Means (unsupervised learning)

%fs head FileStore/tables/sample_kmeans_data.txt

0 1:0.0 2:0.0 3:0.0


1 1:0.1 2:0.1 3:0.1
2 1:0.2 2:0.2 3:0.2
3 1:9.0 2:9.0 3:9.0
4 1:9.1 2:9.1 3:9.1
5 1:9.2 2:9.2 3:9.2

from pyspark.ml.clustering import KMeans


from pyspark.ml.evaluation import ClusteringEvaluator

# Load data
# The label column in this case is only for complementary information (the model will not use it for
training/estimation)
# It results from the svmlib reading, it's only a sequence number of the rows
dataset = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_kmeans_data.txt")
dataset.display()

Table
 
  label features
1 0  {"vectorType": "sparse", "length": 3, "indices": [], "values": []}

1  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1,


2
0.1]}
2  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2,
3
0.2]}
4 3  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]}

4  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1,


5
9 1]}
Showing all 6 rows.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 94/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Train a k-means model with 2 seeds/clusters


kmeans = KMeans().setK(2)
kmodel = kmeans.fit(dataset)

# Just if you want to see the cluster numbers 0 to ...


# clusterModel1.summary.predictions.select("prediction").distinct().show()
# Just if you want to see the cluster centers
# print(clusterModel1.clusterCenters())

# Print the 2 clusters centers


kmodel.clusterCenters()

Out[578]: [array([9.1, 9.1, 9.1]), array([0.1, 0.1, 0.1])]

# Make predictions (if usefull)


predictions = kmodel.transform(dataset)

display(predictions)

Table
  
  label features prediction
1 0  {"vectorType": "sparse", "length": 3, "indices": [], "values": []} 1
1  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1, 1
2
0.1]}
2  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2, 1
3
0.2]}
4 3  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]} 0
4  {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1, 0
5
9 1]}
Showing all 6 rows.

# Evaluate clustering by computing Silhouette score


evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the results


centers = kmodel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)

Silhouette with squared euclidean distance = 0.9997530305375207


Cluster Centers:
[9.1 9.1 9.1]
[0.1 0.1 0.1]

Association Rules - Frequent Pattern Mining (unsupervised learning)


https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#frequent-pattern-mining
(https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#frequent-pattern-mining)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 95/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

from pyspark.ml.fpm import FPGrowth

df = spark.createDataFrame([
(0, [1, 2]),
(1, [1, 2, 3])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)


model = fpGrowth.fit(df)

# Display frequent itemsets.


model.freqItemsets.display()

# Display generated association rules.


model.associationRules.display()

# Transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).display()

Table
 
  items freq
1  [1] 2
2  [2] 2
3  [2, 1] 2
4  [3] 1
5  [3, 2] 1
6  [3, 2, 1] 1
7  [3, 1] 1
Showing all 7 rows.

Table
    
  antecedent consequent confidence lift support
1  [3, 1]  [2] 1 1 0.5
2  [3]  [2] 1 1 0.5
3  [3]  [1] 1 1 0.5
4  [2]  [1] 1 1 1
5  [3, 2]  [1] 1 1 0.5
6  [1]  [2] 1 1 1

Showing all 6 rows.

Table
  
  id items prediction
1 0  [1, 2] []
2 1  [1, 2, 3] []

Showing all 2 rows.

# Make a prediton based on a new aquisition


df_new = spark.createDataFrame([(0, [1])], ["id", "items"])
model.transform(df_new).display()

Table
  
  id items prediction
1 0  [1]  [2]

Showing 1 row.

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 96/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Second example with more itens


#from pyspark.ml.fpm import FPGrowth

df1 = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)


model1 = fpGrowth.fit(df1)

# Display frequent itemsets.


model1.freqItemsets.display()

# Display generated association rules.


model1.associationRules.display()

# Transform examines the input items against all the association rules and summarize the
# consequents as prediction
model1.transform(df1).display()

Table
 
  items freq
1  [1] 3
2  [2] 3
3  [2, 1] 3
4  [5] 2
5  [5, 2] 2
6  [5, 2, 1] 2
7  [5, 1] 2
Showing all 7 rows.

Table
    
  antecedent consequent confidence lift support
1  [5]  [2] 1 1 0.6666666666666666
2  [5]  [1] 1 1 0.6666666666666666
3  [5, 1]  [2] 1 1 0.6666666666666666
4  [5, 2]  [1] 1 1 0.6666666666666666
5  [2]  [1] 1 1 1
6  [2]  [5] 0.6666666666666666 1 0.6666666666666666
7  [2, 1]  [5] 0.6666666666666666 1 0.6666666666666666
Showing all 9 rows.

Table
  
  id items prediction
1 0  [1, 2, 5] []
2 1  [1, 2, 3, 5] []
3 2  [1, 2]  [5]

Showing all 3 rows.

Collaborative filtering - ALS (unsupervised learning)


Example from: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html (https://spark.apache.org/docs/latest/ml-
collaborative-filtering.html))

%fs ls /databricks-datasets/samples/data/mllib

Table
 
  path name size
1 dbfs:/databricks datasets/samples/data/mllib/ DS Store DS Store 614

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 97/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1 dbfs:/databricks-datasets/samples/data/mllib/.DS_Store .DS_Store 614


2 dbfs:/databricks-datasets/samples/data/mllib/als/ als/ 0
3 dbfs:/databricks-datasets/samples/data/mllib/gmm_data.txt gmm_data.txt 639
4 dbfs:/databricks-datasets/samples/data/mllib/kmeans_data.txt kmeans_data.txt 72
5 dbfs:/databricks-datasets/samples/data/mllib/lr-data/ lr-data/ 0
6 dbfs:/databricks-datasets/samples/data/mllib/lr_data.txt lr_data.txt 197
7 dbfs:/databricks-datasets/samples/data/mllib/pagerank data.txt pagerank data.txt 24
Showing all 20 rows.

# Read the data


#from pyspark.ml.evaluation import RegressionEvaluator
#from pyspark.ml.recommendation import ALS
#from pyspark.sql import Row

#lines = spark.read.text("dbfs:/databricks-datasets/samples/data/mllib/sample_movielens_data.txt").rdd
#parts = lines.map(lambda row: row.value.split("::"))
#ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
# rating=float(p[2]) ))
#ratings = spark.createDataFrame(ratingsRDD)

# Read the data


from pyspark.sql.functions import *
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

ratings = spark.read.text("dbfs:/databricks-datasets/samples/data/mllib/sample_movielens_data.txt")
ratings = ratings.withColumn('userId', split('value', '::').getItem(0)).withColumn('movieId', split('value',
'::').getItem(1)).withColumn('rating', split('value', '::').getItem(2))
ratings = ratings.drop('value')

#from pyspark.sql.types import IntegerType,BooleanType,DateType


# Convert String to Integer Type

ratings.printSchema()
ratings = ratings.withColumn('userId', col("userId").cast('int'))
ratings = ratings.withColumn('movieId', col("movieId").cast('int'))
ratings = ratings.withColumn('rating', col("rating").cast('int'))
ratings.printSchema()

root
|-- userId: integer (nullable = true)
|-- movieId: string (nullable = true)
|-- rating: string (nullable = true)

root
|-- userId: integer (nullable = true)
|-- movieId: integer (nullable = true)
|-- rating: integer (nullable = true)

ratings.display()

Table
  
  userId movieId rating
1 0 2 3
2 0 3 1
3 0 5 2
4 0 9 4
5 0 11 1
6 0 12 2
7 0 15 1
Truncated results, showing first 1,000 rows.

(training, test) = ratings.randomSplit([0.8, 0.2])

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 98/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Build the recommendation model using ALS on the training data


# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data


predictions = model.transform(test)

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")


rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.9077745567741176

predictions.display()

Table
   
  userId movieId rating prediction
1 2 39 5 3.7424393
2 2 50 1 0.86440027
3 2 54 1 2.3776407
4 2 58 2 0.04814744
5 2 62 1 5.16774
6 2 65 1 0.67538136
7 2 66 3 3.651377
Showing all 307 rows.

# Generate top 10 movie recommendations for each user


userRecs = model.recommendForAllUsers(10)
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

#userRecs.display()
#movieRecs.display()

# Generate top 10 movie recommendations for a specified set of users


users = ratings.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)
# Generate top 10 user recommendations for a specified set of movies
movies = ratings.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = model.recommendForItemSubset(movies, 10)

movieSubSetRecs.display()

Table
 
  movieId recommendations
2  [{"userId": 8, "rating": 4.2190223}, {"userId": 14, "rating": 4.1397004}, {"userId": 21, "rating": 4.026415}, {"userId": 10, "rating":
1 3.8691313}, {"userId": 12, "rating": 3.6239657}, {"userId": 4, "rating": 3.5872579}, {"userId": 28, "rating": 3.4324553}, {"userId": 0,
"rating": 3.2764773}, {"userId": 6, "rating": 3.0584242}, {"userId": 5, "rating": 2.706623}]
3  [{"userId": 14, "rating": 2.7682219}, {"userId": 16, "rating": 2.315615}, {"userId": 8, "rating": 2.2152493}, {"userId": 11, "rating":
2 2.1919854}, {"userId": 2, "rating": 2.0185175}, {"userId": 24, "rating": 2.0180826}, {"userId": 22, "rating": 1.7710056}, {"userId": 25,
"rating": 1.6565917}, {"userId": 21, "rating": 1.4824837}, {"userId": 12, "rating": 1.455656}]
5  [{"userId": 16, "rating": 3.0494766}, {"userId": 18, "rating": 2.1953611}, {"userId": 26, "rating": 2.1228826}, {"userId": 15, "rating":
3 1.9957798}, {"userId": 2, "rating": 1.9644576}, {"userId": 3, "rating": 1.9028949}, {"userId": 22, "rating": 1.8979962}, {"userId": 0,
"rating": 1.8318737}, {"userId": 27, "rating": 1.8116448}, {"userId": 23, "rating": 1.800349}]

Showing all 3 rows.

ABD13 Spark ML City home prices

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 99/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

Build a linear regression model to predict a city's median home price


from its population

Load and preprocess the data


This notebook uses an example dataset of housing prices in different cities.
The goal is to build a simple model that predicts the median house price in a city from its population.

df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true",
inferSchema="true")
display(df)

Table
    
  2014 rank City State State Code 2014 Population estimate 2015 median sales price
1 101 Birmingham Alabama AL 212247 162.9
2 125 Huntsville Alabama AL 188226 157.7
3 122 Mobile Alabama AL 194675 122.5
4 114 Montgomery Alabama AL 200481 129
5 64 Anchorage[19] Alaska AK 301010 null
6 78 Chandler Arizona AZ 254276 null
7 86 Gilbert[20] Arizona AZ 239277 null
Showing all 294 rows.

# Some of the column names contain spaces. Rename the columns to replace spaces with underscores and shorten the
names.
from pyspark.sql.functions import col
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
data = df.select(exprs)

# Cache data for faster reuse


data.cache()

Out[656]: DataFrame[2014_rank: int, City: string, State: string, State_Code: string, 2014_Population_estimate: int, 2
015_median_sales_price: double]

display(data)

Table
    
  2014_rank City State State_Code 2014_Population_estimate 2015_median_sales_price
1 101 Birmingham Alabama AL 212247 162.9
2 125 Huntsville Alabama AL 188226 157.7
3 122 Mobile Alabama AL 194675 122.5
4 114 Montgomery Alabama AL 200481 129
5 64 Anchorage[19] Alaska AK 301010 null
6 78 Chandler Arizona AZ 254276 null
7 86 Gilbert[20] Arizona AZ 239277 null
Showing all 294 rows.

# Check the number of rows in the DF


data.count()

Out[658]: 294

# Show column details


data.printSchema()

root
|-- 2014_rank: integer (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- State_Code: string (nullable = true)
|-- 2014_Population_estimate: integer (nullable = true)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 100/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

|-- 2015_median_sales_price: double (nullable = true)

# Describing the columns. Use method describe or summary (summary will give you more info)
#data.summary("count", "min", "25%", "75%", "max").display()
data.summary().display()

Table
     
  summary 2014_rank City State State_Code 2014_Population_estimate 2015_median_sa
1 count 294 294 294 294 293 109
2 mean 147.5 null null null 307284.89761092153 211.26605504587
3 stddev 85.01470461043782 null null null 603487.8272175139 134.01724544927
4 min 1 Abilene Alabama AK 101408 78.6
5 25% 74 null null null 120958 141.1
6 50% 147 null null null 168586 177.2
7 75% 221 null null null 262146 218.9
Showing all 8 rows.

# an alternative approach
from pyspark.sql.functions import mean, min, max
data.select([mean('2014_rank'), min('2014_rank'), max('2014_rank')]).show()

+--------------+--------------+--------------+
|avg(2014_rank)|min(2014_rank)|max(2014_rank)|
+--------------+--------------+--------------+
| 147.5| 1| 294|
+--------------+--------------+--------------+

# Counting and Removing Null values


from pyspark.sql.functions import *
data.select([count(when(isnull(c), c)).alias(c) for c in data.columns]).display()

Table
     
  2014_rank City State State_Code 2014_Population_estimate 2015_median_sales_price
1 0 0 0 0 1 185

Showing 1 row.

# Drop rows with null/missing values


data = data.dropna()
data.count()

Out[663]: 109

# Display specific columns to use as feature(s) and label


data.select("2014_Population_estimate", "2015_median_sales_price").display()

Table
 
  2014_Population_estimate 2015_median_sales_price
1 212247 162.9
2 188226 157.7
3 194675 122.5
4 200481 129
5 1537058 206.1
6 527972 178.1
7 197706 131.8
Showing all 109 rows.

Visualize the data "population/feature" vs "price/label" data

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 101/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

model_data = data.selectExpr("2014_Population_estimate as population", "2015_median_sales_price as label")


model_data.display()

Visualization

8.00M
population

6.00M
4.00M
2.00M

label

Showing all 109 rows.

Vectorize the data

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["population"], outputCol="features")


dataset = assembler.transform(model_data)

# Same as above but with pipeline


#from pyspark.ml import Pipeline
#from pyspark.ml.feature import VectorAssembler

#stages = []
#assembler = VectorAssembler(inputCols=["population"], outputCol="features")
#stages += [assembler]
#pipeline = Pipeline(stages=stages)
#pipelineModel = pipeline.fit(model_data)
#dataset = pipelineModel.transform(model_data)

display(dataset)

Table
  
  population label features
1 212247 162.9  {"vectorType": "dense", "length": 1, "values": [212247]}

2 188226 157.7  {"vectorType": "dense", "length": 1, "values": [188226]}

3 194675 122.5  {"vectorType": "dense", "length": 1, "values": [194675]}

4 200481 129  {"vectorType": "dense", "length": 1, "values": [200481]}

5 1537058 206.1  {"vectorType": "dense", "length": 1, "values": [1537058]}

6 527972 178.1  {"vectorType": "dense", "length": 1, "values": [527972]}

7 197706 131.8  {"vectorType": "dense", "length": 1, "values": [197706]}


Showing all 109 rows.

Build the linear regression model


Goal
Predict y = 2015 median housing price
Using feature x = 2014 population estimate

# Import LinearRegression class


from pyspark.ml.regression import LinearRegression

# Create a linear regression object


lr = LinearRegression()

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 102/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Fit the model


model = lr.fit(dataset)
print(">>>> Model intercept: %r, coefficient: %r" % (model.intercept, model.coefficients[0]))

>>>> Model intercept: 191.29427575139394, coefficient: 3.779789682338248e-05

# Fit a new model - B - using different regularization parameters


#modelB = lr.fit(dataset)
#print(">>>> ModelB intercept: %r, coefficient: %r" % (modelB.intercept, modlB.coefficients[0]))

Make predictions
Use the transform() method on the model to generate predictions. The following code takes the first model (modelA) and
creates a new table (predictionsA) containing both the label (original sales price) and the prediction (predicted sales price) based
on the features (population).

predictions = model.transform(dataset)
predictions.show(10)

+----------+-----+-----------+------------------+
|population|label| features| prediction|
+----------+-----+-----------+------------------+
| 212247|162.9| [212247.0]| 199.3167659584664|
| 188226|157.7| [188226.0]|198.40882267887193|
| 194675|122.5| [194675.0]|198.65258131548592|
| 200481|129.0| [200481.0]|198.87203590444247|
| 1537058|206.1|[1537058.0]|249.39183544694856|
| 527972|178.1| [527972.0]|211.25050693302884|
| 197706|131.8| [197706.0]| 198.7671467407576|
| 346997|685.7| [346997.0]| 204.4100325554172|
| 3928864|434.7|[3928864.0]|339.79707185649573|
| 319504|281.0| [319504.0]|203.37085497805194|
+----------+-----+-----------+------------------+
only showing top 10 rows

#predictionsB = modelB.transform(dataset)
#predictionsB.show()

Evaluate the model


You can evaluate the model's performance by calculating the root mean square error between the value predicted by the model
(in the prediction column) and the actual value (in the label column). Use the PySpark RegressionEvaluator. In this example
modelA performed slightly better.

from pyspark.ml.evaluation import RegressionEvaluator


evaluator = RegressionEvaluator(metricName="rmse")
RMSE_model = evaluator.evaluate(predictions)
#RMSE_modelB = evaluator.evaluate(predictionsB)
print("Model: Root Mean Squared Error = " + str(RMSE_model))
#print("ModelB: Root Mean Squared Error = " + str(RMSE_modelB))

Model: Root Mean Squared Error = 128.60202684284758

Plot residuals versus fitted values


Residual analysis is an important step in evaluating a model's performance. Ideally, the model's residuals -- the difference
between the predicted value and the actual value -- should be small and symmetric around 0. In this case there are some very
large residuals and poor symmetry around 0. This is not surprising; housing prices depend on many more variables than a city's
population.

display(model, dataset)
#display(modelB, dataset)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 103/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

600
residuals 400
200
0.00
500

fitted values

Showing all 102 rows.

ABD13 Spark ML Wine quality

Build a linear regression model to predict wine quality


%fs ls /databricks-datasets/wine-quality/

Table
   
  path name size modificationTime
1 dbfs:/databricks-datasets/wine-quality/README.md README.md 1066 1594262736000
2 dbfs:/databricks-datasets/wine-quality/winequality-red.csv winequality-red.csv 84199 1594262736000
3 dbfs:/databricks-datasets/wine-quality/winequality-white.csv winequality-white.csv 264426 1594262736000

Showing all 3 rows.

#winequality = spark.read.csv("dbfs:/databricks-datasets/wine-quality/winequality-red.csv", header="true", sep=",",


inferSchema="true") \
# .toDF("fixed_acidity","volatile_acidity","citric_acid","residual_sugar",
# "chlorides","free_sulfur_dioxide","total_sulfur_dioxide",
# "density","pH","sulphates","alcohol","quality")

winequality = spark.read.csv("dbfs:/databricks-datasets/wine-quality/winequality-red.csv", header="true", sep=";",


inferSchema="true")
winequality.display()

Table
      
  fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 7.4 0.7 0 1.9 0.076 11 34
2 7.8 0.88 0 2.6 0.098 25 67
3 7.8 0.76 0.04 2.3 0.092 15 54
4 11.2 0.28 0.56 1.9 0.075 17 60
5 7.4 0.7 0 1.9 0.076 11 34
6 7.4 0.66 0 1.8 0.075 13 40
7 7.9 0.6 0.06 1.6 0.069 15 59
Truncated results, showing first 1,000 rows.

winequality.columns

Out[679]: ['fixed acidity',


'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol',
'quality']

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 104/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

# Some of the column names contain spaces. Rename the columns to replace spaces with underscores and shorten the
names.
from pyspark.sql.functions import col
exprs = [col(column).alias(column.replace(' ', '_')) for column in winequality.columns]
#wq = winequality.select(*exprs)
wq = winequality.select(exprs)

# Cache data for faster reuse


wq.cache()

Out[680]: DataFrame[fixed_acidity: double, volatile_acidity: double, citric_acid: double, residual_sugar: double, chl
orides: double, free_sulfur_dioxide: double, total_sulfur_dioxide: double, density: double, pH: double, sulphates: do
uble, alcohol: double, quality: int]

Perform multilinear regression to estimate the quality of the wine based on it's components

from pyspark.ml.regression import LinearRegression


#from pyspark.ml.regression import LinearRegressionSummary
from pyspark.ml.feature import VectorAssembler
#from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

# Split the data into training and test sets (20% held out for testing)
(TrainSet, TestSet) = wq.randomSplit([0.8,0.2])

# Vectorize features
lrAssembler = VectorAssembler(inputCols=wq.drop("quality").columns, outputCol="features")
# example on how to use less features
#lrAssembler = VectorAssembler(inputCols=["residual_sugar","alcohol"], outputCol="features")
#use .transform().show() if you want to see the vectorized features: lrAssembler.transform(wq).show()

# Configure model (by default the featuresCol="features")


lr = LinearRegression(labelCol="quality")

# Chain Vectorization and Linear Regression in a Pipeline


pipeline = Pipeline(stages=[lrAssembler,lr])

# Train the Model


lrModel = pipeline.fit(TrainSet)

# Make Predictions
lrPredictions = lrModel.transform(TestSet)

# Show Predictions
display(lrPredictions.select("quality","prediction","features"))

Table
 
  quality prediction features
1 6 6.812013814635069  {"vectorType": "dense", "length": 11, "values": [5, 0.38, 0.01, 1.6, 0.048, 26, 60, 0.99084, 3.7, 0.75, 14]}

2 6 6.901158006990844  {"vectorType": "dense", "length": 11, "values": [5.2, 0.34, 0, 1.8, 0.05, 27, 63, 0.9916, 3.68, 0.79, 14]}

3 7 6.250075230396158  {"vectorType": "dense", "length": 11, "values": [5.3, 0.57, 0.01, 1.7, 0.054, 5, 27, 0.9934, 3.57, 0.84, 12.5]}

4 8 6.781282409967318  {"vectorType": "dense", "length": 11, "values": [5.5, 0.49, 0.03, 1.8, 0.044, 28, 87, 0.9908, 3.5, 0.82, 14]}

5 8 5.898732615862613  {"vectorType": "dense", "length": 11, "values": [5.6, 0.85, 0.05, 1.4, 0.045, 12, 88, 0.9924, 3.56, 0.82, 12.9]}

6 6 5.930470507123344  {"vectorType": "dense", "length": 11, "values": [5.9, 0.29, 0.25, 13.4, 0.067, 72, 160, 0.99721, 3.33, 0.54, 10.3]

7 6 6.2845071417946645  {"vectorType": "dense", "length": 11, "values": [5.9, 0.44, 0, 1.6, 0.042, 3, 11, 0.9944, 3.48, 0.85, 11.7]}

Showing all 333 rows.

# Check the vectorized features


display(lrModel.stages[0].transform(wq))

Table
     
  fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 105/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

1
6 7.4
7.4 0.7
0.66 0
0 1.9
1.8 0.076
0.075 11
13 34
40
2
7 7.8
7.9 0.88
0.6 0
0.06 2.6
1.6 0.098
0.069 25
15 67
59
Truncated
8 7.3results, showing first
0.651,000 rows. 0 1.2 0.065 15 21
9 7.8 0.58 0.02 2 0.073 9 18
# Print the coefficients and intercept for linear regression
print( 7.5 0.5 0.36 6.1 0.071 17 102
10
"Coefficients List: ", lrModel.stages[1].coefficients )
print(
11 6.7
"Intercept: ", 0.58 0.08
lrModel.stages[1].intercept ) 1.8 0.097 15 65
#print("Coefficients
12 7.5 List:
0.5 %s" % str(lrModel.stages[1].coefficients))
0.36 6.1 0.071 17 102
#print("Intercept: %s" % str(lrModel.stages[1].intercept))
13 5.6 0.615 0 1.6 0.089 16 59
7.8
print("______")
14 0.61 0.29 1.6 0.114 9 29
print("Coefficients:")
15 8.9 0.62 0.18 3.8 0.176 52 145
coefficients = lrModel.stages[1].coefficients
16 8.9 0.62 0.19 3.9 0.17 51 148

i=017 8.5 0.28 0.56 1.8 0.092 35 103


for feature in wq.drop("quality").columns:
print(feature,":",coefficients[i])
i+=1

Coefficients List: [0.04412133962368335,-1.145956599303815,-0.27217881452370296,0.024830741246880925,-1.431854793062


3557,0.005919420666068606,-0.003174571695856731,-13.053725661361636,-0.39876440295018467,0.8801817878882452,0.2862842
7013387836]
Intercept: 16.836627323331708
______
Coefficients:
fixed_acidity : 0.04412133962368335
volatile_acidity : -1.145956599303815
citric_acid : -0.27217881452370296
residual_sugar : 0.024830741246880925
chlorides : -1.4318547930623557
free_sulfur_dioxide : 0.005919420666068606
total_sulfur_dioxide : -0.003174571695856731
density : -13.053725661361636
pH : -0.39876440295018467
sulphates : 0.8801817878882452
alcohol : 0.28628427013387836

# Check the model error and residuals


trainingSummary = lrModel.stages[1].summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2 : %f" % trainingSummary.r2)
trainingSummary.residuals.show()

RMSE: 0.640381
R2 : 0.365982
+--------------------+
| residuals|
+--------------------+
| -1.9391673311439117|
| 0.23068518081834455|
| 0.24707841053737667|
| -0.5782654426454705|
| 1.3413690754061012|
| 0.32905422766333814|
| -0.7809713840657988|
|-0.03194714354228...|
| 0.4859116081054946|
|-0.17243726153934347|
| 0.6811151925144436|
| 0.6764826321286925|
|-0.21924259397254353|
| -0.901158006990844|
| 1.0881932571611088|
| 0.1701630743161715|

Build a K-means model to identify 4 wine clusters


#wc = spark.read.csv("/FileStore/tables/wine_clustering.csv", header="true", sep=",", inferSchema="true") \
# .toDF("Alcohol","Malic_Acid","Ash","Ash_Alcanity","Magnesium", \
# "Total_Phenols","Flavanoids","Nonflavanoid_Phenols", \
# "Proanthocyanins","Color_Intensity","Hue","OD280","Proline")
#display(wc)

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 106/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

wc = spark.read.csv("dbfs:/databricks-datasets/wine-quality/winequality-red.csv", header="true", sep=";",


inferSchema="true")
wc.display()

Table
      
  fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 7.4 0.7 0 1.9 0.076 11 34
2 7.8 0.88 0 2.6 0.098 25 67
3 7.8 0.76 0.04 2.3 0.092 15 54
4 11.2 0.28 0.56 1.9 0.075 17 60
5 7.4 0.7 0 1.9 0.076 11 34
6 7.4 0.66 0 1.8 0.075 13 40
7 7.9 0.6 0.06 1.6 0.069 15 59
Truncated results, showing first 1,000 rows.

from pyspark.ml.clustering import KMeans


from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

clusterAssembler1 = VectorAssembler(inputCols=wc.columns, outputCol="features")


featureSet = clusterAssembler1.transform(wc).select("features")

# Trains a k-means model.


kmeans1 = KMeans().setK(4).setSeed(1)
clusterModel1 = kmeans1.fit(featureSet)

# Just if you want to see the cluster numbers 0 to ...


#clusterModel1.summary.predictions.select("prediction").distinct().show()
# Just if you want to see the cluster centers
#print(clusterModel1.clusterCenters())

# Make predictions
clusterPredictions1 = clusterModel1.transform(featureSet)

display(clusterPredictions1)

Table
 
  features prediction
1  {"vectorType": "dense", "length": 12, "values": [7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.9978, 3.51, 0.56, 9.4, 5]} 1
2  {"vectorType": "dense", "length": 12, "values": [7.8, 0.88, 0, 2.6, 0.098, 25, 67, 0.9968, 3.2, 0.68, 9.8, 5]} 0
3  {"vectorType": "dense", "length": 12, "values": [7.8, 0.76, 0.04, 2.3, 0.092, 15, 54, 0.997, 3.26, 0.65, 9.8, 5]} 3
4  {"vectorType": "dense", "length": 12, "values": [11.2, 0.28, 0.56, 1.9, 0.075, 17, 60, 0.998, 3.16, 0.58, 9.8, 6]} 3
5  {"vectorType": "dense", "length": 12, "values": [7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.9978, 3.51, 0.56, 9.4, 5]} 1
6  {"vectorType": "dense", "length": 12, "values": [7.4, 0.66, 0, 1.8, 0.075, 13, 40, 0.9978, 3.51, 0.56, 9.4, 5]} 3
7  {"vectorType": "dense", "length": 12, "values": [7.9, 0.6, 0.06, 1.6, 0.069, 15, 59, 0.9964, 3.3, 0.46, 9.4, 5]} 3
Truncated results, showing first 1,000 rows.

from pyspark.ml.evaluation import ClusteringEvaluator

# Evaluate clustering by computing Silhouette score


evaluator1 = ClusteringEvaluator()

silhouette1 = evaluator1.evaluate(clusterPredictions1)
print("Silhouette with squared euclidean distance = " + str(silhouette1))

# Shows the result.


centers = clusterModel1.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)

Silhouette with squared euclidean distance = 0.656312517141184


Cluster Centers:
[ 8.07333333 0.551 0.27810526 2.8822807 0.09384211 24.48421053
80.58245614 0.99698351 3.31915789 0.65 10.14046784 5.47719298]
[ 8.54964438 0.51737553 0.27648649 2.38869132 0.08406828 8.19914651
20.31578947 0.99662371 3.30308677 0.64859175 10.60493125 5.74110953]

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 107/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

[7.96574074e+00 5.52175926e-01 3.11388889e-01 3.27546296e+00


8.85462963e-02 2.94583333e+01 1.28722222e+02 9.96942130e-01
3.23898148e+00 6.81944444e-01 9.90370370e+00 5.13888889e+00]
[ 8.21371769 0.52405567 0.25055666 2.39582505 0.08837177 18.80815109
46.027833 0.99674239 3.33326044 0.67101392 10.44025845 5.68588469]

The same as above but using pipelines

(change the number of clusters and check the new evaluation results)

from pyspark.ml.clustering import KMeans


from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

# Split the data into training and test sets (20% held out for testing)
(clusterTrainSet, clusterTestSet) = wc.randomSplit([0.8,0.2])

# Vectorize features
clusterAssembler2 = VectorAssembler(inputCols=wc.columns, outputCol="features")

# Configure model
kmeans2 = KMeans().setK(4).setSeed(1)

# Chain Vectorization and Linear Regression in a Pipeline


pipeline2 = Pipeline(stages=[clusterAssembler2,kmeans2])

# Train Model
clusterModel2 = pipeline2.fit(clusterTrainSet)

# Make Predictions
clusterPredictions2 = clusterModel2.transform(clusterTestSet)

# Show Predictions
display(clusterPredictions2)

Table
      
  fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 5 0.74 0 1.2 0.041 16 46
2 5 1.04 0.24 1.6 0.05 32 96
3 5.3 0.47 0.11 2.2 0.048 16 89
4 5.4 0.58 0.08 1.9 0.059 20 31
5 5.4 0.74 0.09 1.7 0.089 16 26
6 5.6 0.31 0.37 1.4 0.074 12 96
7 5.6 0.66 0 2.2 0.087 3 11
Showing all 305 rows.

Silhouette with squared euclidean distance = 0.6761510776375441


Cluster Centers:
[ 8.25967366 0.51685315 0.27184149 2.52109557 0.08976224 21.8986014
54.87412587 0.99683555 3.32624709 0.67759907 10.42408702 5.63636364]
[ 8.43363636 0.5225 0.26460606 2.36219697 0.08490758 8.97878788
22.31666667 0.99660085 3.31360606 0.64772727 10.57063131 5.72727273]
[7.9000e+00 3.0000e-01 6.8000e-01 8.3000e+00 5.0000e-02 3.7500e+01
2.8350e+02 9.9316e-01 3.0100e+00 5.1000e-01 1.2300e+01 7.0000e+00]
[7.88768473e+00 5.75467980e-01 2.72167488e-01 3.29285714e+00
9.09753695e-02 2.64655172e+01 1.07408867e+02 9.96987882e-01
3.28684729e+00 6.62758621e-01 9.97873563e+00 5.32019704e+00]

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 108/109


1/6/23, 5:32 PM ABD00 Notebooks Combined - Databricks

file:///C:/Users/793167/OneDrive - Galp/Desktop/Docs ABD/ABD00 Notebooks Combined.html 109/109

You might also like