ABD00 Notebooks Combined - Databricks
ABD00 Notebooks Combined - Databricks
Python
int=100; dec=222.222
print(type(int), type(dec))
Int = 100
Decimal= 222.222
print('singe Q')
print("singe Q")
print("lot's of")
singe Q
singe Q
lot's of
#list(dict)
list(dict.items())
print(dict['Lisboa'])
print(dict['Porto'])
# Much more on lists but not relevant for our Spark examples
1900
4000
anonymous functions
x = lambda a : a + 2
print(x(3))
f = lambda x, y : x + y
f(1,1)
Out[11]: 2
print(times2)
type(times2)
print(times2(2))
seq = [1,2,3,4,5]
print(seq)
[1, 2, 3, 4, 5]
list(map(times2,seq))
print(*map(times2,seq))
2 4 6 8 10
Out[19]: [2, 4]
2 4 6 8 10
Out[20]: [2, 4, 6, 8, 10]
print("Hi")
Hi
/databricks/driver
%sh ls -l
total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs
%sh ps #ps command is used to list the currently running processes and their PIDs along with some other information
depends on different options
%sh env
SHELL=/bin/bash
PIP_NO_INPUT=1
SUDO_GID=0
PYTHONHASHSEED=0
DISABLE_LOCAL_FILESYSTEM=false
JAVA_HOME=/usr/lib/jvm/zulu8-ca-amd64/jre/
MLR_PYTHONPATH=/etc/mlr_python_path
MLFLOW_PYTHON_EXECUTABLE=/databricks/spark/scripts/mlflow_python.sh
JAVA_OPTS= -Djava.io.tmpdir=/local_disk0/tmp -XX:-OmitStackTraceInFastThrow -Djava.security.properties=/databricks/sp
ark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetai
ls -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_6
4-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=co
m.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.su
n.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.
xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.s
un.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.interna
l.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementat
ionSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.
cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabric
ks.serviceName=driver-1 -Xms7254m -Xmx7254m -Dspark.ui.port=40001 -Dspark.executor.extraJavaOptions="-Djava.io.tmpdir
=/local_disk0/tmp -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:PerMethodRecompilationCutoff=-1 -XX:Pe
%sh free
%sh df #Use the df command for displaying available space on a file system: View used space. See free disk space. Show
what filesystems are mounted.
Filesystem
1K-blocks Used Available Use% Mounted on
/var/lib/lxc/base-images/release__12.0.x-snapshot-cpu-ml-scala2.12__databricks-universe__head__93a7752__dde6fe5__jenk
ins__b0f3aff__format-2 153707984 17438296 128388984 12% /
none
492 0 492 0% /dev
/dev/xvdb
153707984 17438296 128388984 12% /mnt/readonly
/dev/mapper/vg-lv
455461216 10514868 421736776 3% /local_disk0
tmpfs
7808672 0 7808672 0% /sys/fs/cgroup
tmpfs
7808672 0 7808672 0% /dev/shm
tmpfs
Python 3.9.5
%fs ls /FileStore/tables
Table
path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.
Some standard fs linux commands for your knowledge (cp - copy, mv - move, rm - remove)
#dbutils.fs.rm('/FileStore/tables/dataset2.csv')
#%fs rm /FileStore/tables/dataset2.csv
%fs ls
Table
path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 0
2 dbfs:/cp/ cp/ 0 0
3 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
4 dbfs:/databricks-results/ databricks-results/ 0 0
5 dbfs:/delta/ delta/ 0 0
6 dbfs:/local_disk0/ local_disk0/ 0 0
7 dbfs:/tmp/ tmp/ 0 0
Showing all 8 rows.
%fs ls /databricks-datasets/
Table
path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbfs:/databricks datasets/Rdatasets/ Rdatasets/ 0 0
4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.
%fs ls /databricks-datasets/definitive-guide/data/
Table
path name size modifica
1 dbfs:/databricks-datasets/definitive-guide/data/activity-data/ activity-data/ 0 0
2 dbfs:/databricks-datasets/definitive-guide/data/bike-data/ bike-data/ 0 0
3 dbfs:/databricks-datasets/definitive-guide/data/binary-classification/ binary-classification/ 0 0
4 dbfs:/databricks-datasets/definitive-guide/data/clustering/ clustering/ 0 0
5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/ flight-data/ 0 0
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data-hive/ flight-data-hive/ 0 0
7 dbfs:/databricks-datasets/definitive-guide/data/multiclass-classification/ multiclass-classification/ 0 0
Showing all 14 rows.
%fs ls dbfs:/databricks-datasets/samples/
Table
path name size modificationTime
1 dbfs:/databricks-datasets/samples/adam/ adam/ 0 0
2 dbfs:/databricks-datasets/samples/data/ data/ 0 0
3 dbfs:/databricks-datasets/samples/docs/ docs/ 0 0
4 dbfs:/databricks-datasets/samples/lending_club/ lending_club/ 0 0
5 dbfs:/databricks-datasets/samples/newsgroups/ newsgroups/ 0 0
6 dbfs:/databricks-datasets/samples/people/ people/ 0 0
7 dbfs:/databricks-datasets/samples/population-vs-price/ population-vs-price/ 0 0
Showing all 7 rows.
%fs ls dbfs:/databricks-datasets/samples/people
Table
path name size modificationTime
1 dbfs:/databricks-datasets/samples/people/people.json people.json 77 1534435526000
Showing 1 row.
%fs ls dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv
Table
path name size modificationT
1 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2010-summary.csv 2010-summary.csv 7121 152219204900
2 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2011-summary.csv 2011-summary.csv 7069 152219204900
3 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2012-summary.csv 2012-summary.csv 6857 152219204900
4 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2013-summary.csv 2013-summary.csv 7020 152219205000
5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2014-summary.csv 2014-summary.csv 6729 152219205000
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data/csv/2015-summary.csv 2015-summary.csv 7080 152219205000
dbutils.help()
This module provides various utilities for users to interact with the rest of Databricks.
credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
dbutils.fs.help()
dbutils.fs provides utilities for working with FileSystems. Most methods in this package can take either a DBFS path (e.g., "/foo" or "dbfs:/foo"), or another
FileSystem URI. For more info about a method, use dbutils.fs.help("methodName"). In notebooks, you can also use the %fs shorthand to access DBFS.
The %fs shorthand maps straightforwardly onto dbutils calls. For example, "%fs head --maxBytes=10000 /file/path" translates into "dbutils.fs.head("/file/path",
maxBytes = 10000)".
fsutils
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory
mount
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]):
boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
dbutils.fs.help('cp')
/**
* Copies a file or directory, possibly across FileSystems..
*
* Example: cp("/mnt/my-folder/a", "s3n://bucket/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean
#dbutils.fs.cp("dbfs:/FileStore/old_file.txt", "file:/tmp/new/new_file.txt")
dbutils.widgets.help()
dbutils.widgets provides utilities for working with notebook widgets. You can create different types of widgets and get their bound value. For more info about a
method, use dbutils.widgets.help("methodName").
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value
and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and
choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value
and choices
dbutils.widgets provides utilities for working with notebook widgets. You can create different types of widgets and get their bound value. For more info about a
method, use dbutils.widgets.help("methodName").
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value
and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and
choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value
and choices
1/2.0
Out[41]: 0.5
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
spark.version
Out[43]: '3.3.1'
#spark.sparkContext.appName
spark.conf.get("spark.app.name")
spark.sparkContext.getConf().getAll()
+--------------------+
| value|
+--------------------+
| # Datasets|
| |
|This folder conta...|
| |
| |
|The datasets are ...|
| |
| ## Flight Data|
| |
|This data comes f...|
| |
| ## Retail Data|
| |
|Daqing Chen, Sai ...|
| |
|The data was down...|
| |
print(tf)
DataFrame[value: string]
type(tf)
Out[48]: pyspark.sql.dataframe.DataFrame
tf.display()
#display(tf)
Table
value
1 # Datasets
2
3 This folder contains all of the datasets used in The Definitive Guide.
4
tf.dtypes
tf.schema
tf.printSchema()
root
|-- value: string (nullable = true)
diamonds=spark.read.csv('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv', header='True')
diamonds.show()
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55| 326|3.95|3.98|2.43|
| 2| 0.21| Premium| E| SI1| 59.8| 61| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65| 327|4.05|4.07|2.31|
| 4| 0.29| Premium| I| VS2| 62.4| 58| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58| 335|4.34|4.35|2.75|
| 6| 0.24|Very Good| J| VVS2| 62.8| 57| 336|3.94|3.96|2.48|
| 7| 0.24|Very Good| I| VVS1| 62.3| 57| 336|3.95|3.98|2.47|
| 8| 0.26|Very Good| H| SI1| 61.9| 55| 337|4.07|4.11|2.53|
| 9| 0.22| Fair| E| VS2| 65.1| 61| 337|3.87|3.78|2.49|
| 10| 0.23|Very Good| H| VS1| 59.4| 61| 338| 4|4.05|2.39|
| 11| 0.3| Good| J| SI1| 64| 55| 339|4.25|4.28|2.73|
| 12| 0.23| Ideal| J| VS1| 62.8| 56| 340|3.93| 3.9|2.46|
| 13| 0.22| Premium| F| SI1| 60.4| 61| 342|3.88|3.84|2.33|
| 14| 0.31| Ideal| J| SI2| 62.2| 54| 344|4.35|4.37|2.71|
| 15| 0.2| Premium| E| SI2| 60.2| 62| 345|3.79|3.75|2.27|
| 16| 0.32| Premium| E| I1| 60.9| 58| 345|4.38|4.42|2.68|
| 17| 0.3| Ideal| I| SI2| 62| 54| 348|4.31|4.34|2.68|
| 18| 0.3| Good| J| SI1| 63.4| 54| 351|4.23|4.29| 2.7|
diamonds.printSchema()
root
|-- _c0: string (nullable = true)
spark.read.csv('/databricks-datasets/learning-spark-v2/flights/departuredelays.csv', inferSchema='true',
header='True').show()
+-------+-----+--------+------+-----------+
| date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245| 6| 602| ABE| ATL|
|1020600| -8| 369| ABE| DTW|
|1021245| -2| 602| ABE| ATL|
|1020605| -4| 602| ABE| ATL|
|1031245| -4| 602| ABE| ATL|
|1030605| 0| 602| ABE| ATL|
|1041243| 10| 602| ABE| ATL|
|1040605| 28| 602| ABE| ATL|
|1051245| 88| 602| ABE| ATL|
|1050605| 9| 602| ABE| ATL|
|1061215| -6| 602| ABE| ATL|
|1061725| 69| 602| ABE| ATL|
|1061230| 0| 369| ABE| DTW|
|1060625| -3| 602| ABE| ATL|
|1070600| 0| 369| ABE| DTW|
|1071725| 0| 602| ABE| ATL|
|1071230| 0| 369| ABE| DTW|
|1070625| 0| 602| ABE| ATL|
spark.read.csv('/databricks-datasets/learning-spark-v2/flights/departuredelays.csv', inferSchema='true',
header='True').take(3)
%sh ls -l #Long listing. Possibly the most used option for ls.
total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs
drwxr-xr-x 5 root root 4096 Jan 5 11:05 metastore_db
-r-xr-xr-x 1 root root 1306848 Jan 1 1970 preload_class.lst
%fs ls /databricks-datasets/
Table
path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbf /d t b i k d t t /Rd t t / Rd t t / 0 0
4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.
dbutils commands
dbutils.help()
This module provides various utilities for users to interact with the rest of Databricks.
credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
dbutils.fs.help('cp')
/**
* Copies a file or directory, possibly across FileSystems..
*
* Example: cp("/mnt/my-folder/a", "s3n://bucket/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean
Check the loaded files in your cluster (under dbfs:/FileStore/tables as allowed by Batabricks CE)
# Check the files you have in "dbfs:/FileStore/tables" (via command our databricks interface)
# If you don't have this folder already created automaticaly do the purplecow.txt import that is described bellow (2
options)
# 1) go to "Data > DBFS (on top) > Load" and import one file (raw). Ex: purplecow.txt (find it in Moodle on ABD
Class Tech Resources)
# 2) go to Databrick main page and use the import otion to import one file. Ex: purplecow.txt
#dbutils.fs.ls("dbfs:/FileStore/tables")
#dbutils.fs.mkdirs("/FileStore/tables")
#dbutils.fs.rm("/FileStore/tables/purplecow.txt")
%fs ls dbfs:/FileStore/tables
Table
path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.
# Import the purplecow.txt file available in moodle to dbfs:/FileStore/tables (using the databricks interface)
# Then use %fs ls (or databricks interface to see the saved file)
%fs ls dbfs:/FileStore/tables/purplecow.txt
Table
path name size modificationTime
1 dbfs:/FileStore/tables/purplecow.txt purplecow.txt 109 1663266910000
Showing 1 row.
#dbutils.fs.cp("/FileStore/tables/purplecow.txt", "/FileStore/tables/purplecow1.txt")
%sh ls -l
total 1304
drwxr-xr-x 2 root root 4096 Jan 1 1970 azure
drwxr-xr-x 2 root root 4096 Jan 1 1970 conf
drwxr-xr-x 3 root root 4096 Jan 5 11:00 eventlogs
-r-xr-xr-x 1 root root 3037 Jan 1 1970 hadoop_accessed_config.lst
drwxr-xr-x 2 root root 4096 Jan 5 11:01 logs
drwxr-xr-x 5 root root 4096 Jan 5 11:05 metastore_db
-r-xr-xr-x 1 root root 1306848 Jan 1 1970 preload_class.lst
# Copy the purplecow.txt file from your databricks filesystem to your driver node
dbutils.fs.cp("dbfs:/FileStore/tables/purplecow.txt", "file:/purplecow.txt", )
Out[67]: True
%sh ls -l /
total 88
-r-xr-xr-x 1 root root 271 Jan 1 1970 BUILD
drwxrwxrwx 2 root root 4096 Jan 5 11:00 Workspace
lrwxrwxrwx 1 root root 7 Oct 19 16:47 bin -> usr/bin
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot
drwxr-xr-x 1 root root 4096 Jan 5 11:04 databricks
drwxr-xr-x 2 root root 4096 Jan 5 11:00 dbfs
drwxr-xr-x 7 root root 540 Jan 5 11:00 dev
drwxr-xr-x 1 root root 4096 Jan 5 11:00 etc
drwxr-xr-x 1 root root 4096 Nov 23 01:50 home
lrwxrwxrwx 1 root root 7 Oct 19 16:47 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Oct 19 16:47 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Oct 19 16:47 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Oct 19 16:47 libx32 -> usr/libx32
drwxr-xr-x 7 ubuntu ubuntu 4096 Jan 5 11:00 local_disk0
drwxr-xr-x 2 root root 4096 Oct 19 16:47 media
drwxr-xr-x 1 root root 4096 Jan 5 11:00 mnt
drwxr-xr-x 4 root root 4096 Nov 23 01:50 opt
dr-xr-xr-x 225 root root 0 Jan 5 10:59 proc
-rw-r--r-- 1 root root 109 Jan 5 11:14 purplecow.txt
drwxr-xr-x 1 root root 4096 Jan 5 11:05 root
# Delete the purplecowfile in your driver (just for house cleaning purposes)
dbutils.fs.rm("file:/purplecow.txt")
Out[69]: True
Table
Name Zip-Code
1 Porto 1900
2 Lisboa 4000
df.write.format("parquet").save("dbfs:/FileStore/tables/df.parquet")
#df.write.format("delta").save("dbfs:/FileStore/tables/df.delta")
df.write.save("dbfs:/FileStore/tables/df.delta")
%fs ls dbfs:/FileStore/tables/
Table
path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/df.delta/ df.delta/ 0 0
6 dbfs:/FileStore/tables/df.parquet/ df.parquet/ 0 0
7 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
Showing all 15 rows.
%fs ls dbfs:/FileStore/tables/df.delta
Table
path name
1 dbfs:/FileStore/tables/df.delta/_delta_log/ _delta_log/
2 dbfs:/FileStore/tables/df.delta/part-00003-6fc4d7b8-dbb9-482a-a5d3-9180f7e7bc80-c000.snappy.parquet part-00003-6fc4d7b8-dbb9-4
3 dbfs:/FileStore/tables/df.delta/part-00007-f919575c-282a-4a0b-a942-935b519d720f-c000.snappy.parquet part-00007-f919575c-282a-4
%fs ls dbfs:/FileStore/tables/df.delta/_delta_log/
Table
path name size modification
1 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-0 .s3-optimization-0 0 167291727600
2 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-1 .s3-optimization-1 0 167291727600
3 dbfs:/FileStore/tables/df.delta/_delta_log/.s3-optimization-2 .s3-optimization-2 0 167291727600
4 dbfs:/FileStore/tables/df.delta/_delta_log/00000000000000000000.crc 00000000000000000000.crc 2962 167291729100
5 dbfs:/FileStore/tables/df.delta/_delta_log/00000000000000000000.json 00000000000000000000.json 1969 167291727700
%fs ls dbfs:/FileStore/tables/df.parquet
Table
path name
1 dbfs:/FileStore/tables/df.parquet/_SUCCESS _SUCCESS
2 dbfs:/FileStore/tables/df.parquet/_committed_7290864805188671922 _committed_729086
3 dbfs:/FileStore/tables/df.parquet/_started_7290864805188671922 _started_729086480
dbfs:/FileStore/tables/df.parquet/part-00000-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-32-1- part-00000-tid-7290
4
c000.snappy.parquet
dbfs:/FileStore/tables/df.parquet/part-00003-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-35-1- part-00003-tid-7290
5
c000.snappy.parquet
dbfs:/FileStore/tables/df.parquet/part-00007-tid-7290864805188671922-1543ab27-2a0f-47b4-98d2-1fcef931c61e-39-1- part-00007-tid-7290
6
c000.snappy.parquet
%fs rm -r dbfs:/FileStore/tables/df.parquet
%fs rm -r dbfs:/FileStore/tables/df.delta
Python 3.9.5
Spark is a distributed processing engine able to execute parallel processing in a cluster. A Spark cluster is made of one Driver
Node and many Executors (JVMs - Java Virtual Machines) Nodes
sc
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
sc.appName
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
spark.version
Out[83]: '3.3.1'
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
spark.sparkContext.appName
#spark.sparkContext.appName
spark.conf.get("spark.app.name")
spark.conf.get("spark.app.name")
spark.conf.get("spark.sql.warehouse.dir")
Out[89]: 'dbfs:/user/hive/warehouse'
spark.sparkContext.master
Out[92]: 'local[8]'
spark.sparkContext.getConf().getAll()
('spark.sql.warehouse.dir', 'dbfs:/user/hive/warehouse'),
('spark.databricks.managedCatalog.clientClassName',
'com.databricks.managedcatalog.ManagedCatalogClientImpl'),
('spark.databricks.credential.scope.fs.gs.auth.access.tokenProviderClassName',
'com.databricks.backend.daemon.driver.credentials.CredentialScopeGCPTokenProvider'),
('spark.hadoop.fs.fcfs-s3.impl.disable.cache', 'true'),
('spark.hadoop.fs.s3a.retry.limit', '20'),
('spark.sql.streaming.checkpointFileManagerClass',
'com.databricks.spark.sql.streaming.DatabricksCheckpointFileManager'),
('spark.databricks.service.dbutils.repl.backend',
'com.databricks.dbconnect.ReplDBUtils'),
('spark.hadoop.databricks.s3.verifyBucketExists.enabled', 'false'),
('spark.streaming.driver.writeAheadLog.allowBatching', 'true'),
('spark.databricks.clusterSource', 'UI'),
('spark.hadoop.hive.server2.transport.mode', 'http'),
('spark.executor.memory', '8278m'),
help(spark)
class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin)
| SparkSession(sparkContext: pyspark.context.SparkContext, jsparkSession: Optional[py4j.java_gateway.JavaObject] =
None, options: Dict[str, Any] = {})
|
| The entry point to programming Spark with the Dataset and DataFrame API.
|
| A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
| tables, execute SQL over tables, cache tables, and read parquet files.
| To create a :class:`SparkSession`, use the following builder pattern:
|
| .. autoattribute:: builder
| :annotation:
|
| Examples
| --------
| >>> spark = SparkSession.builder \
| ... .master("local") \
| ... .appName("Word Count") \
| ... .config("spark.some.config.option", "some-value") \
%fs ls /databricks-datasets/
Table
path name size modificationTim
1 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
2 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
3 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
4 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
5 dbfs:/databricks-datasets/SPARK_README.md SPARK_README.md 3359 1455043490000
6 dbfs:/databricks-datasets/adult/ adult/ 0 0
7 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
Showing all 55 rows.
%fs ls /FileStore/tables
Table
path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam re
search. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legit
imate (ham) or spam.
## Composition
This corpus has been collected from free or free for research sources at the Internet:
- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in w
hich cell phone users make public claims about SMS spam messages, most of them without reporting the very spam messag
e received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and
it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000
legitimate messages collected for research at the Department of Computer Science at the National University of Singap
ore. The messages largely originate from Singaporeans and mostly from students attending the University. These messag
es were collected from volunteers who were made aware that their contributions were going to be made publicly availab
le. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at http://etheses.bham.ac.uk/253/
%fs ls dbfs:/databricks-datasets/sms_spam_collection/
Table
path name size modificationTime
1 dbfs:/databricks-datasets/sms_spam_collection/README.md README.md 4344 1448498620000
2 dbfs:/databricks-datasets/sms_spam_collection/data-001/ data-001/ 0 0
#To do this you have to upload first the purple cow file to Databricks
myrdd = sc.textFile("dbfs:/FileStore/tables/purplecow.txt")
myrdd.collect()
Transformations, like map() or filter() create a new RDD from an existing one, resulting into another immutable RDD. All
transformations are lazy. That is, they are not executed until an action is invoked or performed.
Actions, like show() or count(), return a value (results) to the user. Other actions like saveAsTextFile() write the RDD to distributed
storage (HDFS, DBFS or S3).
Transformations contribute to a query plan, but nothing is executed until an action is called.
# Doing the same thing for reading a text file but in one line with "."/methods notation
sc.textFile("dbfs:/FileStore/tables/purplecow.txt").collect()
Out[101]: ['I never saw a purple cow.', 'I never hope to see one.']
mydata = sc.textFile("/databricks-datasets/samples/docs/README.md")
Out[103]: 65
mydata.take(5)
mydata.toDebugString()
Out[107]: 2
Filtering lines
Change the lines of the file to Upcase using RDDs / the low-level API
df = spark.createDataFrame([("James", 20), ("Anna", 31), ("Michael", 30), ("Charles", 35), ("Brooke", 25)],
["name", "age"])
df.show()
+-------+---+
| name|age|
+-------+---+
| James| 20|
| Anna| 31|
|Michael| 30|
|Charles| 35|
| Brooke| 25|
+-------+---+
diamonds=spark.read.csv('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv', header='True')
diamonds.display()
Table
_c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.
+--------------------+
| value|
+--------------------+
mydataframe.printSchema()
root
|-- value: string (nullable = true)
== Physical Plan ==
FileScan text [value#1317] Batched: false, DataFilters: [], Format: Text, Location: InMemoryFileIndex(1 paths)[dbfs:/
FileStore/tables/purplecow.txt], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
mydataframe.rdd.toDebugString()
+--------------------+
| upper(value)|
+--------------------+
|I NEVER SAW A PUR...|
|I NEVER HOPE TO S...|
|BUT I CAN TELL YO...|
|I'D RATHER SEE TH...|
+--------------------+
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
spark.sparkContext
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
Creating an RDD
rdd1=sc.parallelize(["Lisboa", "Porto", "Faro"])
rdd1.collect()
Out[121]: pyspark.rdd.RDD
Numeric RDDs
Mylist = [50,59.2,59,57.2,53.5,53.2,55.4,51.8,53.6,55.4,54.7]
Myrdd = sc.parallelize(Mylist)
Myrdd.first()
Out[124]: 50
Myrdd.collect()
Out[125]: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6, 55.4, 54.7]
Myrdd.take(4)
Myrdd.top(4)
Myrdd.sum()
Out[128]: 603.0
Myrdd.mean()
Out[129]: 54.81818181818182
Myrdd.variance()
Out[130]: 7.383305785123963
Myrdd.stdev()
Out[131]: 2.717223911480974
Convert ºF to ºC
Out[132]: [10.0,
15.11111111111111,
15.0,
14.000000000000002,
11.944444444444445,
11.777777777777779,
13.0,
10.999999999999998,
12.0,
13.0,
12.611111111111112]
MyrddC.filter(lambda n: n % 2 == 0).collect()
#MyrddC.filter(lambda n: n % 2 != 0).collect()
Text RDDs
rdd=sc.parallelize(["Do you know that", "a horse has one stomach", "but a cow has four"])
rdd.collect()
Out[136]: ['Do you know that', 'a horse has one stomach', 'but a cow has four']
Out[137]: ['DO YOU KNOW THAT', 'A HORSE HAS ONE STOMACH', 'BUT A COW HAS FOUR']
Out[138]: ['a horse has one stomach', 'but a cow has four']
Demonstrating flatMap
map() vs flatMap()
Out[141]: ['Do',
'you',
'know',
'that',
'a',
'horse',
'has',
'one',
'stomach',
'but',
'a',
'cow',
'has',
'four']
Out[142]: ['but',
'you',
'that',
'a',
'has',
'Do',
'four',
'stomach',
'cow',
'horse',
'know',
'one']
Shared Variables
Broadcast Variables
data = list_parm.value
print("Stored data ", data)
par2 = list_parm.value[2]
print("Parameter 2 is: ", par2)
Acumulators
accum = sc.accumulator(0)
myrdd = sc.parallelize([20,30,40,50])
myrdd.foreach(lambda n: accum.add(n))
final = accum.value
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
%fs ls "/FileStore/tables"
Table
path name size modificationTime
1 dbfs:/FileStore/tables/Managers.csv Managers.csv 133114 1667413379000
2 dbfs:/FileStore/tables/Teams.csv Teams.csv 524526 1667413401000
3 dbfs:/FileStore/tables/alice_in_wonderland.txt alice_in_wonderland.txt 148311 1663266909000
4 dbfs:/FileStore/tables/d2buy.csv d2buy.csv 407 1667434540000
5 dbfs:/FileStore/tables/linkFile.txt linkFile.txt 72 1663266909000
6 dbfs:/FileStore/tables/movielens.txt movielens.txt 616155 1667413258000
7 dbfs:/FileStore/tables/movielensABD.csv movielensABD.csv 616155 1667412560000
Showing all 13 rows.
Word count
rdd = sc.textFile("/FileStore/tables/purplecow.txt")
rdd.collect()
Out[157]: ['I',
'never',
'saw',
'a',
'purple',
'cow.',
'I',
'never',
'hope',
'to',
'see',
'one.',
'But',
'I',
'can',
'tell',
'you,',
'anyhow,',
"I'd",
'rather',
('anyhow,', 1),
('rather', 1),
('than', 1),
('one!', 1),
('I', 3),
('saw', 1),
('a', 1),
('purple', 1),
('hope', 1),
('to', 1),
('see', 2),
('one.', 1),
('can', 1),
('you,', 1),
("I'd", 1),
('be', 1)]
#dbutils.fs.ls("/work/rddOut.txt")
#dbutils.fs.rm("/work/wcount", recurse=True)
Pi estimation
import random
n_samples = 10000
0.7530264908969033
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
Pi is roughly 3.154400
PageRank
Remenber to load the linkFile from moodle to databricks environment
file = "dbfs:/FileStore/tables/linkFile.txt"
iterations = 10
w = sc.textFile(file)
w.collect()
links = sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda pages: (pages[0], pages[1])) \
.groupByKey() \
.persist()
links.collect()
for x in range(iterations):
contribs = links.join(ranks) \
.flatMap(lambda neighborRanks: computeContribs(neighborRanks[1][0], \
neighborRanks[1][1]))
ranks = contribs.reduceByKey(lambda v1,v2: v1+v2) \
.map(lambda pageContribs: (pageContribs[0], \
pageContribs[1] * 0.85 + 0.15))
('page1', 1.4313779845858583)
('page2', 0.4633039012638519)
('page3', 1.3758228705372553)
('page4', 0.7294952436130331)
links = sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda pages: (pages[0], pages[1])) \
.groupByKey() \
.persist()
for x in range(iterations):
contribs = links.join(ranks) \
.flatMap(lambda neighborRanks: computeContribs(neighborRanks[1][0], \
neighborRanks[1][1]))
ranks = contribs.reduceByKey(lambda v1,v2: v1+v2) \
.map(lambda pageContribs: (pageContribs[0], \
pageContribs[1] * 0.85 + 0.15))
('page1', 1.4313779845858583)
('page2', 0.4633039012638519)
('page3', 1.3758228705372553)
('page4', 0.7294952436130331)
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
# hit tab after the dot to see the available methods of the spark session object
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
Through the SparkSession.catalog field instance you can access all the Catalog metadata information about your tables, database,
UDFs etc.
Table
name catalog description locationUri
1 abd2022_v2 spark_catalog dbfs:/user/hive/warehouse/abd2022_v2.db
2 default spark_catalog Default Hive database dbfs:/user/hive/warehouse
Table
databaseName
1 abd2022_v2
2 default
%sql
show databases;
Table
databaseName
1 abd2022_v2
2 default
Out[467]: 'default'
Out[470]: DataFrame[]
spark.sql('show databases').display()
Table
databaseName
1 abd2022
2 abd2022_v2
3 default
Out[472]: DataFrame[]
spark.catalog.currentDatabase()
Out[473]: 'abd2022'
Out[474]: DataFrame[]
# Let's create a Table (just to see it in our Database) using an SQL Statment
# Remender: a Table is not the same thing as a Dataframe (more on this later)
# A Dataframe is data in memory
# A Table is data registered on the Spark Catalog
%sql
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
OK
spark.catalog.listTables()
%sql
DROP TABLE IF EXISTS diamonds;
OK
Out[196]: []
type(dep_delays)
Out[198]: pyspark.sql.dataframe.DataFrame
dep_delays.schema
dep_delays.summary().display()
Table
summary date delay distance origin destination
1 count 1391578 1391578 1391578 1391578 1391578
2 mean 2180446.584000322 12.079802928761449 690.5508264718184 null null
3 stddev 838031.1536741006 38.8077337498565 513.6628153663316 null null
4 min 1010005 -112 21 ABE ABE
5 25% 1240630 -4 316 null null
6 50% 2161410 0 548 null null
7 75% 3101505 12 893 null null
Showing all 8 rows.
dbutils.data.summarize(dep_delays)
ort by
S
Standard
count missing mean std dev zeros min median max custom log expand
date
1.39M 0% 2.18M 838k 0% 1.01M 2.16M 3.31M data type: int
50K
1M 2M
delay 1M
1.39M 0% 12.08 38.81 9.42% -112 0 1,642 data type: int
200K
0 400 8
distance
1.39M 0% 690.55 513.66 0% 21 548 4,330 data type: int
100K
500 2K
Standard
count missing unique top freq top avg len custom log expand
10 30 50 70 90
destination SHOW RAW DATA
10 30 50 70 90
spark.catalog.listTables()
# Note that there is no dep_delays on the Spark catalog
Out[202]: []
%sql
DROP TABLE IF EXISTS Dep_delays;
OK
dep_delays.write.saveAsTable("Dep_delays")
spark.catalog.listTables()
%sql
DROP TABLE IF EXISTS Dep_delays;
OK
DataFrames (like RDDs) support two types of operations: transformations and actions.
Transformations, like select() or filter() create a new DataFrame from an existing one, resulting into another immutable
DataFrame. All transformations are lazy. That is, they are not executed until an action is invoked or performed.
Actions, like show() or count(), return a value with results to the user. Other actions like save() write the DataFrame to distributed
storage (like HDFS, DBFS or S3).
Transformations contribute to a query plan, but nothing is executed until an action is called.
Creating a DataFrame
df = spark.range(5)
display(df)
Table
id
1 0
2 1
3 2
4 3
5 4
df.show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
df.display()
Table
id
1 0
2 1
3 2
4 3
5 4
df.take(3)
#display(df.describe())
df.describe()
+---+
|num|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
df.show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
+---+
|num|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Table
value
1 0
2 2
3 4
4 6
5 8
df2.schema
df2.printSchema
df2.explain()
== Physical Plan ==
*(1) Project [(id#30861L * 2) AS value#30949L]
+- *(1) Range (0, 5, step=1, splits=8)
# spark.createDataFrame(rdd, schema)
+------+--------+
| Name|Zip-Code|
+------+--------+
| Porto| 1900|
|Lisboa| 4000|
+------+--------+
root
|-- Name: string (nullable = true)
|-- Zip-Code: long (nullable = true)
df_people = spark.createDataFrame(people)
# You can also use rdd.toDF()
# df_people = people.toDF()
display(df_people)
Table
name age
1 Mark 25
2 Tom 22
3 Mary 20
4 Sofia 26
df_people.printSchema()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
+------+
| City|
+------+
| Porto|
|Lisboa|
| Faro|
+------+
+------+
| value|
+------+
| Porto|
|Lisboa|
| Faro|
+------+
+------+
| city|
+------+
| Porto|
|Lisboa|
| Faro|
+------+
Out[234]: True
JsonDF.show()
+---------+--------+---+-------+
| array| dict|int| string|
+---------+--------+---+-------+
|[1, 2, 3]|{value1}| 1|string1|
|[2, 4, 6]|{value2}| 2|string2|
|[3, 6, 9]|{value3}| 3|string3|
+---------+--------+---+-------+
%fs ls /databricks-datasets/samples/people
Table
path name size modificationTime
1 dbfs:/databricks-datasets/samples/people/people.json people.json 77 1534435526000
Showing 1 row.
dfpeople = spark.read.json("/databricks-datasets/samples/people/people.json")
display(dfpeople)
Table
age name
1 40 Jane
2 30 Andy
3 50 Justin
diamonds0.display()
Table
_c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.
Out[243]: ['_c0',
'carat',
'cut',
'color',
'clarity',
'depth',
'table',
'price',
'x',
'y',
'z']
# inferSchema means we will automatically figure out column types # at a cost of reading the data more than once
display(diamonds1)
Table
_c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.
textFile.take(5)
linesWithSpark = textFile.where(textFile.value.contains("Spark"))
display(linesWithSpark)
Table
value
1 Welcome to the Spark documentation!
2 This readme will walk you through navigating and building the Spark documentation, which is included
3 here with the Spark source code. You can also find documentation specific to release versions of
4 Spark at http://spark.apache.org/documentation.html.
5 whichever version of Spark you currently have checked out of revision control.
6 The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala,
7 We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as
Showing all 12 rows.
linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
display(linesWithSpark)
Table
value
1 Welcome to the Spark documentation!
2 This readme will walk you through navigating and building the Spark documentation, which is included
3 here with the Spark source code. You can also find documentation specific to release versions of
4 Spark at http://spark.apache.org/documentation.html.
5 whichever version of Spark you currently have checked out of revision control.
6 The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala,
7 We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as
Out[250]: True
%fs ls /delta
OK
diamonds.write.format("delta").save("/delta/diamonds")
%fs ls /delta
Table
path name size modificationTime
1 dbfs:/delta/diamonds/ diamonds/ 0 0
Showing 1 row.
diamonds2delta = spark.read.load("/delta/diamonds")
display(diamonds2delta)
Table
_c0 carat cut color clarity depth table price x
1 1 0.23 Ideal E SI2 61.5 55 326 3.95
2 2 0.21 Premium E SI1 59.8 61 326 3.89
3 3 0.23 Good E VS1 56.9 65 327 4.05
4 4 0.29 Premium I VS2 62.4 58 334 4.2
5 5 0.31 Good J SI2 63.3 58 335 4.34
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 7 0.24 Very Good I VVS1 62.3 57 336 3.95
Truncated results, showing first 1,000 rows.
diamonds.select("carat","cut","color").show(5)
+-----+-------+-----+
|carat| cut|color|
+-----+-------+-----+
| 0.23| Ideal| E|
| 0.21|Premium| E|
| 0.23| Good| E|
| 0.29|Premium| I|
| 0.31| Good| J|
+-----+-------+-----+
only showing top 5 rows
diamonds.select("color","clarity","table","price").where("color = 'E'").show(5)
+-----+-------+-----+-----+
|color|clarity|table|price|
+-----+-------+-----+-----+
| E| SI2| 55.0| 326|
| E| SI1| 61.0| 326|
| E| VS1| 65.0| 327|
| E| VS2| 61.0| 337|
| E| SI2| 62.0| 345|
+-----+-------+-----+-----+
only showing top 5 rows
diamonds.sort('table').show(25)
#diamonds.sort('table').select('table').show()
#diamonds.sort(diamonds.table.asc(),diamonds.price.desc()).show()
#diamonds.sort(diamonds['table'],diamonds['price']).show()
#diamonds.orderBy('table').show()
+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| _c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|11369| 1.04| Ideal| I| VS1| 62.9| 43.0| 4997|6.45|6.41|4.04|
|35634| 0.29|Very Good| E| VS1| 62.8| 44.0| 474| 4.2|4.24|2.65|
| 5980| 1.0| Fair| I| VS1| 64.0| 49.0| 3951|6.43|6.39| 4.1|
|22702| 0.3| Fair| E| SI1| 64.5| 49.0| 630|4.28|4.25|2.75|
|25180| 2.0| Fair| H| SI1| 61.2| 50.0|13764|8.17|8.08|4.97|
| 7419| 1.02| Fair| F| SI1| 61.8| 50.0| 4227|6.59|6.51|4.05|
| 3239| 0.94| Fair| H| SI2| 66.0| 50.1| 3353|6.13|6.17|4.06|
| 1516| 0.91| Fair| F| SI2| 65.3| 51.0| 2996|6.05|5.98|3.93|
| 3980| 1.0| Premium| H| SI1| 62.2| 51.0| 3511|6.47| 6.4| 4.0|
| 4151| 0.91| Premium| F| SI2| 61.0| 51.0| 3546|6.24|6.21| 3.8|
| 8854| 1.0| Fair| E| VS2| 66.4| 51.0| 4480|6.31|6.22|4.16|
|26388| 2.01| Good| H| SI2| 64.0| 51.0|15888|8.08|8.01|5.15|
|33587| 0.37| Premium| F| VS1| 62.7| 51.0| 833|4.65|4.57|2.89|
|45799| 0.51| Fair| E| VS2| 65.5| 51.0| 1709|5.06|5.01| 3.3|
|46041| 0.57| Good| H| VS1| 63.7| 51.0| 1728|5.36|5.29|3.39|
|47631| 0.67| Good| I| VVS2| 58.9| 51.0| 1882|5.74|5.78| 3.4|
|24816| 2.0|Very Good| J| VS1| 61.0| 51.6|13203|8.14|8.18|4.97|
|10541| 1.0| Ideal| E| SI2| 62.2| 52.0| 4808|6.42|6.47|4.01|
# Chose this syntaxe with [] for better future compatibility and less errors
# (like column names that are also attributes on the DataFrame class)
diamonds.filter(diamonds['color'] == 'E').show(5)
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2|Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows
diamonds.select(diamonds["clarity"],diamonds["price"] * 1.10).show(5)
+-------+------------------+
|clarity| (price * 1.1)|
+-------+------------------+
| SI2| 358.6|
| SI1| 358.6|
| VS1|359.70000000000005|
| VS2|367.40000000000003|
| SI2|368.50000000000006|
+-------+------------------+
only showing top 5 rows
Table
color avg(price)
1 F 3724.886396981765
2 E 3076.7524752475247
3 D 3169.9540959409596
4 J 5323.81801994302
5 G 3999.135671271697
6 I 5091.874953891553
7 H 4486.669195568401
Showing all 7 rows.
#spark.catalog.dropTempView("temptable")
#spark.catalog.dropGlobalTempView("temptable")
# Thera are some limitations to drop Tables with spark.catalog, use %sql instead
%sql
DROP VIEW IF EXISTS temptable
OK
# Use createTempView(“name”)
# write.saveAsTable is also an option (see bellow)
diamonds.createTempView("temptable")
spark.catalog.listTables()
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2| Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
| 16| 0.32| Premium| E| I1| 60.9| 58.0| 345|4.38|4.42|2.68|
| 22| 0.23|Very Good| E| VS2| 63.8| 55.0| 352|3.85|3.92|2.48|
| 33| 0.23|Very Good| E| VS1| 60.7| 59.0| 402|3.97|4.01|2.42|
| 34| 0.23|Very Good| E| VS1| 59.5| 58.0| 402|4.01|4.06| 2.4|
| 37| 0.23| Good| E| VS1| 64.1| 59.0| 402|3.83|3.85|2.46|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49|
| 15| 0.2|Premium| E| SI2| 60.2| 62.0| 345|3.79|3.75|2.27|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows
%sql
DROP VIEW IF EXISTS temptable;
DROP TABLE IF EXISTS diamonds_;
OK
spark.catalog.listTables()
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 3 rows
diamonds.explain()
== Physical Plan ==
FileScan csv [_c0#13653,carat#13654,cut#13655,color#13656,clarity#13657,depth#13658,table#13659,price#13660,x#13661,y
#13662,z#13663] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/databricks-d
atasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c
0:int,carat:double,cut:string,color:string,clarity:string,depth:double,table:double,pric...
%sql
DROP VIEW IF EXISTS temptable;
DROP TABLE IF EXISTS diamonds_;
OK
# Spark UDFs
#from pyspark.sql.functions import udf
def price_plus(num):
return num*2
def price_tag(num):
if num < 330:
tag = 'Good'
else:
tag = 'Bad'
return tag
price_plusUDF = udf(price_plus)
price_tagUDF = udf(price_tag)
+-----+----------+----------+
|table|Price_Plus|Price_Note|
+-----+----------+----------+
| 55.0| 110.0| Good|
| 61.0| 122.0| Good|
+-----+----------+----------+
only showing top 2 rows
%sql select table, pricePlus(table) as TablePlus, priceTag(price) as PriceNote from TempTable limit 3
Table
table TablePlus PriceNote
1 55 110.0 Good
2 61 122.0 Good
3 65 130.0 Good
+-----+-----+
|table|price|
+-----+-----+
| 55.0| 326|
| 61.0| 326|
| 65.0| 327|
+-----+-----+
only showing top 3 rows
# Use withColumn() to add a new column or derive a new column based on the existing on
display( diamonds.withColumn("table", price_plusUDF("table")))
Table
_c0 carat cut color clarity depth table price x
Convert
10 DataFrame to0.23
10 RDD (Row RDD)
Very Good H VS1 59.4 122.0 338 4
11 11 0.3 Good J SI1 64 110.0 339 4.25
12 12
df_people1 = 0.23 Ideal J VS1 62.8
spark.read.json("/databricks-datasets/samples/people/people.json") 112.0 340 3.93
rdd_1
13 =13df_people1.rdd
0.22 Premium F SI1 60.4 122.0 342 3.88
rdd_1.collect()
14 14 0.31 Ideal J SI2 62.2 108.0 344 4.35
15 [Row(age=40,
Out[281]:
15 0.2 name='Jane'),
Premium E SI2 60.2 124.0 345 3.79
Row(age=30,
16 16 name='Andy'),
0.32 Premium E I1 60.9 116.0 345 4.38
Row(age=50, name='Justin')]
17 17 0.3 Ideal I SI2 62 108.0 348 4.31
Data is from Lending Club. It includes funded loans from 2012 to 2017. Each loan includes demographic information, current loan
status (Current, Late, Fully Paid, etc.) and latest payment info.
Notes to consider:
The management of the Data Lake layer is done here with folders in the file system but could also be done with databases in
Spark SQL
Parquet format will be used for Bronze layer and the Delta format for the Silver and Gold layer
Out[282]: False
# Importing libraries
from pyspark.sql.functions import *
import time
import datetime
rawL.count()
Out[286]: 1481560
root
|-- id: string (nullable = true)
|-- member_id: string (nullable = true)
|-- loan_amnt: float (nullable = true)
|-- funded_amnt: integer (nullable = true)
|-- funded_amnt_inv: double (nullable = true)
|-- term: string (nullable = true)
|-- int_rate: string (nullable = true)
|-- installment: double (nullable = true)
|-- grade: string (nullable = true)
|-- sub_grade: string (nullable = true)
|-- emp_title: string (nullable = true)
|-- emp_length: string (nullable = true)
|-- home_ownership: string (nullable = true)
|-- annual_inc: float (nullable = true)
|-- verification_status: string (nullable = true)
|-- loan_status: string (nullable = true)
|-- pymnt_plan: string (nullable = true)
|-- url: string (nullable = true)
|-- desc: string (nullable = true)
|-- purpose: string (nullable = true)
Table
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
1 null null 35000 35000 35000 36 months 17.27% 1252.56
2 null null 8000 8000 8000 36 months 18.25% 290.23
3 null null 5000 5000 5000 36 months 6.97% 154.32
4 null null 10000 10000 10000 36 months 9.75% 321.5
5 null null 24000 24000 24000 36 months 9.75% 771.6
6 null null 9600 9600 9600 36 months 9.75% 308.64
7 null null 13000 13000 13000 60 months 8.39% 266.03
Truncated results, showing first 1,000 rows.
Out[289]: 1481542
rawL.count()
Out[291]: 100
rawL.display()
Table
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
1
6 null
null null
null 35000
9600 35000
9600 35000
9600 36 months
36 months 17.27%
9.75% 1252.56
308.64 B
2
7 null
null null
null 8000
13000 8000
13000 8000
13000 36 months
60 months 18.25%
8.39% 290.23
266.03 B
Showing
8 all 100 rows.
null null 9000 9000 9000 36 months 9.16% 286.87 B
9 null null 18000 18000 18000 36 months 12.99% 606.41 C
Correct
10 data formats null
null 16000 16000 16000 36 months 5.32% 481.84 A
11 null null 8400 8400 8400 36 months 9.75% 270.06 B
12 null
# Transforming nullcolumns into9000
string 9000
numeric columns 9000 36 months 5.32% 271.04 A
rawL
13 = null
rawL.withColumn('int_rate',
null regexp_replace('int_rate',
20000 20000 '%',
20000 '').cast('float')) \
36 months 11.47% 659.24 B
.withColumn('revol_util', regexp_replace('revol_util', '%', '').cast('float')) \
14 null null 30000 30000 30000 36 months 12.99% 1010.68 C
.withColumn('issue_year', substring(rawL.issue_d, 5, 4).cast('double') ) \
15 null null .withColumn('earliest_year',
32000 32000 substring(rawL.earliest_cr_line,
32000 36 months
5, 4).cast('double')) 1043.86
10.75% B
16 null null 16300 16300 16300 36 months 9.75% 524.05 B
# Converting emp_length into numeric column
17 null null 10500 10500 10500 36 months 11.99% 348.71 C
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "([ ]*+[a-zA-Z].*)|(n/a)", "") ))
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "< 1", "0") ))
rawL = rawL.withColumn('emp_length', trim(regexp_replace(rawL.emp_length, "10\\+", "10") ).cast('float'))
rawL.display()
#rawL.select('int_rate','revol_util','issue_year','earliest_year').display()
Table
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
1 null null 35000 35000 35000 36 months 17.27 1252.56
2 null null 8000 8000 8000 36 months 18.25 290.23
3 null null 5000 5000 5000 36 months 6.97 154.32
4 null null 10000 10000 10000 36 months 9.75 321.5
5 null null 24000 24000 24000 36 months 9.75 771.6
6 null null 9600 9600 9600 36 months 9.75 308.64
7 null null 13000 13000 13000 60 months 8.39 266.03
Showing all 100 rows.
date_time = datetime.datetime.now()
date_tag = date_time.strftime("%Y-%b-%d")
Bronze_Path = DL_Bronze_Path + date_tag
rawL.write.format('parquet').mode('overwrite').save(Bronze_Path + '/Loans')
Read the Bronze data and process it for the Silver layer
# You may want to creat a log with statistics from your ELT process for control purposes
loans.count()
Out[298]: 100
display(loans)
Table
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment
# Start
12 null
by selectingnull 9000 we need for
only the columns 9000the Bronze layer
9000 36 months 5.32 271.04 A
loans
13 =null
loans.select("loan_status",
null "int_rate",20000
20000 "revol_util", "issue_d",
20000 "earliest_cr_line",
36 months "emp_length",
11.47 659.24 B
"verification_status", \
14 null null 30000 30000 30000 36 months 12.99 1010.68 C
"total_pymnt", "loan_amnt", "grade", "annual_inc", "dti", "addr_state", "term",
15 null
"home_ownership", null
"purpose", \ 32000 32000 32000 36 months 10.75 1043.86 B
16 null null "issue_year",
16300 "earliest_year",
16300 "application_type",
16300 36"delinq_2yrs",
months 9.75 "total_acc")
524.05 B
17 null null 10500 10500 10500 36 months 11.99 348.71 C
# Creating 'bad_loan' label, which includes charged off, defaulted, and late repayments on loans
loans = loans.filter(loans.loan_status.isin(["Default", "Charged Off", "Fully Paid"])) \
.withColumn("bad_loan", (~(loans.loan_status == "Fully Paid")).cast("string"))
# Calculating the 'net' column, the total amount of money earned or lost per loan
loans = loans.withColumn('net', round(loans.total_pymnt - loans.loan_amnt, 2))
display(loans)
Table
loan_status int_rate revol_util issue_d earliest_cr_line emp_length verification_status total_pymn
1 Fully Paid 5.32 27.9 Mar-2016 Nov-2000 8 Not Verified 16098.34
2 Fully Paid 9.75 19.4 Mar-2016 Aug-2010 1 Verified 8663.31
3 Fully Paid 5.32 23.9 Mar-2016 Dec-1989 10 Not Verified 9361.74112
4 Fully Paid 12.99 75.4 Mar-2016 Apr-2007 1 Verified 11088.6700
5 Charged Off 21.18 87.5 Mar-2016 Jun-2004 0 Verified 4693.26
6 Fully Paid 7.39 59.1 Mar-2016 May-1995 2 Verified 7280.34770
7 Charged Off 15.31 7.7 Mar-2016 Jan-1998 null Verified 10162.81
Showing all 23 rows.
Save our cleaned and conformed data as a Silver file and table in the
Delta Lake
# Write the data in the Bronze path
# This data should represent a clean history of business facts with the maximum data detail (level)
# Bronze layer cound story info by year (don't think it's necessary in this case)
#date_time = datetime.datetime.now()
#date_tag = date_time.strftime("%Y")
#Silver_Path = DL_Silver_Path + date_tag
# Write the data on disk (use append as previuos records will exist in the BZ table)
file_path_SV_loans = DL_Silver_Path + '/Loans_SV'
loans.write.format('delta').mode('append').save(file_path_SV_loans)
Out[304]: DataFrame[]
Read the Silver data and prepare it with aggregations for a Business
Unit in the Gold layer
# Read the data (option by reading the data from the Table)
loans_SV = spark.table('Loans_SV')
display(loans_SV)
Table
loan_status int_rate revol_util issue_d earliest_cr_line emp_length verification_status total_pymn
1 Fully Paid 5.32 27.9 Mar-2016 Nov-2000 8 Not Verified 16098.34
2 Fully Paid 9.75 19.4 Mar-2016 Aug-2010 1 Verified 8663.31
3 Fully Paid 5.32 23.9 Mar-2016 Dec-1989 10 Not Verified 9361.74112
4 Fully Paid 12.99 75.4 Mar-2016 Apr-2007 1 Verified 11088.6700
5 Charged Off 21.18 87.5 Mar-2016 Jun-2004 0 Verified 4693.26
6 Fully Paid 7.39 59.1 Mar-2016 May-1995 2 Verified 7280.34770
7 Charged Off 15.31 7.7 Mar-2016 Jan-1998 null Verified 10162.81
Showing all 23 rows.
Create a Gold Table Gold Tables are often created to provide clean, reliable data for a specific business unit or use case.
In our case, we'll create a Gold table that includes only 2 columns - addr_state and count - to provide an aggregated view of
our data. For our purposes, this table will allow us to show what Delta Lake can do, but in practice a table like this could be used
to feed a downstream reporting or BI tool that needs data formatted in a very specific way. Silver tables often feed multiple
downstream Gold tables.
loans_by_state.count()
Out[309]: 15
Out[311]: DataFrame[]
%sql
-- show tables;
Table
database tableName isTemporary
%sql
SELECT *
FROM loans_by_state
Table
addr_state count
1 SC 1
2 MN 1
3 VA 1
4 MI 1
5 WI 1
6 MD 2
7 MO 1
Showing all 15 rows.
%sql
SELECT addr_state, sum(`count`) AS loans
FROM loans_by_state
GROUP BY addr_state
Table
addr_state loans
1 SC 1
2 MN 1
3 VA 1
4 MI 1
5 WI 1
6 MD 2
7 MO 1
Showing all 15 rows.
Drop the Tables you don't need just for house cleaning purposes
%sql
drop table loans_by_state;
drop table loans_sv;
OK
dbutils.fs.rm("/FileStore/tables/DL", True)
Out[316]: True
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| |temptable| true|
+--------+---------+-----------+
Table
database tableName isTemporary
1 temptable true
Showing 1 row.
#import pyspark.sql.functions as f
#if you use the sintaxe above you have to prefix the functions with "f".
#Ex: f.explode, f.split, f.avf
#avg_dim_metric = df.groupBy("col_dimension").agg(avg("col_metric"))
#avg_dim_metric.show()
+-------+------+------+---+
|country| name|gender|age|
+-------+------+------+---+
| UK|Brooke| F| 20|
| UK| Denny| M| 31|
| UK| Jules| M| 30|
| UK| Tom| M| 35|
| UK| Mary| F| 25|
| PT| Pedro| M| 28|
| PT| Rui| M| 40|
| PT|Carlos| M| 34|
| PT| Maria| F| 45|
| PT|Sandra| F| 28|
+-------+------+------+---+
# Group the same names together, aggregate their ages, and compute an average
#dfGB.groupBy("country").agg(avg("age").alias("avg_age")).show()
dfGB.groupBy("country").agg(avg("age"), sum("age"), max("age"), min("age")).show()
+-------+--------+--------+--------+--------+
|country|avg(age)|sum(age)|max(age)|min(age)|
+-------+--------+--------+--------+--------+
| UK| 28.2| 141| 35| 20|
| PT| 35.0| 175| 45| 28|
+-------+--------+--------+--------+--------+
+-------+--------+--------+--------+--------+
|country|avg(age)|sum(age)|max(age)|min(age)|
+-------+--------+--------+--------+--------+
| UK| 28.2| 141| 35| 20|
| PT| 35.0| 175| 45| 28|
+-------+--------+--------+--------+--------+
dfGB.groupBy("country","gender").agg(avg("age")).show()
+-------+------+--------+
|country|gender|avg(age)|
+-------+------+--------+
| UK| F| 22.5|
| UK| M| 32.0|
| PT| M| 34.0|
| PT| F| 36.5|
+-------+------+--------+
dfGB.groupBy("country").pivot("gender").agg(avg("age")).show()
+-------+----+----+
|country| F| M|
+-------+----+----+
| PT|36.5|34.0|
| UK|22.5|32.0|
+-------+----+----+
Using Explode
# Use withColumn to derive/create a new column 'Product' based on the column 'List_Products'
dfExp1 = dfExp.withColumn('Product', split('List_Products', ','))
dfExp1.show()
+-------+-------------+------------+
| Client|List_Products| Product|
+-------+-------------+------------+
|Client1| p1,p2,p3|[p1, p2, p3]|
|Client2| p1,p3,p5|[p1, p3, p5]|
|Client3| p3,p4| [p3, p4]|
+-------+-------------+------------+
Table
Client Product
1 Client1 p1
2 Client1 p2
3 Client1 p3
4 Client2 p1
5 Client2 p3
6 Client2 p5
7 Client3 p3
Showing all 8 rows.
Table
Client Product
1 Client1 p1
2 Client1 p2
3 Client1 p3
4 Client2 p1
5 Client2 p2
6 Client2 p3
7 Client3 p3
Showing all 8 rows.
+-------+---+
| Client|col|
+-------+---+
|Client1| p1|
|Client1| p2|
|Client1| p3|
|Client2| p1|
|Client2| p2|
|Client2| p3|
|Client3| p3|
|Client3| p4|
+-------+---+
Join
EmpDF.join(DeptDF, "dept_id").show()
#EmpDF.join(DeptDF, on = "dept_id").show()
#EmpDF.join(DeptDF,EmpDF["dept_id"] == DeptDF["dept_id"]).show()
+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
| 10| 1| Paul| Finance|
| 10| 3| Tom| Finance|
| 20| 2| Mary|Marketing|
| 30| 4| Sandy| Sales|
+-------+------+--------+---------+
# Joins don't support more than 2 Dataframes. Use a chan join query.
# df1.join(df2,col).join(df3,col)
AdressData=[(1,"1523 Main St","SFO","CA"), (2,"3453 Orange St","SFO","NY"), (3,"34 Warner St","Jersey","NJ"), (4,"221
Cavalier St","Newark","DE"),(5,"789 Walnut St","Sandiago","CA")]
AdressData = spark.createDataFrame(AdressData, ["emp_id","addline1","city","state"])
Union
#EmpDF.union(EmpDFplus).show()
# Check the number of columns, different columns will give an error
# Union ignores dupplicate data. Use distinct or dropDuplicates (allows selection of columns)
#EmpDF.union(EmpDFplus).distinct().show()
EmpDF.union(EmpDFplus).dropDuplicates(['emp_id']).show()
+------+--------+-------+
|emp_id|emp_name|dept_id|
+------+--------+-------+
| 1| Paul| 10|
| 2| Mary| 20|
| 3| Tom| 10|
| 4| Sandy| 30|
| 5| Victor| 10|
| 6| Sam| 20|
| 7| Paty| 10|
| 8| Carol| 30|
+------+--------+-------+
Conditional formating
Table
id name genres
1 1 Toy Story (1995) Adventure|Animation|Chil
2 2 Jumanji (1995) Adventure|Children|Fantas
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
7 7 Sabrina (1995) Comedy|Romance
Truncated results, showing first 1,000 rows.
moviesDF.withColumn('comedy',when(col('genres').contains('Comedy'),lit('Yes')).otherwise(lit('No'))).show()
# Function lit() adds a column
+---+--------------------+--------------------+---+------+----------+------+
| id| name| genres| na|rating| views|comedy|
+---+--------------------+--------------------+---+------+----------+------+
| 1| Toy Story (1995)|Adventure|Animati...| 7| 3.0| 851866703| Yes|
| 2| Jumanji (1995)|Adventure|Childre...| 15| 2.0|1134521380| No|
| 3|Grumpier Old Men ...| Comedy|Romance| 5| 4.0|1163374957| Yes|
| 4|Waiting to Exhale...|Comedy|Drama|Romance| 19| 3.0| 855192868| Yes|
| 5|Father of the Bri...| Comedy| 15| 4.5|1093070098| Yes|
| 6| Heat (1995)|Action|Crime|Thri...| 15| 4.0|1040205753| No|
| 7| Sabrina (1995)| Comedy|Romance| 18| 3.0| 856006982| Yes|
| 8| Tom and Huck (1995)| Adventure|Children| 30| 4.0| 968786809| No|
| 9| Sudden Death (1995)| Action| 18| 3.0| 856007219| No|
| 10| GoldenEye (1995)|Action|Adventure|...| 2| 4.0| 835355493| No|
| 11|American Presiden...|Comedy|Drama|Romance| 15| 2.5|1093028381| Yes|
| 12|Dracula: Dead and...| Comedy|Horror| 67| 3.0| 854711916| Yes|
+---+--------------------+--------------------+---+------+----------+---------+
| id| name| genres| na|rating| views|sensitive|
+---+--------------------+--------------------+---+------+----------+---------+
| 1| Toy Story (1995)|Adventure|Animati...| 7| 3.0| 851866703| No|
| 2| Jumanji (1995)|Adventure|Childre...| 15| 2.0|1134521380| No|
| 3|Grumpier Old Men ...| Comedy|Romance| 5| 4.0|1163374957| No|
| 4|Waiting to Exhale...|Comedy|Drama|Romance| 19| 3.0| 855192868| Yes|
| 5|Father of the Bri...| Comedy| 15| 4.5|1093070098| No|
| 6| Heat (1995)|Action|Crime|Thri...| 15| 4.0|1040205753| Yes|
| 7| Sabrina (1995)| Comedy|Romance| 18| 3.0| 856006982| No|
| 8| Tom and Huck (1995)| Adventure|Children| 30| 4.0| 968786809| No|
| 9| Sudden Death (1995)| Action| 18| 3.0| 856007219| No|
| 10| GoldenEye (1995)|Action|Adventure|...| 2| 4.0| 835355493| No|
| 11|American Presiden...|Comedy|Drama|Romance| 15| 2.5|1093028381| Yes|
| 12|Dracula: Dead and...| Comedy|Horror| 67| 3.0| 854711916| Yes|
| 13| Balto (1995)|Adventure|Animati...|182| 3.0| 845745917| No|
| 14| Nixon (1995)| Drama| 15| 2.5|1166586286| Yes|
| 15|Cutthroat Island ...|Action|Adventure|...| 73| 2.5|1255593501| No|
| 16| Casino (1995)| Crime|Drama| 15| 3.5|1093070150| Yes|
| 17|Sense and Sensibi...| Drama|Romance| 2| 5.0| 835355681| Yes|
| 18| Four Rooms (1995)| Comedy| 18| 3.0| 856007359| No|
Exercise 1:
Check your Spark Session and if you have tables registered
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| |temptable| true|
+--------+---------+-----------+
Exercise 2:
Load the movielens (or movielensABD) file to your environment
Table
id name genres
1 1 Toy Story (1995) Adventure|Animation|Chil
2 2 Jumanji (1995) Adventure|Children|Fantas
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
7 7 Sabrina (1995) Comedy|Romance
Truncated results, showing first 1,000 rows.
Exercise 3:
Show the data about 5 movies that have a rating superior to 2.5
moviesDF.createOrReplaceTempView("movies")
#saveAstable
spark.sql("select name, genres, rating, views from movies where rating > 2.5 limit 5").show()
+--------------------+--------------------+------+----------+
| name| genres|rating| views|
+--------------------+--------------------+------+----------+
| Toy Story (1995)|Adventure|Animati...| 3.0| 851866703|
|Grumpier Old Men ...| Comedy|Romance| 4.0|1163374957|
|Waiting to Exhale...|Comedy|Drama|Romance| 3.0| 855192868|
|Father of the Bri...| Comedy| 4.5|1093070098|
| Heat (1995)|Action|Crime|Thri...| 4.0|1040205753|
+--------------------+--------------------+------+----------+
+--------------------+--------------------+------+----------+
| name| genres|rating| views|
+--------------------+--------------------+------+----------+
| Toy Story (1995)|Adventure|Animati...| 3.0| 851866703|
|Grumpier Old Men ...| Comedy|Romance| 4.0|1163374957|
|Waiting to Exhale...|Comedy|Drama|Romance| 3.0| 855192868|
|Father of the Bri...| Comedy| 4.5|1093070098|
| Heat (1995)|Action|Crime|Thri...| 4.0|1040205753|
+--------------------+--------------------+------+----------+
only showing top 5 rows
Exercise 4:
What different genres are there?
genresDF = moviesDF.select("genres").distinct()
display(genresDF)
Table
genres
1 Comedy|Horror|Thriller
5 Comedy|Drama|Horror|Thriller
2 Adventure|Sci-Fi|Thriller
6 Action|Animation|Comedy|Sci-Fi
3 Action|Adventure|Drama|Fantasy
7 Animation|Children|Drama|Musical|Romance
Showing
8 all 901 rows.
Action|Adventure|Drama
9 Adventure|Animation
Exercise
10
5:
Adventure|Sci-Fi
11 Documentary|Musical|IMAX
How
12 many different genres are there?
Adventure|Children|Fantasy|Sci-Fi|Thriller
13 Documentary|Sci-Fi
14 Musical|Romance|War
genresDF.count()
15 Action|Adventure|Fantasy|Romance
#genresDF.createOrReplaceTempView("genres")
16 Adventure|Children|Drama|Fantasy|IMAX
#spark.sql("select count(*) as count from genres").show()
17 Crime|Drama|Fantasy|Horror|Thriller
Out[347]: 901
Exercise 6:
What is the average rating of the movies?
#import pyspark.sql.functions
moviesDF.select(avg("rating")).show()
+-----------------+
| avg(rating)|
+-----------------+
|3.223527465254798|
+-----------------+
What is the highest rating and the highest rating for each genres?
moviesDF.groupby().max("rating").show(5)
+-----------+
|max(rating)|
+-----------+
| 5.0|
+-----------+
moviesDF.groupby("genres").max("rating").alias("Max_Rating").show(5)
+--------------------+-----------+
| genres|max(rating)|
+--------------------+-----------+
|Comedy|Horror|Thr...| 5.0|
|Adventure|Sci-Fi|...| 2.0|
|Action|Adventure|...| 2.5|
| Action|Drama|Horror| 4.5|
|Comedy|Drama|Horr...| 2.5|
+--------------------+-----------+
only showing top 5 rows
Exercise 7:
What are the movies that have a rating superior to 3 and a pair id number?
+---+--------------------+
| id| name|
+---+--------------------+
| 6| Heat (1995)|
| 8| Tom and Huck (1995)|
| 10| GoldenEye (1995)|
| 16| Casino (1995)|
| 24| Powder (1995)|
| 28| Persuasion (1995)|
| 30|Shanghai Triad (Y...|
| 32|Twelve Monkeys (a...|
| 34| Babe (1995)|
| 36|Dead Man Walking ...|
| 40|Cry, the Beloved ...|
| 50|Usual Suspects, T...|
| 54|Big Green, The (1...|
| 68|French Twist (Gaz...|
| 72|Kicking and Screa...|
| 80|White Balloon, Th...|
| 82|Antonia's Line (A...|
| 84|Last Summer in th...|
Exercise 8:
Show the average rating per genre
genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").avg("rating").toDF("genres","avg_rating")
#genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").agg({"rating": "avg"}).toDF("genres","avg_rating")
#genresAvgDF = moviesDF.select("genres","rating").groupBy("genres").agg(avg("rating")).toDF("genres","avg_rating")
genresAvgDF.display()
Table
genres avg_rating
1 Comedy|Horror|Thriller 2.9615384615384617
2 Adventure|Sci-Fi|Thriller 1.1666666666666667
3 Action|Adventure|Drama|Fantasy 1.3
4 Action|Drama|Horror 4.5
5 Comedy|Drama|Horror|Thriller 2.5
6 Action|Animation|Comedy|Sci-Fi 3.5
7 Animation|Children|Drama|Musical|Romance 4
Showing all 901 rows.
Exercise 9:
What genres have the highest average rating?
genresAvgDF.orderBy("avg_rating",ascending=False).show()
#genresAvgDF.createOrReplaceTempView("genres_avg_rating")
#spark.sql("select * from genres_avg_rating order by avg_rating desc").show()
+--------------------+----------+
| genres|avg_rating|
+--------------------+----------+
| Adventure|Thriller| 5.0|
|Crime|Fantasy|Horror| 5.0|
|Adventure|Comedy|...| 5.0|
|Adventure|Animati...| 5.0|
|Animation|Comedy|...| 5.0|
|Comedy|Drama|Fant...| 5.0|
|Action|Comedy|Fan...| 5.0|
|Animation|Comedy|...| 5.0|
| Drama|Mystery|War| 5.0|
|Children|Comedy|M...| 5.0|
|Action|Adventure|...| 5.0|
|Action|Comedy|Dra...| 5.0|
|Action|Fantasy|Ho...| 5.0|
|Children|Drama|Sc...| 5.0|
| Children|Drama|War| 5.0|
|Adventure|Documen...| 5.0|
| Mystery| 5.0|
| i | d | | |
Exercise 10:
Wow many genres "comedy" movies have a rating of 3.0?
moviesDF.select("name","genres","rating") \
.where("genres like '%Comedy%' and rating == 3") \
.select("name") \
.display()
#spark.sql("select name from movies where genres like '%Comedy%' and rating == 3").show()
Table
name
1 Toy Story (1995)
2 Waiting to Exhale (1995)
3 Sabrina (1995)
4 Dracula: Dead and Loving It (1995)
5 Four Rooms (1995)
6 Get Shorty (1995)
7 Mighty Aphrodite (1995)
Showing all 677 rows.
Exercise 11:
11.1 - Load the managers and teams csv.
11.2 - Perform a query that will return a dataframe containing the team name, its managerID, their number of wins (W)
and number of losses (L)
teamsDF.select("yearID","teamID","name","W","L") \
.join(managersDF.select("teamID","yearID","managerID"), ["teamID", "yearID"]) \
.select("yearID","name", "managerID", "W", "L") \
.toDF("Year", "Name", "Manager", "Wins", "Losses") \
.show()
+----+--------------------+----------+----+------+
|Year| Name| Manager|Wins|Losses|
+----+--------------------+----------+----+------+
|1871|Boston Red Stockings|wrighha01m| 20| 10|
Exercise ACE:
Which is the single type genre (i.e., only one genre, no |) with the highest average rating?
#moviesDF.select(explode(split(moviesDF["genres"],"[|]")).alias("genre"), "rating") \
# .groupBy("genre") \
# .agg({"rating": "avg"}) \
# .orderBy("avg(rating)" ,ascending=False) \
# .show()
+------------------+------------------+
| genre| avg(rating)|
+------------------+------------------+
|(no genres listed)| 3.735294117647059|
| Documentary| 3.641683778234086|
| Film-Noir| 3.549586776859504|
| War| 3.441256830601093|
| Western|3.4107142857142856|
| Drama| 3.383895563770795|
| Musical|3.3553299492385786|
| Animation| 3.335570469798658|
| Romance|3.2868267358857883|
| Mystery|3.2309124767225326|
| Crime| 3.208791208791209|
| Children|3.1348797250859106|
| Comedy|3.1303296038705777|
| Adventure| 3.121863799283154|
| Thriller| 3.053290623179965|
| Sci-Fi|3.0214917825537295|
| Fantasy|3.0191424196018377|
| Action|2.9504212572909916|
Clean the (possible) old files and folders from previous processes
%sh rm -r /temp-files/*
Out[362]: True
Out[363]: False
%sh rm -r /temp-files/*
total 8
drwxr-xr-x 2 root root 4096 Jan 5 14:15 .
drwxr-xr-x 1 root root 4096 Jan 5 14:15 ..
# Remenber that with this python process bellow ou will be working on the local (driver node) file system, not on DBFS
%pwd
Out[367]: '/databricks/driver'
f= open(file_name,"w+")
line = date_tag + ';' + 'ABD2022' + ';' + 'Yes'
f.write(line)
f.close()
f= open(file_name,"w+")
line = date_tag + ';' + 'ABD2022' + ';' + 'No'
f.write(line)
f.close()
total 16
drwxr-xr-x 2 root root 4096 Jan 5 14:15 .
drwxr-xr-x 1 root root 4096 Jan 5 14:15 ..
#%sh rm -r /temp-files/*
Table
value
1 2023-Jan-05_14-15-32;ABD2022;Yes
Showing 1 row.
# Checking the code (if needed) to transform one text line of the tag file into a 3 columns DF ('TimeStamp', 'Device',
'Flag')
#w = spark.read.text("file:/temp-files/file_2022-Nov-05_15-10-43")
#w = w.withColumn('TimeStamp', split('value', ';').getItem(0)) \
# .withColumn('Device', split('value', ';').getItem(1)) \
# .withColumn('Flag', split('value', ';').getItem(2)) \
# .withColumn('Tags_count', lit('1')) \
# .drop('value')
#w.display()
help(display)
Display plot:
- display() # no-op
- display(matplotlib.figure.Figure)
Display dataset:
- display(spark.DataFrame)
- display(list) # if list can be converted to DataFrame, e.g., list of named tuples
- display(pandas.DataFrame)
- display(koalas.DataFrame)
- display(pyspark.pandas.DataFrame)
Table
TimeStamp Device Flag
1 2023-Jan-05_14-15-32 ABD2022 Yes
Showing 1 row.
Table
TimeStamp Device Flag
1 2023-Jan-05_14-15-32 ABD2022 Yes
Showing 1 row.
dbutils.fs.ls("/Stream/Stream_Out/")
ABD_query.status
ABD_query.stop()
ABD_query.status
Out[386]: True
2. define the computation on the input table to a results table (as if it were a static table);
Triggers
Developers define triggers to control how frequently the input table is updated.
Each time a trigger fires, Spark checks for new data (new rows for the input table), and updates the result.
The default value is ProcessingTime(0) and it will run the query as fast as possible.
The trigger specifies when the system should process the next set of data.
Fixed interval .trigger(Trigger.ProcessingTime("6 hours")) The query will be executed in micro-batches and kicked
micro- off at the user-specified intervals
batches
One-time .trigger(Trigger.Once()) The query will execute only one micro-batch to process
micro-batch all the available data and then stop on its own
.trigger(Trigger.ProcessingTime("3 seconds"))
Checkpointing
A checkpoint stores the current state of your streaming job to a reliable storage system such as Azure Blob Storage or HDFS. It
does not store the state of your streaming job to the local file system of any node in your cluster.
Together with write ahead logs, a terminated stream can be restarted and it will continue from where it left off.
To enable this feature, you only need to specify the location of a checkpoint directory:
.option("checkpointLocation", checkpointPath)
Output Modes
| Mode | Example | Notes | | ------------- | ----------- | | Complete | .outputMode("complete") | The entire updated Result Table
is written to the sink. The individual sink implementation decides how to handle writing the entire table. | | Append |
.outputMode("append") | Only the new rows appended to the Result Table since the last trigger are written to the sink. | |
Update | .outputMode("update") | Only the rows in the Result Table that were updated since the last trigger will be outputted
to the sink. Since Spark 2.1.1 |
In the example below, we are writing to a Parquet directory which only supports the append mode:
dsw.outputMode("append")
Output Sinks
DataStreamWriter.format accepts the following values, among others:
Output
Sink Example Notes
File dsw.format("parquet") , Dumps the Result Table to a file. Supports Parquet, json, csv, etc.
dsw.format("csv") ...
foreach dsw.foreach(writer: ForeachWriter) This is your "escape hatch", allowing you to write your own type
of sink.
In the example below, we will be appending files to a Parquet directory and specifying its location with this call:
.format("parquet").start(outputPathDir)
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.3.1
Master
local[8]
AppName
Databricks Shell
Sample Data
We have some sample action data as files in /databricks-datasets/structured-streaming/events/ which we are going to
use to build this appication. Let's take a look at the contents of this directory.
%fs ls /databricks-datasets/structured-streaming/events/
Table
path name size modificationTime
1 dbfs:/databricks-datasets/structured-streaming/events/file-0.json file-0.json 72530 1469673865000
2 dbfs:/databricks-datasets/structured-streaming/events/file-1.json file-1.json 72961 1469673866000
3 dbfs:/databricks-datasets/structured-streaming/events/file-10.json file-10.json 73025 1469673878000
4 dbfs:/databricks-datasets/structured-streaming/events/file-11.json file-11.json 72999 1469673879000
5 dbfs:/databricks-datasets/structured-streaming/events/file-12.json file-12.json 72987 1469673880000
6 dbfs:/databricks-datasets/structured-streaming/events/file-13.json file-13.json 73006 1469673881000
7 dbfs:/databricks-datasets/structured-streaming/events/file-14.json file-14.json 73003 1469673882000
Showing all 50 rows.
There are about 50 JSON files in the directory. Let's see what each JSON file contains.
Each line in the file contains JSON record with two fields - time and action . Let's try to analyze these files interactively.
Batch/Interactive Processing
The usual first step in attempting to process the data is to interactively query the data. Let's define a static DataFrame on the files,
and give it a table name.
inputPath = "/databricks-datasets/structured-streaming/events/"
# Since we know the data format already, let's define the schema to speed up processing (no need for Spark to infer
schema)
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])
display(staticInputDF)
Table
time action
1 2016-07-28T04:19:28.000+0000 Close
2 2016-07-28T04:19:28.000+0000 Close
3 2016-07-28T04:19:29.000+0000 Open
4 2016-07-28T04:19:31.000+0000 Close
5 2016-07-28T04:19:31.000+0000 Open
6 2016-07-28T04:19:31.000+0000 Open
7 2016-07-28T04:19:32.000+0000 Close
Truncated results, showing first 1,000 rows.
Now we can compute the number of "open" and "close" actions with one hour windows. To do this, we will group by the
action column and 1 hour windows over the time column.
staticCountsDF = (
staticInputDF
.groupBy(
staticInputDF.action,
window(staticInputDF.time, "1 hour"))
.count()
)
staticCountsDF.cache()
Table
action window count
1 Close {"start": "2016-07-26T13:00:00.000+0000", "end": "2016-07-26T14:00:00.000+0000"} 1028
2 Open {"start": "2016-07-26T18:00:00.000+0000", "end": "2016-07-26T19:00:00.000+0000"} 1004
3 Close {"start": "2016-07-27T02:00:00.000+0000", "end": "2016-07-27T03:00:00.000+0000"} 971
4 Open {"start": "2016-07-27T04:00:00.000+0000", "end": "2016-07-27T05:00:00.000+0000"} 995
5 Open {"start": "2016-07-27T05:00:00.000+0000", "end": "2016-07-27T06:00:00.000+0000"} 986
6 Open {"start": "2016-07-26T05:00:00.000+0000", "end": "2016-07-26T06:00:00.000+0000"} 1000
7 Open {"start": "2016-07-26T11:00:00.000+0000", "end": "2016-07-26T12:00:00.000+0000"} 991
Showing all 104 rows.
Now we can directly use SQL to query the table. For example, here are the total counts across all the hours.
Visualization
50k
total_count
40k
30k
20k
10k
0.00
Close Open
action
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from static_counts order by time, action
Visualization
1.0k
800
count
600
400
200
0.00
Jul-26 03:00, Close Jul-26 17:00, Open Jul-27 08:00, Close Jul-27 22:00, Open
time, action
Note the two ends of the graph. The close actions are generated such that they are after the corresponding open actions, so
there are more "opens" in the beginning and more "closes" in the end.
Stream Processing
Now that we have analyzed the data interactively, let's convert this to a streaming query that continuously updates as data
comes. Since we just have a static set of files, we are going to emulate a stream from them by reading one file at a time, in the
chronological order they were created. The query we have to write is pretty much the same as the interactive query above.
Out[392]: True
As you can see, streamingCountsDF is a streaming Dataframe ( streamingCountsDF.isStreaming was true ). You can start
streaming computation, by defining the sink and starting it. In our case, we want to interactively query the counts (same queries
as above), so we will set the complete set of 1 hour counts to be in a in-memory table (note that this for testing purpose only in
Spark 2.0).
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
query is a handle to the streaming query that is running in the background. This query is continuously picking up files and
updating the windowed counts.
Note the status of query in the above cell. The progress bar shows that the query is active. Furthermore, if you expand the
> counts above, you will find the number of files they have already processed.
Let's wait a bit for a few files to be processed and then interactively query the in-memory counts table.
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action
Visualization
1.0k
800
count
600
400
200
0.00
Jul-26 03:00, Close Jul-26 08:00, Open Jul-26 14:00, Close Jul-26 19:00, Open
time, action
We see the timeline of windowed counts (similar to the static one earlier) building up. If we keep running this interactive query
repeatedly, we will see the latest updated counts which the streaming query is updating in the background.
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action
Visualization
1.0k
800
count
600
400
200
0.00
Jul-26 03:00, Close Jul-26 09:00, Open Jul-26 16:00, Close Jul-26 22:00, Open
time, action
%sql select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action
Visualization
1.0k
800
count
600
400
200
0.00
Jul-26 03:00, Close Jul-26 10:00, Open Jul-26 18:00, Close Jul-27 01:00, Open
time, action
%sql select action, sum(count) as total_count from counts group by action order by action
Visualization
total_count 25k
20k
15k
10k
5.0k
0.00
Close Open
action
If you keep running the above query repeatedly, you will always find that the number of "opens" is more than the number of
"closes", as expected in a data stream where a "close" always appear after corresponding "open". This shows that Structured
Streaming ensures prefix integrity. Read the blog posts linked below if you want to know more.
Note that there are only a few files, so consuming all of them there will be no updates to the counts. Rerun the query if you want
to interact with the streaming query again.
Finally, you can stop the query running in the background, either by clicking on the 'Cancel' link in the cell of the query, or by
executing query.stop() . Either way, when the query is stopped, the status of the corresponding cell above will automatically
update to TERMINATED .
query.stop()
NOTE: remember that for running this Notebook you must use a Databricks Runtime Version with Spark ML
Creating GraphFrames
Users can create GraphFrames from vertex and edge DataFrames.
Vertex DataFrame: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex
in the graph.
Edge DataFrame: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst"
(destination vertex ID of edge).
Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
display(g)
GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on
the vertex and edge DataFrames. Those DataFrames are made available as vertices and edges fields in the GraphFrame.
display(g.vertices)
Table
id name age
1 a Alice 34
2 b Bob 36
3 c Charlie 30
4 d David 29
5 e Esther 32
6 f Fanny 36
7 g Gabby 60
Showing all 7 rows.
display(g.edges)
Table
src dst relationship
display(g.inDegrees)
Table
id inDegree
1 b 2
2 c 2
3 f 1
4 d 1
5 a 1
6 e 1
display(g.outDegrees)
Table
id outDegree
1 a 2
2 b 1
3 c 1
4 f 1
5 e 2
6 d 1
Table
id degree
1 a 3
2 b 3
3 c 3
4 f 2
5 e 3
6 d 2
You can run queries directly on the vertices DataFrame. For example, we can find the age of the youngest person in the graph:
youngest = g.vertices.groupBy().min("age")
#oldest = g.vertices.groupBy().max("age")
display(youngest)
#display(oldest)
Table
min(age)
1 29
Showing 1 row.
Likewise, you can run queries on the edges DataFrame. For example, let's count the number of 'follow' relationships in the graph:
relationships = g.edges.groupBy("relationship").count()
display(relationships)
Table
relationship count
1 friend 4
2 follow 4
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| b| c| follow|
| c| b| follow|
| f| c| follow|
| e| f| follow|
+---+---+------------+
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| a| b| friend|
| e| d| friend|
| d| a| friend|
| a| e| friend|
+---+---+------------+
Motif finding
Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices
with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.
Check out the GraphFrame User Guide (http://graphframes.github.io/user-guide.html#motif-finding) for more details on the API.
# Do a test with filter and then motif find just for having the links of type tipo "friend"
# Do also a test with the filter condition after the find()
# col(edge)["relationship"]
#motifs = g.filter("relationship = 'friend'").find("(x)-[e1]->(y); (y)-[e2]->(x)")
#motifs = g.find("(x)-[e1]->(y); (y)-[e2]->(x)").filter(e1['relationship'] == "friend")
#display(motifsw)
# Search for pairs of vertices with edges in both directions between them.
motifs = g.find("(x)-[e1]->(y); (y)-[e2]->(x)")
display(motifs)
Table
x e1 y e2
{"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "relationship": {"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
{"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "relationship": {"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "
2
Table
x e1 y e2
{"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "relationship": {"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
{"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "relationship": {"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "
2
Table
a e1 b e2
{"id": "a", "name": "Alice", "age": {"src": "a", "dst": "e", "relationship": {"id": "e", "name": "Esther", "age": {"src": "e", "dst": "d", "rel
1
34} "friend"} 32} "friend"}
{"id": "d", "name": "David", "age": {"src": "d", "dst": "a", "relationship": {"id": "a", "name": "Alice", "age": {"src": "a", "dst": "e", "rel
2
29} "friend"} 34} "friend"}
{"id": "e", "name": "Esther", "age": {"src": "e", "dst": "d", "relationship": {"id": "d", "name": "David", "age": {"src": "d", "dst": "a", "rel
3
32} "friend"} 29} "friend"}
# Do more conditions
# A vertex connected to another vertex (doesn't matter the edge)
motifs2 = g.find("(a)-[]->(b)")
display(motifs2)
Table
a b
1 {"id": "a", "name": "Alice", "age": 34} {"id": "b", "name": "Bob", "age": 36}
{"id": "b", "name": "Bob", "age": 36} {"id": "c", "name": "Charlie", "age":
2
30}
{"id": "c", "name": "Charlie", "age": {"id": "b", "name": "Bob", "age": 36}
3
30}
{"id": "f", "name": "Fanny", "age": {"id": "c", "name": "Charlie", "age":
4
36} 30}
{"id": "e" "name": "Esther" "age": {"id": "f" "name": "Fanny" "age":
Showing all 8 rows.
# Do more conditions
# Do we have a vertex connected to himself without intermediate vertices
motifs3 = g.find("(a)-[]->(a)")
display(motifs3)
# Do more conditions
# A vertex connected to himself with an intermediate vertex
motifs4 = g.find("(a)-[]->(b);(b)-[]->(a)")
display(motifs4)
Table
a b
{"id": "b", "name": "Bob", "age": 36} {"id": "c", "name": "Charlie", "age":
1
30}
{"id": "c", "name": "Charlie", "age": {"id": "b", "name": "Bob", "age": 36}
2
Since the result is a DataFrame, more complex queries can be built on top of the motif. Let us find all the reciprocal relationships
in which one person is older than 30:
# Search for pairs of vertices with edges in both directions between them.
motifs5 = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
display(motifs5)
Table
a e b e2
{"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "relationship": {"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "
1
"follow"} 30} "follow"}
{"id": "c", "name": "Charlie", "age": {"src": "c", "dst": "b", "relationship": {"id": "b", "name": "Bob", "age": 36} {"src": "b", "dst": "c", "
2
Subgraphs
GraphFrames provides APIs for building subgraphs by filtering on edges and vertices. These filters can be composed together, for
example the following subgraph only includes people who are more than 30 years old and have friends who are more than 30
years old.
display(g2.vertices)
Table
id name age
1 a Alice 34
2 b Bob 36
3 e Esther 32
display(g2.edges)
Table
src dst relationship
1 a b friend
2 a e friend
Table
from e0 v1 e1
{"id": "a", "name": "Alice", "age": {"src": "a", "dst": "b", "relationship": {"id": "b", "name": "Bob", "age": {"src": "b", "dst": "c", "relatio
1
34} "friend"} 36} "follow"}
Showing 1 row.
Table
from e0 to
{"id": "e", "name": "Esther", "age": {"src": "e", "dst": "d", "relationship": {"id": "d", "name": "David", "age":
1
32} "friend"} 29}
Showing 1 row.
The search may also be limited by edge filters and maximum path lengths.
filteredPaths = g.bfs(
fromExpr = "name = 'Esther'",
toExpr = "age < 32",
edgeFilter = "relationship != 'friend'",
maxPathLength = 3)
display(filteredPaths)
Table
from e0 v1 e1
{"id": "e", "name": "Esther", "age": {"src": "e", "dst": "f", "relationship": {"id": "f", "name": "Fanny", "age": {"src": "f", "dst": "c", "rela
1
32} "follow"} 36} "follow"}
Showing 1 row.
Connected components
Compute the connected component membership of each vertex and return a DataFrame with each vertex assigned a component
ID. The GraphFrames connected components implementation can take advantage of checkpointing to improve performance.
# Be prepared, this example may take +10m to run (don't do it during the class)
# Check the results with the graphs elements to see that G is the only element that is connected only to A
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
result = g.connectedComponents()
display(result)
Table
id name age component
1 a Alice 34 0
2 c Charlie 30 0
3 d David 29 0
4 e Esther 32 0
5 f Fanny 36 0
6 b Bob 36 0
7 g Gabby 60 8589934593
Showing all 7 rows.
result = g.stronglyConnectedComponents(maxIter=5)
display(result.select("id", "component"))
Table
id component
1 a 0
2 b 1
3 c 1
4 d 0
5 e 0
6 f 4
7 g 8589934593
Showing all 7 rows.
Label Propagation
Run static Label Propagation Algorithm for detecting communities in networks.
Each node in the network is initially assigned to its own community. At every superstep, nodes send their community affiliation to
all neighbors and update their state to the most frequent community affiliation of incoming messages.
LPA is a standard community detection algorithm for graphs. It is very inexpensive computationally, although (1) convergence is
not guaranteed and (2) one can end up with trivial solutions (all nodes are identified into a single community).
result = g.labelPropagation(maxIter=5)
display(result)
Table
id name age label
1 a Alice 34 2
2 b Bob 36 1
3 c Charlie 30 8589934592
4 d David 29 2
5 e Esther 32 2
6 f Fanny 36 2
7 g Gabby 60 8589934593
PageRank
Identify important vertices in a graph based on connections.
Table
id name age pagerank
1 a Alice 34 0.44910633706538744
2 b Bob 36 2.655507832863289
3 c Charlie 30 2.6878300011606218
4 d David 29 0.3283606792049851
5 e Esther 32 0.37085233187676075
6 f Fanny 36 0.3283606792049851
7 g Gabby 60 0.1799821386239711
Showing all 7 rows.
display(results.edges)
Table
src dst relationship weight
1 a e friend 0.5
2 a b friend 0.5
3 b c follow 1
4 c b follow 1
5 d a friend 1
6 e d friend 0.5
7 e f follow 0.5
Showing all 8 rows.
Out[436]: GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])
Out[437]: GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])
Shortest paths
Computes shortest paths to the given set of landmark vertices, where landmarks are specified by vertex ID.
Table
id name age distances
1 a Alice 34 {"d": 2, "a": 0}
2 b Bob 36 {}
3 c Charlie 30 {}
4 d David 29 {"d": 0, "a": 1}
6 f Fanny 36 {}
7 g Gabby 60 {}
7 g Gabby 60 {}
Showing all 7 rows.
Triangle count
Computes the number of triangles passing through each vertex.
results = g.triangleCount()
display(results)
Table
count id name age
1 1 a Alice 34
2 0 b Bob 36
3 0 c Charlie 30
4 1 d David 29
5 1 e Esther 32
6 0 f Fanny 36
7 0 g Gabby 60
Showing all 7 rows.
ML Examples
display(trainingLiR)
Table
label features
-9.490009878824548 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4551273600657362, 0.366446943519
1 -0.38256108933468047, -0.4458430198517267, 0.33109790358914726, 0.8067445293443565, -0.2624341731773887,
-0.44850386111659524, -0.07269284838169332, 0.5658035575800715]}
0.2577820163584905 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8386555657374337, -0.12701805115
2 0.499812362510895, -0.22686625128130267, -0.6452430441812433, 0.18869982177936828, -0.5804648622673358,
0.651931743775642, -0.6555641246242951, 0.17485476357259122]}
-4.438869807456516 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.5025608135349202, 0.142080696829
3 0.16004976900412138, 0.505019897181302, -0.9371635223468384, -0.2841601610457427, 0.6355938616712786,
-0.1646249064941625, 0.9480713629917628, 0.42681251564645817]}
-19.782762789614537 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.0388509668871313,
4 -0.4166870051763918, 0.8997202693189332, 0.6409836467726933, 0.273289095712564, -0.26175701211620517,
-0.2794902492677298, -0.1306778297187794, -0.08536581111046115, -0.05462315824828923]}
-7.966593841555266 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.06195495876886281,
5 0.6546448480299902, -0.6979368909424835, 0.6677324708883314, -0.07938725467767771, -0.43885601665437957,
-0.608071585153688, -0.6414531182501653, 0.7313735926547045, -0.026818676347611925]}
-7.896274316726144 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.15805658673794265,
6 0.26573958270655806, 0.3997172901343442, -0.3693430998846541, 0.14324061105995334, -0.25797542063247825,
0.7436291919296774, 0.6114618853239959, 0.2324273700703574, -0.25128128782199144]}
-8.464803554195287 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.39449745853945895, 0.81722916041
Coefficients: [0.0,0.3229251667740594,-0.3438548034562219,1.915601702345841,0.05288058680386255,0.765962720459771,0.
0,-0.15105392669186676,-0.21587930360904645,0.2202536918881343]
Intercept: 0.15989368442397356
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("Root Mean Squared Error - RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2: %f" % trainingSummary.r2)
numIterations: 6
objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.49363616643404634, 0.4936351537897608, 0.49363512141778
71, 0.49363512062528014, 0.4936351206216114]
+--------------------+
| residuals|
+--------------------+
| -9.889232683103197|
| 0.5533794340053553|
| -5.204019455758822|
| -20.566686715507508|
| -9.4497405180564|
| -6.909112502719487|
| -10.00431602969873|
| 2.0623978070504845|
| 3.1117508432954772|
| -15.89360822941938|
| -5.036284254673026|
| 6.4832158769943335|
| 12.429497299109002|
| -20.32003219007654|
| -2.0049838218725|
Linear Regression
with test (and train) data and pipelines
# Load data
data = spark.read.format("libsvm")\
.load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")
# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8,0.2])
# Train Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)
# Show Predictions
display(predictions)
Table
label features
-26.805483428483072 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4572552704218824, -0.576096954000
1 -0.20809839485012915, 0.9140086345619809, -0.5922981637492224, -0.8969369345510854, 0.3741080343476908,
-0.01854004246308416, 0.07834089512221243, 0.3838413057880994]}
-23.487440120936512 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.5195354431261132, 0.808035794841
2 0.8498613208566037, 0.044766977500795946, -0.9031972948753286, 0.284006053218262, 0.9640004956647206,
-0.04090127960289358, 0.44190479952918427, -0.7359820144913463]}
-19.66731861537172 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.9353590082406811, 0.8768609458072
3 0.9618210554140587, 0.12103715737151921, -0.7691766106953688, -0.4220229608873225, -0.18117247651928658,
-0.14333978019692784, -0.31512358142857066, 0.4022153556528465]}
-19.402336030214553 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.462288625222409, -0.9029755259427
4 0.7442695642729447, 0.3802724233363486, 0.4068685903786069, -0.5054707879424198, -0.8686166000900748,
-0.014710838968344575, -0.1362606460134499, 0.8444452252816472]}
-17.494200356883344 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.4218585945316018,
5 0.15566399304488754, -0.164665303422032, -0.8579743106885072, 0.5651453461779163, -0.6582935645654426,
-0.40838717556437576, -0.19258926475033356, 0.9864284520934183, 0.7156150246487265]}
-17.32672073267595 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.31374599099683476,
6 -0.36270498808879115, 0.7456203273799138, 0.046239858938568856, -0.030136501929084014, -0.06596637210739509,
-0.46829487815816484, -0.2054839116368734, -0.7006480295111763, -0.6886047709544985]}
-17.026492264209548 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8367805314799452, 0.155919044362
Coefficients: [0.0,0.5632690395178652,-0.26439486198649054,1.510281213933446,0.0,0.6917959515358358,0.0,0.0,0.0,0.468
5019027558994]
Intercept : 0.3701057577245132
RMSE: 10.335940526345189
R 2: 0.019420906856787323
Table
label features
0 {"vectorType": "sparse", "length": 692, "indices": [127, 128, 129, 130, 131, 154, 155, 156, 157, 158, 159, 181, 182, 183, 184, 185,
186, 187, 188, 189, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 262,
263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 289, 290, 291, 292, 293, 294, 295, 296, 297, 300, 301, 302, 316, 317, 318, 319,
320, 321, 328, 329, 330, 343, 344, 345, 346, 347, 348, 349, 356, 357, 358, 371, 372, 373, 374, 384, 385, 386, 399, 400, 401, 412, 413,
414, 426, 427, 428, 429, 440, 441, 442, 454, 455, 456, 457, 466, 467, 468, 469, 470, 482, 483, 484, 493, 494, 495, 496, 497, 510, 511,
512, 520, 521, 522, 523, 538, 539, 540, 547, 548, 549, 550, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 594, 595,
596, 597, 598, 599, 600, 601, 602, 603, 604, 622, 623, 624, 625, 626, 627, 628, 629, 630, 651, 652, 653, 654, 655, 656, 657], "values":
1
[51, 159, 253, 159, 50, 48, 238, 252, 252, 252, 237, 54, 227, 253, 252, 239, 233, 252, 57, 6, 10, 60, 224, 252, 253, 252, 202, 84, 252,
253, 122, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 48, 238, 252,
252, 179, 12, 75, 121, 21, 253, 243, 50, 38, 165, 253, 233, 208, 84, 253, 252, 165, 7, 178, 252, 240, 71, 19, 28, 253, 252, 195, 57, 252,
252, 63, 253, 252, 195, 198, 253, 190, 255, 253, 196, 76, 246, 252, 112, 253, 252, 148, 85, 252, 230, 25, 7, 135, 253, 186, 12, 85, 252,
223, 7, 131, 252, 225, 71, 85, 252, 145, 48, 165, 252, 173, 86, 253, 225, 114, 238, 253, 162, 85, 252, 249, 146, 48, 29, 85, 178, 225,
253, 223, 167, 56, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 28, 199, 252, 252, 253, 252, 252, 233, 145, 25, 128, 252, 253,
252, 141, 37]}
1 {"vectorType": "sparse", "length": 692, "indices": [158, 159, 160, 161, 185, 186, 187, 188, 189, 213, 214, 215, 216, 217, 240, 241,
242, 243, 244, 245, 267, 268, 269, 270, 271, 295, 296, 297, 298, 322, 323, 324, 325, 326, 349, 350, 351, 352, 353, 377, 378, 379, 380,
381, 404, 405, 406, 407, 408, 431, 432, 433, 434, 435, 459, 460, 461, 462, 463, 486, 487, 488, 489, 490, 514, 515, 516, 517, 518, 542,
543, 544, 545, 569, 570, 571, 572, 573, 596, 597, 598, 599, 600, 601, 624, 625, 626, 627, 652, 653, 654, 655, 680, 681, 682, 683],
2
"values": [124, 253, 255, 63, 96, 244, 251, 253, 62, 127, 251, 251, 253, 62, 68, 236, 251, 211, 31, 8, 60, 228, 251, 251, 94, 155, 253,
253, 189, 20, 253, 251, 235, 66, 32, 205, 253, 251, 126, 104, 251, 253, 184, 15, 80, 240, 251, 193, 23, 32, 253, 253, 253, 159, 151,
251, 251, 251, 39, 48, 221, 251, 251, 172, 234, 251, 251, 196, 12, 253, 251, 251, 89, 159, 255, 253, 253, 31, 48, 228, 253, 247, 140, 8,
# Split the data into training and test sets (30% held out for testing)
(trainingDataLoR, testDataLoR) = dataLoR.randomSplit([0.7,0.3])
Coefficients: (692,[351,378,379,405,406,407,433,434,435,461,462,489],[0.0006062065933687492,0.0006353388402241491,0.0
010036254420385836,0.0005701600678249564,0.0010321061366040955,0.0012146854845528385,0.0006613971762257488,0.00102153
81836988545,0.0006819741249471635,0.0005969809559873252,0.0006768449741966736,0.0005984540021993428])
Intercept: -1.104081936861524
#predictionsLoR.select('*').display()
predictionsLoR.select("probability","prediction","label").display()
# Remember that the field probability will give you a probability value for each class
Table
probability prediction label
1 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
2 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
3 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
4 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
5 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
6 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
7 {"vectorType": "dense", "length": 2, "values": [0.7510241560396478, 0.24897584396035222]} 0 0
Showing all 39 rows.
Out[490]: 0.9746298984034834
# Load data
# The label column in this case is only for complementary information (the model will not use it for
training/estimation)
# It results from the svmlib reading, it's only a sequence number of the rows
dataset = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_kmeans_data.txt")
dataset.display()
Table
label features
1 0 {"vectorType": "sparse", "length": 3, "indices": [], "values": []}
# Make predictions
predictions = model.transform(dataset)
display(predictions)
Table
label features prediction
1 0 {"vectorType": "sparse", "length": 3, "indices": [], "values": []} 1
1 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1, 1
2
0.1]}
2 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2, 1
3
0.2]}
4 3 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]} 0
4 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1, 0
5
9 1]}
Showing all 6 rows.
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
Table
id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 2 Fritz 1.78 null 45 M null
4 3 Florence 1.75 null null null null
5 4 Nicola 1.6 60 33 F Dancer
6 5 Gregory 1.8 88 54 M Teacher
7 6 Steven 1.82 null null M null
Showing all 9 rows.
Descriptive statistics
# Use describe or summary for statistics (summary will give you more info)
#people.describe().display()
#people.select("age").summary().display()
people.summary().display()
#dbutils.data.summarize(people)
Table
summary id name height weight age gend
1 count 9 9 9 5 7 8
2 mean 4.222222222222222 null 1.8133333333333335 62.4 49.57142857142857 null
3 stddev 2.438123139721299 null 0.194357917255768 32.292413969847466 23.8107618725731 null
4 min 1 Dagmar 1.6 10 28 F
5 25% 2 null 1.75 60 33 null
6 50% 4 null 1.78 64 45 null
7 75% 6 null 1.8 88 54 null
Showing all 8 rows.
Table
avg(age) min(age) max(age)
1 49.57142857142857 28 100
Showing 1 row.
Table
skewness(id) kurtosis(id)
1 0.22135555123008185 -1.2793038693335663
Showing 1 row.
None
Skewness and Kurtosis for variable name:
Table
skewness(name) kurtosis(name)
1 null null
Showing 1 row.
None
Skewness and Kurtosis for variable height:
Table
skewness(height) kurtosis(height)
1 1.8736806669884842 2.7503936497452406
Showing 1 row.
None
Skewness and Kurtosis for variable weight:
Table
skewness(weight) kurtosis(weight)
1 -0.8805430091452401 -0.5432331648482904
Showing 1 row.
None
Skewness and Kurtosis for variable age:
Table
skewness(age) kurtosis(age)
1 1.5084230988771352 1.0914433556729701
Showing 1 row.
None
Skewness and Kurtosis for variable gender:
Table
skewness(gender) kurtosis(gender)
1 null null
Showing 1 row.
None
Skewness and Kurtosis for variable job:
Table
skewness(job) kurtosis(job)
1 null null
Showing 1 row.
None
display(people)
Table
id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 2 Fritz 1.78 null 45 M null
4 3 Florence 1.75 null null null null
5 4 Nicola 1.6 60 33 F Dancer
6 5 Gregory 1.8 88 54 M Teacher
7 6 Steven 1.82 null null M null
Showing all 9 rows.
df = people.dropDuplicates()
df.display()
Table
id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 3 Florence 1.75 null null null null
4 4 Nicola 1.6 60 33 F Dancer
5 5 Gregory 1.8 88 54 M Teacher
6 6 Steven 1.82 null null M null
7 7 Dagmar 1.7 64 42 F Nurse
Showing all 8 rows.
Table
id name height weight age gender job
1 0 0 0 3 2 1 3
Showing 1 row.
Table
id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 2 Fritz 1.78 null 45 M null
3 4 Nicola 1.6 60 33 F Dancer
4 5 Gregory 1.8 88 54 M Teacher
5 6 Steven 1.82 null null M null
6 7 Dagmar 1.7 64 42 F Nurse
7 8 Thomaz 2.3 10 100 M Driver
Showing all 7 rows.
df = df.dropna(how="any")
df.display()
Table
id name height weight age gender job
1 1 Peter 1.79 90 28 M Tiler
2 4 Nicola 1.6 60 33 F Dancer
3 5 Gregory 1.8 88 54 M Teacher
4 7 Dagmar 1.7 64 42 F Nurse
5 8 Thomaz 2.3 10 100 M Driver
Table
id name weight height age gender job bmi
1 1 Peter 90 1.79 28 M Tiler 28.089010954714272
2 4 Nicola 60 1.6 33 F Dancer 23.437499999999996
3 5 Gregory 88 1.8 54 M Teacher 27.160493827160494
4 7 Dagmar 64 1.7 42 F Nurse 22.145328719723185
5 8 Thomaz 10 2.3 100 M Driver 1.8903591682419663
df.withColumn("bmi", bmi_udf(df["weight"],df["height"])).display()
Table
id name weight height age gender job bmi
1 1 Peter 90 1.79 28 M Tiler 28.089010954714272
2 4 Nicola 60 1.6 33 F Dancer 23.437499999999996
3 5 Gregory 88 1.8 54 M Teacher 27.160493827160494
4 7 Dagmar 64 1.7 42 F Nurse 22.145328719723185
5 8 Thomaz 10 2.3 100 M Driver 1.8903591682419663
# Last step to have a final table only with labels and features
df = df.drop("name")
df.display()
Table
id weight height age gender job bmi
1 1 90 1.79 28 M Tiler 28.089010954714272
2 4 60 1.6 33 F Dancer 23.437499999999996
3 5 88 1.8 54 M Teacher 27.160493827160494
4 7 64 1.7 42 F Nurse 22.145328719723185
5 8 10 2.3 100 M Driver 1.8903591682419663
Identifying Outliers
There are no outliers if all the values are roughly within the Q1−1.5IQR and Q3+1.5IQR range
IQR is the interquartile range. It's defined as a difference between the upper (Q3) and lower (Q1) quartiles
# We'll use the .approxQuantile(...) method (that will give you a list with the Q1 and Q3 values)
# The 1st parameter is the name of the column
# The 2nd parameter can be either a number between 0 or 1 (where 0.5 means to calculated median) or a list (as in this
case)
# The 3rd parameter specifies the acceptable level of an error for each metric (0 means an exact value - it can be
very expensive)
# bounds dic will have the lower and upper bounds for each feature
print(bounds)
outliers = df.select(['id','weight','height','age'] +
[( (df[c] < bounds[c][0]) | (df[c] > bounds[c][1]) ).alias(c + '_o') for c in cols ])
outliers.display()
Table
id weight height age weight_o height_o age_o
1 1 90 1.79 28 false false false
2 4 60 1.6 33 false false false
3 5 88 1.8 54 false false false
4 7 64 1.7 42 false false false
5 8 10 2.3 100 true true true
# Use a join now to eliminate the outliers from the original dataset
df_No_Outl = df.join(outliers['id','weight_o','height_o','age_o'], on='id')
#df_No_Outl.filter('age_o').show()
df_No_Outl.filter(df_No_Outl.age_o == False).select('id','height','weight','age').display()
Table
id height weight age
1 1 1.79 90 28
2 4 1.6 60 33
3 5 1.8 88 54
4 7 1.7 64 42
Vectorize
assembler.transform(df).display()
assembler.transform(df).printSchema()
Table
id weight height age gender job bmi features
1 90 1.79 28 M Tiler 28.089010954714272 {"vectorType": "de
1
90]}
2 4 60 1.6 33 F Dancer 23.437499999999996 {"vectorType": "de
root
|-- id: long (nullable = true)
|-- weight: long (nullable = true)
|-- height: double (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- job: string (nullable = true)
|-- bmi: double (nullable = true)
|-- features: vector (nullable = true)
StringIndexer
Used to transform a categorical string feature into a numerical featura
df.display()
Table
id weight height age gender job bmi genderIndex
1 1 90 1.79 28 M Tiler 28.089010954714272 0
2 4 60 1.6 33 F Dancer 23.437499999999996 1
3 5 88 1.8 54 M Teacher 27.160493827160494 0
4 7 64 1.7 42 F Nurse 22.145328719723185 1
5 8 10 2.3 100 M Driver 1.8903591682419663 0
encoder = OneHotEncoder(inputCols=["jobIndex","genderIndex"],
outputCols=["jobOHEVector","genderOHEVector"]).setDropLast(False)
encoded = encoder.fit(df).transform(df)
encoded.display()
Table
id weight height age gender job bmi genderIndex
1 90 1.79 28 M Tiler 28.089010954714272 0
1
7 64 17 42 F Nurse 22 145328719723185 1
Showing all 5 rows.
# In this exemple setDropLast=True. Meaning the last category is ignored (not encoded)
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=["jobIndex","genderIndex"],
outputCols=["jobOHEVector","genderOHEVector"]).setDropLast(True)
encoded = encoder.fit(df).transform(df)
encoded.show()
+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+
| id|weight|height|age|gender| job| bmi|genderIndex|jobIndex| jobOHEVector|genderOHEVector|
+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+
| 1| 90| 1.79| 28| M| Tiler|28.089010954714272| 0.0| 4.0| (4,[],[])| (1,[0],[1.0])|
| 4| 60| 1.6| 33| F| Dancer|23.437499999999996| 1.0| 0.0|(4,[0],[1.0])| (1,[],[])|
| 5| 88| 1.8| 54| M|Teacher|27.160493827160494| 0.0| 3.0|(4,[3],[1.0])| (1,[0],[1.0])|
| 7| 64| 1.7| 42| F| Nurse|22.145328719723185| 1.0| 2.0|(4,[2],[1.0])| (1,[],[])|
| 8| 10| 2.3|100| M| Driver|1.8903591682419663| 0.0| 1.0|(4,[1],[1.0])| (1,[0],[1.0])|
+---+------+------+---+------+-------+------------------+-----------+--------+-------------+---------------+
ML Examples
%fs ls /databricks-datasets/samples/data/mllib
Table
path name size
1 dbfs:/databricks-datasets/samples/data/mllib/.DS_Store .DS_Store 614
2 dbfs:/databricks-datasets/samples/data/mllib/als/ als/ 0
3 dbfs:/databricks-datasets/samples/data/mllib/gmm_data.txt gmm_data.txt 639
4 dbfs:/databricks-datasets/samples/data/mllib/kmeans_data.txt kmeans_data.txt 72
5 dbfs:/databricks-datasets/samples/data/mllib/lr-data/ lr-data/ 0
6 dbfs:/databricks-datasets/samples/data/mllib/lr_data.txt lr_data.txt 197
7 dbfs:/databricks-datasets/samples/data/mllib/pagerank data.txt pagerank data.txt 24
Showing all 20 rows.
%fs ls /databricks-datasets/definitive-guide/data/
Table
path name size modifica
1 dbfs:/databricks-datasets/definitive-guide/data/activity-data/ activity-data/ 0 0
2 dbfs:/databricks-datasets/definitive-guide/data/bike-data/ bike-data/ 0 0
3 dbfs:/databricks-datasets/definitive-guide/data/binary-classification/ binary-classification/ 0 0
4 dbfs:/databricks-datasets/definitive-guide/data/clustering/ clustering/ 0 0
5 dbfs:/databricks-datasets/definitive-guide/data/flight-data/ flight-data/ 0 0
6 dbfs:/databricks-datasets/definitive-guide/data/flight-data-hive/ flight-data-hive/ 0 0
7 dbfs:/databricks-datasets/definitive-guide/data/multiclass-classification/ multiclass-classification/ 0 0
Showing all 14 rows.
display(trainingLiR)
Table
label features
-9.490009878824548 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4551273600657362, 0.366446943519
1 -0.38256108933468047, -0.4458430198517267, 0.33109790358914726, 0.8067445293443565, -0.2624341731773887,
-0.44850386111659524, -0.07269284838169332, 0.5658035575800715]}
0.2577820163584905 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8386555657374337, -0.12701805115
2 0.499812362510895, -0.22686625128130267, -0.6452430441812433, 0.18869982177936828, -0.5804648622673358,
0.651931743775642, -0.6555641246242951, 0.17485476357259122]}
-4.438869807456516 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.5025608135349202, 0.142080696829
3 0.16004976900412138, 0.505019897181302, -0.9371635223468384, -0.2841601610457427, 0.6355938616712786,
-0.1646249064941625, 0.9480713629917628, 0.42681251564645817]}
-19.782762789614537 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.0388509668871313,
4 -0.4166870051763918, 0.8997202693189332, 0.6409836467726933, 0.273289095712564, -0.26175701211620517,
-0.2794902492677298, -0.1306778297187794, -0.08536581111046115, -0.05462315824828923]}
-7.966593841555266 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.06195495876886281,
5 0.6546448480299902, -0.6979368909424835, 0.6677324708883314, -0.07938725467767771, -0.43885601665437957,
-0.608071585153688, -0.6414531182501653, 0.7313735926547045, -0.026818676347611925]}
-7.896274316726144 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.15805658673794265,
6 0.26573958270655806, 0.3997172901343442, -0.3693430998846541, 0.14324061105995334, -0.25797542063247825,
0.7436291919296774, 0.6114618853239959, 0.2324273700703574, -0.25128128782199144]}
-8.464803554195287 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.39449745853945895, 0.81722916041
Coefficients: [0.0,0.3229251667740594,-0.3438548034562219,1.915601702345841,0.05288058680386255,0.765962720459771,0.
0,-0.15105392669186676,-0.21587930360904645,0.2202536918881343]
Intercept: 0.15989368442397356
# Summarize the model over the training set and print out some metrics
# RMSE: square root of the variance of the residuals (Lower values of RMSE indicate better fit)
# R2: the proportional improvement in prediction from the model (0 -> the model does not improve prediction over the
mean model. 1 -> indicates perfect prediction. )
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("Root Mean Squared Error - RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2: %f" % trainingSummary.r2)
numIterations: 6
objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.49363616643404634, 0.4936351537897608, 0.49363512141778
71, 0.49363512062528014, 0.4936351206216114]
+--------------------+
| residuals|
+--------------------+
| -9.889232683103197|
| 0.5533794340053553|
| -5.204019455758822|
| -20.566686715507508|
| -9.4497405180564|
| -6.909112502719487|
| -10.00431602969873|
| 2.0623978070504845|
| 3.1117508432954772|
| -15.89360822941938|
| -5.036284254673026|
| 6.4832158769943335|
| 12.429497299109002|
| -20.32003219007654|
| -2.0049838218725|
Linear Regression
with test (and train) data and pipelines
# Load data
data = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_linear_regression_data.txt")
# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8,0.2])
# Train Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)
# Show Predictions
display(predictions)
Table
label features
-26.805483428483072 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.4572552704218824, -0.576096954000
1 -0.20809839485012915, 0.9140086345619809, -0.5922981637492224, -0.8969369345510854, 0.3741080343476908,
-0.01854004246308416, 0.07834089512221243, 0.3838413057880994]}
-23.51088409032297 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.4683538422180036, 0.14695401859
2 0.9113612952591796, -0.9838482669789823, 0.4506466371133697, 0.6456121712599778, 0.8264783725578371,
0.562664168655115, -0.8299281852090683, 0.40690300256653256]}
-19.402336030214553 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.462288625222409, -0.9029755259427
3 0.7442695642729447, 0.3802724233363486, 0.4068685903786069, -0.5054707879424198, -0.8686166000900748,
-0.014710838968344575, -0.1362606460134499, 0.8444452252816472]}
-18.27521356600463 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.489685764918109, 0.683231434274
4 0.9115808714640257, -0.0004680515344936964, 0.03760860984717218, 0.4344127744883004, -0.30019645809377127,
-0.48339658188341783, -0.5488933834939806, -0.4735052851773165]}
-17.428674570939506 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.8562209225926345, 0.707720210065
5 0.7449487615498371, 0.4648122665228682, 0.20867633509077188, 0.08516406450475422, 0.22426604902631664,
-0.5503074163123833, -0.40653248591627533, -0.34680731694527833]}
-16.692207021311106 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [0.9117919458569854, 0.628599902089
6 -0.29426892743208954, -0.7936280881977256, 0.8429787263741186, 0.7932494418330283, 0.31956207523432667,
0.9890773145202636, -0.7936494627564858, 0.9917688731048739]}
-16.26143027545273 {"vectorType": "sparse", "length": 10, "indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "values": [-0.9309578475799722, 0.75917958809
Showing all 101 rows.
Coefficients: [0.0,1.9141348939000609,-0.05113069489802313,1.983468491510318,0.553327181606741,0.1749868889333724,0.
0,-0.4667102520693325,-0.9095076912374733,0.26743281796509416]
Intercept : 0.16479345157668718
RMSE: 10.735968494895788
R2: -0.08201975916265836
Table
label features
0 {"vectorType": "sparse", "length": 692, "indices": [127, 128, 129, 130, 131, 154, 155, 156, 157, 158, 159, 181, 182, 183, 184, 185,
186, 187, 188, 189, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 262,
263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 289, 290, 291, 292, 293, 294, 295, 296, 297, 300, 301, 302, 316, 317, 318, 319,
320, 321, 328, 329, 330, 343, 344, 345, 346, 347, 348, 349, 356, 357, 358, 371, 372, 373, 374, 384, 385, 386, 399, 400, 401, 412, 413,
414, 426, 427, 428, 429, 440, 441, 442, 454, 455, 456, 457, 466, 467, 468, 469, 470, 482, 483, 484, 493, 494, 495, 496, 497, 510, 511,
512, 520, 521, 522, 523, 538, 539, 540, 547, 548, 549, 550, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 594, 595,
596, 597, 598, 599, 600, 601, 602, 603, 604, 622, 623, 624, 625, 626, 627, 628, 629, 630, 651, 652, 653, 654, 655, 656, 657], "values":
1
[51, 159, 253, 159, 50, 48, 238, 252, 252, 252, 237, 54, 227, 253, 252, 239, 233, 252, 57, 6, 10, 60, 224, 252, 253, 252, 202, 84, 252,
253, 122, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 48, 238, 252,
252, 179, 12, 75, 121, 21, 253, 243, 50, 38, 165, 253, 233, 208, 84, 253, 252, 165, 7, 178, 252, 240, 71, 19, 28, 253, 252, 195, 57, 252,
252, 63, 253, 252, 195, 198, 253, 190, 255, 253, 196, 76, 246, 252, 112, 253, 252, 148, 85, 252, 230, 25, 7, 135, 253, 186, 12, 85, 252,
223, 7, 131, 252, 225, 71, 85, 252, 145, 48, 165, 252, 173, 86, 253, 225, 114, 238, 253, 162, 85, 252, 249, 146, 48, 29, 85, 178, 225,
253, 223, 167, 56, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 28, 199, 252, 252, 253, 252, 252, 233, 145, 25, 128, 252, 253,
252, 141, 37]}
1 {"vectorType": "sparse", "length": 692, "indices": [158, 159, 160, 161, 185, 186, 187, 188, 189, 213, 214, 215, 216, 217, 240, 241,
242, 243, 244, 245, 267, 268, 269, 270, 271, 295, 296, 297, 298, 322, 323, 324, 325, 326, 349, 350, 351, 352, 353, 377, 378, 379, 380,
381, 404, 405, 406, 407, 408, 431, 432, 433, 434, 435, 459, 460, 461, 462, 463, 486, 487, 488, 489, 490, 514, 515, 516, 517, 518, 542,
543, 544, 545, 569, 570, 571, 572, 573, 596, 597, 598, 599, 600, 601, 624, 625, 626, 627, 652, 653, 654, 655, 680, 681, 682, 683],
2
"values": [124, 253, 255, 63, 96, 244, 251, 253, 62, 127, 251, 251, 253, 62, 68, 236, 251, 211, 31, 8, 60, 228, 251, 251, 94, 155, 253,
253, 189, 20, 253, 251, 235, 66, 32, 205, 253, 251, 126, 104, 251, 253, 184, 15, 80, 240, 251, 193, 23, 32, 253, 253, 253, 159, 151,
251, 251, 251, 39, 48, 221, 251, 251, 172, 234, 251, 251, 196, 12, 253, 251, 251, 89, 159, 255, 253, 253, 31, 48, 228, 253, 247, 140, 8,
Showing all 100 rows.
# Split the data into training and test sets (30% held out for testing)
(trainingDataLoR, testDataLoR) = dataLoR.randomSplit([0.7,0.3])
Coefficients: (692,[271,272,300,328,350,351,356,378,379,405,406,407,433,434,435,461,462,489,490,511,512,517,539,540],
[-0.00019903767637136554,-0.0001941577234365856,-0.00034527933210997097,-4.680078802426797e-05,0.0002033194519132294
2,0.00022292147220978535,-5.213010992387977e-05,0.0005717340811962772,0.00022634251292289687,0.0003611998147730647,0.
0007059600020105616,0.0004143587482711157,0.000604649706484048,0.0006808244146618602,1.6865828617895953e-05,0.0004753
5748773760707,0.0006162903776126785,0.0004422202688935682,0.00035742728163686986,-5.462326765468662e-05,-0.0002070301
3094870845,0.0004217378448089352,-4.418484614812527e-05,-0.0002559427594559066])
Intercept: -0.2602368710870625
#predictionsLoR.select('*').display()
predictionsLoR.select("probability","prediction","label").display()
# Remember that the field probability will give you a probability value for each class
Table
probability prediction label
1 {"vectorType": "dense", "length": 2, "values": [0.6348341003103886, 0.36516589968961144]} 0 0
2 {"vectorType": "dense", "length": 2, "values": [0.6127092158471116, 0.3872907841528884]} 0 0
3 {"vectorType": "dense", "length": 2, "values": [0.6482675547104254, 0.3517324452895746]} 0 0
4 {"vectorType": "dense", "length": 2, "values": [0.6440298352541282, 0.35597016474587184]} 0 0
5 {"vectorType": "dense", "length": 2, "values": [0.6015645505916722, 0.39843544940832776]} 0 0
6 {"vectorType": "dense", "length": 2, "values": [0.5893064637024384, 0.41069353629756156]} 0 0
7 {"vectorType": "dense", "length": 2, "values": [0.6395634012764051, 0.3604365987235949]} 0 0
Showing all 33 rows.
Out[574]: 1.0
# Load data
# The label column in this case is only for complementary information (the model will not use it for
training/estimation)
# It results from the svmlib reading, it's only a sequence number of the rows
dataset = spark.read.format("libsvm").load("dbfs:/FileStore/tables/sample_kmeans_data.txt")
dataset.display()
Table
label features
1 0 {"vectorType": "sparse", "length": 3, "indices": [], "values": []}
display(predictions)
Table
label features prediction
1 0 {"vectorType": "sparse", "length": 3, "indices": [], "values": []} 1
1 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.1, 0.1, 1
2
0.1]}
2 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [0.2, 0.2, 1
3
0.2]}
4 3 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9, 9, 9]} 0
4 {"vectorType": "sparse", "length": 3, "indices": [0, 1, 2], "values": [9.1, 9.1, 0
5
9 1]}
Showing all 6 rows.
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
df = spark.createDataFrame([
(0, [1, 2]),
(1, [1, 2, 3])
], ["id", "items"])
# Transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).display()
Table
items freq
1 [1] 2
2 [2] 2
3 [2, 1] 2
4 [3] 1
5 [3, 2] 1
6 [3, 2, 1] 1
7 [3, 1] 1
Showing all 7 rows.
Table
antecedent consequent confidence lift support
1 [3, 1] [2] 1 1 0.5
2 [3] [2] 1 1 0.5
3 [3] [1] 1 1 0.5
4 [2] [1] 1 1 1
5 [3, 2] [1] 1 1 0.5
6 [1] [2] 1 1 1
Table
id items prediction
1 0 [1, 2] []
2 1 [1, 2, 3] []
Table
id items prediction
1 0 [1] [2]
Showing 1 row.
df1 = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
# Transform examines the input items against all the association rules and summarize the
# consequents as prediction
model1.transform(df1).display()
Table
items freq
1 [1] 3
2 [2] 3
3 [2, 1] 3
4 [5] 2
5 [5, 2] 2
6 [5, 2, 1] 2
7 [5, 1] 2
Showing all 7 rows.
Table
antecedent consequent confidence lift support
1 [5] [2] 1 1 0.6666666666666666
2 [5] [1] 1 1 0.6666666666666666
3 [5, 1] [2] 1 1 0.6666666666666666
4 [5, 2] [1] 1 1 0.6666666666666666
5 [2] [1] 1 1 1
6 [2] [5] 0.6666666666666666 1 0.6666666666666666
7 [2, 1] [5] 0.6666666666666666 1 0.6666666666666666
Showing all 9 rows.
Table
id items prediction
1 0 [1, 2, 5] []
2 1 [1, 2, 3, 5] []
3 2 [1, 2] [5]
%fs ls /databricks-datasets/samples/data/mllib
Table
path name size
1 dbfs:/databricks datasets/samples/data/mllib/ DS Store DS Store 614
#lines = spark.read.text("dbfs:/databricks-datasets/samples/data/mllib/sample_movielens_data.txt").rdd
#parts = lines.map(lambda row: row.value.split("::"))
#ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
# rating=float(p[2]) ))
#ratings = spark.createDataFrame(ratingsRDD)
ratings = spark.read.text("dbfs:/databricks-datasets/samples/data/mllib/sample_movielens_data.txt")
ratings = ratings.withColumn('userId', split('value', '::').getItem(0)).withColumn('movieId', split('value',
'::').getItem(1)).withColumn('rating', split('value', '::').getItem(2))
ratings = ratings.drop('value')
ratings.printSchema()
ratings = ratings.withColumn('userId', col("userId").cast('int'))
ratings = ratings.withColumn('movieId', col("movieId").cast('int'))
ratings = ratings.withColumn('rating', col("rating").cast('int'))
ratings.printSchema()
root
|-- userId: integer (nullable = true)
|-- movieId: string (nullable = true)
|-- rating: string (nullable = true)
root
|-- userId: integer (nullable = true)
|-- movieId: integer (nullable = true)
|-- rating: integer (nullable = true)
ratings.display()
Table
userId movieId rating
1 0 2 3
2 0 3 1
3 0 5 2
4 0 9 4
5 0 11 1
6 0 12 2
7 0 15 1
Truncated results, showing first 1,000 rows.
predictions.display()
Table
userId movieId rating prediction
1 2 39 5 3.7424393
2 2 50 1 0.86440027
3 2 54 1 2.3776407
4 2 58 2 0.04814744
5 2 62 1 5.16774
6 2 65 1 0.67538136
7 2 66 3 3.651377
Showing all 307 rows.
#userRecs.display()
#movieRecs.display()
movieSubSetRecs.display()
Table
movieId recommendations
2 [{"userId": 8, "rating": 4.2190223}, {"userId": 14, "rating": 4.1397004}, {"userId": 21, "rating": 4.026415}, {"userId": 10, "rating":
1 3.8691313}, {"userId": 12, "rating": 3.6239657}, {"userId": 4, "rating": 3.5872579}, {"userId": 28, "rating": 3.4324553}, {"userId": 0,
"rating": 3.2764773}, {"userId": 6, "rating": 3.0584242}, {"userId": 5, "rating": 2.706623}]
3 [{"userId": 14, "rating": 2.7682219}, {"userId": 16, "rating": 2.315615}, {"userId": 8, "rating": 2.2152493}, {"userId": 11, "rating":
2 2.1919854}, {"userId": 2, "rating": 2.0185175}, {"userId": 24, "rating": 2.0180826}, {"userId": 22, "rating": 1.7710056}, {"userId": 25,
"rating": 1.6565917}, {"userId": 21, "rating": 1.4824837}, {"userId": 12, "rating": 1.455656}]
5 [{"userId": 16, "rating": 3.0494766}, {"userId": 18, "rating": 2.1953611}, {"userId": 26, "rating": 2.1228826}, {"userId": 15, "rating":
3 1.9957798}, {"userId": 2, "rating": 1.9644576}, {"userId": 3, "rating": 1.9028949}, {"userId": 22, "rating": 1.8979962}, {"userId": 0,
"rating": 1.8318737}, {"userId": 27, "rating": 1.8116448}, {"userId": 23, "rating": 1.800349}]
df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true",
inferSchema="true")
display(df)
Table
2014 rank City State State Code 2014 Population estimate 2015 median sales price
1 101 Birmingham Alabama AL 212247 162.9
2 125 Huntsville Alabama AL 188226 157.7
3 122 Mobile Alabama AL 194675 122.5
4 114 Montgomery Alabama AL 200481 129
5 64 Anchorage[19] Alaska AK 301010 null
6 78 Chandler Arizona AZ 254276 null
7 86 Gilbert[20] Arizona AZ 239277 null
Showing all 294 rows.
# Some of the column names contain spaces. Rename the columns to replace spaces with underscores and shorten the
names.
from pyspark.sql.functions import col
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
data = df.select(exprs)
Out[656]: DataFrame[2014_rank: int, City: string, State: string, State_Code: string, 2014_Population_estimate: int, 2
015_median_sales_price: double]
display(data)
Table
2014_rank City State State_Code 2014_Population_estimate 2015_median_sales_price
1 101 Birmingham Alabama AL 212247 162.9
2 125 Huntsville Alabama AL 188226 157.7
3 122 Mobile Alabama AL 194675 122.5
4 114 Montgomery Alabama AL 200481 129
5 64 Anchorage[19] Alaska AK 301010 null
6 78 Chandler Arizona AZ 254276 null
7 86 Gilbert[20] Arizona AZ 239277 null
Showing all 294 rows.
Out[658]: 294
root
|-- 2014_rank: integer (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- State_Code: string (nullable = true)
|-- 2014_Population_estimate: integer (nullable = true)
# Describing the columns. Use method describe or summary (summary will give you more info)
#data.summary("count", "min", "25%", "75%", "max").display()
data.summary().display()
Table
summary 2014_rank City State State_Code 2014_Population_estimate 2015_median_sa
1 count 294 294 294 294 293 109
2 mean 147.5 null null null 307284.89761092153 211.26605504587
3 stddev 85.01470461043782 null null null 603487.8272175139 134.01724544927
4 min 1 Abilene Alabama AK 101408 78.6
5 25% 74 null null null 120958 141.1
6 50% 147 null null null 168586 177.2
7 75% 221 null null null 262146 218.9
Showing all 8 rows.
# an alternative approach
from pyspark.sql.functions import mean, min, max
data.select([mean('2014_rank'), min('2014_rank'), max('2014_rank')]).show()
+--------------+--------------+--------------+
|avg(2014_rank)|min(2014_rank)|max(2014_rank)|
+--------------+--------------+--------------+
| 147.5| 1| 294|
+--------------+--------------+--------------+
Table
2014_rank City State State_Code 2014_Population_estimate 2015_median_sales_price
1 0 0 0 0 1 185
Showing 1 row.
Out[663]: 109
Table
2014_Population_estimate 2015_median_sales_price
1 212247 162.9
2 188226 157.7
3 194675 122.5
4 200481 129
5 1537058 206.1
6 527972 178.1
7 197706 131.8
Showing all 109 rows.
Visualization
8.00M
population
6.00M
4.00M
2.00M
label
#stages = []
#assembler = VectorAssembler(inputCols=["population"], outputCol="features")
#stages += [assembler]
#pipeline = Pipeline(stages=stages)
#pipelineModel = pipeline.fit(model_data)
#dataset = pipelineModel.transform(model_data)
display(dataset)
Table
population label features
1 212247 162.9 {"vectorType": "dense", "length": 1, "values": [212247]}
Make predictions
Use the transform() method on the model to generate predictions. The following code takes the first model (modelA) and
creates a new table (predictionsA) containing both the label (original sales price) and the prediction (predicted sales price) based
on the features (population).
predictions = model.transform(dataset)
predictions.show(10)
+----------+-----+-----------+------------------+
|population|label| features| prediction|
+----------+-----+-----------+------------------+
| 212247|162.9| [212247.0]| 199.3167659584664|
| 188226|157.7| [188226.0]|198.40882267887193|
| 194675|122.5| [194675.0]|198.65258131548592|
| 200481|129.0| [200481.0]|198.87203590444247|
| 1537058|206.1|[1537058.0]|249.39183544694856|
| 527972|178.1| [527972.0]|211.25050693302884|
| 197706|131.8| [197706.0]| 198.7671467407576|
| 346997|685.7| [346997.0]| 204.4100325554172|
| 3928864|434.7|[3928864.0]|339.79707185649573|
| 319504|281.0| [319504.0]|203.37085497805194|
+----------+-----+-----------+------------------+
only showing top 10 rows
#predictionsB = modelB.transform(dataset)
#predictionsB.show()
display(model, dataset)
#display(modelB, dataset)
600
residuals 400
200
0.00
500
fitted values
Table
path name size modificationTime
1 dbfs:/databricks-datasets/wine-quality/README.md README.md 1066 1594262736000
2 dbfs:/databricks-datasets/wine-quality/winequality-red.csv winequality-red.csv 84199 1594262736000
3 dbfs:/databricks-datasets/wine-quality/winequality-white.csv winequality-white.csv 264426 1594262736000
Table
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 7.4 0.7 0 1.9 0.076 11 34
2 7.8 0.88 0 2.6 0.098 25 67
3 7.8 0.76 0.04 2.3 0.092 15 54
4 11.2 0.28 0.56 1.9 0.075 17 60
5 7.4 0.7 0 1.9 0.076 11 34
6 7.4 0.66 0 1.8 0.075 13 40
7 7.9 0.6 0.06 1.6 0.069 15 59
Truncated results, showing first 1,000 rows.
winequality.columns
# Some of the column names contain spaces. Rename the columns to replace spaces with underscores and shorten the
names.
from pyspark.sql.functions import col
exprs = [col(column).alias(column.replace(' ', '_')) for column in winequality.columns]
#wq = winequality.select(*exprs)
wq = winequality.select(exprs)
Out[680]: DataFrame[fixed_acidity: double, volatile_acidity: double, citric_acid: double, residual_sugar: double, chl
orides: double, free_sulfur_dioxide: double, total_sulfur_dioxide: double, density: double, pH: double, sulphates: do
uble, alcohol: double, quality: int]
Perform multilinear regression to estimate the quality of the wine based on it's components
# Split the data into training and test sets (20% held out for testing)
(TrainSet, TestSet) = wq.randomSplit([0.8,0.2])
# Vectorize features
lrAssembler = VectorAssembler(inputCols=wq.drop("quality").columns, outputCol="features")
# example on how to use less features
#lrAssembler = VectorAssembler(inputCols=["residual_sugar","alcohol"], outputCol="features")
#use .transform().show() if you want to see the vectorized features: lrAssembler.transform(wq).show()
# Make Predictions
lrPredictions = lrModel.transform(TestSet)
# Show Predictions
display(lrPredictions.select("quality","prediction","features"))
Table
quality prediction features
1 6 6.812013814635069 {"vectorType": "dense", "length": 11, "values": [5, 0.38, 0.01, 1.6, 0.048, 26, 60, 0.99084, 3.7, 0.75, 14]}
2 6 6.901158006990844 {"vectorType": "dense", "length": 11, "values": [5.2, 0.34, 0, 1.8, 0.05, 27, 63, 0.9916, 3.68, 0.79, 14]}
3 7 6.250075230396158 {"vectorType": "dense", "length": 11, "values": [5.3, 0.57, 0.01, 1.7, 0.054, 5, 27, 0.9934, 3.57, 0.84, 12.5]}
4 8 6.781282409967318 {"vectorType": "dense", "length": 11, "values": [5.5, 0.49, 0.03, 1.8, 0.044, 28, 87, 0.9908, 3.5, 0.82, 14]}
5 8 5.898732615862613 {"vectorType": "dense", "length": 11, "values": [5.6, 0.85, 0.05, 1.4, 0.045, 12, 88, 0.9924, 3.56, 0.82, 12.9]}
6 6 5.930470507123344 {"vectorType": "dense", "length": 11, "values": [5.9, 0.29, 0.25, 13.4, 0.067, 72, 160, 0.99721, 3.33, 0.54, 10.3]
7 6 6.2845071417946645 {"vectorType": "dense", "length": 11, "values": [5.9, 0.44, 0, 1.6, 0.042, 3, 11, 0.9944, 3.48, 0.85, 11.7]}
Table
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide
1
6 7.4
7.4 0.7
0.66 0
0 1.9
1.8 0.076
0.075 11
13 34
40
2
7 7.8
7.9 0.88
0.6 0
0.06 2.6
1.6 0.098
0.069 25
15 67
59
Truncated
8 7.3results, showing first
0.651,000 rows. 0 1.2 0.065 15 21
9 7.8 0.58 0.02 2 0.073 9 18
# Print the coefficients and intercept for linear regression
print( 7.5 0.5 0.36 6.1 0.071 17 102
10
"Coefficients List: ", lrModel.stages[1].coefficients )
print(
11 6.7
"Intercept: ", 0.58 0.08
lrModel.stages[1].intercept ) 1.8 0.097 15 65
#print("Coefficients
12 7.5 List:
0.5 %s" % str(lrModel.stages[1].coefficients))
0.36 6.1 0.071 17 102
#print("Intercept: %s" % str(lrModel.stages[1].intercept))
13 5.6 0.615 0 1.6 0.089 16 59
7.8
print("______")
14 0.61 0.29 1.6 0.114 9 29
print("Coefficients:")
15 8.9 0.62 0.18 3.8 0.176 52 145
coefficients = lrModel.stages[1].coefficients
16 8.9 0.62 0.19 3.9 0.17 51 148
RMSE: 0.640381
R2 : 0.365982
+--------------------+
| residuals|
+--------------------+
| -1.9391673311439117|
| 0.23068518081834455|
| 0.24707841053737667|
| -0.5782654426454705|
| 1.3413690754061012|
| 0.32905422766333814|
| -0.7809713840657988|
|-0.03194714354228...|
| 0.4859116081054946|
|-0.17243726153934347|
| 0.6811151925144436|
| 0.6764826321286925|
|-0.21924259397254353|
| -0.901158006990844|
| 1.0881932571611088|
| 0.1701630743161715|
Table
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 7.4 0.7 0 1.9 0.076 11 34
2 7.8 0.88 0 2.6 0.098 25 67
3 7.8 0.76 0.04 2.3 0.092 15 54
4 11.2 0.28 0.56 1.9 0.075 17 60
5 7.4 0.7 0 1.9 0.076 11 34
6 7.4 0.66 0 1.8 0.075 13 40
7 7.9 0.6 0.06 1.6 0.069 15 59
Truncated results, showing first 1,000 rows.
# Make predictions
clusterPredictions1 = clusterModel1.transform(featureSet)
display(clusterPredictions1)
Table
features prediction
1 {"vectorType": "dense", "length": 12, "values": [7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.9978, 3.51, 0.56, 9.4, 5]} 1
2 {"vectorType": "dense", "length": 12, "values": [7.8, 0.88, 0, 2.6, 0.098, 25, 67, 0.9968, 3.2, 0.68, 9.8, 5]} 0
3 {"vectorType": "dense", "length": 12, "values": [7.8, 0.76, 0.04, 2.3, 0.092, 15, 54, 0.997, 3.26, 0.65, 9.8, 5]} 3
4 {"vectorType": "dense", "length": 12, "values": [11.2, 0.28, 0.56, 1.9, 0.075, 17, 60, 0.998, 3.16, 0.58, 9.8, 6]} 3
5 {"vectorType": "dense", "length": 12, "values": [7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.9978, 3.51, 0.56, 9.4, 5]} 1
6 {"vectorType": "dense", "length": 12, "values": [7.4, 0.66, 0, 1.8, 0.075, 13, 40, 0.9978, 3.51, 0.56, 9.4, 5]} 3
7 {"vectorType": "dense", "length": 12, "values": [7.9, 0.6, 0.06, 1.6, 0.069, 15, 59, 0.9964, 3.3, 0.46, 9.4, 5]} 3
Truncated results, showing first 1,000 rows.
silhouette1 = evaluator1.evaluate(clusterPredictions1)
print("Silhouette with squared euclidean distance = " + str(silhouette1))
(change the number of clusters and check the new evaluation results)
# Split the data into training and test sets (20% held out for testing)
(clusterTrainSet, clusterTestSet) = wc.randomSplit([0.8,0.2])
# Vectorize features
clusterAssembler2 = VectorAssembler(inputCols=wc.columns, outputCol="features")
# Configure model
kmeans2 = KMeans().setK(4).setSeed(1)
# Train Model
clusterModel2 = pipeline2.fit(clusterTrainSet)
# Make Predictions
clusterPredictions2 = clusterModel2.transform(clusterTestSet)
# Show Predictions
display(clusterPredictions2)
Table
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide
1 5 0.74 0 1.2 0.041 16 46
2 5 1.04 0.24 1.6 0.05 32 96
3 5.3 0.47 0.11 2.2 0.048 16 89
4 5.4 0.58 0.08 1.9 0.059 20 31
5 5.4 0.74 0.09 1.7 0.089 16 26
6 5.6 0.31 0.37 1.4 0.074 12 96
7 5.6 0.66 0 2.2 0.087 3 11
Showing all 305 rows.