0% found this document useful (0 votes)
4 views11 pages

Analise de Dados

The document outlines a final exercise for a Cloud Computing III module, focusing on Apache Spark installation and data wrangling using PySpark in Google Colab. It includes detailed instructions for setting up an Apache Spark cluster, downloading necessary data, and performing data manipulation on UK macroeconomic data. The exercises aim to provide hands-on experience with Spark's capabilities in data processing and analysis.

Uploaded by

Fabio Pereira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Analise de Dados

The document outlines a final exercise for a Cloud Computing III module, focusing on Apache Spark installation and data wrangling using PySpark in Google Colab. It includes detailed instructions for setting up an Apache Spark cluster, downloading necessary data, and performing data manipulation on UK macroeconomic data. The exercises aim to provide hands-on experience with Spark's capabilities in data processing and analysis.

Uploaded by

Fabio Pereira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

Módulo | Computação em Nuvem III


Caderno de Exercícios
Professor André Perez

 Tópicos

1. Introdução;
2. Apache Spark;
3. Data Wrangling com Spark.

 Exercícios

 1. Apache Spark

Replique as atividades do item 2.1 e 2.2 para instalar e con�gurar um cluster Apache Spark na
máquina virtual do Google Colab.

# Parte 1: Instalação e Configuração

# Download do Spark, versão 3.1.1


!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz && rm spark-3.1.1-bin-hadoop2.7.tgz

1 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

# Download e instalação do Java, versão 8


!apt-get update
!apt-get install -y openjdk-8-jdk-headless

# Instalando a versão mais recente do PySpark


!pip install -q pyspark==3.1.1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

!pip install -q findspark==1.4.2

import findspark
findspark.init()

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease


Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,861 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,519 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [35.2 kB]
Get:14 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,561 kB]
Get:15 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,228 kB]
Get:16 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,622 kB]
Get:17 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
Get:18 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,647 kB]
Get:19 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Package
Fetched 21.2 MB in 4s (5,622 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libxtst6 openjdk-8-jre-headless
Suggested packages:
openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-g
fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
libxtst6 openjdk-8-jdk-headless openjdk-8-jre-headless
0 upgraded, 3 newly installed, 0 to remove and 57 not upgraded.
Need to get 39.7 MB of archives.
After this operation, 144 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxtst6 amd64 2:1.2.3-1build4 [13.4
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jre-headless a

2 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jdk-headless a


Fetched 39.7 MB in 1s (56.6 MB/s)
Selecting previously unselected package libxtst6:amd64.
(Reading database ... 124561 files and directories currently installed.)
Preparing to unpack .../libxtst6_2%3a1.2.3-1build4_amd64.deb ...
Unpacking libxtst6:amd64 (2:1.2.3-1build4) ...
Selecting previously unselected package openjdk-8-jre-headless:amd64.
Preparing to unpack .../openjdk-8-jre-headless_8u432-ga~us1-0ubuntu2~22.04_amd64.deb ...
Unpacking openjdk-8-jre-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
Selecting previously unselected package openjdk-8-jdk-headless:amd64.
Preparing to unpack .../openjdk-8-jdk-headless_8u432-ga~us1-0ubuntu2~22.04_amd64.deb ...
Unpacking openjdk-8-jdk-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
Setting up libxtst6:amd64 (2:1.2.3-1build4) ...
Setting up openjdk-8-jre-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/orbd to provide /usr/bi
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/servertool to provide /
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/tnameserv to provide /u
Setting up openjdk-8-jdk-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/clhsdb to provide /usr/bin/
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/extcheck to provide /usr/bi
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/hsdb to provide /usr/bin/hs
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/idlj to provide /usr/bin/id

 2. Data Wrangling

A base de dados presente neste link contem dados macroeconômicos sobre o Reino Unido
desde o século 13.

2.1. Data

Fazendo o download dos dados utilizando a máquina virutal do Google Colab com o código
abaixo.

# Parte 2: Data Wrangling

# Importar bibliotecas necessárias


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Criar uma sessão Spark


spark = SparkSession.builder.master("local[*]").getOrCreate()

# Fazer o download dos dados presentes no link abaixo.


!wget -q "https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-do

2.2. Wrangling

3 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

2.2. Wrangling

# Verificar se o arquivo foi baixado


!ls -l "uk-macroeconomic-data.csv"

-rw-r--r-- 1 root root 184319 Jan 23 15:08 uk-macroeconomic-data.csv

# Carregar os dados
df = spark.read.csv("uk-macroeconomic-data.csv", header=True, inferSchema=True)

# Mostrar os primeiros registros para verificação


df.show(5)

+-----------+------------------------------------+-----------------------------------+------
|Description|Real GDP of England at market prices|Real GDP of England at factor cost |Real U
+-----------+------------------------------------+-----------------------------------+------
| Units| £mn, Chained Volu...| £mn, Chained Volu...|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|
+-----------+------------------------------------+-----------------------------------+------
only showing top 5 rows

# Verificar a estrutura do DataFrame


df.printSchema()

root
|-- Description: string (nullable = true)
|-- Real GDP of England at market prices: string (nullable = true)
|-- Real GDP of England at factor cost : string (nullable = true)
|-- Real UK GDP at market prices, geographically-consistent estimate based on post-1922 bor
|-- Real UK GDP at factor cost, geographically-consistent estimate based on post-1922 borde
|-- Index of real UK GDP at factor cost - based on changing political boundaries, : string
|-- Composite estimate of English and (geographically-consistent) UK real GDP at factor cos
|-- HP-filter of log of real composite estimate of English and UK real GDP at factor cost:
|-- Real UK gross disposable national income at market prices, constant border estimate: st
|-- Real consumption: string (nullable = true)
|-- Real investment: string (nullable = true)
|-- Stockbuilding contribution: string (nullable = true)
|-- Real government consumption of goods and services: string (nullable = true)
|-- Export volumes: string (nullable = true)
|-- Import volumes: string (nullable = true)
|-- Nominal GDP of England at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices.1: string (nullable = true)
|-- Population (GB+NI): string (nullable = true)
|-- Population (England): string (nullable = true)
|-- Employment: string (nullable = true)
|-- Unemployment rate: string (nullable = true)

4 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

|-- Unemployment rate: string (nullable = true)


|-- Average weekly hours worked: string (nullable = true)
|-- Capital Services, whole economy: string (nullable = true)
|-- TFP growth: string (nullable = true)
|-- Labour productivity: string (nullable = true)
|-- Labour productivity.1: string (nullable = true)
|-- Labour share, whole economy excluding rents: string (nullable = true)
|-- GDP deflator at market prices: string (nullable = true)
|-- Export prices: string (nullable = true)
|-- Import prices: string (nullable = true)
|-- Terms of Trade: string (nullable = true)
|-- $ Oil prices: string (nullable = true)
|-- Earnings per head: string (nullable = true)
|-- Consumer price index: string (nullable = true)
|-- Consumer price inflation: string (nullable = true)
|-- Real consumption wages: string (nullable = true)
|-- Wholesale/producer price index: string (nullable = true)
|-- Bank Rate: string (nullable = true)
|-- Bank Rate.1: string (nullable = true)
|-- 10 year/medium-term government bond yields: string (nullable = true)
|-- Consols / long-term government bond yields: string (nullable = true)
|-- Mortgage rates: string (nullable = true)
|-- Corporate borrowing rate from banks: string (nullable = true)
|-- Corporate bond yields: string (nullable = true)
|-- Share prices: string (nullable = true)
|-- $/£ exchange rate: string (nullable = true)
|-- Real $/£ exchange rate: string (nullable = true)
|-- Nominal ERI: string (nullable = true)
|-- Real ERI: string (nullable = true)
|-- House price index: string (nullable = true)
|-- Credit : string (nullable = true)
|-- Secured credit: string (nullable = true)
|-- Bank of England Balance sheet: string (nullable = true)
|-- Bank of England Balance sheet.1: string (nullable = true)
|-- Coin in circulation outside the Bank of England: string (nullable = true)
|-- Notes and coin in circulation: string (nullable = true)
|-- Monetary base: string (nullable = true)
# Selecionar as colunas relevantes e ordenar por ano decrescente
df_filtered = df.select("Description", "Unemployment rate", "Population (GB+NI)").orderBy(col(

# Mostrar os primeiros registros para verificação


df_filtered.show()

+-----------+-----------------+------------------+
|Description|Unemployment rate|Population (GB+NI)|
+-----------+-----------------+------------------+
| Units| %| 000s|
| 2016| 4.90| 65573|
| 2015| 5.38| 65110|
| 2014| 6.18| 64597|
| 2013| 7.61| 64106|
| 2012| 7.97| 63705|
| 2011| 8.11| 63285|
| 2010| 7.87| 62759|

5 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

| 2010| 7.87| 62759|


| 2009| 7.61| 62260|
| 2008| 5.69| 61824|
| 2007| 5.33| 61319|
| 2006| 5.42| 60827|
| 2005| 4.83| 60413|
| 2004| 4.75| 59950|
| 2003| 5.01| 59637|
| 2002| 5.19| 59366|
| 2001| 5.10| 59113|
| 2000| 5.46| 58886|
| 1999| 5.98| 58684|
| 1998| 6.26| 58475|
+-----------+-----------------+------------------+
only showing top 20 rows

Processe os dados para que a base de dados �nal apresente os valores da taxa de desemprego
( Unemployment rate ) e população ( Population (GB+NI) ) estejam ordenados por ano
decrescente:

year,population,unemployment_rate
...,...,...

Para isso, utilize:

• PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("pyspark-notebook").getOrCreate()

dataframe = spark.read.csv(path='uk-macroeconomic-data.csv', sep=',', header=True)

#visualizando as 10 primeiras linhas do dataframe


dataframe.show(n=10)

+-----------+------------------------------------+-----------------------------------+------
|Description|Real GDP of England at market prices|Real GDP of England at factor cost |Real U
+-----------+------------------------------------+-----------------------------------+------
| Units| £mn, Chained Volu...| £mn, Chained Volu...|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|

6 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

| 1212| null| null|


| 1213| null| null|
| 1214| null| null|
| 1215| null| null|
| 1216| null| null|
| 1217| null| null|
+-----------+------------------------------------+-----------------------------------+------
only showing top 10 rows

# Visualizando colunas presentes


dataframe.columns

['Description',
'Real GDP of England at market prices',
'Real GDP of England at factor cost ',
'Real UK GDP at market prices, geographically-consistent estimate based on
post-1922 borders',
'Real UK GDP at factor cost, geographically-consistent estimate based on post-1922
borders',
'Index of real UK GDP at factor cost - based on changing political boundaries, ',
'Composite estimate of English and (geographically-consistent) UK real GDP at
factor cost',
'HP-filter of log of real composite estimate of English and UK real GDP at factor
cost',
'Real UK gross disposable national income at market prices, constant border
estimate',
'Real consumption',
'Real investment',
'Stockbuilding contribution',
'Real government consumption of goods and services',
'Export volumes',
'Import volumes',
'Nominal GDP of England at market prices',
'Nominal UK GDP at market prices',
'Nominal UK GDP at market prices.1',
'Population (GB+NI)',
'Population (England)',
'Employment',
'Unemployment rate',
'Average weekly hours worked',
'Capital Services, whole economy',
'TFP growth',
'Labour productivity',
'Labour productivity.1',
'Labour share, whole economy excluding rents',
'GDP deflator at market prices',
'Export prices',
'Import prices',
'Terms of Trade',
'$ Oil prices',
'Earnings per head',
'Consumer price index',
'Consumer price inflation',
'Real consumption wages',

7 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

'Wholesale/producer price index',


'Bank Rate',
'Bank Rate.1',
'10 year/medium-term government bond yields',
'Consols / long-term government bond yields',
'Mortgage rates',
'Corporate borrowing rate from banks',
'Corporate bond yields',
'Share prices',
'$/£ exchange rate',
'Real $/£ exchange rate',
'Nominal ERI',
'Real ERI',
'House price index',
'Credit ',
'Secured credit',
'Bank of England Balance sheet',
# Visualizando número de colunas
len(dataframe.columns)

#Selecionando colunas relevantes


dataframe = dataframe.select(['Description', 'Population (GB+NI)', 'Unemployment rate'])

# Renomeando as colunas
data = dataframe.\
withColumnRenamed('Description', 'year').\
withColumnRenamed('Population (GB+NI)', 'population').\
withColumnRenamed('Unemployment rate', 'unemployment_rate')

# visualizando as 10 primeiras linhas

data.show(n=10)

+-----+----------+-----------------+
| year|population|unemployment_rate|
+-----+----------+-----------------+
|Units| 000s| %|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|
| 1213| null| null|
| 1214| null| null|
| 1215| null| null|
| 1216| null| null|
| 1217| null| null|
+-----+----------+-----------------+
only showing top 10 rows

8 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

# selecionando uma linha do dataframe baseado no conteudo da coluna 'year'


data_description = data.filter(data['year'] == 'Units')

#Visualizando resultado
data_description.show()

+-----+----------+-----------------+
| year|population|unemployment_rate|
+-----+----------+-----------------+
|Units| 000s| %|
+-----+----------+-----------------+

O método join faz a junção distribuída de dois DataFrames. Já o método broadcast "marca" um
DataFrame como "pequeno" e força o Spark a trafega-lo pela rede.

from pyspark.sql.functions import broadcast

data = data.join(other=broadcast(data_description), on=['year'], how='left_anti')

data.show(n=10)

+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1209| null| null|
|1210| null| null|
|1211| null| null|
|1212| null| null|
|1213| null| null|
|1214| null| null|
|1215| null| null|
|1216| null| null|
|1217| null| null|
|1218| null| null|
+----+----------+-----------------+
only showing top 10 rows

# removendo todas as linhas que apresentam ao menos um valor nulo


data = data.dropna()

#Visulaizando resultado
data.show()

+----+----------+-----------------+

9 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855| 23241| 3.73|
|1856| 23466| 3.52|
|1857| 23689| 3.95|
|1858| 23914| 5.23|
|1859| 24138| 3.27|
|1860| 24360| 2.94|
|1861| 24585| 3.72|
|1862| 24862| 4.68|
|1863| 25142| 4.15|
|1864| 25425| 2.99|
|1865| 25712| 2.96|
|1866| 26003| 3.29|
|1867| 26296| 4.84|
|1868| 26594| 5.01|
|1869| 26896| 4.68|
|1870| 27201| 3.77|
|1871| 27516| 3.08|
|1872| 27855| 2.31|
|1873| 28198| 2.40|
|1874| 28545| 2.83|
+----+----------+-----------------+
only showing top 20 rows

# ordenando os dados por ordem decrescente


data = data.orderBy('year', ascending=True)

#Visualizando dados alterados


data.show()

+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855| 23241| 3.73|
|1856| 23466| 3.52|
|1857| 23689| 3.95|
|1858| 23914| 5.23|
|1859| 24138| 3.27|
|1860| 24360| 2.94|
|1861| 24585| 3.72|
|1862| 24862| 4.68|
|1863| 25142| 4.15|
|1864| 25425| 2.99|
|1865| 25712| 2.96|
|1866| 26003| 3.29|
|1867| 26296| 4.84|
|1868| 26594| 5.01|
|1869| 26896| 4.68|
|1870| 27201| 3.77|
|1871| 27516| 3.08|
|1872| 27855| 2.31|

10 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

|1872| 27855| 2.31|


|1873| 28198| 2.40|
|1874| 28545| 2.83|
+----+----------+-----------------+
only showing top 20 rows

11 of 11 23/01/2025, 14:46

You might also like