Analise de Dados
Analise de Dados
Tópicos
1. Introdução;
2. Apache Spark;
3. Data Wrangling com Spark.
Exercícios
1. Apache Spark
Replique as atividades do item 2.1 e 2.2 para instalar e con�gurar um cluster Apache Spark na
máquina virtual do Google Colab.
1 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
import findspark
findspark.init()
2 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
2. Data Wrangling
A base de dados presente neste link contem dados macroeconômicos sobre o Reino Unido
desde o século 13.
2.1. Data
Fazendo o download dos dados utilizando a máquina virutal do Google Colab com o código
abaixo.
2.2. Wrangling
3 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
2.2. Wrangling
# Carregar os dados
df = spark.read.csv("uk-macroeconomic-data.csv", header=True, inferSchema=True)
+-----------+------------------------------------+-----------------------------------+------
|Description|Real GDP of England at market prices|Real GDP of England at factor cost |Real U
+-----------+------------------------------------+-----------------------------------+------
| Units| £mn, Chained Volu...| £mn, Chained Volu...|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|
+-----------+------------------------------------+-----------------------------------+------
only showing top 5 rows
root
|-- Description: string (nullable = true)
|-- Real GDP of England at market prices: string (nullable = true)
|-- Real GDP of England at factor cost : string (nullable = true)
|-- Real UK GDP at market prices, geographically-consistent estimate based on post-1922 bor
|-- Real UK GDP at factor cost, geographically-consistent estimate based on post-1922 borde
|-- Index of real UK GDP at factor cost - based on changing political boundaries, : string
|-- Composite estimate of English and (geographically-consistent) UK real GDP at factor cos
|-- HP-filter of log of real composite estimate of English and UK real GDP at factor cost:
|-- Real UK gross disposable national income at market prices, constant border estimate: st
|-- Real consumption: string (nullable = true)
|-- Real investment: string (nullable = true)
|-- Stockbuilding contribution: string (nullable = true)
|-- Real government consumption of goods and services: string (nullable = true)
|-- Export volumes: string (nullable = true)
|-- Import volumes: string (nullable = true)
|-- Nominal GDP of England at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices.1: string (nullable = true)
|-- Population (GB+NI): string (nullable = true)
|-- Population (England): string (nullable = true)
|-- Employment: string (nullable = true)
|-- Unemployment rate: string (nullable = true)
4 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
+-----------+-----------------+------------------+
|Description|Unemployment rate|Population (GB+NI)|
+-----------+-----------------+------------------+
| Units| %| 000s|
| 2016| 4.90| 65573|
| 2015| 5.38| 65110|
| 2014| 6.18| 64597|
| 2013| 7.61| 64106|
| 2012| 7.97| 63705|
| 2011| 8.11| 63285|
| 2010| 7.87| 62759|
5 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
Processe os dados para que a base de dados �nal apresente os valores da taxa de desemprego
( Unemployment rate ) e população ( Population (GB+NI) ) estejam ordenados por ano
decrescente:
year,population,unemployment_rate
...,...,...
• PySpark
spark = SparkSession.builder.master("local[*]").appName("pyspark-notebook").getOrCreate()
+-----------+------------------------------------+-----------------------------------+------
|Description|Real GDP of England at market prices|Real GDP of England at factor cost |Real U
+-----------+------------------------------------+-----------------------------------+------
| Units| £mn, Chained Volu...| £mn, Chained Volu...|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|
6 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
['Description',
'Real GDP of England at market prices',
'Real GDP of England at factor cost ',
'Real UK GDP at market prices, geographically-consistent estimate based on
post-1922 borders',
'Real UK GDP at factor cost, geographically-consistent estimate based on post-1922
borders',
'Index of real UK GDP at factor cost - based on changing political boundaries, ',
'Composite estimate of English and (geographically-consistent) UK real GDP at
factor cost',
'HP-filter of log of real composite estimate of English and UK real GDP at factor
cost',
'Real UK gross disposable national income at market prices, constant border
estimate',
'Real consumption',
'Real investment',
'Stockbuilding contribution',
'Real government consumption of goods and services',
'Export volumes',
'Import volumes',
'Nominal GDP of England at market prices',
'Nominal UK GDP at market prices',
'Nominal UK GDP at market prices.1',
'Population (GB+NI)',
'Population (England)',
'Employment',
'Unemployment rate',
'Average weekly hours worked',
'Capital Services, whole economy',
'TFP growth',
'Labour productivity',
'Labour productivity.1',
'Labour share, whole economy excluding rents',
'GDP deflator at market prices',
'Export prices',
'Import prices',
'Terms of Trade',
'$ Oil prices',
'Earnings per head',
'Consumer price index',
'Consumer price inflation',
'Real consumption wages',
7 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
# Renomeando as colunas
data = dataframe.\
withColumnRenamed('Description', 'year').\
withColumnRenamed('Population (GB+NI)', 'population').\
withColumnRenamed('Unemployment rate', 'unemployment_rate')
data.show(n=10)
+-----+----------+-----------------+
| year|population|unemployment_rate|
+-----+----------+-----------------+
|Units| 000s| %|
| 1209| null| null|
| 1210| null| null|
| 1211| null| null|
| 1212| null| null|
| 1213| null| null|
| 1214| null| null|
| 1215| null| null|
| 1216| null| null|
| 1217| null| null|
+-----+----------+-----------------+
only showing top 10 rows
8 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
#Visualizando resultado
data_description.show()
+-----+----------+-----------------+
| year|population|unemployment_rate|
+-----+----------+-----------------+
|Units| 000s| %|
+-----+----------+-----------------+
O método join faz a junção distribuída de dois DataFrames. Já o método broadcast "marca" um
DataFrame como "pequeno" e força o Spark a trafega-lo pela rede.
data.show(n=10)
+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1209| null| null|
|1210| null| null|
|1211| null| null|
|1212| null| null|
|1213| null| null|
|1214| null| null|
|1215| null| null|
|1216| null| null|
|1217| null| null|
|1218| null| null|
+----+----------+-----------------+
only showing top 10 rows
#Visulaizando resultado
data.show()
+----+----------+-----------------+
9 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855| 23241| 3.73|
|1856| 23466| 3.52|
|1857| 23689| 3.95|
|1858| 23914| 5.23|
|1859| 24138| 3.27|
|1860| 24360| 2.94|
|1861| 24585| 3.72|
|1862| 24862| 4.68|
|1863| 25142| 4.15|
|1864| 25425| 2.99|
|1865| 25712| 2.96|
|1866| 26003| 3.29|
|1867| 26296| 4.84|
|1868| 26594| 5.01|
|1869| 26896| 4.68|
|1870| 27201| 3.77|
|1871| 27516| 3.08|
|1872| 27855| 2.31|
|1873| 28198| 2.40|
|1874| 28545| 2.83|
+----+----------+-----------------+
only showing top 20 rows
+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855| 23241| 3.73|
|1856| 23466| 3.52|
|1857| 23689| 3.95|
|1858| 23914| 5.23|
|1859| 24138| 3.27|
|1860| 24360| 2.94|
|1861| 24585| 3.72|
|1862| 24862| 4.68|
|1863| 25142| 4.15|
|1864| 25425| 2.99|
|1865| 25712| 2.96|
|1866| 26003| 3.29|
|1867| 26296| 4.84|
|1868| 26594| 5.01|
|1869| 26896| 4.68|
|1870| 27201| 3.77|
|1871| 27516| 3.08|
|1872| 27855| 2.31|
10 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...
11 of 11 23/01/2025, 14:46