0% found this document useful (0 votes)

4 views11 pages

Analise de Dados

The document outlines a final exercise for a Cloud Computing III module, focusing on Apache Spark installation and data wrangling using PySpark in Google Colab. It includes detailed instructions for setting up an Apache Spark cluster, downloading necessary data, and performing data manipulation on UK macroeconomic data. The exercises aim to provide hands-on experience with Spark's capabilities in data processing and analysis.

Uploaded by

Fabio Pereira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views11 pages

Analise de Dados

Uploaded by

Fabio Pereira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

Módulo | Computação em Nuvem III

Caderno de Exercícios
Professor André Perez

 Tópicos

1. Introdução;
2. Apache Spark;
3. Data Wrangling com Spark.

 Exercícios

 1. Apache Spark

Replique as atividades do item 2.1 e 2.2 para instalar e con�gurar um cluster Apache Spark na
máquina virtual do Google Colab.

# Parte 1: Instalação e Configuração

# Download do Spark, versão 3.1.1

!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz && rm spark-3.1.1-bin-hadoop2.7.tgz

1 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

# Download e instalação do Java, versão 8

!apt-get update
!apt-get install -y openjdk-8-jdk-headless

# Instalando a versão mais recente do PySpark

!pip install -q pyspark==3.1.1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

!pip install -q findspark==1.4.2

import findspark
findspark.init()

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease

Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,861 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,519 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [35.2 kB]
Get:14 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,561 kB]
Get:15 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,228 kB]
Get:16 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,622 kB]
Get:17 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
Get:18 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,647 kB]
Get:19 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Package
Fetched 21.2 MB in 4s (5,622 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libxtst6 openjdk-8-jre-headless
Suggested packages:
openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-g
fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
libxtst6 openjdk-8-jdk-headless openjdk-8-jre-headless
0 upgraded, 3 newly installed, 0 to remove and 57 not upgraded.
Need to get 39.7 MB of archives.
After this operation, 144 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxtst6 amd64 2:1.2.3-1build4 [13.4
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jre-headless a

2 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jdk-headless a

Fetched 39.7 MB in 1s (56.6 MB/s)
Selecting previously unselected package libxtst6:amd64.
(Reading database ... 124561 files and directories currently installed.)
Preparing to unpack .../libxtst6_2%3a1.2.3-1build4_amd64.deb ...
Unpacking libxtst6:amd64 (2:1.2.3-1build4) ...
Selecting previously unselected package openjdk-8-jre-headless:amd64.
Preparing to unpack .../openjdk-8-jre-headless_8u432-ga~us1-0ubuntu2~22.04_amd64.deb ...
Unpacking openjdk-8-jre-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
Selecting previously unselected package openjdk-8-jdk-headless:amd64.
Preparing to unpack .../openjdk-8-jdk-headless_8u432-ga~us1-0ubuntu2~22.04_amd64.deb ...
Unpacking openjdk-8-jdk-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
Setting up libxtst6:amd64 (2:1.2.3-1build4) ...
Setting up openjdk-8-jre-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/orbd to provide /usr/bi
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/servertool to provide /
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/tnameserv to provide /u
Setting up openjdk-8-jdk-headless:amd64 (8u432-ga~us1-0ubuntu2~22.04) ...
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/clhsdb to provide /usr/bin/
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/extcheck to provide /usr/bi
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/hsdb to provide /usr/bin/hs
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/idlj to provide /usr/bin/id

 2. Data Wrangling

A base de dados presente neste link contem dados macroeconômicos sobre o Reino Unido
desde o século 13.

2.1. Data

Fazendo o download dos dados utilizando a máquina virutal do Google Colab com o código
abaixo.

# Parte 2: Data Wrangling

# Importar bibliotecas necessárias

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Criar uma sessão Spark

spark = SparkSession.builder.master("local[*]").getOrCreate()

# Fazer o download dos dados presentes no link abaixo.

!wget -q "https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-do

2.2. Wrangling

3 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

2.2. Wrangling

# Verificar se o arquivo foi baixado

!ls -l "uk-macroeconomic-data.csv"

-rw-r--r-- 1 root root 184319 Jan 23 15:08 uk-macroeconomic-data.csv

# Carregar os dados
df = spark.read.csv("uk-macroeconomic-data.csv", header=True, inferSchema=True)

# Mostrar os primeiros registros para verificação

df.show(5)

# Verificar a estrutura do DataFrame

df.printSchema()

root
|-- Description: string (nullable = true)
|-- Real GDP of England at market prices: string (nullable = true)
|-- Real GDP of England at factor cost : string (nullable = true)
|-- Real UK GDP at market prices, geographically-consistent estimate based on post-1922 bor
|-- Real UK GDP at factor cost, geographically-consistent estimate based on post-1922 borde
|-- Index of real UK GDP at factor cost - based on changing political boundaries, : string
|-- Composite estimate of English and (geographically-consistent) UK real GDP at factor cos
|-- HP-filter of log of real composite estimate of English and UK real GDP at factor cost:
|-- Real UK gross disposable national income at market prices, constant border estimate: st
|-- Real consumption: string (nullable = true)
|-- Real investment: string (nullable = true)
|-- Stockbuilding contribution: string (nullable = true)
|-- Real government consumption of goods and services: string (nullable = true)
|-- Export volumes: string (nullable = true)
|-- Import volumes: string (nullable = true)
|-- Nominal GDP of England at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices: string (nullable = true)
|-- Nominal UK GDP at market prices.1: string (nullable = true)
|-- Population (GB+NI): string (nullable = true)
|-- Population (England): string (nullable = true)
|-- Employment: string (nullable = true)
|-- Unemployment rate: string (nullable = true)

4 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

|-- Unemployment rate: string (nullable = true)

|-- Average weekly hours worked: string (nullable = true)
|-- Capital Services, whole economy: string (nullable = true)
|-- TFP growth: string (nullable = true)
|-- Labour productivity: string (nullable = true)
|-- Labour productivity.1: string (nullable = true)
|-- Labour share, whole economy excluding rents: string (nullable = true)
|-- GDP deflator at market prices: string (nullable = true)
|-- Export prices: string (nullable = true)
|-- Import prices: string (nullable = true)
|-- Terms of Trade: string (nullable = true)
|-- $ Oil prices: string (nullable = true)
|-- Earnings per head: string (nullable = true)
|-- Consumer price index: string (nullable = true)
|-- Consumer price inflation: string (nullable = true)
|-- Real consumption wages: string (nullable = true)
|-- Wholesale/producer price index: string (nullable = true)
|-- Bank Rate: string (nullable = true)
|-- Bank Rate.1: string (nullable = true)
|-- 10 year/medium-term government bond yields: string (nullable = true)
|-- Consols / long-term government bond yields: string (nullable = true)
|-- Mortgage rates: string (nullable = true)
|-- Corporate borrowing rate from banks: string (nullable = true)
|-- Corporate bond yields: string (nullable = true)
|-- Share prices: string (nullable = true)
|-- $/£ exchange rate: string (nullable = true)
|-- Real $/£ exchange rate: string (nullable = true)
|-- Nominal ERI: string (nullable = true)
|-- Real ERI: string (nullable = true)
|-- House price index: string (nullable = true)
|-- Credit : string (nullable = true)
|-- Secured credit: string (nullable = true)
|-- Bank of England Balance sheet: string (nullable = true)
|-- Bank of England Balance sheet.1: string (nullable = true)
|-- Coin in circulation outside the Bank of England: string (nullable = true)
|-- Notes and coin in circulation: string (nullable = true)
|-- Monetary base: string (nullable = true)
# Selecionar as colunas relevantes e ordenar por ano decrescente
df_filtered = df.select("Description", "Unemployment rate", "Population (GB+NI)").orderBy(col(

# Mostrar os primeiros registros para verificação

df_filtered.show()

+-----------+-----------------+------------------+
|Description|Unemployment rate|Population (GB+NI)|
+-----------+-----------------+------------------+
| Units| %| 000s|
| 2016| 4.90| 65573|
| 2015| 5.38| 65110|
| 2014| 6.18| 64597|
| 2013| 7.61| 64106|
| 2012| 7.97| 63705|
| 2011| 8.11| 63285|
| 2010| 7.87| 62759|

5 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

| 2010| 7.87| 62759|

| 2009| 7.61| 62260|
| 2008| 5.69| 61824|
| 2007| 5.33| 61319|
| 2006| 5.42| 60827|
| 2005| 4.83| 60413|
| 2004| 4.75| 59950|
| 2003| 5.01| 59637|
| 2002| 5.19| 59366|
| 2001| 5.10| 59113|
| 2000| 5.46| 58886|
| 1999| 5.98| 58684|
| 1998| 6.26| 58475|
+-----------+-----------------+------------------+
only showing top 20 rows

Processe os dados para que a base de dados �nal apresente os valores da taxa de desemprego
( Unemployment rate ) e população ( Population (GB+NI) ) estejam ordenados por ano
decrescente:

year,population,unemployment_rate
...,...,...

Para isso, utilize:

• PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("pyspark-notebook").getOrCreate()

dataframe = spark.read.csv(path='uk-macroeconomic-data.csv', sep=',', header=True)

#visualizando as 10 primeiras linhas do dataframe

dataframe.show(n=10)

6 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

| 1212| null| null|

# Visualizando colunas presentes

dataframe.columns

['Description',
'Real GDP of England at market prices',
'Real GDP of England at factor cost ',
'Real UK GDP at market prices, geographically-consistent estimate based on
post-1922 borders',
'Real UK GDP at factor cost, geographically-consistent estimate based on post-1922
borders',
'Index of real UK GDP at factor cost - based on changing political boundaries, ',
'Composite estimate of English and (geographically-consistent) UK real GDP at
factor cost',
'HP-filter of log of real composite estimate of English and UK real GDP at factor
cost',
'Real UK gross disposable national income at market prices, constant border
estimate',
'Real consumption',
'Real investment',
'Stockbuilding contribution',
'Real government consumption of goods and services',
'Export volumes',
'Import volumes',
'Nominal GDP of England at market prices',
'Nominal UK GDP at market prices',
'Nominal UK GDP at market prices.1',
'Population (GB+NI)',
'Population (England)',
'Employment',
'Unemployment rate',
'Average weekly hours worked',
'Capital Services, whole economy',
'TFP growth',
'Labour productivity',
'Labour productivity.1',
'Labour share, whole economy excluding rents',
'GDP deflator at market prices',
'Export prices',
'Import prices',
'Terms of Trade',
'$ Oil prices',
'Earnings per head',
'Consumer price index',
'Consumer price inflation',
'Real consumption wages',

7 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

'Wholesale/producer price index',

'Bank Rate',
'Bank Rate.1',
'10 year/medium-term government bond yields',
'Consols / long-term government bond yields',
'Mortgage rates',
'Corporate borrowing rate from banks',
'Corporate bond yields',
'Share prices',
'$/£ exchange rate',
'Real $/£ exchange rate',
'Nominal ERI',
'Real ERI',
'House price index',
'Credit ',
'Secured credit',
'Bank of England Balance sheet',
# Visualizando número de colunas
len(dataframe.columns)

#Selecionando colunas relevantes

dataframe = dataframe.select(['Description', 'Population (GB+NI)', 'Unemployment rate'])

# Renomeando as colunas
data = dataframe.\
withColumnRenamed('Description', 'year').\
withColumnRenamed('Population (GB+NI)', 'population').\
withColumnRenamed('Unemployment rate', 'unemployment_rate')

# visualizando as 10 primeiras linhas

data.show(n=10)

8 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

# selecionando uma linha do dataframe baseado no conteudo da coluna 'year'

data_description = data.filter(data['year'] == 'Units')

#Visualizando resultado
data_description.show()

+-----+----------+-----------------+
| year|population|unemployment_rate|
+-----+----------+-----------------+
|Units| 000s| %|
+-----+----------+-----------------+

O método join faz a junção distribuída de dois DataFrames. Já o método broadcast "marca" um
DataFrame como "pequeno" e força o Spark a trafega-lo pela rede.

from pyspark.sql.functions import broadcast

data = data.join(other=broadcast(data_description), on=['year'], how='left_anti')

data.show(n=10)

# removendo todas as linhas que apresentam ao menos um valor nulo

data = data.dropna()

#Visulaizando resultado
data.show()

+----+----------+-----------------+

9 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855| 23241| 3.73|
|1856| 23466| 3.52|
|1857| 23689| 3.95|
|1858| 23914| 5.23|
|1859| 24138| 3.27|
|1860| 24360| 2.94|
|1861| 24585| 3.72|
|1862| 24862| 4.68|
|1863| 25142| 4.15|
|1864| 25425| 2.99|
|1865| 25712| 2.96|
|1866| 26003| 3.29|
|1867| 26296| 4.84|
|1868| 26594| 5.01|
|1869| 26896| 4.68|
|1870| 27201| 3.77|
|1871| 27516| 3.08|
|1872| 27855| 2.31|
|1873| 28198| 2.40|
|1874| 28545| 2.83|
+----+----------+-----------------+
only showing top 20 rows

# ordenando os dados por ordem decrescente

data = data.orderBy('year', ascending=True)

#Visualizando dados alterados

data.show()

10 of 11 23/01/2025, 14:46
Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

|1872| 27855| 2.31|

|1873| 28198| 2.40|
|1874| 28545| 2.83|
+----+----------+-----------------+
only showing top 20 rows

11 of 11 23/01/2025, 14:46

Data Science With Python Workflow
100% (2)
Data Science With Python Workflow
2 pages
Final Lab Manual of ML BCA
No ratings yet
Final Lab Manual of ML BCA
69 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Conda Cheat Sheet: Bit - Ly/tryconda
No ratings yet
Conda Cheat Sheet: Bit - Ly/tryconda
2 pages
Data Science For EnergySystemModelling
No ratings yet
Data Science For EnergySystemModelling
7 pages
Conda Cheatsheet PDF
No ratings yet
Conda Cheatsheet PDF
2 pages
Test Your Understanding - Packages (Copy) - Attempt Review
No ratings yet
Test Your Understanding - Packages (Copy) - Attempt Review
6 pages
Anaconda Installation Guide
No ratings yet
Anaconda Installation Guide
3 pages
Applications of AI in InfoSec
No ratings yet
Applications of AI in InfoSec
86 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
ML Final Lab Manual
No ratings yet
ML Final Lab Manual
68 pages
W03 - AI Data Handling
No ratings yet
W03 - AI Data Handling
47 pages
Underwater Object Detection With YOLO v8
No ratings yet
Underwater Object Detection With YOLO v8
47 pages
Anaconda CheatSheet PDF
No ratings yet
Anaconda CheatSheet PDF
2 pages
Selectionofattributes
No ratings yet
Selectionofattributes
41 pages
Dev New
No ratings yet
Dev New
44 pages
Complete Hadoop Map Reduce Hive Setup Step by Step
No ratings yet
Complete Hadoop Map Reduce Hive Setup Step by Step
30 pages
Installing Spark
No ratings yet
Installing Spark
20 pages
Conda Cheatsheet 2019
No ratings yet
Conda Cheatsheet 2019
2 pages
Install Log
No ratings yet
Install Log
32 pages
Software Installation and Verification
No ratings yet
Software Installation and Verification
17 pages
Install Spark On Windows 10-MacOS
No ratings yet
Install Spark On Windows 10-MacOS
23 pages
Python and Jupyter Notebook Installation
No ratings yet
Python and Jupyter Notebook Installation
20 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
18 pages
Sec-D ML Practical File PDF
No ratings yet
Sec-D ML Practical File PDF
19 pages
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
No ratings yet
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
18 pages
Iouu
No ratings yet
Iouu
12 pages
1097 Ads Pythoninstallationinstructions
No ratings yet
1097 Ads Pythoninstallationinstructions
16 pages
Spring Boot PDF Notes
0% (1)
Spring Boot PDF Notes
11 pages
Building JSP Pages Using The Expression Language (EL)
No ratings yet
Building JSP Pages Using The Expression Language (EL)
30 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Spark Installation Guide
No ratings yet
Spark Installation Guide
6 pages
Neo4j Interview3
No ratings yet
Neo4j Interview3
6 pages
Etl - ApacheSpark - Booking - Colab
No ratings yet
Etl - ApacheSpark - Booking - Colab
9 pages
Anaconda Quickstart
0% (1)
Anaconda Quickstart
5 pages
Artix
No ratings yet
Artix
7 pages
Maven Build Connoisiure
0% (1)
Maven Build Connoisiure
11 pages
Installation Steps
No ratings yet
Installation Steps
5 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
ChatGPT Conda
No ratings yet
ChatGPT Conda
5 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Conda Setup
No ratings yet
Conda Setup
2 pages
Fds PDF
No ratings yet
Fds PDF
4 pages
Anaconda Starter Guide Cheat Sheet PDF
No ratings yet
Anaconda Starter Guide Cheat Sheet PDF
2 pages
Learning Spark - Chapter 2
No ratings yet
Learning Spark - Chapter 2
6 pages
2017-08 Anaconda Starter Guide CheatSheet Web PDF
No ratings yet
2017-08 Anaconda Starter Guide CheatSheet Web PDF
2 pages
Part B Assignment No 13
No ratings yet
Part B Assignment No 13
4 pages
Incupd Axa
No ratings yet
Incupd Axa
2 pages
Spark Python Install
No ratings yet
Spark Python Install
3 pages
Data Science With Python Workflow: Click The Links For Documentation
No ratings yet
Data Science With Python Workflow: Click The Links For Documentation
2 pages
POLYMORPHISM
No ratings yet
POLYMORPHISM
21 pages
Static Dynamic Binding
100% (1)
Static Dynamic Binding
9 pages
Loge 2 e 2 User Product Order
No ratings yet
Loge 2 e 2 User Product Order
1,392 pages
Classes and Objects in Java - GeeksforGeeks
No ratings yet
Classes and Objects in Java - GeeksforGeeks
33 pages
Chap4 OOP in PHP
No ratings yet
Chap4 OOP in PHP
22 pages
Java AWT Unit5
No ratings yet
Java AWT Unit5
108 pages
Cap776 Oops
No ratings yet
Cap776 Oops
81 pages
Soal B. Inggris
No ratings yet
Soal B. Inggris
10 pages
Object Oriented Programming Object Oriented Programming: Lecture-13-16 Instructor Name
No ratings yet
Object Oriented Programming Object Oriented Programming: Lecture-13-16 Instructor Name
69 pages
String in Java
No ratings yet
String in Java
13 pages
Cs8392 - Object Oriented Programming Question Bank Unit-I
No ratings yet
Cs8392 - Object Oriented Programming Question Bank Unit-I
1 page
SNMP4J-Agent & AgenPro Instrumentation Guide
No ratings yet
SNMP4J-Agent & AgenPro Instrumentation Guide
62 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
36 pages
Orm PDF
No ratings yet
Orm PDF
30 pages
Object Oriented Programming Systems
No ratings yet
Object Oriented Programming Systems
7 pages
CVM 20033 038-Juno 120202 Pinot Wario-379151 Crash Dec 02 15.22.15 2024
No ratings yet
CVM 20033 038-Juno 120202 Pinot Wario-379151 Crash Dec 02 15.22.15 2024
18 pages
IP Address Finder Using Swing
No ratings yet
IP Address Finder Using Swing
10 pages
Arrays in Java
No ratings yet
Arrays in Java
11 pages
Login - Success: Execution Environment
No ratings yet
Login - Success: Execution Environment
21 pages
Ankit Resume
No ratings yet
Ankit Resume
2 pages
Computing Di Indonesia: Studi Pada Kap Non-Big Four Di Jawa: Zaki@ub - Ac.id
No ratings yet
Computing Di Indonesia: Studi Pada Kap Non-Big Four Di Jawa: Zaki@ub - Ac.id
19 pages
2023-24 - BCS403 - CT Paper
No ratings yet
2023-24 - BCS403 - CT Paper
3 pages
Lab - Exp - 1 (Compiling and Running Java Program)
No ratings yet
Lab - Exp - 1 (Compiling and Running Java Program)
5 pages
ISC 10 Program Shreyash Tiwari
No ratings yet
ISC 10 Program Shreyash Tiwari
3 pages
Sachin Mishra 01
No ratings yet
Sachin Mishra 01
1 page

Analise de Dados

Uploaded by

Analise de Dados

Uploaded by

Analista_M40_Exercicio_FINAL - Colab https://colab.research.google.com/#fileId=https%3A//storage.googleapi...

Módulo | Computação em Nuvem III

# Parte 1: Instalação e Configuração

# Download do Spark, versão 3.1.1

# Download e instalação do Java, versão 8

# Instalando a versão mais recente do PySpark

!pip install -q findspark==1.4.2

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease

Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jdk-headless a

# Parte 2: Data Wrangling

# Importar bibliotecas necessárias

# Criar uma sessão Spark

# Fazer o download dos dados presentes no link abaixo.

# Verificar se o arquivo foi baixado

-rw-r--r-- 1 root root 184319 Jan 23 15:08 uk-macroeconomic-data.csv

# Mostrar os primeiros registros para verificação

# Verificar a estrutura do DataFrame

|-- Unemployment rate: string (nullable = true)

# Mostrar os primeiros registros para verificação

| 2010| 7.87| 62759|

Para isso, utilize:

from pyspark.sql import SparkSession

dataframe = spark.read.csv(path='uk-macroeconomic-data.csv', sep=',', header=True)

#visualizando as 10 primeiras linhas do dataframe

| 1212| null| null|

# Visualizando colunas presentes

'Wholesale/producer price index',

#Selecionando colunas relevantes

# visualizando as 10 primeiras linhas

# selecionando uma linha do dataframe baseado no conteudo da coluna 'year'

from pyspark.sql.functions import broadcast

data = data.join(other=broadcast(data_description), on=['year'], how='left_anti')

# removendo todas as linhas que apresentam ao menos um valor nulo

# ordenando os dados por ordem decrescente

#Visualizando dados alterados

|1872| 27855| 2.31|

You might also like