0% found this document useful (0 votes)
2 views

Tutorial+1

The document provides a tutorial on data formats and ETL processes, focusing on Anaconda installation, Jupyter Notebook usage, and various data serialization formats such as CSV, JSON, XML, and Avro. It includes instructions for installing Jupyter Notebook, reading different file types, and a comparison between XML and HTML. Additionally, it introduces NLTK as a platform for processing human language data in Python.

Uploaded by

ong.sihui1
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Tutorial+1

The document provides a tutorial on data formats and ETL processes, focusing on Anaconda installation, Jupyter Notebook usage, and various data serialization formats such as CSV, JSON, XML, and Avro. It includes instructions for installing Jupyter Notebook, reading different file types, and a comparison between XML and HTML. Additionally, it introduces NLTK as a platform for processing human language data in Python.

Uploaded by

ong.sihui1
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

IS3107 Tutorial 1

Data Formats and ETL Example

1
TA: Benedict Tan
Final Year Business Analytics Student

2
Anaconda Installation

Anaconda offers the easiest way to use Python on a single machine.

Different projects may have different version of Python dependencies. Anaconda creates a
"virtual environment,” an isolated dependencies library for different projects, so there will
be no “version conflicts”.

• Install on Windows - https://docs.anaconda.com/anaconda/install/windows/

• Install on Linux - https://docs.anaconda.com/anaconda/install/linux/

• Install on macOS - https://docs.anaconda.com/anaconda/install/mac-os/

3
What is Jupyter Notebook?

• Jupyter Notebook is a common tool in data science that makes it easy to explore
and plot the data.
• Take a look here for more about how to write markdowns.
• Very popular code editor of choice for fast prototyping of “data
science/analytics” code. (But not a substitute for a proper IDE, e.g., PyCharm, to
develop .py files!)

4
What is Jupyter Notebook?

5
Install Jupyter Notebook

Install Jupyter Notebook


• conda activate NAME
• pip install jupyter notebook

Use virtual environment in Jupyter Notebook


• conda activate NAME
•ipython kernel install --name "local-venv" --
user

Remove kernel
• jupyter kernelspec list
• jupyter kernelspec uninstall unwanted-
kernel
Jupyter Notebook Kernels: How to Add, Change, Remov
e

6
Install Jupyter Notebook

7
How to open Jupyter Notebook
Mac
1. Open Terminal
2. Enter command `jupyter notebook`

Windows
1. Open powershell
2. Enter command `jupyter notebook`

8
Data Serialisation formats:
1. CSV
2. Json
3. XML
4. Avro

How it is transferred:
5. HTTP (All formats)
6. SOAP (XML)

10
How to install package in Jupyter Notebook?

pip install Do not need to install every time


Package_name Need to import library every time
import Package_name

Example:
pip install pandas
import pandas as pd

pd.show_versions()

11
How to read csv in Jupyter Notebook?

import pandas as pd
mpg = pd.read_csv('Path_Route')

Example:
import pandas as pd
mydata
=pd.read_csv('C:/Users/Mike/Documents/mpg.csv')

12
How to read json in Jupyter Notebook?

import pandas as pd
df = pd.read_json(" 'Path_Route")

Example:
import pandas as pd
df = pd.read_json("FILE_JSON.json")

13
How to read xml in Jupyter Notebook?

Example:
from bs4 import BeautifulSoup

with open('dict.xml', 'r') as


f: data = f.read()

Bs_data = BeautifulSoup(data,
"xml")

# Finding all instances of


tag`unique`
b_unique =
Bs_data.find_all('unique')

print(b_unique)

14
What is Avro:
1. Another serialisation format that serialises in compact binary format.
2. Efficient for both storage and transmission
3. Schema based

15
How to read avro file in Jupyter Notebook?

Example:
import avro.schema
from avro.datafile import DataFileReader,
DataFileWriter from avro.io import DatumReader,
DatumWriter

schema = avro.schema.parse(open("user.avsc").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(),


schema)
writer.append({"name": "Alyssa", "age": 25,"gender":"female"})
writer.append({"name": "Ahmad", "age": 35,"gender":"male"})

16
XML & HTML

What is the difference between XML and HTML?


• A) XML is used to describe data, defines data structure
• B) HTML is used to describe data, defines data structure
• C) XML is used to arrange data, defines organization and display
• D) HTML is used to arrange data, defines organization and
display

17
XML & HTML

What is the difference between XML and HTML?


• A) XML is used to describe data, defines data structure
• B) HTML is used to describe data, defines data structure
• C) XML is used to arrange data, defines organization and display
• D) HTML is used to arrange data, defines organization and
display

18
XML Example

• XML efficiently stores and carries data from place to place.

•While it is generally human readable, XML relies on other applications


to display, analyze, or output the data. It only stores and moves it.

• XML is platform-agnostic and can hook into any application that


supports it.
•XML tags are user-defined, so it is comparatively simple, easy to write
and learn. You don’t need to memorize the tags like HTML; you make
them up yourself.

• It’s an extensible language that can have information written to or


removed from it at any time.

• XML is dynamic and can be used to create non-static web pages.

XML vs. HTM 17


HTML Example

•HTML is the primary, standardized language for web development. It is platform-agnostic and works in all
browsers and applications that support it.
• HTML uses a simple markup syntax made of tags and attributes. These tags are predefined.
• HTML is not case-sensitive and will display even with typos and syntax errors.
• It creates static web pages that don’t update or change.
• HTML can integrate with other web languages such as CSS, XML, and back-end languages

18
XML vs. HTML

Key Difference
•XML is abbreviation for extensible Markup Language whereas HTML stands for Hypertext Markup

Language.
• XML mainly focuses on transfer of data while HTML is focused on presentation of the data.
• XML tags are extensible whereas HTML has limited tags.
• XML is Case sensitive while HTML is Case insensitive

Summary
• XML’s primary function is in storing and transporting data, it isn’t concerned with displaying the data.
•HTML is the primary language used for coding the front end of a website. While it’s commonly used alongside
and integrates with other languages like CSS, XML, and back-end languages such as Ruby and Python, HTML
is primarily responsible for crafting a website’s layout and basic appearance.

See this tutorial for more informati 19


NLTK

Human language
• needs external linguistic knowledge or data to process
• sometimes even need model or training
• essential for enabling high-level analysis

NLTK is a leading platform for building Python programs to work with human language data.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

20

You might also like