0% found this document useful (0 votes)

22 views

Big Data Training1

This document provides information on various Hadoop ecosystem tools and concepts. It discusses tools for data ingestion (Sqoop, Flume), processing (Hive, Pig, Spark), resource management (YARN), and data storage (HBase, HDFS). It also includes examples of SQL queries and loading data into Hive tables.

Uploaded by

seshuchoudary

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Big Data Training1

Uploaded by

seshuchoudary

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

selcect a source data

tool to bring data

hcatalog to define schema
hive to determine sentiment
use BI to analyze

hbase has API's to support random reads and writes

tools such as Splice machine use these API's to provide SQL layer on top of Hbase

MR is one implementation, Spark is another implementation

multiple namenodes/namespace

https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/
Federation.html

create table salary (

name varchar(20),
age int,
salary double,
zipcode int);

load data local infile '/home/mapr/labs/data/salary.txt' into table salary fields

terminated by ',';

alter table salary add column `id` int(10) unsigned primary KEY AUTO_INCREMENT;

sqoop import --connect jdbc:mysql://localhost/sqooplab --username root --password

AmexImpetus --table salary

sqoop export --connect jdbc:mysql://localhost/sqooplab --table salaries2 --username

root -P --export-dir salaryquery --input-fields-terminated-by ","

----------------------------

day 2

yarn

container is equivalent to mapper or reducer or any other processing framework

that is bein executed

monitoring of jobs moved from single point of failure to multiple

why do application master started on data nodes

spout and bolts - processing units for spark

cpu ,mem - capacity planning 50-60% only available for processing

meta data on resource manager - what jobs are running on what application masters
-----------------------------------------------------------------------------

flume agent

multiplex

storm has capacity to horizontally scale? storm can scale without shutting down
topology - streaming ETL

can i increase mem of jvm while its running

flume is not used for real time scenarios

if # events are increasing unpredicatably - use kafka+flume

file channels/memory channels

flume can have interceptors to modify events in flight

--------------------------------------------------------------------------------

Pig

datatype is optional in pig - default is bytearray

if you don't specify, typecasting might be expensive

until pig execution hits dump, it doesn't load the data

in pig script , 'PARALLEl' indicates number of reducers

order by - expensive

replicated joins - read more

is it a good idea to include data within UDF

- create a pool, and connect udf to the pool because udf is executing many times

you can call webservice from udf

DataFu library - collection of pig UDF's

optimize pig scripts - filter early and often

sample

types of loaders and storage

if u want to write into couchbase, u can write custom loaders

http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_joins

pig - compression
hcat loader - allows filtering to happen while loading

http://chimera.labs.oreilly.com/books/1234000001811/ch07.html#explain

---------------------------------------------------------------------------

Hive

hive partitions

when u r loading data using load, data will not be parsed and distributed based on
partition key

but while loading using dynamic query, data will be partitioned

map side joins

bucketing - further dividing partitions

useful in sampling

skewed table

you can read and write into hive from pig script

normally pig is used to structure data for hive

denormalization is the key for best results

compression - best result when compressing columns ( data of same type )

-----

describe formatted <tablename>;

dfs -ls <location>

copy above file to diff db

create table testcopy like test;

create table names (id int, name string) partitioned by (state string) row format
delimited fields terminated by '\t';

load data local inpath '/home/mapr/labs/data/hivedata_ca.txt' into table names

partition (state = 'CA');
create external table incomedata ( gender string, age int, salary double, zip int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/mapr/incomedata/';

CREATE EXTERNAL TABLE user (

custid string,
name string,
age int,
gender string,
cm15 string,
zipcode string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/mapr/data/riskdata/Users.csv';

Select user.* , transaction.* from user JOIN transaction ON

(user.cm15=transaction.cm15) where transaction.amount>100 and
transaction.type='keyed';

-------------------------

create indexes in hive

deferrred rebuild - implies that u initiate a command to create index later on

recommendation to try partition/skew over indexes

------------------------------

Stinger

insert/update/delete - Hive 0.14 - 1.0

------

Yahoo Hadoop Tutorial
No ratings yet
Yahoo Hadoop Tutorial
28 pages
Cloudera Msazure Hadoop Deployment Guide
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
39 pages
Final Exam
No ratings yet
Final Exam
15 pages
Hpe2 K45
No ratings yet
Hpe2 K45
4 pages
SEL-2431 DNP3 Communications Options: Application Guide AG2014-32
No ratings yet
SEL-2431 DNP3 Communications Options: Application Guide AG2014-32
10 pages
CS50 Quiz 1 Cheat Sheet
No ratings yet
CS50 Quiz 1 Cheat Sheet
2 pages
R For Stata Users PDF
100% (1)
R For Stata Users PDF
32 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Rfid and Barcode Technology
No ratings yet
Rfid and Barcode Technology
31 pages
Hive Query
No ratings yet
Hive Query
5 pages
SaS Notes
No ratings yet
SaS Notes
8 pages
Transport Tablespace From One Database To Another
No ratings yet
Transport Tablespace From One Database To Another
10 pages
Sintax English Final
No ratings yet
Sintax English Final
40 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Perl Scripts For Eda Tools
No ratings yet
Perl Scripts For Eda Tools
6 pages
Perl Interface
No ratings yet
Perl Interface
7 pages
Sqlmap Cheatsheet v1.0-SBD
No ratings yet
Sqlmap Cheatsheet v1.0-SBD
2 pages
Freda Song Drechsler - Maneuvering WRDS Data
No ratings yet
Freda Song Drechsler - Maneuvering WRDS Data
8 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-12
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-12
3 pages
Small Codes: Tutorial
No ratings yet
Small Codes: Tutorial
15 pages
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
18 pages
Which of The Following Is The Foundation of Mapreduce Operations?
No ratings yet
Which of The Following Is The Foundation of Mapreduce Operations?
12 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Apache - Hadoop Streaming
No ratings yet
Apache - Hadoop Streaming
13 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Antara DB Refresh
No ratings yet
Antara DB Refresh
5 pages
Filg 8
No ratings yet
Filg 8
631 pages
DATA 1050 Cheatsheet
No ratings yet
DATA 1050 Cheatsheet
4 pages
Example 1.1: Map With A Single Layer: Map /ms4w/apps/tutorial/htdocs/example1-1.map&layer States&mode Map
No ratings yet
Example 1.1: Map With A Single Layer: Map /ms4w/apps/tutorial/htdocs/example1-1.map&layer States&mode Map
27 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
38 pages
Hadoop
No ratings yet
Hadoop
13 pages
Java - The JSP Files (Part 5) - No Forwarding Address by Vikram Vaswani and Harish Kamath
No ratings yet
Java - The JSP Files (Part 5) - No Forwarding Address by Vikram Vaswani and Harish Kamath
29 pages
Sastortosas: Philip R Holland, Holland Numerics Limited, Royston, Herts, Uk
No ratings yet
Sastortosas: Philip R Holland, Holland Numerics Limited, Royston, Herts, Uk
7 pages
Unit 3 Chapter 1 Notes
No ratings yet
Unit 3 Chapter 1 Notes
18 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Sqoop
No ratings yet
Sqoop
4 pages
Sqoop Practice
No ratings yet
Sqoop Practice
5 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
yE8Xf7mJfQ9o
No ratings yet
yE8Xf7mJfQ9o
31 pages
Lab 3 ML
No ratings yet
Lab 3 ML
19 pages
Getting Started in R Stata Notes On Exploring Data: Oscar Torres-Reyna
No ratings yet
Getting Started in R Stata Notes On Exploring Data: Oscar Torres-Reyna
30 pages
Exercise 7,8,9 Basic Commands
No ratings yet
Exercise 7,8,9 Basic Commands
7 pages
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
No ratings yet
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
51 pages
Bash Cookbook
No ratings yet
Bash Cookbook
21 pages
Sqoop Additional Reading Pp-200913-222451-Unlocked
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
18 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
No ratings yet
Apache Hive Optimization Techniques - 1 - Towards Data Science
8 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Embuk
No ratings yet
Embuk
36 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
General Siebel
No ratings yet
General Siebel
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-2
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-2
4 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Soccer
No ratings yet
Soccer
14 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-C
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-C
6 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Using PHP MySQL With Google Maps
No ratings yet
Using PHP MySQL With Google Maps
12 pages
Perl One-Liners: 130 Programs That Get Things Done
From Everand
Perl One-Liners: 130 Programs That Get Things Done
Peteris Krumins
4/5 (3)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
9CSC006267e - PROFIsafe Safety Functions Module - 11122023 - EN
No ratings yet
9CSC006267e - PROFIsafe Safety Functions Module - 11122023 - EN
29 pages
Intelivision 18touch g2 Datasheet
No ratings yet
Intelivision 18touch g2 Datasheet
4 pages
React - React Native
No ratings yet
React - React Native
79 pages
Inventory Control Management System 1
No ratings yet
Inventory Control Management System 1
22 pages
Applications of PLC
100% (1)
Applications of PLC
23 pages
Computer Answer Key
No ratings yet
Computer Answer Key
1 page
Keyes - Data Logging Module PDF
100% (1)
Keyes - Data Logging Module PDF
11 pages
2023-07-03
No ratings yet
2023-07-03
2 pages
HCI IEC60870-5-104 en
No ratings yet
HCI IEC60870-5-104 en
110 pages
Implementas API Bot Telegram Untuk Sistem Notifikasi Pada The Dude Network Monitoring System
No ratings yet
Implementas API Bot Telegram Untuk Sistem Notifikasi Pada The Dude Network Monitoring System
7 pages
PHP File Handling
No ratings yet
PHP File Handling
8 pages
Manual
No ratings yet
Manual
23 pages
DAP_2018-19_Project report_B
No ratings yet
DAP_2018-19_Project report_B
50 pages
WH07 NVR GN XX0001 00001 Layout1
No ratings yet
WH07 NVR GN XX0001 00001 Layout1
1 page
Gartner - Top 10 Strategic Technology Trends For 2014 PDF
No ratings yet
Gartner - Top 10 Strategic Technology Trends For 2014 PDF
29 pages
4 First Version
No ratings yet
4 First Version
6 pages
Application Architectures: Single Tier Architecture
No ratings yet
Application Architectures: Single Tier Architecture
2 pages
Brief Overview of Parallel Computing
No ratings yet
Brief Overview of Parallel Computing
14 pages
Iserver Questions
No ratings yet
Iserver Questions
16 pages
Sharepoint Server 2019
No ratings yet
Sharepoint Server 2019
16 pages
V500hybridspecsheet 050319
No ratings yet
V500hybridspecsheet 050319
4 pages
INetsim Setup
No ratings yet
INetsim Setup
13 pages
Current Trends in NoSQL, New SQL, and Study of NoSQL Like mongoDB Etc Database
No ratings yet
Current Trends in NoSQL, New SQL, and Study of NoSQL Like mongoDB Etc Database
2 pages
Green Data Center by APC PDF
100% (1)
Green Data Center by APC PDF
27 pages
BCA Entrance Computer Awareness Paper I
100% (2)
BCA Entrance Computer Awareness Paper I
8 pages
Linux Voice Issue 005
100% (1)
Linux Voice Issue 005
116 pages