0% found this document useful (0 votes)
22 views

Big Data Training1

This document provides information on various Hadoop ecosystem tools and concepts. It discusses tools for data ingestion (Sqoop, Flume), processing (Hive, Pig, Spark), resource management (YARN), and data storage (HBase, HDFS). It also includes examples of SQL queries and loading data into Hive tables.

Uploaded by

seshuchoudary
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Big Data Training1

This document provides information on various Hadoop ecosystem tools and concepts. It discusses tools for data ingestion (Sqoop, Flume), processing (Hive, Pig, Spark), resource management (YARN), and data storage (HBase, HDFS). It also includes examples of SQL queries and loading data into Hive tables.

Uploaded by

seshuchoudary
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

selcect a source data

tool to bring data


hcatalog to define schema
hive to determine sentiment
use BI to analyze

hbase has API's to support random reads and writes

tools such as Splice machine use these API's to provide SQL layer on top of Hbase

MR is one implementation, Spark is another implementation

multiple namenodes/namespace

https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/
Federation.html

create table salary (


name varchar(20),
age int,
salary double,
zipcode int);

load data local infile '/home/mapr/labs/data/salary.txt' into table salary fields


terminated by ',';

alter table salary add column `id` int(10) unsigned primary KEY AUTO_INCREMENT;

sqoop import --connect jdbc:mysql://localhost/sqooplab --username root --password


AmexImpetus --table salary

sqoop export --connect jdbc:mysql://localhost/sqooplab --table salaries2 --username


root -P --export-dir salaryquery --input-fields-terminated-by ","

----------------------------

day 2

yarn

container is equivalent to mapper or reducer or any other processing framework


that is bein executed

monitoring of jobs moved from single point of failure to multiple

why do application master started on data nodes

spout and bolts - processing units for spark

cpu ,mem - capacity planning 50-60% only available for processing

meta data on resource manager - what jobs are running on what application masters
-----------------------------------------------------------------------------

flume agent

multiplex

storm has capacity to horizontally scale? storm can scale without shutting down
topology - streaming ETL

can i increase mem of jvm while its running

flume is not used for real time scenarios

if # events are increasing unpredicatably - use kafka+flume

file channels/memory channels

flume can have interceptors to modify events in flight

--------------------------------------------------------------------------------

Pig

datatype is optional in pig - default is bytearray


if you don't specify, typecasting might be expensive

until pig execution hits dump, it doesn't load the data

in pig script , 'PARALLEl' indicates number of reducers

order by - expensive

replicated joins - read more

is it a good idea to include data within UDF

- create a pool, and connect udf to the pool because udf is executing many times

you can call webservice from udf

DataFu library - collection of pig UDF's

optimize pig scripts - filter early and often

sample

types of loaders and storage

if u want to write into couchbase, u can write custom loaders


http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_joins

pig - compression
hcat loader - allows filtering to happen while loading

http://chimera.labs.oreilly.com/books/1234000001811/ch07.html#explain

---------------------------------------------------------------------------

Hive

hive partitions

when u r loading data using load, data will not be parsed and distributed based on
partition key

but while loading using dynamic query, data will be partitioned

map side joins

bucketing - further dividing partitions


useful in sampling

skewed table

you can read and write into hive from pig script

normally pig is used to structure data for hive

denormalization is the key for best results

compression - best result when compressing columns ( data of same type )

-----

describe formatted <tablename>;

dfs -ls <location>

copy above file to diff db

create table testcopy like test;

create table names (id int, name string) partitioned by (state string) row format
delimited fields terminated by '\t';

load data local inpath '/home/mapr/labs/data/hivedata_ca.txt' into table names


partition (state = 'CA');
create external table incomedata ( gender string, age int, salary double, zip int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/mapr/incomedata/';

CREATE EXTERNAL TABLE user (


custid string,
name string,
age int,
gender string,
cm15 string,
zipcode string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/mapr/data/riskdata/Users.csv';

Select user.* , transaction.* from user JOIN transaction ON


(user.cm15=transaction.cm15) where transaction.amount>100 and
transaction.type='keyed';

-------------------------

create indexes in hive

deferrred rebuild - implies that u initiate a command to create index later on

recommendation to try partition/skew over indexes

------------------------------

Stinger

insert/update/delete - Hive 0.14 - 1.0

------

You might also like