4

accounting e book management from my prespective

Uploaded by

abhinc2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

30 views

4

accounting e book management from my prespective

Uploaded by

abhinc2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 53

DECISIONS ANALYTICS BIG DATADISTRIBUTED FILE SYSTEMS + The main purpose of the Distributed File System (DFS) is to allow users of physically distributed systems to share their data and resources exactly as they do local ones. * The performance and reliability of such access should be comparable to that for files stored locally. + Recent advances in higher bandwidth connectivity of switched local networks and disk organization have lead high performance and highly scalable file systems.~ —_— FEATURES OF DISTRIBUTED FILE SYSTEM + Transparency * Concurrent Updates + Replication + Fault Tolerance * Consistency + Platform Independence * Security * EfficiencyTRANSPARENCY usion cha all files are similar. Includes: + Access transparency —a single set of eperations. Clients that work on Toca files can werk with remote fies, * Location transparency — clients sze a uniform name space, Relocate without changing path names, + Mobility ransparency —files 19 he moved without modifying programs or changing system tables + Performance transparency —within linits, local and remote file aceess meet performance standards + Seating wansparency — increased loads do not degrae performance significantly. Capacity can be expanded,CONCURRENT UPDATES + Changes to file from one client should not interfere with changes from other clients + Even if changes at same time * Solutions often include: + File or record-level lockingREPLICATION + File may have several copies of its data at different locations + Often for performance reasons + Requires update other copies when one copy is changed + Simple solution + Change master copy and periodically refresh the other copies * More complicated solution + Multiple copies can be updated independently at same time needs finer grained refresh and/or mergeFAULT TOLERANCE + Function when clients or servers fail * Detect, report, and correct faults that occur * Solutions often include: + Redundant copies of data, redundant hardware, backups, transaction logs and other measures * Stateless servers + Idempotent operationsCONSISTENCY + Data must always be complete, current, and correct * File seen by one process looks the same for all processes accessing + Consistency special concern whenever data is duplicated + Solutions often include: + Timestamps and ownership informationPLATFORM INDEPENDENCE + Access even though hardware and OS completely different in design, architecture and functioning, from different vendors + Solutions often include: * Set flexible communication protocol b/w clients and serversEFFICIENCY * Overall, want same power and generality as local file systems + Early days, goal was to share “expensive” resource 1 the disk * Now, allow convenient access to remotely stored filesSECURITY + File systems must be protected against unauthorized access, data corruption, loss and other threats + Solutions include: + Access control mechanisms (ownership, permissions) * Encryption of commands or data to prevent “sniffing”HADOOP+ Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. * Open-source Data Management with scale-out storage & distributed processing.story Doug Cutting added DFS & Mapheduce ule Hadoop defeated Supercomputer Doug Cutting stated vantoo! stared ‘orkng lmao (= = Google published GPS & Hodoop on a 1000 Node Fuster Hadoop became top-level project Developme: ieaafiaays stated as Lucene sub-project MapRedice papers ‘ocebook [TERS SQL Support for Hadoop Apache released fist table version 1.0 was released wi Doug Cutting joined Clouders Hadoop 3.0 version 7 Hauloop 33.4 latest version of hadeop Hadoop 2.0 which eontains YARN was elesed* Amazon WHO USES HADOOP? * Facebook * Google + New York Times * Veoh * Yahoo! *.... many more_ 7] Job Tracker Admin Node Name nodeHadoop is a system for large scale data processing. It has two m: ~ HDFS — Hadoop Distributed File System (Storage) ~ Distributed across “nodes” ~ Natively redundant ~ NameNode tracks locations. ~ MapReduce (Processing) ~ Splits a task across processors components: ~ “near” the data & assembles results ~ Self-Healing, High Bandwidth ~ Clustered storage ~ JobTracker manages the TaskTrackersv NameNode: V_ master of the system v_ maintains and manages the blocks which are present on the DataNodes v DataNodes: slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clientsSECONDARY NAMENODE eee eS] ¥ Secondary NameNode: ¥ Not a hot standby for the NameNNode ¥ Connects to NameNode every hour* ¥ Housekeeping, backup of NemeNede metadata ¥ Saved metadata can build a failed NameNodeJOBTRACKER AND TASKTRACKER V JobTracker + Determines the execution plan for the job + Assigns individual tasks Vv TaskTracker + Keeps track of the performance of an individual mapper or reducerHADOOP DISTRIBUTED FILE SYSTEMOVERVIEW * Responsible for storing data on the cluster + Data files are split into blocks and distributed across the nodes in the cluster + Each block is replicated multiple timesHDFS BASIC CONCEPTS + Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. + Distribute large data file into blocks * Blocks are managed by different nodes in the cluster + Each block is replicated on multiple nodesHOW ARE FILES STORED + Files are split into blocks * Blocks are split across many machines at load time * Different blocks from the same file will be stored on different machines * Blocks are replicated across multiple machines + The NameNode keeps track of which blocks make up a file and where they are storedDATA REPLICATION + Default replication is 3-fold HDFS Data Distribution Node A Node B Node D Node € HEHE Input FileDATA RETRIEVAL + When a client wants to retrieve data + Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored * Then communicated directly with the data nodes to read the dataFUNCTIONS OF A NAMENODE + Manages File System Namespace * Maps a file name to a set of blocks * Maps a block to the DataNodes where it resides * Cluster Configuration Management * Replication Engine for BlocksNAMENODE METADATA + Types of metadata * List of files * List of Blocks for each file * List of DataNodes for each block attributes, e.g. creation time, replication factor + A Transaction Log + Records file creations, file deletions eteDATANODE * A Block Server + Stores data in the local file system (e.g. ext3) + Stores metadata of a block (e.g. CRC) + Serves data and metadata to Clients * Block Report + Periodically sends a report of all existing blocks to the NameNode + Facilitates Pipelining of Data + Forwards data to other specified DataNodesBLOCK PLACEMENT * Current Strategy + One replica on local node + Second replica on a remote rack * Third replica on same remote rack + Additional replicas are randomly placed * Clients read from nearest replicasHEARTBEATS * DataNodes send hearbeat to the NameNode + Once every 3 seconds + NameNode uses heartbeats to detect DataNode failureREPLICATION ENGINE * NameNode detects DataNode failures * Chooses new DataNodes for new replicas * Balances disk usage + Balances communication traffic to DataNodesNAMENODE FAILURE + A single point of failure + Transaction Log stored in multiple directories + A directory on the local file system + A directory on a remote file system (NFS/CIFS)DATA PIEPLINING * Client retrieves a list of DataNodes on which to place replicas of a block * Client writes block to the first DataNode + The first DataNode forwards the data to the next node in the Pipeline * When all replicas are written, the Client moves on to write the next block in fileREBALANCER *Goal: % disk full on DataNodes should be similar * Usually run when new DataNodes are added + Cluster is online when Rebalancer is active + Rebalancer is throttled to avoid network congestion * Command line toolSECONDARY NAMENODE + Copies FsImage and Transaction Log from Namenode to a temporary directory + Merges FSImage and Transaction Log into a new FSImage in temporary directory + Uploads new FSImage to the NameNode + Transaction Log on NameNode is purgedUSE RERUCE stributing computation across nodesWHY MAPREDUCE? + Before MapReduce Concurrent Systems, * Grid Computing + Rolling Your Own Solution * Considerations + Threading is hard! + How do you scale to more machines? + How do you handle machine failures? + How do you facilitate communication between nodes? + Does your solution scale? Scale out, not up!THE MAPREDUCE PARADIGM + Platform for reliable and scalable computing + Runs over distributed file systems + Google File System + Hadoop File System (HDFS)MAPREDUCE PROGRAMMING MODEL + Inspired from map and reduce operations commonly used in functional programming languages like Lisp. + Input: a set of key/value pairs + User supplies two functions + map(k,y) O list(kty1) + reduce(ki, list(v1)) 0 v2 + (KL,vI) is an intermediate key/value pair * Output is the set of (k1,v2) pairs+ Map Reduce is a programming model which is used to process large data sets in a batch processing manner. + A method for distributing computation across multiple nodes + A Map Reduce program comprises of + a Map() procedure that performs filtering and sorting (such as sorting students by last name into queucs, one queue for cach name) + a Reduce() procedure that performs a summary operation (such as counting the number of students in cach queue, yielding name frequencies),THE MAPPER + Each block is processed in isolation by a map task called mapper + Map task runs on the node where the block is storedInput ink HE. MAPPER key-value pairs key-value pairs om map Am om 42 om AE E.g. (doc—id, doc-content) Eg. (word, wordeount-in-a-doc) Adapted from Jeff Lilliman’s course slidesSHUFFLE AND SORT * Output from the mapper is sorted by key * All values with the same key are guaranteed to go to the same machineTHE REDUCER + Consolidate result from different mappers * Produce final outputIntermediate Output key-value pairs O— (mE <@ Of 0m =e om key-value groups key-value pairs Ee (word, list-of-wordeount) (word, final-count) (word, ~SQL Group ~ SQL aggregation wordcount-in-a-doc) by Adapted from Jeif Ullman’s course slidesJOB TRACKER + Job Tracker is the one to which client application submit map reduce programs(jobs), * Job Tracker schedule clients jobs and allocates task to the slave ‘task trackers’ that are running on individual worker machines(date nodes). + Job tracker manage overall execution of Map-Reduce job. + Job tracker manages the resources of the cluster like: + Manage the data nodes i. task tracker + To keep track of the consumed and available resource. + To keep track of already running task, to provide faulttolerance for task ete.TASK TRACKER + Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker, + Task Tracker also handles the data motion between the map and reduce phases. + One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task. + Ifthe Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time, it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.HOW MAPREDUCE ENGINE WORKS? * Client applications submit jobs to the Job Tracker. * The Job Tracker talks to the Name Node to determine the location of the data * The Job Tracker locates TaskTracker nodes with available slots at or near the data * The Job Tracker submits the work to the chosen TaskTracker nodes.HOW MAPREDUCE ENGINE WORKS? + The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. + A TaskTracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable. + When the work is completed, the Job Tracker updates its status.WORD COUNT EXAMPLE ee oeThe Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K2,V2) __K2,List(V2) \ List(K3,V3) KiV1 Bear, (1,1) Deg eee Pere pael \\ ots \

3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Hadoop
No ratings yet
Hadoop
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 3
No ratings yet
Unit 3
5 pages
unit III
No ratings yet
unit III
120 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-I
No ratings yet
Unit-I
38 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
21ai402 Data Analytics Unit-2
No ratings yet
21ai402 Data Analytics Unit-2
44 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
Unit III
No ratings yet
Unit III
86 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages

4

Uploaded by

4

Uploaded by

You might also like