0% found this document useful (0 votes)
30 views

4

accounting e book management from my prespective

Uploaded by

abhinc2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
30 views

4

accounting e book management from my prespective

Uploaded by

abhinc2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 53
DECISIONS ANALYTICS BIG DATA DISTRIBUTED FILE SYSTEMS + The main purpose of the Distributed File System (DFS) is to allow users of physically distributed systems to share their data and resources exactly as they do local ones. * The performance and reliability of such access should be comparable to that for files stored locally. + Recent advances in higher bandwidth connectivity of switched local networks and disk organization have lead high performance and highly scalable file systems. ~ —_— FEATURES OF DISTRIBUTED FILE SYSTEM + Transparency * Concurrent Updates + Replication + Fault Tolerance * Consistency + Platform Independence * Security * Efficiency TRANSPARENCY usion cha all files are similar. Includes: + Access transparency —a single set of eperations. Clients that work on Toca files can werk with remote fies, * Location transparency — clients sze a uniform name space, Relocate without changing path names, + Mobility ransparency —files 19 he moved without modifying programs or changing system tables + Performance transparency —within linits, local and remote file aceess meet performance standards + Seating wansparency — increased loads do not degrae performance significantly. Capacity can be expanded, CONCURRENT UPDATES + Changes to file from one client should not interfere with changes from other clients + Even if changes at same time * Solutions often include: + File or record-level locking REPLICATION + File may have several copies of its data at different locations + Often for performance reasons + Requires update other copies when one copy is changed + Simple solution + Change master copy and periodically refresh the other copies * More complicated solution + Multiple copies can be updated independently at same time needs finer grained refresh and/or merge FAULT TOLERANCE + Function when clients or servers fail * Detect, report, and correct faults that occur * Solutions often include: + Redundant copies of data, redundant hardware, backups, transaction logs and other measures * Stateless servers + Idempotent operations CONSISTENCY + Data must always be complete, current, and correct * File seen by one process looks the same for all processes accessing + Consistency special concern whenever data is duplicated + Solutions often include: + Timestamps and ownership information PLATFORM INDEPENDENCE + Access even though hardware and OS completely different in design, architecture and functioning, from different vendors + Solutions often include: * Set flexible communication protocol b/w clients and servers EFFICIENCY * Overall, want same power and generality as local file systems + Early days, goal was to share “expensive” resource 1 the disk * Now, allow convenient access to remotely stored files SECURITY + File systems must be protected against unauthorized access, data corruption, loss and other threats + Solutions include: + Access control mechanisms (ownership, permissions) * Encryption of commands or data to prevent “sniffing” HADOOP + Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. * Open-source Data Management with scale-out storage & distributed processing. story Doug Cutting added DFS & Mapheduce ule Hadoop defeated Supercomputer Doug Cutting stated vantoo! stared ‘orkng lmao (= = Google published GPS & Hodoop on a 1000 Node Fuster Hadoop became top-level project Developme: ieaafiaays stated as Lucene sub-project MapRedice papers ‘ocebook [TERS SQL Support for Hadoop Apache released fist table version 1.0 was released wi Doug Cutting joined Clouders Hadoop 3.0 version 7 Hauloop 33.4 latest version of hadeop Hadoop 2.0 which eontains YARN was elesed * Amazon WHO USES HADOOP? * Facebook * Google + New York Times * Veoh * Yahoo! *.... many more _ 7] Job Tracker Admin Node Name node Hadoop is a system for large scale data processing. It has two m: ~ HDFS — Hadoop Distributed File System (Storage) ~ Distributed across “nodes” ~ Natively redundant ~ NameNode tracks locations. ~ MapReduce (Processing) ~ Splits a task across processors components: ~ “near” the data & assembles results ~ Self-Healing, High Bandwidth ~ Clustered storage ~ JobTracker manages the TaskTrackers v NameNode: V_ master of the system v_ maintains and manages the blocks which are present on the DataNodes v DataNodes: slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients SECONDARY NAMENODE eee eS] ¥ Secondary NameNode: ¥ Not a hot standby for the NameNNode ¥ Connects to NameNode every hour* ¥ Housekeeping, backup of NemeNede metadata ¥ Saved metadata can build a failed NameNode JOBTRACKER AND TASKTRACKER V JobTracker + Determines the execution plan for the job + Assigns individual tasks Vv TaskTracker + Keeps track of the performance of an individual mapper or reducer HADOOP DISTRIBUTED FILE SYSTEM OVERVIEW * Responsible for storing data on the cluster + Data files are split into blocks and distributed across the nodes in the cluster + Each block is replicated multiple times HDFS BASIC CONCEPTS + Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. + Distribute large data file into blocks * Blocks are managed by different nodes in the cluster + Each block is replicated on multiple nodes HOW ARE FILES STORED + Files are split into blocks * Blocks are split across many machines at load time * Different blocks from the same file will be stored on different machines * Blocks are replicated across multiple machines + The NameNode keeps track of which blocks make up a file and where they are stored DATA REPLICATION + Default replication is 3-fold HDFS Data Distribution Node A Node B Node D Node € HEHE Input File DATA RETRIEVAL + When a client wants to retrieve data + Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored * Then communicated directly with the data nodes to read the data FUNCTIONS OF A NAMENODE + Manages File System Namespace * Maps a file name to a set of blocks * Maps a block to the DataNodes where it resides * Cluster Configuration Management * Replication Engine for Blocks NAMENODE METADATA + Types of metadata * List of files * List of Blocks for each file * List of DataNodes for each block attributes, e.g. creation time, replication factor + A Transaction Log + Records file creations, file deletions ete DATANODE * A Block Server + Stores data in the local file system (e.g. ext3) + Stores metadata of a block (e.g. CRC) + Serves data and metadata to Clients * Block Report + Periodically sends a report of all existing blocks to the NameNode + Facilitates Pipelining of Data + Forwards data to other specified DataNodes BLOCK PLACEMENT * Current Strategy + One replica on local node + Second replica on a remote rack * Third replica on same remote rack + Additional replicas are randomly placed * Clients read from nearest replicas HEARTBEATS * DataNodes send hearbeat to the NameNode + Once every 3 seconds + NameNode uses heartbeats to detect DataNode failure REPLICATION ENGINE * NameNode detects DataNode failures * Chooses new DataNodes for new replicas * Balances disk usage + Balances communication traffic to DataNodes NAMENODE FAILURE + A single point of failure + Transaction Log stored in multiple directories + A directory on the local file system + A directory on a remote file system (NFS/CIFS) DATA PIEPLINING * Client retrieves a list of DataNodes on which to place replicas of a block * Client writes block to the first DataNode + The first DataNode forwards the data to the next node in the Pipeline * When all replicas are written, the Client moves on to write the next block in file REBALANCER *Goal: % disk full on DataNodes should be similar * Usually run when new DataNodes are added + Cluster is online when Rebalancer is active + Rebalancer is throttled to avoid network congestion * Command line tool SECONDARY NAMENODE + Copies FsImage and Transaction Log from Namenode to a temporary directory + Merges FSImage and Transaction Log into a new FSImage in temporary directory + Uploads new FSImage to the NameNode + Transaction Log on NameNode is purged USE RERUCE stributing computation across nodes WHY MAPREDUCE? + Before MapReduce Concurrent Systems, * Grid Computing + Rolling Your Own Solution * Considerations + Threading is hard! + How do you scale to more machines? + How do you handle machine failures? + How do you facilitate communication between nodes? + Does your solution scale? Scale out, not up! THE MAPREDUCE PARADIGM + Platform for reliable and scalable computing + Runs over distributed file systems + Google File System + Hadoop File System (HDFS) MAPREDUCE PROGRAMMING MODEL + Inspired from map and reduce operations commonly used in functional programming languages like Lisp. + Input: a set of key/value pairs + User supplies two functions + map(k,y) O list(kty1) + reduce(ki, list(v1)) 0 v2 + (KL,vI) is an intermediate key/value pair * Output is the set of (k1,v2) pairs + Map Reduce is a programming model which is used to process large data sets in a batch processing manner. + A method for distributing computation across multiple nodes + A Map Reduce program comprises of + a Map() procedure that performs filtering and sorting (such as sorting students by last name into queucs, one queue for cach name) + a Reduce() procedure that performs a summary operation (such as counting the number of students in cach queue, yielding name frequencies), THE MAPPER + Each block is processed in isolation by a map task called mapper + Map task runs on the node where the block is stored Input ink HE. MAPPER key-value pairs key-value pairs om map Am om 42 om AE E.g. (doc—id, doc-content) Eg. (word, wordeount-in-a-doc) Adapted from Jeff Lilliman’s course slides SHUFFLE AND SORT * Output from the mapper is sorted by key * All values with the same key are guaranteed to go to the same machine THE REDUCER + Consolidate result from different mappers * Produce final output Intermediate Output key-value pairs O— (mE <@ Of 0m =e om key-value groups key-value pairs Ee (word, list-of-wordeount) (word, final-count) (word, ~SQL Group ~ SQL aggregation wordcount-in-a-doc) by Adapted from Jeif Ullman’s course slides JOB TRACKER + Job Tracker is the one to which client application submit map reduce programs(jobs), * Job Tracker schedule clients jobs and allocates task to the slave ‘task trackers’ that are running on individual worker machines(date nodes). + Job tracker manage overall execution of Map-Reduce job. + Job tracker manages the resources of the cluster like: + Manage the data nodes i. task tracker + To keep track of the consumed and available resource. + To keep track of already running task, to provide faulttolerance for task ete. TASK TRACKER + Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker, + Task Tracker also handles the data motion between the map and reduce phases. + One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task. + Ifthe Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time, it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster. HOW MAPREDUCE ENGINE WORKS? * Client applications submit jobs to the Job Tracker. * The Job Tracker talks to the Name Node to determine the location of the data * The Job Tracker locates TaskTracker nodes with available slots at or near the data * The Job Tracker submits the work to the chosen TaskTracker nodes. HOW MAPREDUCE ENGINE WORKS? + The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. + A TaskTracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable. + When the work is completed, the Job Tracker updates its status. WORD COUNT EXAMPLE ee oe The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K2,V2) __K2,List(V2) \ List(K3,V3) KiV1 Bear, (1,1) Deg eee Pere pael \\ ots \

You might also like