Apache Spark is a fast, open-source cluster computing technology designed for efficient data processing, offering in-memory computation that significantly speeds up tasks compared to Hadoop MapReduce. It supports various processing types, including batch and stream processing, and includes components like Spark SQL, Spark Streaming, MLlib for machine learning, and GraphX for graph processing. YARN serves as the resource management and job scheduling framework for Hadoop, optimizing resource utilization and ensuring fault tolerance across distributed computing environments.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
10 views52 pages
BDA Unit 2
Apache Spark is a fast, open-source cluster computing technology designed for efficient data processing, offering in-memory computation that significantly speeds up tasks compared to Hadoop MapReduce. It supports various processing types, including batch and stream processing, and includes components like Spark SQL, Spark Streaming, MLlib for machine learning, and GraphX for graph processing. YARN serves as the resource management and job scheduling framework for Hadoop, optimizing resource utilization and ensuring fault tolerance across distributed computing environments.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52
Unit 2
Big Data Platforms
Overview of Apache Spark • Apache Spark is a fast cluster computing technology, designed for fast computation. • It is fast and general-purpose, means, it is an open-source, extensive range data processing engine. • It is developed around agility, ease of use, and advanced analytics. • Apache Spark is most famous for running the Iterative Machine Learning Algorithm. Overview of Apache Spark • It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations. • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. • Spark is designed to cover a wide range of processing such as batch applications, iterative algorithms, interactive queries and streaming. • Spark can be used to perform batch processing and stream processing tasks. Limitations of MapReduce • Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm. • A challenge to MapReduce is the sequential multi-step process it takes to run a job. • With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. • As each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O. Advantages of Spark over MapReduce • Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. • With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. • Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms Overview of Apache Spark • Spark provides following functionality:- 1. High-level APIs in Java, Scala, Python, and R. 2. A simplified and the computationally intensive task of high processing volumes of real-time. 3. Fast programming up to 100x faster than Apache Hadoop MapReduce. 4. Perform ad-hoc data analysis interactively. 5. Increase in processing speed. 6. In-memory cluster computation capability. Components of Apache Spark Components of Apache Spark • Spark Core is the foundation of the platform. • It is responsible for memory management, fault recovery, scheduling, distributing & monitoring jobs, and interacting with storage systems. • Spark Core is exposed through an application programming interface (APIs) built for Java, Scala, Python and R. • These APIs hide the complexity of distributed processing behind simple, high- level operators. Components of Apache Spark • Spark SQL:- Interactive Queries • Spark SQL is a distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce. • It includes a cost-based optimizer, columnar storage, and code generation for fast queries, while scaling to thousands of nodes. • Business analysts can use standard SQL or the Hive Query Language for querying data. • Developers can use APIs, available in Scala, Java, Python, and R. • It supports various data sources out-of-the-box including JDBC, ODBC, JSON, HDFS, Hive, ORC, and Parquet. • Other popular stores—Amazon Redshift, Amazon S3, Couchbase, Cassandra, MongoDB, Salesforce.com, Elasticsearch, and many others can be found from the Spark Packages ecosystem. Components of Apache Spark • Spark Streaming:- Real-time • Spark Streaming is a real-time solution that leverages Spark Core’s fast scheduling capability to do streaming analytics. • It ingests data in mini-batches, and enables analytics on that data with the same application code written for batch analytics. • This improves developer productivity, because they can use the same code for batch processing, and for real-time streaming applications. • Spark Streaming supports data from Twitter, Kafka, Flume, HDFS, and ZeroMQ, and many others found from the Spark Packages ecosystem. Components of Apache Spark • Mllib:- Machine Learning • Spark includes MLlib, a library of algorithms to do machine learning on data at scale. • Machine Learning models can be trained by data scientists with R or Python on any Hadoop data source, saved using MLlib, and imported into a Java or Scala-based pipeline. • Spark was designed for fast, interactive computation that runs in memory, enabling machine learning to run quickly. • The algorithms include the ability to do classification, regression, clustering, collaborative filtering, and pattern mining. Components of Apache Spark • GraphX:- Graph Processing • Spark GraphX is a distributed graph processing framework built on top of Spark. • GraphX provides ETL, exploratory analysis, and iterative graph computation to enable users to interactively build, and transform a graph data structure at scale. • It comes with a highly flexible API, and a selection of distributed Graph algorithms. HDFS • Hadoop’s file system allows applications to be run across multiple servers. • HDFS, in that data is divided into blocks, and copies of these blocks are stored on other servers in the Hadoop cluster. • This means, an individual file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster. • HDFS replicates the smaller pieces of data onto two additional servers by default. HDFS • The redundancy of data has following benefits:- • 1. higher availability of data • 2. increases fault-tolerance. • 2. allows the Hadoop cluster to break down the work into smaller chunks and run those jobs on all the servers in the cluster for better scalability • 3. tries to assign workloads to these servers where the data to be processed is stored. This is known as data locality, which is critical when working with large data sets. HDFS
An example of how data blocks are written to HDFS
HDFS • A data file in HDFS is divided into blocks, and the default size of these blocks for Apache Hadoop is 64 MB. • Hadoop was designed to scan through very large data sets, so it makes sense for it to use a very large block size so that each server can work on a larger chunk of data at the same time. • The coordination across a cluster has significant overhead, and that is managed by HDFS. • HDFS makes sure that at least two blocks are stored on a separate server rack to improve reliability even an entire rack of servers may be lost. HDFS • All of Hadoop’s data placement logic is managed by a special server called NameNode. • This NameNode server keeps track of all the data files in HDFS, such as where the blocks are stored. • All of the NameNode’s information is stored in memory, which allows it to provide quick response times to storage manipulation or read requests. • When a file is created, HDFS will automatically communicate with the NameNode to allocate storage on specific servers and perform the data replication. HDFS • The NameNode maintains and manages the file system namespace and provides clients with the right access permissions. • The NameNode performs file system namespace operations, including opening, closing and renaming files and directories. • The NameNode records any change to the file system namespace or its properties HDFS HDFS • HDFS also has multiple DataNodes on a commodity hardware cluster. • The DataNodes are generally organized within the same rack in the data center. • Data is broken down into separate blocks and distributed among the various DataNodes for storage. • The NameNode knows which DataNode contains which blocks and where the DataNodes reside within the cluster. • The NameNode also manages access to the files, across the DataNodes. • The DataNodes serve read and write requests from the clients and can perform block creation, deletion and replication when the NameNode instructs them to do so HDFS • The DataNodes are in constant communication with the NameNode to determine need of DataNodes to complete specific tasks. • The NameNode is always aware of the status of each DataNode. • If the NameNode realizes that one DataNode is not working properly, it can immediately reassign that DataNode's task to a different node containing the same data block. • DataNodes also communicate with each other, so that they can cooperate during normal file operations. YARN • YARN is considered the brain of the Hadoop architecture. • Apart from resource management and allocation, it also performs job scheduling. • YARN is the parallel processing framework for implementing distributed computing clusters. • YARN allows different data processing methods like graph processing, interactive processing, stream processing and batch processing to run and process data stored in HDFS YARN YARN • Yarn uses master servers and data servers. • There is only one master server per cluster. • It runs the resource manager daemon(program). • There are many data servers in the cluster, each one runs on its own Node Manager daemon and the application master manager as required. YARN • Components of YARN are:- • Resource Manager • It is responsible for resource allocation. • On receiving the processing requests, it passes parts of requests to corresponding node managers, where the actual processing takes place. • It is the arbitrator of the cluster resources and decides the allocation of the available resources for the applications. • It optimizes the cluster utilization by keeping all resources in use all the time. • It has two major components: a) Scheduler , b) Application Manager YARN • a) Scheduler • The scheduler is responsible for allocating resources to the various running applications by considering constraints of capacities, queues. • It is called as a pure scheduler in Resource Manager, as it does not perform any monitoring or tracking of status for the applications. • If there is an application failure or hardware failure, the Scheduler does not guarantee to restart the failed tasks. • Performs scheduling based on the resource requirements of the applications. • It has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various applications. • There are two such plug-ins: Capacity Scheduler and Fair Scheduler YARN • b) Application Manager • It is responsible for accepting job submissions. • It negotiates the first container from the Resource Manager for executing the application specific Application Master. • Manages running the Application Masters in a cluster and provides service for restarting the Application Master container on failure. YARN • 2. Node Manager • It handles individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node. • It registers with the Resource Manager and sends status of each node. • Its primary goal is to manage application containers assigned to it by the resource manager. • Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context(CLC) which includes everything the application needs in order to run. • The Node Manager creates the requested container process and starts it. • Monitors resource usage (memory, CPU) of individual containers also performs Log management. YARN • 3. Application Master • An application is a single job submitted to the framework. • Each such application has a unique Application Master associated with it which is a framework specific entity. • It is the process that coordinates an application’s execution in the cluster and also manages faults. • Its task is to negotiate resources from the Resource Manager and work with the Node Manager to execute and monitor the component tasks. • It is responsible for negotiating appropriate resource containers from the Resource Manager, tracking their status and monitoring progress. YARN • 4. Container • It is a collection of physical resources such as RAM, CPU cores, and disks on a single node. • YARN containers are managed by a container launch context which is container life-cycle(CLC). • This record contains a map of environment variables, dependencies stored in a remotely accessible storage, security tokens, payload for Node Manager services and the command necessary to create the process. • It assigns rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific host. YARN • Workflow in YARN:- 1. Client submits an application 2. Resource Manager allocates a container to start Application Manager 3. Application Manager registers with Resource Manager 4. Application Manager asks containers from Resource Manager 5. Application Manager notifies Node Manager to launch containers 6. Application code is executed in the container 7. Client contacts Resource Manager/Application Manager to monitor application’s status 8. Application Manager unregisters with Resource Manage YARN YARN 1. Client submits an application to Resource Manager 2. The Resource Manager allocates a container to start the Application Manager 3. The Application Manager registers itself with the Resource Manager 4. The Application Manager negotiates containers from the Resource Manager 5. The Application Manager notifies the Node Manager to launch containers 6. Application code is executed in the container of Node Manager 7. Client contacts Resource Manager/Application Manager to monitor application’s status 8. Once the processing is complete, the Application Manager un-registers with the Resource Manager YARN • Key features of YARN:- • Multi-tenancy: Yarn allows multiple engine access and fulfills the requirement for a real-time system and manages the movement of that data within the framework. • Sharing Resources: Yarn ensures there is no dependency between the compute jobs. • Each compute job is run on its own node and does not share its allocated resources. • Each job is responsible for its own assigned work. • Cluster Utilization: Yarn optimizes the cluster by utilizing and allocating its resources in a dynamic manner. YARN • Fault Tolerance: Yarn is highly fault-tolerant. • It allows the rescheduling of the failed to compute jobs without any implications for the final output. • Scalability: Yarn focuses mainly on scheduling the resources. Due to this the data nodes can be expanded and increasing the processing capacity. • Compatibility: The jobs that were working in MapReduce v1 can be easily migrated to higher versions of Hadoop with ease ensuring high compatibility of Yarn. YARN MapReduce • MapReduce is a programming framework that allows to perform distributed and parallel processing on large data sets in a distributed environment. • The MapReduce contains two important tasks, as Map and Reduce. • The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs). • The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. • The reduce task is always performed after the map job. MapReduce • The reducer phase takes place after the mapper phase has been completed. • In the map job, the block of data is read and processed to produce key-value pairs as intermediate outputs. • The output of a Mapper or map job (key-value pairs) is input to the Reducer. • The reducer receives the key-value pair from multiple map jobs. • Then, the reducer aggregates those intermediate data tuples (intermediate key- value pair) into a smaller set of tuples or key-value pairs which is the final output. MapReduce MapReduce • Input Phase in this phase recorder translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. • Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate key-value pairs. • Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys. • Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into some sets. • It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper MapReduce • Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated in the Reducer task. • Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. The data can be aggregated, filtered, and combined, and it requires a wide range of processing. Once the execution is over, it generates zero or more key-value pairs to the final step. • Output Phase − In the output phase, an output formatter translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer. Map Reduce:Word Count Example Map Reduce:Word Count Example Key Value bad 1 Class 1 good 1 Hadoop 3 is 2 to 1 Welcome 1 MapReduce • One map task is created for each split which then executes map function for each record in the split. • It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time taken for processing of the whole input. • When the splits are smaller, the processing is better to load balanced since we are processing the splits in parallel. • However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time. • For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by default). MapReduce • Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS. • Map output is intermediate output which is processed by reduce tasks to produce the final output. • Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill. • In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output. MapReduce • Reduce task doesn’t work on the concept of data locality. An output of every map task is fed to the reduce task. • Map output is transferred to the machine where reduce task is running. • On this machine, the output is merged and then passed to the user-defined reduce function. • Unlike the map output, reduce output is stored in HDFS ,the first replica is stored on the local node and other replicas are stored on off-rack nodes. PageRank Algorithm • PageRank is a recursive algorithm developed by Google founder Larry Page to assign a real number to each page in the Web. • They can be ranked based on the scores, where higher the score of a page, the page is more important. • According to Google: • PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. • It is based on assumption, if a website receives more links from other websites, that is more important PageRank Algorithm • The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. • PageRank can be calculated for collections of documents of any size. • The PageRank computations require several passes, called iterations, through the collection to adjust approximate PageRank values to predict more accurate results. PageRank Algorithm • Assume that there are four web pages: A, B, C and D. • Links from a page to itself, or multiple links from one single page to another single page, are ignored. • PageRank is initialized to the same value for all pages. • In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. • But for simplification, it is considered as a probability distribution between 0 and 1. • Hence the initial value for each page in this example is 0.25. PageRank Algorithm • If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.i.e PR(A)=PR(B)+PR(C)+PR(D) • Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had links to all three pages. • Thus, upon the first iteration, page B would transfer half of its existing value,(0.125), to page A and the other half,( 0.125) to page C. • Page C would transfer all of its existing value, 0.25, to the only page it links to, A. PageRank Algorithm • Page D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A. • At the completion of this iteration, page A will have a PageRank of approximately 0.458.i.e • PR(A)=PR(B)/2+ PR(C)/1 + PR(D)/3 • The PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L( ). • The general equation for calculating page rank is: PageRank Algorithm: MapReduce • MapReduce approach will tackle the problem by taking the advantage of running on a cluster (parallelization) and scaled up to very large data sets. • At the beginning of each iteration, a node passes its PageRank contributions to other nodes that it is connected to., this is also called as spreading probability mass to neighbors via outgoing links • At the end of the iteration, each node sums up all PageRank contributions that have been passed to it and computes an updated PageRank score, this is also called as gathering probability mass passed to a node via its incoming links.