0% found this document useful (0 votes)

98 views18 pages

Processing Unstructured Data in Hadoop

The document is a question bank focused on Data Analytics and Supporting Services, covering topics such as structured and unstructured data, machine learning types, Hadoop and Spark components, and data security. It includes multiple-choice questions, explanations of key concepts, and comparisons between technologies like Hadoop and Spark. Additionally, it addresses various frameworks and tools relevant to data processing and analytics.

Uploaded by

JAYASUTHA.P MIT-AP/CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views18 pages

Processing Unstructured Data in Hadoop

Uploaded by

JAYASUTHA.P MIT-AP/CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Question Bank

UNIT-IV
Data Analytics and Supporting Services

1. Traditional RDBMS unable to process –

a. Structured data
b. Unstructured data
c. Both structured and unstructured data
d. None of these
2. Structured data is managed in database using –
a. .NET Framework
b. Structured Query Language
c. Normal Language Processing
d. All of these
3. What are the main components of Hadoop Ecosystem?
[Link], HDFS, YARN
[Link], GraphX Gelly,
[Link], CEP
[Link] of the mentioned
4. NoSQL databases store unstructured data with no particular schema. True False
5. Which of the following is not a NoSQL database?
[Link]
b. SQL Server
[Link]
[Link] of the mentioned
6.____________ is a distributed machine learning framework on top of Spark.
a .MLlib
[Link] Streaming
[Link]
[Link]
7.________________is a resource management platform responsible for managing
compute resources in the cluster and using them in order to schedule users and
applications.
[Link] Common
[Link] Distributed File System (HDFS)
[Link] YARN
[Link] MapReduce
8. In simple term, machine learning is
a. Training based on historical data
b. Prediction to answer a query
c. Both a and b
d. None
9 .Deep learning is
a. Subfield of machine learning
b. Learns features by its own
c. Mimics the working function of several features
d. All of the above
10. In spark, a ______________________is a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a partition is lost.
[Link] Streaming
[Link] Distributed Dataset (RDD)
[Link]
[Link]
11. Consider the following statements in the context of Spark:
Statement 1: Spark also gives you control over how you can partition your
Resilient Distributed Datasets (RDDs)
Statement 2: Spark allows you to choose whether you want to persist Resilient
Distributed Dataset (RDD) onto disk or not.
[Link] statement 1 is true
b. Only statement 2 is true
[Link] statements are true
[Link] statements are false
12. Which of the following are the simplest NoSQL databases?
[Link]-value
[Link]-column
[Link]
[Link] of the mentioned
13. Point out the incorrect statement in the context of Cassandra:
[Link] is originally designed at Facebook
[Link] is a centralized key-value store
[Link] is designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.
[Link] uses a ring-based DHT (Distributed Hash Table) but without finger tables or
routing
Part-B
1. What is the difference between structured semi structured and
unstructured data?
Structured Data is get organized by the means of Relational Database. While in
case of Semi Structured Data is partially organized by the means of XML/RDF.
On other hand in case of Unstructured Data data is based on simple character
and binary data.
2. What is structured data?
Structured data is most often categorized as quantitative data, and it's the type of
data most of us are used to working with. Think of data that fits neatly within fixed
fields and columns in relational databases and [Link] of
structured data include names, dates, addresses, credit card numbers, stock
information, geolocation, and more.
3. What is unstructured data?
Unstructured data is most often categorized as qualitative data, and it cannot be
processed and analyzed using conventional tools and methods.
Examples of unstructured data include text, video, audio, mobile activity, social
media activity, satellite imagery, surveillance imagery – the list goes on and on.
Unstructured data is difficult to deconstruct because it has no pre-defined model,
meaning it cannot be organized in relational databases. Instead, non-relational,
or NoSQL databases, are best fit for managing unstructured data.
4. How would you secure data in motion as well as data at rest?
Data Encryption. Protecting data at rest and in motion. Securing data at the
perimeter through measures like firewalls is simply a band-aid. From regular use to
warehousing, data must be protected at each point throughout its lifecycle.
What is the difference between data at rest and data in transit?
Data in transit, or data in motion, is data actively moving from one location to
another such as across the internet or through a private network. ... Data protection
at rest aims to secure inactive data stored on any device or network
5. What are three pillars of Big Data?
 Structured Data
 Unstructured Data and
 Semi Structured Data

6. What are the different types of Machine Learning?

Types of Machine Learning – Machine Learning Interview Questions – Edureka

There are three ways in which machines learn:

1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Supervised Learning:

Supervised learning is a method in which the machine learns using labeled data.
 It is like learning under the guidance of a teacher
 Training dataset is like a teacher which is used to train the machine
 Model is trained on a pre-defined dataset before it starts making decisions
when given new data

Unsupervised Learning:Unsupervised learning is a method in which the machine

is trained on unlabelled data or without any guidance

 It is like learning without a teacher.

 Model learns through observation & finds structures in data.
 Model is given a dataset and is left to automatically find patterns and
relationships in that dataset by creating clusters.

Reinforcement Learning:

Reinforcement learning involves an agent that interacts with its environment by

producing actions & discovers errors or rewards.

 It is like being stuck in an isolated island, where you must explore the
environment and learn how to live and adapt to the living conditions on your
own.
 Model learns through the hit and trial method
 It learns on the basis of reward or penalty given for every action it performs

7. How would you explain Machine Learning to a school-going kid?

 Suppose your friend invites you to his party where you meet total strangers.
Since you have no idea about them, you will mentally classify them on the
basis of gender, age group, dressing, etc.
 In this scenario, the strangers represent unlabeled data and the process of
classifying unlabeled data points is nothing but unsupervised learning.
 Since you didn’t use any prior knowledge about people and classified them
on-the-go, this becomes an unsupervised learning problem.

8. How does Deep Learning differ from Machine Learning?

Deep Learning Machine Learning

Deep Learning is a form of machine learning Machine Learning is all about

that is inspired by the structure of the human algorithms that parse data, learn from
brain and is particularly effective in feature that data, and then apply what they’ve
detection. learned to make informed decisions.

9. Explain Classification and Regression

10. How is KNN different from K-means clustering?

11. Name some companies that use Hadoop.

 Yahoo (One of the biggest user & more than 80% code contributor to
Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter

[Link] are the core components of Hadoop?

Hadoop is an open-source software framework for distributed storage and
processing of large datasets. Apache Hadoop core components are HDFS,
MapReduce, and YARN.
 HDFS- Hadoop Distributed File System (HDFS) is the primary storage system
of Hadoop. HDFS store very large files running on a cluster of commodity
hardware. It works on the principle of storage of less number of large files
rather than the huge number of small files. HDFS stores data reliably even in
the case of hardware failure. It provides high throughput access to an
application by accessing in parallel.
 MapReduce- MapReduce is the data processing layer of Hadoop. It writes an
application that processes large structured and unstructured data stored in
HDFS. MapReduce processes a huge amount of data in parallel. It does this by
dividing the job (submitted job) into a set of independent tasks (sub-job). In
Hadoop, MapReduce works by breaking the processing into
phases: Map and Reduce. The Map is the first phase of processing, where we
specify all the complex logic code. Reduce is the second phase of processing.
Here we specify light-weight processing like aggregation/summation.
 YARN- YARN is the processing framework in Hadoop. It provides Resource
management and allows multiple data processing engines. For example real-
time streaming, data science, and batch processing.
13. What is Kafka?

Kafka as “an open-source message broker project developed by the Apache

Software Foundation written in Scala and is a distributed publish-subscribe
messaging system.

Kafka Salient Features

Feature Description

High
Support for millions of messages with modest hardware
Throughput

Scalability Highly scalable distributed systems with no downtime

Messages are replicated across the cluster to provide support for multiple su
Replication
balances the consumers in case of failures

Durability Provides support for persistence of message to disk

Stream
Used with real-time streaming applications like Apache Spark & Storm
Processing

Data Loss Kafka with proper configurations can ensure zero data loss

14. List the various components in Kafka.

The four major components of Kafka are:

 Topic – a stream of messages belonging to the same type

 Producer – that can publish messages to a topic
 Brokers – a set of servers where the publishes messages are stored
 Consumer – that subscribes to various topics and pulls data from the brokers

15. Compare Hadoop and Spark.

We will compare Hadoop MapReduce and Spark based on the following aspects:

Apache Spark vs. Hadoop

Feature Criteria Apache Spark Hadoop

Speed 100 times faster than Hadoop Decent speed

Processing Real-time & Batch processing Batch processing only

Easy because of high level
Difficulty Tough to learn
modules

Recovery Allows recovery of partitions Fault-tolerant

No interactive mode except Pig

Interactivity Has interactive modes
& Hive

[Link] the key features of Apache Spark.

The following are the key features of Apache Spark:

1. Polyglot
2. Speed
3. Multiple Format Support
4. Lazy Evaluation
5. Real Time Computation
6. Hadoop Integration
7. Machine Learning

Let us look at these features in detail:

1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R.

Spark code can be written in any of these four languages. It provides a shell
in Scala and Python. The Scala shell can be accessed through ./bin/spark-
shell and Python shell through ./bin/pyspark from the installed directory.

2. Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-
scale data processing. Spark is able to achieve this speed through controlled
partitioning. It manages data using partitions that help parallelize distributed
data processing with minimal network traffic.
3. Multiple Formats: Spark supports multiple data sources such as Parquet,
JSON, Hive and Cassandra. The Data Sources API provides a pluggable
mechanism for accessing structured data though Spark SQL. Data sources
can be more than just simple pipes that convert data and pull it into Spark.

4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely

necessary. This is one of the key factors contributing to its speed. For
transformations, Spark adds them to a DAG of computation and only when
the driver requests some data, does this DAG actually gets executed.

5. Real Time Computation: Spark’s computation is real-time and has less

latency because of its in-memory computation. Spark is designed for
massive scalability and the Spark team has documented users of the system
running production clusters with thousands of nodes and supports several
computational models.

6. Hadoop Integration: Apache Spark provides smooth compatibility with

Hadoop. This is a great boon for all the Big Data engineers who started their
careers with Hadoop. Spark is a potential replacement for the MapReduce
functions of Hadoop, while Spark has the ability to run on top of an existing
Hadoop cluster using YARN for resource scheduling.

7. Machine Learning: Spark’s MLlib is the machine learning component

which is handy when it comes to big data processing. It eradicates the need
to use multiple tools, one for processing and one for machine learning. Spark
provides data engineers and data scientists with a powerful, unified engine
that is both fast and easy to use.
17. What are the languages supported by Apache Spark and which is the most
popular one?
Apache Spark supports the following four languages: Scala, Java, Python and R.
Among these languages, Scala and Python have interactive shells for Spark. The
Scala shell can be accessed through ./bin/spark-shell and the Python shell
through ./bin/pyspark. Scala is the most used among them because Spark is
written in Scala and it is the most popularly used for Spark.

18. What are benefits of Spark over MapReduce?

Spark has the following benefits over MapReduce:

1. Due to the availability of in-memory processing, Spark implements the

processing around 10 to 100 times faster than Hadoop MapReduce whereas
MapReduce makes use of persistence storage for any of the data processing
tasks.
2. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks
from the same core like batch processing, Steaming, Machine learning,
Interactive SQL queries. However, Hadoop only supports batch processing.
3. Hadoop is highly disk-dependent whereas Spark promotes caching and in-
memory data storage.
4. Spark is capable of performing computations multiple times on the same
dataset. This is called iterative computation while there is no iterative
computing implemented by Hadoop.

19. What is YARN?

Similar to Hadoop, YARN is one of the key features in Spark, providing a central
and resource management platform to deliver scalable operations across the
cluster. YARN is a distributed container manager, like Mesos for example,
whereas Spark is a data processing tool. Spark can run on YARN, the same way
Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a
binary distribution of Spark as built on YARN support.

20. How is Streaming implemented in Spark? Explain with examples.

Spark Streaming is used for processing real-time streaming data. Thus it is a useful
addition to the core Spark API. It enables high-throughput and fault-tolerant stream
processing of live data streams. The fundamental stream unit is DStream which is
basically a series of RDDs (Resilient Distributed Datasets) to process the real-time
data. The data from different sources like Flume, HDFS is streamed and finally
processed to file systems, live dashboards and databases. It is similar to batch
processing as the input data is divided into streams like batches.

21. What is EDGE streaming analytics?

Edge Streaming Analytics is a powerful cloud-based tool for creating stream

processing workflows that can be deployed to edge devices. It helps manufacturers
and industrial companies to reduce time-to-market for any analytics-based
improvement like predictive maintenance, operational excellence or energy
efficiency.
22. What are streaming analytics?
Streaming analytics, also known as event stream processing, is the analysis of
huge pools of current and “in-motion” data through the use of continuous queries,
called event streams.
23. What is Xively cloud for IoT?
Xively (formerly known as Cosm and Pachube) is an Internet of Things
(IoT) platform owned by Google. Xively offers product companies a way to
connect products, manage connected devices and the data they produce, and
integrate that data into other systems. It is pronounced "zively" (rhymes with
lively).
24. What is Python web framework?

A Web framework is a collection of packages or modules which allow developers

to write Web applications (see WebApplications) or services without having to
handle such low-level details as protocols, sockets or process/thread management.
25. What is Django?
Ans: Django is a high-level Python Web framework that encourages rapid
development and clean, pragmatic design. Built by experienced developers, it takes
care of much of the hassle of Web development, so you can focus on writing your
app without needing to reinvent the wheel. It’s free and open source..
26. What does Django mean?
Ans: Django is named after Django Reinhardt, a gypsy jazz guitarist from the
1930s to early 1950s who is known as one of the best guitarists of all time.
27. Which architectural pattern does Django Follow?
Ans: Django follows Model-View Template (MVT) architectural pattern.
28. Explain the architecture of Django?
Ans: Django is based on MVT architecture. It contains the following layers:
Models: It describes the database schema and data structure.
Views: The view layer is a user interface. It controls what a user sees, the view
retrieves data from appropriate models and execute any calculation made to the
data and pass it to the template.
Templates: It determines how the user sees it. It describes how the data received
from the views should be changed or formatted for display on the page.
Controller: Controller is the heart of the system. It handles requests and responses,
setting up database connections and loading add-ons. It specifies the Django
framework and URL parsing.
29. Is Django stable?
Ans: Yes, Django is quite stable. Many companies like Disqus, Instagram,
Pinterest, and Mozilla have been using Django for many years.
30. What is AWS IoT Core?

AWS IoT Core is a managed cloud platform that lets connected devices easily and
securely interact with cloud applications and other devices. AWS IoT Core can
support billions of devices and trillions of messages, and can process and route
those messages to AWS endpoints and to other devices reliably and securely. With
AWS IoT Core, your applications can keep track of and communicate with all your
devices, all the time, even when they aren’t connected.
31. What does AWS IoT Core offer?

Connectivity between devices and the AWS cloud. First, with AWS IoT Core you
can communicate with connected devices securely, with low latency and with low
overhead. The communication can scale to as many devices as you want. AWS IoT
Core supports standard communication protocols (HTTP, MQTT, and WebSockets
are supported currently). Communication is secured using TLS.
Processing data sent from connected devices. Secondly, with AWS IoT Core you
can continuously ingest, filter, transform, and route the data streamed from
connected devices. You can take actions based on the data and route it for further
processing and analytics.

Application interaction with connected devices. Finally, AWS IoT Core accelerates
IoT application development. It serves as an easy to use interface for applications
running in the cloud and on mobile devices to access data sent from connected
devices, and send data and commands back to the devices.

32. Compare between AWS and OpenStack.

Criteria AWS OpenStack

License Amazon proprietary Open source
Operating system Whatever the cloud Whatever AMIs provided
administrator provides by AWS
Performing repeatable Through templates Through text files
operations

33. What is AWS?

AWS (Amazon Web Services) is a platform to provide secure cloud services,

database storage, offerings to compute power, content delivery, and other services
to help business level and develop.
34. What is NETCONF/YANG ?
NETCONF/YANG provides a standardized way to programmatically update and
modify the configuration of a network device. ... YANG is the modelling language
that describes the configuration changes. Whereas NETCONF is the protocol that
applies the changes to the relevant datastore (i.e running, saved etc) upon the
device.
Part –C
1. Explain the Role of Machine learning-
2. Describe NoSQL Databases and their types
3. Explain the Hadoop Ecosystem-
4. Discuss the Edge Streaming Analytics and Network Analytics in
detail
5. Explain the Xively Cloud for IoT,

Big Data Analytics Question Bank
No ratings yet
Big Data Analytics Question Bank
8 pages
Big Data Concepts and Hadoop Overview
No ratings yet
Big Data Concepts and Hadoop Overview
33 pages
Understanding Big Data Concepts and Tools
No ratings yet
Understanding Big Data Concepts and Tools
7 pages
Big Data Concepts and Hadoop Overview
No ratings yet
Big Data Concepts and Hadoop Overview
18 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
31 pages
Big Data and NoSQL: Key Concepts Explained
No ratings yet
Big Data and NoSQL: Key Concepts Explained
6 pages
Big Data and NoSQL Overview
No ratings yet
Big Data and NoSQL Overview
88 pages
Big Data Analytics MCQ Set
No ratings yet
Big Data Analytics MCQ Set
8 pages
Big Data Concepts and Technologies Overview
No ratings yet
Big Data Concepts and Technologies Overview
55 pages
Digital Fluency Notes MCQ
No ratings yet
Digital Fluency Notes MCQ
24 pages
Digital Fluency Exam Question Bank
No ratings yet
Digital Fluency Exam Question Bank
36 pages
Understanding Big Data and Autonomy
No ratings yet
Understanding Big Data and Autonomy
5 pages
Digital Fluency Question Bank
100% (4)
Digital Fluency Question Bank
48 pages
Digital Fluency Assessment Question Bank
No ratings yet
Digital Fluency Assessment Question Bank
27 pages
MapReduce vs. Spark: Key Differences and Features
No ratings yet
MapReduce vs. Spark: Key Differences and Features
8 pages
Big Data Analytics Question Bank
No ratings yet
Big Data Analytics Question Bank
18 pages
Apache Spark Interview Questions Guide
No ratings yet
Apache Spark Interview Questions Guide
12 pages
Data Science Interview Questions Guide
No ratings yet
Data Science Interview Questions Guide
8 pages
Big Data Analytics Course Materials
No ratings yet
Big Data Analytics Course Materials
12 pages
Cloud Computing RDD and YARN Overview
No ratings yet
Cloud Computing RDD and YARN Overview
3 pages
Unsupported Languages in Spark
No ratings yet
Unsupported Languages in Spark
26 pages
AI and Machine Learning Concepts Quiz
No ratings yet
AI and Machine Learning Concepts Quiz
30 pages
Big Data Characteristics and Hadoop Overview
No ratings yet
Big Data Characteristics and Hadoop Overview
4 pages
Digital Fluency Course Overview
No ratings yet
Digital Fluency Course Overview
37 pages
Spark Batch 4.0 Overview and Insights
No ratings yet
Spark Batch 4.0 Overview and Insights
1 page
Understanding Big Data and Autonomy
No ratings yet
Understanding Big Data and Autonomy
7 pages
Spark Interview Questions and Answers Guide
No ratings yet
Spark Interview Questions and Answers Guide
32 pages
Spark Interview Questions & Answers Guide
No ratings yet
Spark Interview Questions & Answers Guide
32 pages
Big Data Analytics Quiz Questions
No ratings yet
Big Data Analytics Quiz Questions
309 pages
Big Data Analytics Question Bank
No ratings yet
Big Data Analytics Question Bank
10 pages
Understanding Big Data vs. Small Data
No ratings yet
Understanding Big Data vs. Small Data
22 pages
AI, ML, DL, IoT: Concepts & Examples
No ratings yet
AI, ML, DL, IoT: Concepts & Examples
21 pages
Hadoop vs Spark and Big Data Insights
No ratings yet
Hadoop vs Spark and Big Data Insights
8 pages
Big Data Analytics Question Bank
No ratings yet
Big Data Analytics Question Bank
17 pages
Big Data Analytics and Hadoop Evolution
No ratings yet
Big Data Analytics and Hadoop Evolution
31 pages
Hadoop and Big Data MCQ Quiz
No ratings yet
Hadoop and Big Data MCQ Quiz
7 pages
Class 9 AI MCQ Worksheets
No ratings yet
Class 9 AI MCQ Worksheets
7 pages
Assignment 01 K!
No ratings yet
Assignment 01 K!
8 pages
Hadoop's Role in Big Data Processing
No ratings yet
Hadoop's Role in Big Data Processing
14 pages
Big Data Analytics Overview and Hadoop Basics
No ratings yet
Big Data Analytics Overview and Hadoop Basics
39 pages
Data Science Overview and Concepts
No ratings yet
Data Science Overview and Concepts
25 pages
Understanding Big Data and Autonomy
No ratings yet
Understanding Big Data and Autonomy
2 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
11 pages
Machine Learning Overview for Class 8
No ratings yet
Machine Learning Overview for Class 8
6 pages
Snowflake Edit Distance in Big Data Analysis
No ratings yet
Snowflake Edit Distance in Big Data Analysis
35 pages
Big Data and Machine Learning FAQs
No ratings yet
Big Data and Machine Learning FAQs
2 pages
Understanding Big Data and Its Applications
No ratings yet
Understanding Big Data and Its Applications
42 pages
SQL and Big Data Concepts Explained
No ratings yet
SQL and Big Data Concepts Explained
10 pages
Pyspark Certification Practice Questions
No ratings yet
Pyspark Certification Practice Questions
10 pages
Apache Spark Interview Questions Guide
No ratings yet
Apache Spark Interview Questions Guide
12 pages
AI Concepts: Search Algorithms & Models
No ratings yet
AI Concepts: Search Algorithms & Models
11 pages
Key Concepts in Big Data and Hadoop
No ratings yet
Key Concepts in Big Data and Hadoop
4 pages
Big Data Analytics Question Bank CSE
No ratings yet
Big Data Analytics Question Bank CSE
10 pages
Data Science: NoSQL & Machine Learning Basics
No ratings yet
Data Science: NoSQL & Machine Learning Basics
10 pages
Spark Interview Questions & Answers
No ratings yet
Spark Interview Questions & Answers
4 pages
Big Data Mining Concepts and Techniques
No ratings yet
Big Data Mining Concepts and Techniques
3 pages
Key Concepts in Data Science and Analytics
No ratings yet
Key Concepts in Data Science and Analytics
21 pages
Hadoop Assignments with Solutions
100% (1)
Hadoop Assignments with Solutions
37 pages
IoT Design and Microcontroller Overview
No ratings yet
IoT Design and Microcontroller Overview
13 pages
Cisco IoT System and Smart City Applications
No ratings yet
Cisco IoT System and Smart City Applications
13 pages
IoT Protocols and Components Overview
No ratings yet
IoT Protocols and Components Overview
19 pages
Internet of Things Concepts and Protocols
No ratings yet
Internet of Things Concepts and Protocols
10 pages
Space Complexity Overview
No ratings yet
Space Complexity Overview
8 pages
Continuous Flow Systems Analysis
No ratings yet
Continuous Flow Systems Analysis
13 pages
Bilge Pumping & Oily Water Separators
No ratings yet
Bilge Pumping & Oily Water Separators
9 pages
Sound Design Techniques for Filmmakers
No ratings yet
Sound Design Techniques for Filmmakers
145 pages
Magnetic Circuits and Reluctance Analysis
100% (1)
Magnetic Circuits and Reluctance Analysis
72 pages
Delta DPM Series Power Meters Overview
No ratings yet
Delta DPM Series Power Meters Overview
12 pages
Toshiba B-SX6 8T Brochure
No ratings yet
Toshiba B-SX6 8T Brochure
4 pages
Introduction to Programming Basics
No ratings yet
Introduction to Programming Basics
2 pages
Cortisol CLIA Assay Instructions
No ratings yet
Cortisol CLIA Assay Instructions
14 pages
Nano Chitosan Coating for Shallots
No ratings yet
Nano Chitosan Coating for Shallots
11 pages
Trigonometry Scheme of Work for Form 2
No ratings yet
Trigonometry Scheme of Work for Form 2
12 pages
Dynamics of An Overhead Crane Under A Wind Disturbance Condition - 2014
No ratings yet
Dynamics of An Overhead Crane Under A Wind Disturbance Condition - 2014
12 pages
Filters and Tuned Amplifiers Overview
No ratings yet
Filters and Tuned Amplifiers Overview
31 pages
Turbine Wheel Flowmeter DPE Specifications
No ratings yet
Turbine Wheel Flowmeter DPE Specifications
5 pages
Mathematics Problem Solutions and Theorems
No ratings yet
Mathematics Problem Solutions and Theorems
17 pages
Ankle Joint Biomechanics Explained
No ratings yet
Ankle Joint Biomechanics Explained
8 pages
Understanding Computer System Units
No ratings yet
Understanding Computer System Units
12 pages
Fundamentals of Metal Casting Processes
No ratings yet
Fundamentals of Metal Casting Processes
18 pages
Yearly Mathematics Lesson Plan 2025
No ratings yet
Yearly Mathematics Lesson Plan 2025
12 pages
Manual Reloj W734
No ratings yet
Manual Reloj W734
5 pages
Geocentric vs Non-Geocentric Datums Explained
No ratings yet
Geocentric vs Non-Geocentric Datums Explained
3 pages
Pyroelectric, Piezoelectric, and Ferroelectric Materials
100% (1)
Pyroelectric, Piezoelectric, and Ferroelectric Materials
14 pages
Understanding Collisions and Momentum
100% (1)
Understanding Collisions and Momentum
2 pages
Functions Test and Answer Scheme
No ratings yet
Functions Test and Answer Scheme
4 pages
Modulational Instability in NLS Equation
No ratings yet
Modulational Instability in NLS Equation
12 pages
Defining TFL Sentences and Connectives
No ratings yet
Defining TFL Sentences and Connectives
2 pages
Sketch to 3D Model Framework for Houses
No ratings yet
Sketch to 3D Model Framework for Houses
15 pages
Comprehensive Guide to DIY Antennas
100% (2)
Comprehensive Guide to DIY Antennas
292 pages
Phase Change Materials for PV Efficiency
No ratings yet
Phase Change Materials for PV Efficiency
12 pages
Parallel Axis Theorem Explained
No ratings yet
Parallel Axis Theorem Explained
4 pages

Processing Unstructured Data in Hadoop

Uploaded by

Processing Unstructured Data in Hadoop

Uploaded by

Question Bank

1. Traditional RDBMS unable to process –

6. What are the different types of Machine Learning?

Types of Machine Learning – Machine Learning Interview Questions – Edureka

There are three ways in which machines learn:

Unsupervised Learning:Unsupervised learning is a method in which the machine

 It is like learning without a teacher.

Reinforcement learning involves an agent that interacts with its environment by

7. How would you explain Machine Learning to a school-going kid?

8. How does Deep Learning differ from Machine Learning?

Deep Learning is a form of machine learning Machine Learning is all about

9. Explain Classification and Regression

11. Name some companies that use Hadoop.

[Link] are the core components of Hadoop?

Kafka as “an open-source message broker project developed by the Apache

Kafka Salient Features

Scalability Highly scalable distributed systems with no downtime

Durability Provides support for persistence of message to disk

14. List the various components in Kafka.

The four major components of Kafka are:

 Topic – a stream of messages belonging to the same type

15. Compare Hadoop and Spark.

Apache Spark vs. Hadoop

Speed 100 times faster than Hadoop Decent speed

Processing Real-time & Batch processing Batch processing only

Recovery Allows recovery of partitions Fault-tolerant

No interactive mode except Pig

[Link] the key features of Apache Spark.

Let us look at these features in detail:

1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R.

4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely

5. Real Time Computation: Spark’s computation is real-time and has less

6. Hadoop Integration: Apache Spark provides smooth compatibility with

7. Machine Learning: Spark’s MLlib is the machine learning component

18. What are benefits of Spark over MapReduce?

1. Due to the availability of in-memory processing, Spark implements the

19. What is YARN?

20. How is Streaming implemented in Spark? Explain with examples.

21. What is EDGE streaming analytics?

Edge Streaming Analytics is a powerful cloud-based tool for creating stream

A Web framework is a collection of packages or modules which allow developers

32. Compare between AWS and OpenStack.

Criteria AWS OpenStack

33. What is AWS?

AWS (Amazon Web Services) is a platform to provide secure cloud services,

You might also like