0% found this document useful (0 votes)
98 views18 pages

Processing Unstructured Data in Hadoop

The document is a question bank focused on Data Analytics and Supporting Services, covering topics such as structured and unstructured data, machine learning types, Hadoop and Spark components, and data security. It includes multiple-choice questions, explanations of key concepts, and comparisons between technologies like Hadoop and Spark. Additionally, it addresses various frameworks and tools relevant to data processing and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views18 pages

Processing Unstructured Data in Hadoop

The document is a question bank focused on Data Analytics and Supporting Services, covering topics such as structured and unstructured data, machine learning types, Hadoop and Spark components, and data security. It includes multiple-choice questions, explanations of key concepts, and comparisons between technologies like Hadoop and Spark. Additionally, it addresses various frameworks and tools relevant to data processing and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Question Bank

UNIT-IV
Data Analytics and Supporting Services

1. Traditional RDBMS unable to process –


a. Structured data
b. Unstructured data
c. Both structured and unstructured data
d. None of these
2. Structured data is managed in database using –
a. .NET Framework
b. Structured Query Language
c. Normal Language Processing
d. All of these
3. What are the main components of Hadoop Ecosystem?
[Link], HDFS, YARN
[Link], GraphX Gelly,
[Link], CEP
[Link] of the mentioned
4. NoSQL databases store unstructured data with no particular schema. True False
5. Which of the following is not a NoSQL database?
[Link]
b. SQL Server
[Link]
[Link] of the mentioned
6.____________ is a distributed machine learning framework on top of Spark.
a .MLlib
[Link] Streaming
[Link]
[Link]
7.________________is a resource management platform responsible for managing
compute resources in the cluster and using them in order to schedule users and
applications.
[Link] Common
[Link] Distributed File System (HDFS)
[Link] YARN
[Link] MapReduce
8. In simple term, machine learning is
a. Training based on historical data
b. Prediction to answer a query
c. Both a and b
d. None
9 .Deep learning is
a. Subfield of machine learning
b. Learns features by its own
c. Mimics the working function of several features
d. All of the above
10. In spark, a ______________________is a read-only collection of objects
partitioned across a set of machines that can be rebuilt if a partition is lost.
[Link] Streaming
[Link] Distributed Dataset (RDD)
[Link]
[Link]
11. Consider the following statements in the context of Spark:
Statement 1: Spark also gives you control over how you can partition your
Resilient Distributed Datasets (RDDs)
Statement 2: Spark allows you to choose whether you want to persist Resilient
Distributed Dataset (RDD) onto disk or not.
[Link] statement 1 is true
b. Only statement 2 is true
[Link] statements are true
[Link] statements are false
12. Which of the following are the simplest NoSQL databases?
[Link]-value
[Link]-column
[Link]
[Link] of the mentioned
13. Point out the incorrect statement in the context of Cassandra:
[Link] is originally designed at Facebook
[Link] is a centralized key-value store
[Link] is designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.
[Link] uses a ring-based DHT (Distributed Hash Table) but without finger tables or
routing
Part-B
1. What is the difference between structured semi structured and
unstructured data?
Structured Data is get organized by the means of Relational Database. While in
case of Semi Structured Data is partially organized by the means of XML/RDF.
On other hand in case of Unstructured Data data is based on simple character
and binary data.
2. What is structured data?
Structured data is most often categorized as quantitative data, and it's the type of
data most of us are used to working with. Think of data that fits neatly within fixed
fields and columns in relational databases and [Link] of
structured data include names, dates, addresses, credit card numbers, stock
information, geolocation, and more.
3. What is unstructured data?
Unstructured data is most often categorized as qualitative data, and it cannot be
processed and analyzed using conventional tools and methods.
Examples of unstructured data include text, video, audio, mobile activity, social
media activity, satellite imagery, surveillance imagery – the list goes on and on.
Unstructured data is difficult to deconstruct because it has no pre-defined model,
meaning it cannot be organized in relational databases. Instead, non-relational,
or NoSQL databases, are best fit for managing unstructured data.
4. How would you secure data in motion as well as data at rest?
Data Encryption. Protecting data at rest and in motion. Securing data at the
perimeter through measures like firewalls is simply a band-aid. From regular use to
warehousing, data must be protected at each point throughout its lifecycle.
What is the difference between data at rest and data in transit?
Data in transit, or data in motion, is data actively moving from one location to
another such as across the internet or through a private network. ... Data protection
at rest aims to secure inactive data stored on any device or network
5. What are three pillars of Big Data?
 Structured Data
 Unstructured Data and
 Semi Structured Data

6. What are the different types of Machine Learning?

Types of Machine Learning – Machine Learning Interview Questions – Edureka

There are three ways in which machines learn:

1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Supervised Learning:

Supervised learning is a method in which the machine learns using labeled data.
 It is like learning under the guidance of a teacher
 Training dataset is like a teacher which is used to train the machine
 Model is trained on a pre-defined dataset before it starts making decisions
when given new data

Unsupervised Learning:Unsupervised learning is a method in which the machine


is trained on unlabelled data or without any guidance

 It is like learning without a teacher.


 Model learns through observation & finds structures in data.
 Model is given a dataset and is left to automatically find patterns and
relationships in that dataset by creating clusters.

Reinforcement Learning:

Reinforcement learning involves an agent that interacts with its environment by


producing actions & discovers errors or rewards.

 It is like being stuck in an isolated island, where you must explore the
environment and learn how to live and adapt to the living conditions on your
own.
 Model learns through the hit and trial method
 It learns on the basis of reward or penalty given for every action it performs

7. How would you explain Machine Learning to a school-going kid?

 Suppose your friend invites you to his party where you meet total strangers.
Since you have no idea about them, you will mentally classify them on the
basis of gender, age group, dressing, etc.
 In this scenario, the strangers represent unlabeled data and the process of
classifying unlabeled data points is nothing but unsupervised learning.
 Since you didn’t use any prior knowledge about people and classified them
on-the-go, this becomes an unsupervised learning problem.

8. How does Deep Learning differ from Machine Learning?


Deep Learning Machine Learning

Deep Learning is a form of machine learning Machine Learning is all about


that is inspired by the structure of the human algorithms that parse data, learn from
brain and is particularly effective in feature that data, and then apply what they’ve
detection. learned to make informed decisions.

9. Explain Classification and Regression


10. How is KNN different from K-means clustering?

11. Name some companies that use Hadoop.

 Yahoo (One of the biggest user & more than 80% code contributor to
Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter

[Link] are the core components of Hadoop?


Hadoop is an open-source software framework for distributed storage and
processing of large datasets. Apache Hadoop core components are HDFS,
MapReduce, and YARN.
 HDFS- Hadoop Distributed File System (HDFS) is the primary storage system
of Hadoop. HDFS store very large files running on a cluster of commodity
hardware. It works on the principle of storage of less number of large files
rather than the huge number of small files. HDFS stores data reliably even in
the case of hardware failure. It provides high throughput access to an
application by accessing in parallel.
 MapReduce- MapReduce is the data processing layer of Hadoop. It writes an
application that processes large structured and unstructured data stored in
HDFS. MapReduce processes a huge amount of data in parallel. It does this by
dividing the job (submitted job) into a set of independent tasks (sub-job). In
Hadoop, MapReduce works by breaking the processing into
phases: Map and Reduce. The Map is the first phase of processing, where we
specify all the complex logic code. Reduce is the second phase of processing.
Here we specify light-weight processing like aggregation/summation.
 YARN- YARN is the processing framework in Hadoop. It provides Resource
management and allows multiple data processing engines. For example real-
time streaming, data science, and batch processing.
13. What is Kafka?

Kafka as “an open-source message broker project developed by the Apache


Software Foundation written in Scala and is a distributed publish-subscribe
messaging system.

Kafka Salient Features


Feature Description

High
Support for millions of messages with modest hardware
Throughput

Scalability Highly scalable distributed systems with no downtime


Messages are replicated across the cluster to provide support for multiple su
Replication
balances the consumers in case of failures

Durability Provides support for persistence of message to disk

Stream
Used with real-time streaming applications like Apache Spark & Storm
Processing

Data Loss Kafka with proper configurations can ensure zero data loss

14. List the various components in Kafka.

The four major components of Kafka are:

 Topic – a stream of messages belonging to the same type


 Producer – that can publish messages to a topic
 Brokers – a set of servers where the publishes messages are stored
 Consumer – that subscribes to various topics and pulls data from the brokers

15. Compare Hadoop and Spark.


We will compare Hadoop MapReduce and Spark based on the following aspects:

Apache Spark vs. Hadoop


Feature Criteria Apache Spark Hadoop

Speed 100 times faster than Hadoop Decent speed

Processing Real-time & Batch processing Batch processing only


Easy because of high level
Difficulty Tough to learn
modules

Recovery Allows recovery of partitions Fault-tolerant

No interactive mode except Pig


Interactivity Has interactive modes
& Hive

[Link] the key features of Apache Spark.


The following are the key features of Apache Spark:

1. Polyglot
2. Speed
3. Multiple Format Support
4. Lazy Evaluation
5. Real Time Computation
6. Hadoop Integration
7. Machine Learning

Let us look at these features in detail:

1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R.


Spark code can be written in any of these four languages. It provides a shell
in Scala and Python. The Scala shell can be accessed through ./bin/spark-
shell and Python shell through ./bin/pyspark from the installed directory.

2. Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-
scale data processing. Spark is able to achieve this speed through controlled
partitioning. It manages data using partitions that help parallelize distributed
data processing with minimal network traffic.
3. Multiple Formats: Spark supports multiple data sources such as Parquet,
JSON, Hive and Cassandra. The Data Sources API provides a pluggable
mechanism for accessing structured data though Spark SQL. Data sources
can be more than just simple pipes that convert data and pull it into Spark.

4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely


necessary. This is one of the key factors contributing to its speed. For
transformations, Spark adds them to a DAG of computation and only when
the driver requests some data, does this DAG actually gets executed.

5. Real Time Computation: Spark’s computation is real-time and has less


latency because of its in-memory computation. Spark is designed for
massive scalability and the Spark team has documented users of the system
running production clusters with thousands of nodes and supports several
computational models.

6. Hadoop Integration: Apache Spark provides smooth compatibility with


Hadoop. This is a great boon for all the Big Data engineers who started their
careers with Hadoop. Spark is a potential replacement for the MapReduce
functions of Hadoop, while Spark has the ability to run on top of an existing
Hadoop cluster using YARN for resource scheduling.

7. Machine Learning: Spark’s MLlib is the machine learning component


which is handy when it comes to big data processing. It eradicates the need
to use multiple tools, one for processing and one for machine learning. Spark
provides data engineers and data scientists with a powerful, unified engine
that is both fast and easy to use.
17. What are the languages supported by Apache Spark and which is the most
popular one?
Apache Spark supports the following four languages: Scala, Java, Python and R.
Among these languages, Scala and Python have interactive shells for Spark. The
Scala shell can be accessed through ./bin/spark-shell and the Python shell
through ./bin/pyspark. Scala is the most used among them because Spark is
written in Scala and it is the most popularly used for Spark.

18. What are benefits of Spark over MapReduce?


Spark has the following benefits over MapReduce:

1. Due to the availability of in-memory processing, Spark implements the


processing around 10 to 100 times faster than Hadoop MapReduce whereas
MapReduce makes use of persistence storage for any of the data processing
tasks.
2. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks
from the same core like batch processing, Steaming, Machine learning,
Interactive SQL queries. However, Hadoop only supports batch processing.
3. Hadoop is highly disk-dependent whereas Spark promotes caching and in-
memory data storage.
4. Spark is capable of performing computations multiple times on the same
dataset. This is called iterative computation while there is no iterative
computing implemented by Hadoop.

19. What is YARN?


Similar to Hadoop, YARN is one of the key features in Spark, providing a central
and resource management platform to deliver scalable operations across the
cluster. YARN is a distributed container manager, like Mesos for example,
whereas Spark is a data processing tool. Spark can run on YARN, the same way
Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a
binary distribution of Spark as built on YARN support.

20. How is Streaming implemented in Spark? Explain with examples.


Spark Streaming is used for processing real-time streaming data. Thus it is a useful
addition to the core Spark API. It enables high-throughput and fault-tolerant stream
processing of live data streams. The fundamental stream unit is DStream which is
basically a series of RDDs (Resilient Distributed Datasets) to process the real-time
data. The data from different sources like Flume, HDFS is streamed and finally
processed to file systems, live dashboards and databases. It is similar to batch
processing as the input data is divided into streams like batches.

21. What is EDGE streaming analytics?

Edge Streaming Analytics is a powerful cloud-based tool for creating stream


processing workflows that can be deployed to edge devices. It helps manufacturers
and industrial companies to reduce time-to-market for any analytics-based
improvement like predictive maintenance, operational excellence or energy
efficiency.
22. What are streaming analytics?
Streaming analytics, also known as event stream processing, is the analysis of
huge pools of current and “in-motion” data through the use of continuous queries,
called event streams.
23. What is Xively cloud for IoT?
Xively (formerly known as Cosm and Pachube) is an Internet of Things
(IoT) platform owned by Google. Xively offers product companies a way to
connect products, manage connected devices and the data they produce, and
integrate that data into other systems. It is pronounced "zively" (rhymes with
lively).
24. What is Python web framework?

A Web framework is a collection of packages or modules which allow developers


to write Web applications (see WebApplications) or services without having to
handle such low-level details as protocols, sockets or process/thread management.
25. What is Django?
Ans: Django is a high-level Python Web framework that encourages rapid
development and clean, pragmatic design. Built by experienced developers, it takes
care of much of the hassle of Web development, so you can focus on writing your
app without needing to reinvent the wheel. It’s free and open source..
26. What does Django mean?
Ans: Django is named after Django Reinhardt, a gypsy jazz guitarist from the
1930s to early 1950s who is known as one of the best guitarists of all time.
27. Which architectural pattern does Django Follow?
Ans: Django follows Model-View Template (MVT) architectural pattern.
28. Explain the architecture of Django?
Ans: Django is based on MVT architecture. It contains the following layers:
Models: It describes the database schema and data structure.
Views: The view layer is a user interface. It controls what a user sees, the view
retrieves data from appropriate models and execute any calculation made to the
data and pass it to the template.
Templates: It determines how the user sees it. It describes how the data received
from the views should be changed or formatted for display on the page.
Controller: Controller is the heart of the system. It handles requests and responses,
setting up database connections and loading add-ons. It specifies the Django
framework and URL parsing.
29. Is Django stable?
Ans: Yes, Django is quite stable. Many companies like Disqus, Instagram,
Pinterest, and Mozilla have been using Django for many years.
30. What is AWS IoT Core?

AWS IoT Core is a managed cloud platform that lets connected devices easily and
securely interact with cloud applications and other devices. AWS IoT Core can
support billions of devices and trillions of messages, and can process and route
those messages to AWS endpoints and to other devices reliably and securely. With
AWS IoT Core, your applications can keep track of and communicate with all your
devices, all the time, even when they aren’t connected.
31. What does AWS IoT Core offer?

Connectivity between devices and the AWS cloud. First, with AWS IoT Core you
can communicate with connected devices securely, with low latency and with low
overhead. The communication can scale to as many devices as you want. AWS IoT
Core supports standard communication protocols (HTTP, MQTT, and WebSockets
are supported currently). Communication is secured using TLS.

Connectivity between devices and the AWS cloud. First, with AWS IoT Core you
can communicate with connected devices securely, with low latency and with low
overhead. The communication can scale to as many devices as you want. AWS IoT
Core supports standard communication protocols (HTTP, MQTT, and WebSockets
are supported currently). Communication is secured using TLS.
Processing data sent from connected devices. Secondly, with AWS IoT Core you
can continuously ingest, filter, transform, and route the data streamed from
connected devices. You can take actions based on the data and route it for further
processing and analytics.

Application interaction with connected devices. Finally, AWS IoT Core accelerates
IoT application development. It serves as an easy to use interface for applications
running in the cloud and on mobile devices to access data sent from connected
devices, and send data and commands back to the devices.

32. Compare between AWS and OpenStack.

Criteria AWS OpenStack


License Amazon proprietary Open source
Operating system Whatever the cloud Whatever AMIs provided
administrator provides by AWS
Performing repeatable Through templates Through text files
operations

33. What is AWS?

AWS (Amazon Web Services) is a platform to provide secure cloud services,


database storage, offerings to compute power, content delivery, and other services
to help business level and develop.
34. What is NETCONF/YANG ?
NETCONF/YANG provides a standardized way to programmatically update and
modify the configuration of a network device. ... YANG is the modelling language
that describes the configuration changes. Whereas NETCONF is the protocol that
applies the changes to the relevant datastore (i.e running, saved etc) upon the
device.
Part –C
1. Explain the Role of Machine learning-
2. Describe NoSQL Databases and their types
3. Explain the Hadoop Ecosystem-
4. Discuss the Edge Streaming Analytics and Network Analytics in
detail
5. Explain the Xively Cloud for IoT,

You might also like