UNIT 2 Notes by ARUN JHAPATE
UNIT 2 Notes by ARUN JHAPATE
According to Gartner:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making
From Wikipedia:
Big Data is a broad term for data sets so large or complex that they are difficult to process
using traditional data processing applications. Challenges include analysis, capture,
curation, search, sharing, storage, transfer, visualization, and information privacy.
Volume: The amount of data is immense. Each day 2.3 trillion gigabytes of new
data is being created.
Velocity: The speed of data (always in flux) and processing (analysis of streaming
data to produce near or real time results)
Variety: The different types of data, structured, as well as, unstructured.
Visibility Dimension: This dimension refers to a customers‘ ability to see, track
their experience or order through the operations process. A high visibility
dimension includes courier companies where you can track your package online or
a retail store where you pick up the goods and purchase them over the counter.
Value: Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization – which takes a lot of time, effort and
resources – you want to be sure your organization is getting value from the data.
Variability: Variability is different from variety. A coffee shop may offer 6
different blends of coffee, but if you get the same blend every day and it tastes
different every day, that is variability. The same is true of data; if the meaning is
constantly changing it can have a huge impact on your data homogenization.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
It is the combination of these factors, high-volume, high-velocity and high-variety that
serves as the basis for data to be termed Big Data. Big Data platforms and solutions
provide the tools, methods and technologies used to capture, curate, store and search
& analyze the data to find new correlations, relationships and trends that were previously
unavailable.
Big data analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights. With today‘s technology, it‘s possible to analyze your
data and get answers from it immediately. Big Data Analytics helps you to understand
your organization better. With the use of Big data analytics, one can make informed
decisions without blindly relying on guesses.
The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.
Organisations from different domain are investing in Big Data applications, for
examining large data sets to uncover all hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information. In this blog we will
we be covering:
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Big Data Applications: Healthcare
Big Data Applications: Media & Entertainment
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
o Scheduling optimization
o Increasing acquisition and retention
o Ad targeting
o Content monetization and new product development
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
BIG DATA TECHNOLOGIES
The list of technology vendors offering big data solutions is seemingly infinite. Many of
the big data solutions that are particularly popular right now fit into one of the following
15 categories:
While Apache Hadoop may not be as dominant as it once was, it's nearly impossible to
talk about big data without mentioning this open source framework for distributed
processing of large data sets. Last year, Forrester predicted, "100% of all large enterprises
will adopt it (Hadoop and related technologies such as Spark) for big data
analytics within the next two years."
Over the years, Hadoop has grown to encompass an entire ecosystem of related software,
and many commercial big data solutions are based on Hadoop. In fact, Zion Market
Research forecasts that the market for Hadoop-based products and services will continue
to grow at a 50 percent CAGR through 2022, when it will be worth $87.14 billion, up
from $7.69 billion in 2016.
Key Hadoop vendors include Cloudera, Hortonworks and MapR, and the leading public
clouds all offer services that support the technology.
2. Spark
Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that
it deserves a category of its own. It is an engine for processing big data within Hadoop,
and it's up to one hundred times faster than the standard Hadoop engine, MapReduce.
In the AtScale 2016 Big Data Maturity Survey, 25 percent of respondents said that they
had already deployed Spark in production, and 33 percent more had Spark projects in
development. Clearly, interest in the technology is sizable and growing, and many
vendors with Hadoop offerings also offer Spark-based products.
open source project, is a programming language and software environment designed for
working with statistics. The darling of data scientists, it is managed by the R Foundation
and available under the GPL 2 license. Many popular integrated development
environments (IDEs), including Eclipse and Visual Studio, support the language.
Several organizations that rank the popularity of various programming languages say that
R has become one of the most popular languages in the world. For example,
the IEEE says that R is the fifth most popular programming language, and
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
both Tiobe and RedMonk rank it 14th. This is significant because the programming
languages near the top of these charts are usually general-purpose languages that can be
used for many different kinds of work. For a language that is used almost exclusively for
big data projects to be so near the top demonstrates the significance of big data and the
importance of this language in its field.
4. Data Lakes
To make it easier to access their vast stores of data, many enterprises are setting up data
lakes. These are huge data repositories that collect data from many different sources and
store it in its natural state. This is different than a data warehouse, which also collects
data from disparate sources, but processes it and structures it for storage. In this case, the
lake and warehouse metaphors are fairly accurate. If data is like water, a data lake is
natural and unfiltered like a body of water, while a data warehouse is more like a
collection of water bottles stored on shelves.
Data lakes are particularly attractive when enterprises want to store data but aren't yet
sure how they might use it. A lot of Internet of Things (IoT) data might fit into that
category, and the IoT trend is playing into the growth of data lakes.
MarketsandMarkets predicts that data lake revenue will grow from $2.53 billion in 2016
to $8.81 billion by 2021.
5. NoSQL Databases
NoSQL databases specialize in storing unstructured data and providing fast performance,
although they don't provide the same level of consistency as RDBMSes. Popular NoSQL
databases include MongoDB, Redis, Cassandra, Couchbase and many others; even the
leading RDBMS vendors like Oracle and IBM now also offer NoSQL databases.
NoSQL databases have become increasingly popular as the big data trend has grown.
According to Allied Market Research the NoSQL market could be worth $4.2 billion by
2020. However, the market for RDBMSes is still much, much larger than the market for
NoSQL.
6. Predictive Analytics
Predictive analytics is a sub-set of big data analytics that attempts to forecast future
events or behavior based on historical data. It draws on data mining, modeling and
machine learning techniques to predict what will happen next. It is often used for fraud
detection, credit scoring, marketing, finance and business analysis purposes.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
In recent years, advances in artificial intelligence have enabled vast improvements in the
capabilities of predictive analytics solutions. As a result, enterprises have begun to invest
more in big data solutions with predictive capabilities. Many vendors, including
Microsoft, IBM, SAP, SAS, Statistica, RapidMiner, KNIME and others, offer predictive
analytics solutions. Zion Market Research says the Predictive Analytics market generated
$3.49 billion in revenue in 2016, a number that could reach $10.95 billion by 2022.
7. In-Memory Databases
In any computer system, the memory, also known as the RAM, is orders of magnitude
faster than the long-term storage. If a big data analytics solution can process data that is
stored in memory, rather than data stored on a hard drive, it can perform dramatically
faster. And that's exactly what in-memory database technology does.
Many of the leading enterprise software vendors, including SAP, Oracle, Microsoft and
IBM, now offer in-memory database technology. In addition, several smaller companies
like Teradata, Tableau, Volt DB and DataStax offer in-memory database solutions.
Research from MarketsandMarkets estimates that total sales of in-memory technology
were $2.72 billion in 2016 and may grow to $6.58 billion by 2021.
Because big data repositories present an attractive target to hackers and advanced
persistent threats, big data security is a large and growing concern for enterprises. In the
AtScale survey, security was the second fastest-growing area of concern related to big
data.
According to the IDG report, the most popular types of big data security solutions include
identity and access controls (used by 59 percent of respondents), data encryption (52
percent) and data segregation (42 percent). Dozens of vendors offer big data security
solutions, and Apache Ranger, an open source project from the Hadoop ecosystem, is
also attracting growing attention.
Closely related to the idea of security is the concept of governance. Data governance is a
broad topic that encompasses all the processes related to the availability, usability and
integrity of data. It provides the basis for making sure that the data used for big data
analytics is accurate and appropriate, as well as providing an audit trail so that business
analysts or executives can see where data originated.
In the New Vantage Partners survey, 91.8 percent of the Fortune 1000 executives
surveyed said that governance was either critically important (52.5 percent) or important
(39.3 percent) to their big data initiatives. Vendors offering big data governance tools
include Collibra, IBM, SAS, Informatics, Adaptive and SAP.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
10. Self-Service Capabilities
With data scientists and other big data experts in short supply — and commanding large
salaries — many organizations are looking for big data analytics tools that allow business
users to self-service their own needs. In fact, a report from Research and
Markets estimates that the self-service business intelligence market generated $3.61
billion in revenue in 2016 and could grow to $7.31 billion by 2021. And Gartner has
noted, "The modern BI and analytics platform emerged in the last few years to meet new
organizational requirements for accessibility, agility and deeper analytical insight,
shifting the market from IT-led, system-of-record reporting to business-led, agile
analytics including self-service."
Hoping to take advantage of this trend, multiple business intelligence and big data
analytics vendors, such as Tableau, Microsoft, IBM, SAP, Splunk, Syncsort, SAS,
TIBCO, Oracle and other have added self-service capabilities to their solutions. Time will
tell whether any or all of the products turn out to be truly usable by non-experts and
whether they will provide the business value organizations are hoping to achieve with
their big data initiatives.
While the concept of artificial intelligence (AI) has been around nearly as long as there
have been computers, the technology has only become truly usable within the past couple
of years. In many ways, the big data trend has driven advances in AI, particularly in two
subsets of the discipline: machine learning and deep learning.
The standard definition of machine learning is that it is technology that gives "computers
the ability to learn without being explicitly programmed." In big data analytics, machine
learning technology allows systems to look at historical data, recognize patterns, build
models and predict future outcomes. It is also closely associated with predictive analytics.
Deep learning is a type of machine learning technology that relies on artificial neural
networks and uses multiple layers of algorithms to analyze data. As a field, it holds a lot
of promise for allowing analytics tools to recognize the content in images and videos and
then process it accordingly.
Experts say this area of big data tools seems poised for a dramatic takeoff. IDC has
predicted, "By 2018, 75 percent of enterprise and ISV development will include
cognitive/AI or machine learning functionality in at least one application, including all
business analytics tools."
Leading AI vendors with tools related to big data include Google, IBM, Microsoft and
Amazon Web Services, and dozens of small startups are developing AI technology (and
getting acquired by the larger technology vendors).
As organizations have become more familiar with the capabilities of big data analytics
solutions, they have begun demanding faster and faster access to insights. For these
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
31
enterprises, streaming analytics with the ability to analyze data as it is being created, is
something of a holy grail. They are looking for solutions that can accept input from
multiple disparate sources, process it and return insights immediately — or as close to it
as possible. This is particular desirable when it comes to new IoT deployments, which are
helping to drive the interest in streaming big data analytics.
Several vendors offer products that promise streaming analytics capabilities. They
include IBM, Software AG, SAP, TIBCO, Oracle, DataTorrent, SQLstream, Cisco,
Informatica and others. MarketsandMarkets believes the streaming analytics solutions
brought in $3.08 billion in revenue in 2016, which could increase to $13.70 billion by
2021.
In addition to spurring interest in streaming analytics, the IoT trend is also generating
interest in edge computing. In some ways, edge computing is the opposite of cloud
computing. Instead of transmitting data to a centralized server for analysis, edge
computing systems analyze data very close to where it was created — at the edge of the
network.
The advantage of an edge computing system is that it reduces the amount of information
that must be transmitted over the network, thus reducing network traffic and related costs.
It also decreases demands on data centers or cloud computing facilities, freeing up
capacity for other workloads and eliminating a potential single point of failure.
While the market for edge computing, and more specifically for edge computing
analytics, is still developing, some analysts and venture capitalists have begun calling the
technology the "next big thing."
14. Blockchain
Also a favorite with forward-looking analysts and venture capitalists, blockchain is the
distributed database technology that underlies Bitcoin digital currency. The unique
feature of a blockchain database is that once data has been written, it cannot be deleted or
changed after the fact. In addition, it is highly secure, which makes it an excellent choice
for big data applications in sensitive industries like banking, insurance, health care, retail
and others.
Blockchain technology is still in its infancy and use cases are still developing. However,
several vendors, including IBM, AWS, Microsoft and multiple startups, have rolled out
experimental or introductory solutions built on blockchain technology.
Many analysts divide big data analytics tools into four big categories. The first,
descriptive analytics, simply tells what happened. The next type, diagnostic analytics,
goes a step further and provides a reason for why events occurred. The third type,
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
32
predictive analytics, discussed in depth above, attempts to determine what will happen
next. This is as sophisticated as most analytics tools currently on the market can get.
However, there is a fourth type of analytics that is even more sophisticated, although very
few products with these capabilities are available at this time. Prescriptive analytics
offers advice to companies about what they should do in order to make a desired result
happen. For example, while predictive analytics might give a company a warning that the
market for a particular product line is about to decrease, prescriptive analytics will
analyze various courses of action in response to those market changes and forecast the
most likely results.
Currently, very few enterprises have invested in prescriptive analytics, but many analysts
believe this will be the next big area of investment after organizations begin experiencing
the benefits of predictive analytics.
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning
applications. Hadoop can handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and analyzing data than relational
databases and data warehouses provide.
Hadoop runs on clusters of commodity servers and can scale up to support thousands of
hardware nodes and massive amounts of data. It uses a namesake distributed file system
that's designed to provide rapid data access across the nodes in a cluster, plus fault-
tolerant capabilities so applications can continue to run if individual nodes fail.
Consequently, Hadoop became a foundational data management platform for big data
analytics uses after it emerged in the mid-2000s.
HISTORY OF HADOOP
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to
support processing in the Nutch open source search engine and web crawler. After
Google published technical papers detailing its Google File System (GFS) and
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
33
MapReduce programming framework in 2003 and 2004, Cutting and Cafarella modified
earlier technology plans and developed a Java-based MapReduce implementation and a
file system modeled on Google's.
In early 2006, those elements were split off from Nutch and became a separate Apache
subproject, which Cutting named Hadoop after his son's stuffed elephant. At the same
time, Cutting was hired by internet services company Yahoo, which became the first
production user of Hadoop later in 2006.
Use of the framework grew over the next few years, and three independent Hadoop
vendors were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo
spinoff in 2011. In addition, AWS launched a Hadoop cloud service called Elastic
MapReduce in 2009. That was all before Apache released Hadoop 1.0.0, which became
available in December 2011 after a succession of 0.x releases.
Put simply: Hadoop has two main components. The first component, the Hadoop
Distributed File System, helps split the data, put it on different nodes, replicate it and
manage it. The second component, MapReduce, processes the data on each node in
parallel and calculates the results of the job. There is also a method to help manage the
data processing jobs.
It can store and process vast amounts of structured and unstructured data, quickly application
and data processing are protected against hardware failure. So if one node goes down, jobs
are redirected automatically to other nodes to ensure that the distributed computing doesn‘t fail.
The data doesn‘t have to be preprocessed before it‘s stored. Organizations can store as much
data as they want, including unstructured data, such as text, videos and images, and decide how
to use it later it‘s scalable so companies can add nodes to enable their systems to handle more
data.it can analyze data in real time to enable better decision making.
HADOOP APPLICATIONS
YARN greatly expanded the applications that Hadoop clusters can handle to include
stream processing and real-time analytics applications run in tandem with processing
engines, like Apache Spark and Apache Flink. For example, some manufacturers are
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
34
using real-time data that's streaming into Hadoop in predictive maintenance applications to try
to detect equipment failures before they occur. Fraud detection, website personalization and
customer experience scoring are other real-time use cases.
Because Hadoop can process and store such a wide assortment of data, it enables
organizations to set up data lakes as expansive reservoirs for incoming streams of
information. In a Hadoop data lake, raw data is often stored as is so data scientists and
other analysts can access the full data sets if need be; the data is then filtered and
prepared by analytics or IT teams as needed to support different applications.
Data lakes generally serve different purposes than traditional data warehouses that hold
cleansed sets of transaction data. But, in some cases, companies view their Hadoop data
lakes as modern-day data warehouses. Either way, the growing role of big data analytics
in business decision-making has made effective data governance and data security
processes a priority in data lake deployments.
Risk management -- financial institutions use Hadoop clusters to develop more accurate
risk analysis models for their customers. Financial services companies can use Hadoop to
build and run applications to assess risk, build investment models and develop trading
algorithms.
Predictive maintenance -- with input from IoT devices feeding data into big data
programs, companies in the energy industry can use Hadoop-powered analytics to help
predict when equipment might fail to determine when maintenance should be performed.
Supply chain risk management -- manufacturing companies, for example, can track the
movement of goods and vehicles so they can determine the costs of various transportation
options. Using Hadoop, manufacturers can analyze large amounts of historical, time-
stamped location data as well as map out potential delays so they can optimize their
delivery routes.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
35
BEST BIG DATA ANALYTICS TOOLS
Big Data Analytics software is widely used in providing meaningful analysis of a large
set of data. This software helps in finding current market trends, customer preferences,
and other information.
Here are the 8 Top Big Data Analytics Tools with key feature and download links.
1. Apache Hadoop
The long-standing champion in the field of Big Data processing, well-known for its
capabilities for huge-scale data processing. This open source Big Data framework can run
on-prem or in the cloud and has quite low hardware requirements. The main Hadoop
benefits and features are as follows:
Hadoop Libraries — the needed glue for enabling third party modules to work
with Hadoop
2. Apache Spark
Apache Spark is the alternative — and in many aspects the successor — of Apache
Hadoop. Spark was built to address the shortcomings of Hadoop and it does this
incredibly well. For example, it can process both batch data and real-time data, and
operates 100 times faster than MapReduce. Spark provides the in-memory data
processing capabilities, which is way faster than disk processing leveraged by
MapReduce. In addition, Spark works with HDFS, OpenStack and Apache Cassandra,
both in the cloud and on-prem, adding another layer of versatility to big data operations
for your business.
3. Apache Storm
Storm is another Apache product, a real-time framework for data stream processing,
which supports any programming language. Storm scheduler balances the workload
between multiple nodes based on topology configuration and works well with Hadoop
HDFS. Apache Storm has the following benefits:
Built-in fault-tolerance
Auto-restart on crashes
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
36
Clojure-written
4. Apache Cassandra
Apache Cassandra is one of the pillars behind Facebook‘s massive success, as it allows to
process structured data sets distributed across huge number of nodes across the globe. It
works well under heavy workloads due to its architecture without single points of failure
and boasts unique capabilities no other NoSQL or relational DB has, such as:
Built-in high-availability
5. MongoDB (https://www.guru99.com/mongodb-tutorials.html)
MongoDB is another great example of an open source NoSQL database with rich
features, which is cross-platform compatible with many programming languages. IT Svit
uses MongoDB in a variety of cloud computing and monitoring solutions, and we
specifically developed a module for automated MongoDB backups using Terraform. The
most prominent MongoDB features are:
Stores any type of data, from text and integer to strings, arrays, dates and boolean
6. R Programming Environment
R is mostly used along with JuPyteR stack (Julia, Python, R) for enabling wide-scale
statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big
Data visualization tools, as it allows composing literally any analytical model from more
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
37
R is highly portable
R easily scales from a single test machine to vast Hadoop data lakes
7. Neo4j
Neo4j is an open source graph database with interconnected node-relationship of data,
which follows the key-value pattern in storing data. IT Svit has recently built a resilient
AWS infrastructure with Neo4j for one of our customers and the database performs well
under heavy workload of network data and graph-related requests. Main Neo4j features
are as follows:
8. Apache SAMOA
This is another of the Apache family of tools used for Big Data processing. Samoa
specializes at building distributed streaming algorithms for successful Big Data mining.
This tool is built with pluggable architecture and must be used atop other Apache
products like Apache Storm we mentioned earlier. Its other features used for Machine
Learning include the following:
Clustering
Classification
Normalization
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
38
Regression
Using Apache Samoa enables the distributed stream processing engines to provide such
tangible benefits:
Final thoughts on the list of hot Big Data tools for 2018
Big Data industry and data science evolve rapidly and progressed a big deal lately, with
multiple Big Data projects and tools launched in 2017. This is one of the hottest IT trends
of 2018, along with IoT, blockchain, AI & ML.
PREDICTIVE ANALYTICS
Predictive analytics is a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves applying
statistical analysis techniques, analytical queries and automated machine learning
algorithms to data sets to create predictive models that place a numerical value --
or score -- on the likelihood of a particular event happening.
Predictive analytics software applications use variables that can be measured and
analyzed to predict the likely behavior of individuals, machinery or other entities.
For example, an insurance company is likely to take into account potential driving
safety variables, such as age, gender, location, type of vehicle and driving record,
when pricing and issuing auto insurance policies.
Multiple variables are combined into a predictive model capable of assessing
future probabilities with an acceptable level of reliability. The software relies
heavily on advanced algorithms and methodologies, such as logistic regression
models, time series analysis and decision trees.
Predictive analytics has grown in prominence alongside the emergence of big data
systems. As enterprises have amassed larger and broader pools of data in Hadoop
clusters and other big data platforms, they have created increased data mining
opportunities to gain predictive insights. Heightened development and
commercialization of machine learning tools by IT vendors has also helped
expand predictive analytics capabilities.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
39
MOBILE BUSINESS INTELLIGENCE
Ans. Mobile business intelligence (mobile BI) refers to the ability to provide business
and data analytics services to mobile/handheld devices and/or remote users. MBI enables
users with limited computing capacity to use and receive the same or similar features,
capabilities and processes as those found in a desktop-based business intelligence
software solution.
One of the major problems customers face when using mobile devices for
information retrieval is the fact that mobile BI is no longer as simple as the pure
display of BI content on a mobile device. Moreover, a mobile strategy has to be
defined to cope with different suppliers and systems as well as private phones.
Besides attempts to standardize with the same supplier, companies are also
concerned that solutions should have robust security features. These points have
led many to the conclusion that a proper concept and strategy must be in place
before supplying corporate information to mobile devices.
The first major benefit is the ability for end users to access information in their
mobile BI system at any time and from any location. This enables them to get
data and analytics in ‗real time‘, which improves their daily operations and
means they can react more quickly to a wider range of events.
MBI works much like a standard BI software/solution but it is designed specifically for
handheld users. Typically, MBI requires a client end utility to be installed on mobile
devices, which remotely/wirelessly connect over the Internet or a mobile network to the
primary business intelligence application server. Upon connection, MBI users can
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
40
perform queries, and request and receive data. Similarly, clientless MBI solutions can be
accessed through a cloud server that provides Software as a Service business intelligence
(SaaS BI), Real-Time Business Intelligence (RTBI or Real-Time BI).
WHAT IS CROWDSOURCING?
Crowdsourcing data collection consists in building data sets with the help of a large
group of people. There are a source and data suppliers who are willing to enrich the data
with relevant, missing, or new information.
This method originates from the scientific world. One of the first ever case of
crowdsourcing is the Oxford English Dictionary. The project aimed to list all the words
that enjoy any recognized lifespan in the standard English language with their definition
and explanation of usage. That was a gigantic task. So the dictionary creators invited the
crowd to help them on a voluntary basis.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Sounds familiar? Think no further than Wikipedia.
More than 1 million mappers work together to collect and supply data to OpenStreetMap
making it full of valuable information about the specified location.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
THE IMPORTANCE OF CROWDSOURCING
The Internet is now a melting pot of user-generated content from blogs to Wikipedia
entries to YouTube videos. The distinction between producer and consumer is no longer
such a prevalent distinction as everyone is equipped with the tools needed to create as
well as consume.
As a business strategy, soliciting customer input isn‘t new, and open source software has
proven the productivity possible through a large group of individuals.
The history of crowdsourcing
While the idea behind crowdsourcing isn‘t new, its active use online as a business
building strategy has only been around since 2006. The phrase was initially coined by
Jeff Howe, where he described a world in which people outside of a company contribute
work toward that project‘s success. Video games have been utilizing crowdsourcing for
many years through their beta invitation. Granting players early access to the game,
studios request only that these passionate gamers report bugs and issues with gameplay as
they encounter before the finished product is released for sale and distribution.
Companies utilize crowdsourcing not only in a research and development capacity, but
also to simply get help from anyone for anything, whether it's word-of-mouth marketing,
creating content or giving feedback.
INFORMATION MANAGEMENT
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
access and read information, with whom it should share or not.
In 1970 when 4th generation computing was in its beginning phase, data scientist
developed various concept to determine the security of data (prevent it from unauthorized
access) and they create a concept of Object which was focused on data security rather
than logics. Before object there was entities called structures and unions that used to
manage data structure algorithms which were quite similar to object but it can‘t
encapsulate the behaviour like object is capable to encapsulate it‘s attribute as well as
behaviour. So, the concept of object orientation developed with first Object Oriented
Language i.e. SIMULA 67. But this concept grabs more attention when Bjarne Stroustrup
introduces the same concept with release of C++.
Need of Information Management in Web Applications
Web application of 20th century was not that secure like web applications of today. You
must hear that no one is perfect and it‘s also not possible to attain the perfection. Object
Oriented System is also not perfect system which has certain limitations. But they have
almost resolved the problem of data security in term of logics. Now it‘s too much hard to
access data if you‘re an unauthorized user.
In 1990s when Sabeer Bhatia created a web based email service called hotmail, people
think many times before using it, because the chance of leakage of information was in
extreme. That‘s why Object Oriented Approach was used in web languages like asp and
php to ensure the data security to create such application that can work with the self
established environment i.e. frameworks, a library of classes and functions to make
programming more easy. We needed to be secure because someone can use our
information to track us and use our information for illegal work. So, this concept was
almost used by every popular programming language.
Information management is essential part of today‘s web development that ensures the
security of data that should be shared within the authorized users. This is not an easy task
to manage the whole thing, pass the authority to users and manage the privacy.
Facebook is the good example of Information system. Facebook is providing privacy to
its user which is the one of the capability that a perfect information system can do. It
passes authority to the graph nodes (i.e., your friends) connected with you to access the
information that is set for such circumstances.
Nobody can see your private information expect whom you pass authority to see your
private information in Facebook.
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP