0% found this document useful (0 votes)
7 views10 pages

What Is Big Data? Explain in Detail About The Characteristics of Big Data

Uploaded by

Arathi M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

What Is Big Data? Explain in Detail About The Characteristics of Big Data

Uploaded by

Arathi M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

1. What is big data? Explain in detail about the characteristics of big data.

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.
Any piece of information can be considered as data. This data can be in various forms and in various
sizes. It can vary from small data to very big Data.
Big Data is characterized in terms of following Vs:
Volume – Today data size has increased to size of terabytes in the form of records or transactions.
Volume is how much data we have – what used to be measured in Gigabytes is now measured in
Zettabytes (ZB) or even Yottabytes (YB). The IoT (Internet of Things) is creating exponential growth in
data.
Velocity - is the speed in which data is accessible. It means near or real-time assimilation of data coming
in huge volume.
Variety – There is huge variety of data based on internal, external, behavioral, or/and social type. Data
can be of structured, semi structured or unstructured type. Variety describes one of the biggest
challenges of big data. It can be unstructured and it can include so many different types of data from
XML to video to SMS. Organizing the data in a meaningful way is no simple task, especially when the
data itself changes rapidly.
Veracity - is all about making sure the data is accurate, which requires processes to keep the bad data
from accumulating in your systems. The simplest example is contacts that enter your marketing
automation system with false names and inaccurate contact information. How many times have you seen
Mickey Mouse in your database? It’s the classic “garbage in, garbage out” challenge.
Value - is the end game. After addressing volume, velocity, variety, variability, veracity, and
visualization – which takes a lot of time, effort and resources – you want to be sure your organization is
getting value from the data.

2. Discuss the applications of big data analytics.


1. Big data in Banking
Large amounts of data streaming in from countless sources, banks have to find out unique and
innovative ways to manage big data. It’s important to analyse customers’ needs and provide them
service as per their requirements, and minimize risk and fraud while maintaining regulatory compliance.
Big data have to deal with financial institutions to do one step from the advanced analytics.
2. Big data in Government
When government agencies are harnessing and applying analytics to their big data, they have
improvised a lot in terms of managing utilities, running agencies, dealing with traffic congestion or
preventing the effects crime. But apart from its advantages in Big Data, governments also address issues
of transparency and privacy.
3. Big data in Health Care
When it comes to health care in terms of patient records. Treatment plans. Prescription information etc.,
everything needs to be done quickly and accurately and some aspects enough transparency to satisfy
stringent industry regulations. Effective management results in good health care to uncover hidden
insights that improve patient care.
4. Big data in Manufacturing
Manufacturers can improve their quality and output while minimizing waste where processes are known
as the main key factors in today’s highly competitive market. Several manufacturers are working on
analytics where they can solve problems faster and make more agile business decisions.
5. Big data in Retail
Customer relationship maintains is the biggest challenge in the retail industry and the best way to
manage will be to manage big data. Retailers must have unique marketing ideas to sell their products to
customers, the most effective way to handle transactions, and applying improvised tactics of using
innovative ideas using BigData to improve their business.
6. Big data in Finance sector
Financial services have widely adopted big data analytics to inform better investment decisions with
consistent returns. The big data pendulum for financial services has swung from passing fad to large
deployments last year.
7. Big data in Telecom
A recent report found that use of data analytics tools in telecom sector is expected to grow at a
compound annual growth rate of 28% over the next four years.
8. Big data in retail sector
Retailers harness Big Data to offer consumers personalized shopping experiences. Analyzing how a
customer came to make a purchase, or the path to purchase. 66% of retailers have made financial gains
in customer relationship management through big data.
9. Big Data in tourism
Big data is transforming the global tourism industry. People know more about the world than ever
before. People have much more detailed itineraries these days with the help of Big data.
10. Big data in Airlines
Big Data and Analytics give wings to the Aviation Industry. An airline now knows where a plane is
headed, where a passenger is sitting, and what a passenger is viewing on the IFE or connectivity system.
11. Big data in Social Media
Big data is a driving factor behind every marketing decision made by social media companies and it is
driving personalization to the extreme.
3. With the neat diagram, describe the working of analytical processing model.
First Step is a very important step, as data is the key component for to any analytical process. The
selection of data will have a deterministic impact on the analytical models.
Second Step is the data collection. All data will then be gathered in a staging area, which could be a data
mart or data warehouse. Some basic exploratory analysis can be considered here using, Online
Analytical Processing (OLAP) facilities for multidimensional data analysis (e.g., rollup, drill down,
slicing and dicing).
Third Step is data cleaning step. Data cleaning step is to get rid of all inconsistencies, such as missing
values, outliers, and duplicate data.
Fourth Step is Additional transformations may also be considered, such as binning, alphanumeric to
numeric coding, geographical aggregation, and so forth.
Fifth Step is analytics step. In the analytics step, an analytical model will be estimated on the
preprocessed and transformed data. Different types of analytics can be considered here. (e.g., to do
churn prediction, fraud detection, customer segmentation, market basket analysis).
Finally, once the model has been built, it will be interpreted and evaluated by the business experts.
Usually, many trivial patterns will be detected by the model.
For example, in a market basket analysis setting, one may find that spaghetti and spaghetti sauce are
often purchased together. These patterns are interesting because they provide some validation of the
model. But of course, the key issue here is to find the unexpected yet interesting and actionable patterns
(sometimes also referred to as knowledge diamonds) that can provide added value in the business
setting. Once the analytical model has been appropriately validated and approved, it can be put into
production as an analytics application (e.g., decision support system, scoring engine).

4. Mention the different types of data sources for big data analytics. Explain.
More data is better to start off the analysis. Data can originate from a variety of different sources. They
are as follows:
• Transactional data
• Unstructured data
• Qualitative/Export based data
• Data poolers
• Publicly available data
Transactional Data: Transactions are the first important source of data. Transactional data consist of
structured, low level, detailed information capturing the key characteristics of a customer transaction
(e.g., purchase, claim, cash transfer, credit card payment). This type of data is usually stored in massive
online transaction processing (OLTP) relational databases. It can also be summarized over longer time
horizons by aggregating it into averages, absolute/relative trends, maximum/minimum values, and so on.
Unstructured data: Embedded in text documents (e.g., emails, web pages, claim forms) or multimedia
content can also be interesting to analyse. However, these sources typically require extensive pre-
processing before they can be successfully included in an analytical exercise.
Qualitative/Expert based data: Another important source of data is qualitative, expert-based data. An
expert is a person with a substantial amount of subject matter expertise within a particular setting (e.g.,
credit portfolio manager, brand manager). The expertise stems from both common sense and business
experience, and it is important to elicit expertise as much as possible before the analytics is run. This
will steer the modelling in the right direction and allow you to interpret the analytical results from the
right perspective. A popular example of applying expert-based validation is checking the univariate
signs of a regression model.
Data poolers: Nowadays, data poolers are becoming more and more important in the industry. Popular
examples are Dun & Bradstreet, Bureau Van Dijck, and Thomson Reuters. The core business of these
companies is to gather data in a particular setting (e.g., credit risk, marketing), build models with it, and
sell the output of these models (e.g., scores), possibly together with the underlying raw data, to
interested customers. A popular example of this in the United States is the FICO score, which is a credit
score ranging between 300 and 850 that is provided by the three most important credit bureaus:
Experian, Equifax, and TransUnion. Many financial institutions use these FICO scores either as their
final internal model, or as a benchmark against an internally developed credit scorecard to better
understand the weaknesses of the latter.
Publicly available data: Finally, plenty of publicly available data can be included in the analytical
exercise. A first important example is macroeconomic data about gross domestic product (GDP),
inflation, unemployment, and so on. By including this type of data in an analytical model, it will become
possible to see how the model varies with the state of the economy. This is especially relevant in a credit
risk setting, where typically all models need to be thoroughly stress tested. In addition, social media data
from Facebook, Twitter, and others can be an important source of information. However, one needs to be
careful here and make sure that all data gathering respects both local and international privacy
regulations.
5. List the various factors required for analytical model and explain.
Analytics is a term that is often used interchangeably with data science, data mining, knowledge
discovery, and others. The distinction between all those is not clear cut. All of these terms essentially
refer to extracting useful business patterns or mathematical decision models from a preprocessed data
set.
A good analytical model should satisfy several requirements, depending on the application area.
– Achieve business relevance
– Statistical significance and predictive power
– Interpretability
– Justifiability
– Operationally Efficient
– Economic Cost
– Regulation and Legislation

A first critical success factor is business relevance. The analytical model should actually solve the
business problem for which it was developed. It makes no sense to have a working analytical model that
got sidetracked from the original problem statement. In order to achieve business relevance, it is of key
importance that the business problem to be solved is appropriately defined, qualified, and agreed upon
by all parties involved at the outset of the analysis.
A second criterion is statistical performance. The model should have statistical significance and
predictive power. How this can be measured will depend upon the type of analytics considered. For
example, in a classification setting (churn, fraud), the model should have good discrimination power. In
a clustering setting, the clusters should be as homogenous as possible.
Interpretability refers to understanding the patterns that the analytical model captures. This aspect has a
certain degree of subjectivism, since interpretability may depend on the business user’s knowledge. In
many settings, however, it is considered to be a key requirement. For example, in credit risk modeling or
medical diagnosis, interpretable models are absolutely needed to get good insight into the underlying
data patterns. In other settings, such as response modeling and fraud detection, having interpretable
models may be less of an issue.
Justifiability refers to the degree to which a model corresponds to prior business knowledge and
intuition. For example, a model stating that a higher debt ratio results in more creditworthy clients may
be interpretable, but is not justifiable because it contradicts basic financial intuition. Note that both
interpretability and justifiability often need to be balanced against statistical performance. Often one will
observe that high performing analytical models are incomprehensible and black box in nature.
Analytical models should also be Operationally efficient. This refers to the efforts needed to collect the
data, preprocess it, evaluate the model, and feed its outputs to the business application (e.g., campaign
management, capital calculation). Especially in a real-time online scoring environment (e.g., fraud
detection) this may be a crucial characteristic. Operational efficiency also entails the efforts needed to
monitor and back test the model, and reestimate it when necessary.
Another key attention point is the economic cost needed to set up the analytical model. This includes the
costs to gather and preprocess the data, the costs to analyze the data, and the costs to put the resulting
analytical models into production. In addition, the software costs and human and computing resources
should be taken into account here. It is important to do a thorough cost–benefit analysis at the start of
the project.
Finally, analytical models should also comply with both local and international regulation and
legislation. For example, in a credit risk setting, the Basel II and Basel III Capital Accords have been
introduced to appropriately identifying the types of data that can or cannot be used to build credit risk
models.
6. Write a note on missing values, data sampling.
1. Missing values:
Missing values can occur because of various reasons.
• The information can be nonapplicable. For example, when modeling time of churn, this information is
only available for the churners and not for the non-churners because it is not applicable there.
• The information can also be undisclosed. For example, a customer decided not to disclose his or her
income because of privacy.
• Missing data can also originate because of an error during merging.
Table 1.2: Dealing with missing values
• Some analytical techniques (e.g., decision trees) can directly deal with missing values.
• Other techniques need some additional preprocessing. The following are the most popular schemes to
deal with missing values:
– Replace (impute). This implies replacing the missing value with a known value.
– Delete. This is the most straightforward option and consists of deleting observations or variables with
lots of missing values.
– Keep. Missing values can be meaningful (e.g., a customer did not disclose his or her income because
he or she is currently unemployed).
As a practical way of working, one can first start with statistically testing whether missing information
is related to the target variable. If yes, then we can adopt the keep strategy and make a special category
for it. If not, depending on the number of observations available, decide to either delete or keep.
2. Data Sampling:
Sampling is to take a subset of past customer data and use that to build an analytical model.
• The sample should also be taken from an average business period to get a picture of the target
population that is as accurate as possible.
• With the availability of high-performance computing facilities (e.g. grid computing and cloud
computing) one could also directly analyze the full data set.
• Key requirement for a good sample is, it should be representative of the future customers on which the
analytical model will be run.
• Timing aspect becomes important because customers of today are more similar to customers of
tomorrow than customers of yesterday.
• Choosing the optimal time window for the sample involves the trade-off between the lost data and the
recent data.
Example: credit scoring-assume one wants to build an application scorecard to score mortgage (a legal
agreement that allows you to borrow money from a bank or similar organization, especially in order to
buy a house, or the amount of money itself) application. In future population then consists of all
through-the-door (TTD) population. One then needs a subset of the historical TTD population to build
an analytical model. The customer was accepted with the old policy and the one that were rejected in
shown in the figure 1.2. When building a sample, one can the make use of those that were accepted,
which clearly implies a bias.
Figure 1.3: The reject inference problem in credit scoring
In stratified sampling, a sampling a taken according to predefined strata. Example: a churn prediction or
fraud detection context in which data sets are typically very skewed. When stratifying according to the
target churn indicator, the sample will contain exactly the same percentage of churners and no churners
as in the original data.

7. What are the different types of data elements available for big data analysis.
It is important to appropriately consider the different types of data elements at the start of the analysis.
The different types of data elements can be considered:
1. Continuous
2. Categorical
Continuous: These are data elements that are defined on an interval that can be limited or unlimited.
Examples include income, sales, RFM (recency, frequency, monetary).

Categorical: The categorical data elements are differentiated as follows:


Nominal: These are data elements that can only take on a limited let of values with no meaningful
ordering in between.
Examples: marital status, profession, purpose of loan.
Ordinal: These are data elements that can only take on a limited set of values with a meaningful ordering
in between.
Examples: credit rating; age coded as young, middle aged, and old.
Binary: These are data elements that can only take on two values.
Example: gender, employment status.

Appropriately distinguishing between these different data elements is of key importance to start the
analysis when importing the data into an analytics tool. For example, if marital status were to be
incorrectly specified as a continuous data element, then the software would calculate its mean, standard
deviation, and so on, which is obviously having no meaning.

8. Explain the different techniques used in visual data exploration.


Visual Data exploration is a getting to know your data in an "informal” way. It allows you to get some
initial insights into the data, which can then be usefully adopted throughout the modeling. Different
plots/graphs can be useful here.
Examples:
Pie charts
Bar charts
Histograms and scatter plots
• A pie chart represents a variable’s distribution as a pie.
• Bar charts represent the frequency of each of the values (either absolute or relative) as bars.
• Histogram provides an easy way to visualize the central tendency and to determine the variability or
spread of the data. It also allows us to contrast the observed data with standard known distributions.
• Scatter plots allow us to visualize one variable against another to see whether there are any correlation
patterns in the data.
• WordCloud: an image composed of words used in a particular text or subject to visualize data
occurrence, in which the size of each word indicates its frequency or importance.
Module-2
1. Explain in detail about inter and trans firewall analytics with a neat diagram.
Over the last 100 years, supply chain has evolved to connect multiple companies and enable them to
collaborate to create enormous value to the end-consumer through concepts like CPFR (collaborative
planning, forecasting and replenishment) a collection of business practices that leverage the Internet and
electronic data interchange to reduce inventories and expenses while improving customer service,
Supply chain has evolved to connect multiple companies andenable them to collaborate to create
enormous value to the end-consumer.
VMI (vendor-managed inventory) — a technique used by customers in which manufacturers receive
sales data to forecast consumer demand more accurately.
In the healthcare industry, rich consumer insights can be generated by collaborating on data and insights
from the health insurance provider, pharmacy delivering the drugs and the drug manufacturer. In fact,
this is not necessarily limited to companies within the traditional demand-supply chain. Disruptive value
and efficiencies can be extracted by co-operative and exploring outside the boundaries of the firewall.
Some of the more progressive companies will take this a step further and work on leveraging the large
volumes of data outside the firewall such as social data, location data, etc. In other words, it will not be
long before internal data and insights from within the enterprise firewall is no longer a differentiator. We
call this trend the move from intra- to inter- and trans-firewall analytics.
There are instances where a retailer and a social media company can come together to share insights on
consumer behaviour that will benefit both concerns. Some of the more progressive companies will take
this a step further and work on leveraging the large volumes of data outside the firewall such as social
data, location data, etc.
It will not be long before internal data and insights from within the enterprise firewall is no longer a
differentiator. We call this trend the move from intra- to inter- and trans-firewall analytics.
All the previous years the companies were doing functional silo-based analytics. Silo Analytics is a type
of analysis provides rich information through statistical reporting. Today the companies are doing intra-
firewall analytics with data within the firewall. May be in future they will be collaborating on insights
with other companies to do inter-firewall analytics as well as leveraging the public domain spaces to do
trans-firewall analytics

2. Write a note on cloud and big data.


• Cloud models are basic necessity for every industry functioning and it is just a matter of when an
industry will shift do the cloud model.
• Many startup industries do not have unlimited capital to invest in infrastructure, in such cases a cloud
can provide Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service
(Saas)
• The data is exploding—both structured and unstructured, so the storage devices are very costly and
sometimes leads to problem due to hardware failure. So, cloud provides Storage Space as a Service
(SPaaS) example: Google Drive, OneDrive, FileHipo, 4Shared etc.,
• The Cloud app products price and manage risks are broken (come down) due to the service of free
storage space and high security.
• Many IT experts says
• Stop Saying Cloud: as the true value lies in delivering software, data, and/or analytics. So instead of
calling Cloud Computing Model, it is more appropriate to say “as a service” model.
• Acknowledge the business: The matters around information privacy, security, access, and delivery are
regulated by implementation of cloud model lead to new business opportunity for many IT companies.
Example: AWS, Microsoft Azure, IBM BlueMix etc.,
• Fix the core Business - Technical gap: The ability to run analytics at scale in a virtual environment, to
ensuring information processing and analytics authenticity are issues that gave solutions for many core
business problems and have to be fixed. This reduced the gap between business and technology.
3. What is crowd sourcing analytics? Explain different types of cloud sourcing.
4. Summarize the two critical components of Hadoop and explain in detail with a neat diagram.
5. Define the terms: data discovery and predictive analytics.

6. Explain the working of kaggle’s crowd sourcing.


7. Explain mobile intelligence and big data.
8. Explain working together of HDFS and map reduce.
9. What is predictive analysis? Why are they required? Discuss the leading trends of predictive
analysis.
10. List and explain the technical features of Hadoop.

You might also like