What is Big Data Analytics-1
What is Big Data Analytics-1
1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.
• Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business case,
which defines the reason and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets are
integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful
information.
• Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts
can produce graphic visualizations of the analysis.
• Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where the
final results of the analysis are made available to business stakeholders who will take action.
1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps in creating reports, like a
company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization across its
office and lab space. Using descriptive analytics, Dow was able to identify underutilized space. This space
consolidation helped the company save nearly US $4 million annually.
2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques like drill-down, data
mining, and data recovery are all examples. Organizations use diagnostic analytics because they provide
an in-depth insight into a particular problem.
Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form didn’t load
correctly, the shipping fee is too high, or there are not enough payment options available. This is where
you can use diagnostic analytics to find the reason.
3. Predictive Analytics
This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they have to take to protect their clients against
fraudulent transactions. Using predictive analytics, the company uses all the historical payment data and
user behavior data and builds an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspective analytics works with
both descriptive and predictive analytics. Most of the time, it relies on AI and machine learning.
Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of analytics is used
to build an algorithm that will automatically adjust the flight fares based on numerous factors, including
customer demand, weather, destination, holiday seasons, and oil prices.
• Spark - used for real-time processing and analyzing large amounts of data
Here are some of the sectors where Big Data is actively used:
• Ecommerce - Predicting customer trends and optimizing prices are a few of the ways e-
commerce uses Big Data analytics
• Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which result in
improved sales
• Education - Used to develop new and improve existing courses based on market requirements
• Healthcare - With the help of a patient’s medical history, Big Data analytics is used to predict
how likely they are to have health issues
• Media and entertainment - Used to understand the demand of shows, movies, songs, and more
to deliver a personalized recommendation list to its users
• Banking - Customer income and spending patterns help to predict the likelihood of choosing
various banking offers, like loans and credit cards
• Government - Big Data analytics helps governments in law enforcement, among other things
Big data is a collection of data from many sources and is often described by five characteristics, known as
the 5 V's: volume, velocity, value, variety, and veracity. These characteristics help to understand the
complexity of big data and can help data scientists derive more value from their data.
Volume: Volume refers to the 'size'or amount of data. For instance, YouTube has over 2.6 billion monthly
active users and generates a large amount of data daily, which can't be processed manually; thus,
modern techniques and tools are used to handle such voluminous data.
Velocity: Velocity refers to the 'speed'or rate with which the data is accumulated. In 2010, YouTube
had 200 million monthly active users, which increased to 2.6 billion in 2022.
Variety: Variety refers to the 'heterogeneity' or diversity of data. The data can be structured,
unstructured, or semi-structured.
Veracity: Veracity refers to the 'trustworthiness'or quality of data. It means whether the data is free
from various ambiguities or not.
Value: Value refers to the 'Insights' gained from the data. It means whether the given data set is
producing any useful result. Data, in its raw form, gives no valuable result, but once processed efficiently,
it can give us important insights that could help us in decision-making.
• Visualization
The use of tools like charts, graphs, and maps to create visual representations of data
• Data provenance
Checking the origin of a piece of data, and the processes and techniques used to produce it
• Transparency
The right of a person to know whether a company collects, uses, or processes their personal data
DFS (Distributed File System) is a technology that allows you to group shared folders located on different
servers into one or more logically structured namespaces. The main purpose of the Distributed File
System (DFS) is to allows users of physically distributed systems to share their data and resources by
using a Common File System. A collection of workstations and mainframes connected by a Local Area
Network (LAN) is a configuration on Distributed File System. A DFS is executed as a part of the operating
system. In DFS, a namespace is created and this process is transparent for the clients.
Components of DFS
In the case of failure and heavy load, these components together improve data availability by allowing
the sharing of data in different locations to be logically grouped under one folder, which is known as the
“DFS root”. It is not necessary to use both the two components of DFS together, it is possible to use the
namespace component without using the file replication component and it is perfectly possible to use
the file replication component without using the namespace component between servers.
Early iterations of DFS made use of Microsoft’s File Replication Service (FRS), which allowed for
straightforward file replication between servers. The most recent iterations of the whole file are
distributed to all servers by FRS, which recognises new or updated files. “DFS Replication” was developed
by Windows Server 2003 R2 (DFSR). By only copying the portions of files that have changed and
minimising network traffic with data compression, it helps to improve FRS. Additionally, it provides users
with flexible configuration options to manage network traffic on a configurable schedule.
Features of DFS
• Transparency
o Structure transparency: There is no need for the client to know about the number or
locations of file servers and the storage devices. Multiple file servers should be provided
for performance, adaptability, and dependability.
o Access transparency: Both local and remote files should be accessible in the same
manner. The file system should be automatically located on the accessed file and send it
to the client’s side.
o Naming transparency: There should not be any hint in the name of the file to the
location of the file. Once a name is given to the file, it should not be changed during
transferring from one node to another.
o Replication transparency: If a file is copied on multiple nodes, both the copies of the file
and their locations should be hidden from one node to another.
• User mobility: It will automatically bring the user’s home directory to the node where the user
logs in.
• Performance: Performance is based on the average amount of time needed to convince the
client requests. This time covers the CPU time + time taken to access secondary storage +
network access time. It is advisable that the performance of the Distributed File System be
similar to that of a centralized file system.
• Simplicity and ease of use: The user interface of a file system should be simple and the number
of commands in the file should be small.
• High availability: A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and independent
file servers for controlling different and independent storage devices.
• Scalability: Since growing the network by adding new machines or joining two networks
together is routine, the distributed system will inevitably grow over time. As a result, a good
distributed file system should be built to scale quickly as the number of nodes and users in the
system grows. Service should not be substantially disrupted as the number of nodes and users
grows.
• Data integrity: Multiple users frequently share a file system. The integrity of data saved in a
shared file must be guaranteed by the file system. That is, concurrent access requests from many
users who are competing for access to the same file must be correctly synchronized using a
concurrency control method. Atomic transactions are a high-level concurrency management
mechanism for data integrity that is frequently offered to users by a file system.
• Security: A distributed file system should be secure so that its users may trust that their data will
be kept private. To safeguard the information contained in the file system from unwanted &
unauthorized access, security mechanisms must be implemented.
Applications of DFS
• NFS: NFS stands for Network File System. It is a client-server architecture that allows a computer
user to view, store, and update files remotely. The protocol of NFS is one of the several
distributed file system standards for Network-Attached Storage (NAS).
• CIFS: CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is, CIFS is an
application of SIMB protocol, designed by Microsoft.
• SMB: SMB stands for Server Message Block. It is a protocol for sharing a file and was invented by
IMB. The SMB protocol was created to allow computers to perform read and write operations on
files to a remote host over a Local Area Network (LAN). The directories present in the remote
host can be accessed via SMB and are called as “shares”.
• Hadoop: Hadoop is a group of open-source software services. It gives a software framework for
distributed storage and operating of big data using the MapReduce programming model. The
core of Hadoop contains a storage part, known as Hadoop Distributed File System (HDFS), and an
operating part which is a MapReduce programming model.
• NetWare: NetWare is an abandon computer network operating system developed by Novell, Inc.
It primarily used combined multitasking to run different services on a personal computer, using
the IPX network protocol.
Working of DFS
• Standalone DFS namespace: It allows only for those DFS roots that exist on the local computer
and are not using Active Directory. A Standalone DFS can only be acquired on those computers
on which it is created. It does not provide any fault liberation and cannot be linked to any other
DFS. Standalone DFS roots are rarely come across because of their limited advantage.
• Domain-based DFS namespace: It stores the configuration of DFS in Active Directory, creating
the DFS namespace root accessible at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>
Advantages of Distributed File System(DFS)
• Improved the capacity to change the size of the data and also improves the ability to exchange
the data.
• Distributed File System provides transparency of data even if server or disk fails.
• In Distributed File System nodes and connections needs to be secured therefore we can say that
security is at stake.
• There is a possibility of lose of messages and data in the network while movement from one
node to another.
• Also handling of the database is not easy in Distributed File System as compared to a single user
system.
• There are chances that overloading will take place if all nodes tries to send data at once.