Hamid Seminar Doc
Hamid Seminar Doc
The success and the final outcome of the seminar require a lot of guidance and
assistance from many people and I am extremely fortunate to have got this all along
the complettion of my seminar work.whatever I have done is only due to such guidance
and assistance and I would not forget to thank them.
I owe profound gratitude to our in charge principal priyanka parmar and seminar
guide prof. vijay shah and all other assistance professor of smt z.s patel college of
computer application college, who took keen interest on my seminar work and guide me
all along, till the completion of my seminar work by providing all the necessary
information for presenting a good concept. I am extremely grateful to them for providing
such a nice support and guidance through they had busy schedule managing the college
affair.
I am thankful and fortunate enough to get support and guidance from all teaching
staff of bachelor of computer application department which helped me in successfully
completing my seminar work. I would like to extend my sincere regards to all the non-
teaching staff of bachelor of computer application department for their support .
Name:MANDLEWALA HAMID
Roll no:49
CONTENT
1) BIG DATA
7) DATA VISULIZATION
BIG DATA
Big data refers to extremely large and complex data sets that cannot
be effectively processed or analyzed using traditional data processing
methods.
The history of Big Data dates back to the 1960s and 1970s, when
computers were first introduced for data processing. However, it was
not until the 1990s that the term "Big Data" was coined to describe
the growing volume, variety, and velocity of data being generated by
various sources.
In the early 2000s, the emergence of the internet and the proliferation of
digital devices led to a massive increase in the amount of data being
generated and collected. This, in turn, created a need for new tools and
technologies to store, process, and analyze the data.
The first major data project was created in 1937 and was ordered by
the Franklin D. Roosevelt administration after the Social Security Act
became law. The government had to keep track of contributions from
26 million Americans and more than 3 million employers. IBM got the
contract to develop punch card-reading machine for this massive
bookkeeping project.
Present)
The development of cloud computing and distributed
systems enabled the storage and processing of massive
datasets.
Companies began using advanced analytics tools, including
machine learning and AI, to extract insights from data.
Real-time data processing became essential for businesses
to remain competitive.
Volume:
The sheer amount of data generated daily has exploded,
moving from gigabytes to zettabytes
Velocity:
Data is generated at unprecedented speeds, necessitating
real-time processing.
Variety:
Data comes in various forms, including structured, semi-
structured, and unstructured formats.
CATEGORIES OF BIG DATA
STRUCTURED DATA
SEMI-STRUCTURED DATA
UNSTRUCTURED DATA
STRUCTURED DATA
Structured data is also called relational data. It is split into multiple tables
to enhance the integrity of the data by creating a single record to depict an
entity.
SEMISTRUCTURED DATA
Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly
organized into rows and columns like that in a spreadsheet.
UNSTRUCTURED DATA
Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
APPLICATIONS:
Inventory management
Demand forecasting
Route optimization for delivery
5) SOCIAL MEDIA ANALIYSIS:
DESCRIPTION: Organizations analyze social media data to
understand public sentiment and brand perception.
APPLICTIONS:
Sentiment analysis for brand monitoring
Trend analysis for marketing strategies
Crisis management through real-time monitoring
6) FINANCIAL SERVICES:
DESCRIPTION: Financial institutions use big data for risk
management, customer insights, and regulatory compliance.
APPLICATIONS:
7) TELECOMMUNIACTIONS:
DESCRIPTION: Telecom companies analyze big data to
improve network performance and customer services.
APPLICATION:
DATA SOURCE:
DATA PROCESSING:
Data Collection: Gathering data from various sources, including IoT
devices, social media, transaction records, and more.
Data Storage: Storing data in data lakes, warehouses, or cloud
storage solutions that can handle large volumes of data.
Data Cleaning: Removing inaccuracies, duplicates, and irrelevant
information to ensure data quality
DATA ANALYSIS:
The 5V's are a set of characteristics of big data that defines the opportunities
and challenges of big data analytics. These include the following:
1. VOLUME
2. VERACITY
3. VELOCITY
4. VALUE
5. VARIETY
1) VOLUME:
This refers to the massive amounts of data generated from different
sources.
The sheer volume of data generated today, from social media feeds,
IoT devices, transaction records and more, presents a significant
challenge.
Big data technologies and cloud-based storage solutions enable
organizations to store and manage these vast data sets cost-effectively,
protecting valuable data from being discarded due to storage limitations.
The larger the volume, the deeper the analysis can be, revealing trends and
patterns that smaller datasets may miss.
2) VERACITY:
Veracity refers to the accuracy and quality of data.
Data reliability and accuracy are critical, as decisions based on
inaccurate or incomplete data can lead to negative outcomes.
Veracity refers to the data's trustworthiness, encompassing data quality,
noise and anomaly detection issues.
Techniques and tools for data cleaning, validation and verification are
integral to ensuring the integrity of big data
3) VELOCITY:
Velocity refers to the speed at which this data is generated and how
fast it's processed and analyzed.
Data is being produced at unprecedented speeds, from real-time social
media updates to high-frequency stock trading records.
The velocity at which data flows into organizations requires robust
processing capabilities to capture, process and deliver accurate analysis
in near real-time.
Stream processing frameworks and in-memory data processing are
designed to handle these rapid data streams and balance supply with
demand.
4) VALUE:
Value refers to the overall worth that big data analytics should
provide.
Large data sets should be processed and analyzed to provide real-
world meaningful insights that can positively affect an organization's
decisions.
Big data analytics aims to extract actionable insights that offer
tangible value.
This involves turning vast data sets into meaningful information that
can inform strategic decisions, uncover new opportunities and drive
innovation.
5) VARIETY:
This refers to the data types, including structured, semi structured and
unstructured data.
It also refers to the data's format, such as text, videos or images.
The variety in data means that organizations must have a flexible data
management system to handle, integrate and analyze different data
types.
This variety demans flexible data management systems to handle and
integrate disparate data types for comprehensive analysis.
BIG DATA ANALYTICS MODELS
DEFINATION:
1. Descriptive Analytics:
Descriptive analytics is one of the most common forms of
analytics that companies use to stay updated on current trends
and the company’s operational performances.
It is one of the first steps of analyzing raw data by performing
simple mathematical operations and producing statements about
samples and measurements.
3. Predictive Analytics
As the name suggests, this type of data analytics is all about making
predictions about future outcomes based on insight from data.
In order to get the best results, it uses many sophisticated predictive
tools and models such as machine learning and statistical modeling.
Predictive analytics is one of the most widely used types of analytics
today. The market size and shares are projected to reach $10.95
billion by 2022, growing at a 21% rate for six years.
1. APACHE CASSANDRA:
It is one of the No-SQL databases which is highly scalable and has
high availability.
we can replicate data across multiple data centers. Replication across
multiple data centers is supported.
In Cassandra, fault to lerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.
2. APACHE HADOOP:
Hadoop is one of the most widely used big data technology that is
used to handle large-scale data, large file systems by using Hadoop
file system which is called HDFS.
Parallel processing like features using the MapReduce framework of
Hadoop.
Hadoop is a scalable system that helps to provide a scalable solution
capable of handling large capacities and capabilities.
For example: If you see real use cases like NextBio is using Hadoop
MapReduce and HBase to process multi-terabyte data sets off the
human genome.
3. APACHE HIVE:
It is used for data summarization and ad hoc querying which means
for querying and analyzing Big Data easily.
It is built on top of Hadoop for providing data summarization, ad-hoc
queries, and the analysis of large datasets using SQL-like language
called HiveQL.
It is not a relational database and not a language for real-time
queries.
It has many features like: designed for OLAP, SQL type language
called HiveQL, fast, scalable, and extensible.
4. APACHE FLUME:
It is a distributed and reliable system that is used to collect,
aggregate, and move large amounts of log data from many data
sources toward a centralized data store.
5. APACHE SPARK:
The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software
Foundation.
The Main idea to implement Spark with Hadoop in two ways is for storage
and processing.
two ways Spark uses Hadoop for storage purposes just because Spark has
its own cluster management computation.
6. APACHE KAFKA:
It is a distributed publish-subscribe messaging system and more
specifically you can say it has a robust queue that allows you to
handle a high volume of data.
you can pass the messages from one point to another as you can
say from one sender to receiver.
You can perform message computation in both offline and online
modes, it is suitable for both.
To prevent data loss Kafka messages are replicated within the cluster.
For real-time streaming data analysis, It integrates Apache Storm and
Spark and is built on top of the ZooKeeper synchronization service.
7. MONGO DB:
It is based on cross-platform and works on a concept like collection and
document.
It has document-oriented storage that means data will be stored in
the form of JSON form.
It can be an index on any attribute. It has features like high
availability, replication, rich queries, support by MongoDB, Auto-
Sharding, and Fast in-place updates.
8. ELASTIC SEARCH:
It is a real-time distributed system, and open-source full-text search and
analytics engine.
It has features like scalability factor is high and scalable structured
and unstructured data up to petabytes, It can be used as a
replacement of MongoDB, RavenDB which is based on document-
based storage.
WHERE DOES BIG DATA COME FROM?
DIFFERNCES BETWEEN
TRADITIONAL DATA AND BIG DATA
DEFINATION:
Prior to the 17th century, data visualization existed mainly in the realm of
maps, displaying land markers, cities, roads, and resources. As the demand
grew for more accurate mapping and physical measurement, better
visualizations were needed.
In 1644, Michael Florent Van Langren, a Flemish astronomer, is believed to
have provided the first visual representation of statistical data.
The one-dimensional line graph below shows the twelve known estimates at
the time of the difference in longitude between Toledo and Rome as well as
the name of each astronomer who provided the estimate.
The latter half of the 19th century is what Friendly calls the Golden Age of
statistical graphics. Two famous examples of data visualization from that era
include John Snow’s (not that Jon Snow!) map of cholera outbreaks in the
London epidemic of 1854 and Charles Minard’s 1869 chart showing the
number of men in Napoleon’s 1812 infamous Russian campaign army, with
army location indicated by the X-axis, and extreme cold temperatures
indicated at points when frostbite took a fatal toll.
This time also provided us with a new visualization, the Rose Chart, created
by Florence Nightingale.
A number of factors contributed to this “Golden Age” of statistical graphing:
the industrial revolution, which created the modern business;
The latter half of the 20th century is what Friendly calls the ‘rebirth of data
visualization’, brought on by the emergence of computer processing.
the United States and Jacques Bertin in France, who developed the science
of information visualization in the areas of statistics and cartography,
respectively
Bar Chart
Histogram
Gantt Chart
Heat Map
Box and Whisker Plot
Waterfall Chart
Area Chart
Scatter Plot
Pictogram Chart
Timeline
Highlight Table
Bullet Graph
Choropleth Map
Word Cloud
Network Diagram
Correlation Matrices