DBMS, Big Data Anlaytics Module 1 Notes
DBMS, Big Data Anlaytics Module 1 Notes
1
❖ Advantages:
✓ Since all data is stored at a single location only thus it is easier to access and co-ordinate data.
✓ The centralized database has very minimal data redundancy since all data is stored at a single place.
✓ It is cheaper in comparison to all other databases available.
❖ Disadvantages:
✓ The data traffic in case of centralized database is more.
✓ If any kind of system failure occurs at centralized system, then entire data will be destroyed.
2. Client/Server Database Architecture:
✓ The client/server architecture is based on the hardware and software components that interact
to form a system. The system includes three main
components: Clients, Servers and Communication Middleware.
2) Three-tier Architecture:
✓ This architecture adds application server between the client and database server. The
client communicates with the application server, which in turn communicates with the
database server. The application server stores the business rule (procedures and
constraints) used for accessing data from database server.
2
3. Distributed Database
✓ Distributed database is basically a database that is not limited to one system, it is spread over
different sites, i.e, on multiple computers or over a network of computers.
✓ A distributed database system is located on various sites that don’t share physical components.
✓ This may be required when a particular database needs to be accessed by various users globally.
✓ It needs to be managed such that for the users it looks like one single database.
❖ Advantages:
✓ This database can be easily expanded as data is already spread across different physical
locations.
✓ The distributed database can easily be accessed from different networks.
✓ This database is more secure in comparison to centralized database.
❖ Disadvantages:
✓ This database is very costly and it is difficult to maintain because of its complexity.
✓ In this database, it is difficult to provide a uniform view to user since it is spread across different
physical locations.
3
WHAT IS DBMS?
➢ DBMS or Database Management system is a software application used to access, create and
manage the database.
➢ The database management system is nothing but the application software it allows the user to
create the database, maintain the database, access the database as well as manage the database
➢ Repository or container used to store website information as well as application information
which is used by various users as well as various resource.
➢ With the help of DBMS, we can easily create, retrieve and update the data in database.
➢ A DBMS consists of a group of commands to manipulate the database and acts as an interface
between the end-users and the database.
➢ Database management system also aims to facilitate an overview of the databases, by providing
a variety of administrative operations such as tuning, performance monitoring and backup
recovery.
DATABASE MANAGEMNET SYSTEM:
❖ Define Data: Allows the users to create, modify and delete the data which is organized in the
database.
❖ Update Data: Provides access to the users to insert, modify and delete from the database.
❖ Retrieve Data: Allows the users to retrieve data from the database based on the requirement.
❖ Administration of users: Registers the users and monitors their action. Enforces data security,
maintains data integrity, monitors performance and deals with the concurrency control.
TYPES OF DBMS:
➢ Following are the different types of DBMS:
❖ Hierarchical DBMS:
✓ This is the type database management system showcases a style of predecessor-successor type
of relationship. We can consider it to be similar to a tree where the nodes of the tree represent
records and the branches of the tree represents fields.
4
❖ Relational DBMS(RDBMS):
✓ This is the type of dbms which as structure which allows the users to identify and access data in
relation to another piece of data in the database. In this type of DBMS, the data is stored in the
forms of tables.
❖ Network DBMS:
✓ This type of database management system supports many to many relations where multiple
user records can be linked
5
❖ Object-oriented DBMS:
✓ This type of database management system uses small individual software called objects. Here,
each object contains a piece of data and the instructions for the actions to done with
✓ Relational DBMS(RDBMS):
✓ This is the type of dbms which as structure which allows the users to identify and access data in
relation to another piece of data in the database.
✓ In this type of DBMS, the data is stored in the form of tables.
WHAT IS DATA SCIENCE?
✓ Data science is a filed that deals with unstructured, structured data and semi-structured data.
✓ It involves practice like data cleansing, data preparation, data analysis, and much more.
✓ Data science is the combination of statistics, mathematics, programming, and problem-solving
✓ Capturing data in ingenious ways
✓ The ability to look at things differently
✓ The of activity of cleansing, preparing, and aligning data
✓ This umbrella term includes various techniques that are used when extracting insights and
information from data.
DATA SCIENCE:
✓ Data science is the study of data.
✓ It involves developing of recording, storing and analysing data to effectively extract useful
information
✓ The goal of data science is to gain insights and knowledge from any type of data -both structured
and unstructured.
✓ Data science is related to computer science, but is a separate field.
✓ Computer science involves creating programs and algorthrim to record and process data, while
data science covers any type of data analysis, which may or may not use computers.
✓ Data science is more closely related to the mathematics field of statistics, which includes the
collection, organization, analysis and presentation of data.
6
NEED FOR DATA SCIENCE:
✓ With the help of data science technology, we can convert the massive amount of raw &
unstructured data into meaningful insights
✓ Data science technology is opting by various companies, whether it is a big brand or a start-up
Google, Amazon, Netflix, etc., which handle the huge amount of data are using data science
algorithms for better customer experience.
COMPONENTS OF DATA SCIENCE:
1) Discovery: The first phase is discovery, which involves asking the right questions. When you start any
data science project, you need to determine what are the basic requirements, priorities, and project
budget. In this phase, we need to determine all the requirements of the project such as the number
of people, technology, time, data, an end goal, and then we can frame the business problem on first
hypothesis level.
7
2) Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
▪ Data cleaning
▪ Data Reduction
▪ Data integration
▪ Data transformation, after performing all the above tasks, we can easily use this data for our
further processes.
3) Model Planning: In this phase, we need to determine the various methods and techniques to establish
the relation between input variables. We will apply Exploratory data analytics (EDA) by using various
statistical formula and visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
• SQL Analysis Services
• R
• SAS
• Python
4) Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification,
and clustering, to build the model. Following are some common Model building tools:
✓ SAS Enterprise Miner
✓ WEKA
✓ SPCS Modeler
✓ MATLAB
5) Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance
and other components on a small scale before the full deployment.
6) Communicate results: In this phase, we will check if we reach the goal, which we have set on the
initial phase. We will communicate the findings and final result with the business team.
8
BIG DATA ANAYTICS
9
WHAT ARE BIG DATA TOOLS AND SOFTWARE?
✓ Hadoop
✓ Quoble
✓ Cassandra
✓ MongoDB
✓ Apache storm
✓ CouchDB
1) Data creation, ingestion, or capture: Whether you generate data from data entry, acquire existing
data from other sources, or receive signals from devices, you get information somehow. This stage
describes when data values enter the firewalls of your system.
2) Data Processing: Data preparation typically includes integrating data from multiple sources, validating
data, and applying the transformation.
3) Data Analysis: However, you analyse and interpret your data, this is where the magic happens.
Exploring and interpreting your data may require a variety of analyses. This could mean statistical
analysis and visualization.
4) Data sharing or publication: This stage is where forecasts and insights turn into decisions and
direction.
5) Archiving: Once data has been collected, processed, analysed, and shared, it is typically stored for
future reference.
TYPES OF BIG DATA (5 Marks):
10
1. Structured Big Data:
✓ Any data that can be stored, accessed and processed in the form of fixed format is termed as
a ‘structured’ data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, nowadays, we are foreseeing
issues when a size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.
✓ Examples Of Structured Data
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
11
TYPES OF BIG DATA (10 Marks)
1.Structured Big Data:
✓ Data is stored in rows & columns. This type of data constitutes about 10% of the today’s total data &
is accessible through DBMS.
✓ Eg: Official registers that are created by governmental institutions to store data on individuals,
enterprises & real estates
✓ Data of different forms like text, image, video, document, etc. This type of data accounts for about
90% of the data.
✓ Data related to roads, buildings, lakes, addresses, people, workplaces & transportation routes that
are generated from geographic information systems.
✓ Eg: Google Maps
4.Real-Time Media:
✓ Real-time streaming of live or stored media data. One of the main source of media data is services
like e.g., YouTube, Flicker & Vimeo that produce a huge amount of video, pictures & audio. Another
important source of real-time media is video conferencing which allow two or more locations to
communicate in two-way video & audio transmission.
✓ Human-generated data, particularly in the verbal form. The sources of natural language data include
speech capture devices, land phones & IoT that generate large sizes of text-like communication
between devices
6.Time Series:
7.Event Data:
✓ Data generated from the matching between external events with time series. This requires the
identification of important events from the unimportant.
✓ Eg: information related to vehicle crashes or accidents can be collected & analysed to help
understand what the vehicles were doing before, during & after the event. The data is generated by
sensors fixed in different places of the vehicle body.
12
8.Network Data:
✓ Data concerns very large networks, such as social networks (e.g., Facebook & Twitter), information
networks (e.g., the World Wide Web), biological networks (e.g., biochemical, ecological & neural
networks) & technological networks (e.g., the Internet, telephone & transportation networks.
9.Linked Data:
✓ Data that is built upon standard Web technologies such as HTTP, RDF, SPARQL & URIs to share
information that can be semantically queried by computers. This allows data from different sources to
be connected & read.
✓ Big data analytics have improved healthcare by providing personalized medicine and prescriptive
analytics. Researchers are mining the data to see what treatments are more effective for particular
conditions, identify patterns related to drug side effects, and gains other important information that
can help patients and reduce costs. It’s possible to predict disease that will escalate in specific areas.
Based on predictions, it’s easier to strategize diagnostics and plan for stocking serums and vaccines.
✓ Various companies in the media and entertainment industry are facing new business models, for the
way they – create, market and distribute their content. Big Data applications benefits media and
entertainment industry by:
• Scheduling optimization
• Ad targeting
✓ An on-demand music service, uses Hadoop Big Data analytics, to collect data from its millions of users
worldwide and then uses the analysed data to give informed music recommendations to individual
users. Amazon Prime, which is driven to provide a great customer experience by offering video, music,
and Kindle books in a one-stopshop, also heavily utilizes Big Data.
3.Traffic Optimization
✓ Big Data helps in aggregating real-time traffic data gathered from road sensors, GPS devices and video
cameras. The potential traffic problems in dense areas can be prevented by adjusting public
transportation routes in real time.
13
4.Real-time Analytics to Optimize Flight Route
✓ With each unsold seat of the aircraft, there is a loss of revenue. Route analysis is done to determine
aircraft occupancy and route profitability. By analysing customers’ travel behaviour, airlines can
optimize flight routes to provide services to maximum customers Increasing the customer base is most
important for maximizing capacity utilization. Through big data analytics, we can do route optimization
very easily. We can increase the number of aircraft on the most profitable routes
5.E-commerce Recommendation
✓ By tracking customer spending habit, shopping behaviour, Big retails store provide a recommendation
to the customer. E-commerce site like Amazon, Walmart, Flipkart does product recommendation. They
track what product a customer is searching, based on that data they recommend that type of product
to that customer. As an example, suppose any customer searched bed cover on Amazon. So, Amazon
got data that customer may be interested to buy bed cover. Next time when that customer will go to
any google page, advertisement of various bed covers will be seen. Thus, advertisement of the right
product to the right customer can be sent. YouTube also shows recommend video based on user’s
previous liked, watched video type. Based on the content of a video, the user is watching, relevant
advertisement is shown during video running. As an example suppose someone watching a tutorial
video of Big data, then advertisement of some other big data course will be shown during that video
✓ Traditional tools are being replaced by sensor-equipped machines that can collect data from their
environments to control their behaviour – such as thermostats for temperature regulation or
algorithms for implementing crop protection strategies. Technology, combined with external big data
sources like weather data, market data, or standards with other farms, is contributing to the rapid
development of smart farming.
14
PROCESS OF BDA
1. Data Collection
✓ Data collection plays the most important role in the Big Data cycle. The Internet provides almost
unlimited sources of data for a variety of topics. The importance of this area depends on the type of
business, but traditional industries can acquire a diverse source of external data and combine those
with their transactional data. For example, let’s assume we would like to build a system that
recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants
from different websites and store them in a database.
2. Data Cleansing
✓ Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled.
3. Data Exploration
✓ Data exploration is the first step of data analysis used to explore and visualize data to uncover insights
from the start or identify areas or patterns to dig into more. Using interactive dashboards and point-
and-click data exploration, users can better understand the bigger picture and get to insights faster.
4. Data Visualization
✓ Big data visualization is the process of displaying data in charts, graphs, maps, and other visual forms.
It is used to help people easily understand and interpret their data at a glance, and to clearly show
trends and patterns that arise from this data.
BIG DATA ANALYTICS TOOLS
▪ R-Programming
▪ Altamira LUMIFY
▪ Apache Hadoop
▪ MongoDB
▪ RapidMiner
▪ Apache Spark
▪ Microsoft Azure
▪ Zoho Analytics
15