0% found this document useful (0 votes)
18 views15 pages

DBMS, Big Data Anlaytics Module 1 Notes

The document discusses database management for data science. It defines key terms like data, database, and different types of databases. It also explains what a database management system (DBMS) is, different types of DBMS, and what data science is.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

DBMS, Big Data Anlaytics Module 1 Notes

The document discusses database management for data science. It defines key terms like data, database, and different types of databases. It also explains what a database management system (DBMS) is, different types of DBMS, and what data science is.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DIGITAL FLUENCY

Module 1: Emerging Technologies


Database Management for Data Science
DATA:
❖ Data is nothing but the raw fact or statistics or figures and that can be stored or recorded in the
electronic machine then it is called as Data.
❖ Data can be defined as a representation of facts, concepts or instructions in a formalized manner,
which should be suitable for communication interpretation or processing by human or electronic
machine.
❖ Data is represented with the help of characters such as alphabets (A-Z, a-z) digits (0-9) or special
characters (+, -, /, *, <,>, =, etc….
DATABASE:
❖ Database is a collection of related data.
❖ A database is a place where all the data gets stored in a structured format
❖ It helps the users to easily access, manage and update the required information.
❖ In other words, a database as a big container where in all the information about a website or an
application is stored in a structured format.
❖ Example: a company can have various details of employees, such as name, empID, email, blood
group, salary and so on.
❖ All these details can be stored in a database with the name: “employee” in a structured format
such as tables, hierarchy, etc...
TYPES OF DATABASES:
1. Centralized Database:
✓ Works on a client-server basis.
✓ They are located on a particular location
✓ This location is most often a central computer or database system, for example a desktop or
server CPU, or a mainframe computer.
✓ The controlling mechanism is also centralized & data deposited in a central location
✓ Files are kept on the base of the location of disk drives & names
✓ Security is not so crucial part here.
✓ It is maintained and modified from that location only and usually accessed using an internet
connection such as a LAN or WAN.
✓ The centralized database is used by organizations such as colleges, companies, banks etc.

1
❖ Advantages:
✓ Since all data is stored at a single location only thus it is easier to access and co-ordinate data.
✓ The centralized database has very minimal data redundancy since all data is stored at a single place.
✓ It is cheaper in comparison to all other databases available.
❖ Disadvantages:
✓ The data traffic in case of centralized database is more.
✓ If any kind of system failure occurs at centralized system, then entire data will be destroyed.
2. Client/Server Database Architecture:
✓ The client/server architecture is based on the hardware and software components that interact
to form a system. The system includes three main
components: Clients, Servers and Communication Middleware.

Fig: Client/server system


✓ Client: The client is any computer process that requests service from the server.
✓ Server: The server is any computer process providing services to the clients.
✓ Communication Middleware: The communication middleware is any computer process
through which clients and servers communicate and is also known as communication
layer.
✓ There are basically two-types of client/server architectures:
1) Two-tier Architecture:
✓ In his architecture, the user interface and application programs are placed on the client
side and database system on the server side. The application programs that reside at the
client side invoke the DBMS at the server side.

2) Three-tier Architecture:
✓ This architecture adds application server between the client and database server. The
client communicates with the application server, which in turn communicates with the
database server. The application server stores the business rule (procedures and
constraints) used for accessing data from database server.

2
3. Distributed Database
✓ Distributed database is basically a database that is not limited to one system, it is spread over
different sites, i.e, on multiple computers or over a network of computers.
✓ A distributed database system is located on various sites that don’t share physical components.
✓ This may be required when a particular database needs to be accessed by various users globally.
✓ It needs to be managed such that for the users it looks like one single database.

❖ Advantages:
✓ This database can be easily expanded as data is already spread across different physical
locations.
✓ The distributed database can easily be accessed from different networks.
✓ This database is more secure in comparison to centralized database.
❖ Disadvantages:
✓ This database is very costly and it is difficult to maintain because of its complexity.
✓ In this database, it is difficult to provide a uniform view to user since it is spread across different
physical locations.

3
WHAT IS DBMS?
➢ DBMS or Database Management system is a software application used to access, create and
manage the database.

➢ The database management system is nothing but the application software it allows the user to
create the database, maintain the database, access the database as well as manage the database
➢ Repository or container used to store website information as well as application information
which is used by various users as well as various resource.
➢ With the help of DBMS, we can easily create, retrieve and update the data in database.
➢ A DBMS consists of a group of commands to manipulate the database and acts as an interface
between the end-users and the database.
➢ Database management system also aims to facilitate an overview of the databases, by providing
a variety of administrative operations such as tuning, performance monitoring and backup
recovery.
DATABASE MANAGEMNET SYSTEM:
❖ Define Data: Allows the users to create, modify and delete the data which is organized in the
database.
❖ Update Data: Provides access to the users to insert, modify and delete from the database.
❖ Retrieve Data: Allows the users to retrieve data from the database based on the requirement.
❖ Administration of users: Registers the users and monitors their action. Enforces data security,
maintains data integrity, monitors performance and deals with the concurrency control.
TYPES OF DBMS:
➢ Following are the different types of DBMS:
❖ Hierarchical DBMS:
✓ This is the type database management system showcases a style of predecessor-successor type
of relationship. We can consider it to be similar to a tree where the nodes of the tree represent
records and the branches of the tree represents fields.

4
❖ Relational DBMS(RDBMS):
✓ This is the type of dbms which as structure which allows the users to identify and access data in
relation to another piece of data in the database. In this type of DBMS, the data is stored in the
forms of tables.

❖ Network DBMS:
✓ This type of database management system supports many to many relations where multiple
user records can be linked

5
❖ Object-oriented DBMS:

✓ This type of database management system uses small individual software called objects. Here,
each object contains a piece of data and the instructions for the actions to done with
✓ Relational DBMS(RDBMS):
✓ This is the type of dbms which as structure which allows the users to identify and access data in
relation to another piece of data in the database.
✓ In this type of DBMS, the data is stored in the form of tables.
WHAT IS DATA SCIENCE?
✓ Data science is a filed that deals with unstructured, structured data and semi-structured data.
✓ It involves practice like data cleansing, data preparation, data analysis, and much more.
✓ Data science is the combination of statistics, mathematics, programming, and problem-solving
✓ Capturing data in ingenious ways
✓ The ability to look at things differently
✓ The of activity of cleansing, preparing, and aligning data
✓ This umbrella term includes various techniques that are used when extracting insights and
information from data.
DATA SCIENCE:
✓ Data science is the study of data.
✓ It involves developing of recording, storing and analysing data to effectively extract useful
information
✓ The goal of data science is to gain insights and knowledge from any type of data -both structured
and unstructured.
✓ Data science is related to computer science, but is a separate field.
✓ Computer science involves creating programs and algorthrim to record and process data, while
data science covers any type of data analysis, which may or may not use computers.
✓ Data science is more closely related to the mathematics field of statistics, which includes the
collection, organization, analysis and presentation of data.

6
NEED FOR DATA SCIENCE:
✓ With the help of data science technology, we can convert the massive amount of raw &
unstructured data into meaningful insights
✓ Data science technology is opting by various companies, whether it is a big brand or a start-up
Google, Amazon, Netflix, etc., which handle the huge amount of data are using data science
algorithms for better customer experience.
COMPONENTS OF DATA SCIENCE:

DATA SCIENCE PROCESS:

1) Discovery: The first phase is discovery, which involves asking the right questions. When you start any
data science project, you need to determine what are the basic requirements, priorities, and project
budget. In this phase, we need to determine all the requirements of the project such as the number
of people, technology, time, data, an end goal, and then we can frame the business problem on first
hypothesis level.

7
2) Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
▪ Data cleaning
▪ Data Reduction
▪ Data integration
▪ Data transformation, after performing all the above tasks, we can easily use this data for our
further processes.
3) Model Planning: In this phase, we need to determine the various methods and techniques to establish
the relation between input variables. We will apply Exploratory data analytics (EDA) by using various
statistical formula and visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
• SQL Analysis Services
• R
• SAS
• Python
4) Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification,
and clustering, to build the model. Following are some common Model building tools:
✓ SAS Enterprise Miner
✓ WEKA
✓ SPCS Modeler
✓ MATLAB
5) Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance
and other components on a small scale before the full deployment.
6) Communicate results: In this phase, we will check if we reach the goal, which we have set on the
initial phase. We will communicate the findings and final result with the business team.

8
BIG DATA ANAYTICS

WHAT IS BIG DATA?


✓ Data which are very large in size is called as big data.
✓ Normally we work on data of size MB (worddoc, excel) or maximum GB (movies, codes) but data
in Peta bytes i.e., 10^15 size is called as big data
✓ Big data is a term that describes large, hard -to-manage volumes of data both structured and
unstructured data.
MEANING OF BIG DATA
✓ Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It
encompasses the volume of information, the velocity or speed at which it is created & collected &
the variety or scope of the data points being covered. This describes the large volume of data both
structured & unstructured.
DEFINITIONS
✓ According to John Mashey, “Big data refers to the data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage & process data within a tolerable
elapsed time. “According to McKinsey, “Big data is the datasets whose size is beyond the ability of
typical database software tools to capture, store, manage & analyse.”

SOURCES OF BIG DATA:


❖ Social networking sites: Facebook, google, LinkedIn all these sites generate huge amount of
data on a day-to-day basis as they have billions of users worldwide.
❖ E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge number of logs from which
users buying trends can be traced.
❖ Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
❖ Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users. O
❖ Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.

3V's OF BIG DATA


1) Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
2) Variety: Now a days data is not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables are
structured data like the transaction data of the bank.
3) Volume: The amount of data which we deal with is of very large size of Peta bytes.

9
WHAT ARE BIG DATA TOOLS AND SOFTWARE?
✓ Hadoop
✓ Quoble
✓ Cassandra
✓ MongoDB
✓ Apache storm
✓ CouchDB

BIG DATA LIFE CYCLE


The data life cycle, also called the information life cycle, refers to the entire period of time that data exists
in your system. This life cycle encompasses all the stages that your data goes through, from first capture
onward.

1) Data creation, ingestion, or capture: Whether you generate data from data entry, acquire existing
data from other sources, or receive signals from devices, you get information somehow. This stage
describes when data values enter the firewalls of your system.
2) Data Processing: Data preparation typically includes integrating data from multiple sources, validating
data, and applying the transformation.
3) Data Analysis: However, you analyse and interpret your data, this is where the magic happens.
Exploring and interpreting your data may require a variety of analyses. This could mean statistical
analysis and visualization.
4) Data sharing or publication: This stage is where forecasts and insights turn into decisions and
direction.
5) Archiving: Once data has been collected, processed, analysed, and shared, it is typically stored for
future reference.
TYPES OF BIG DATA (5 Marks):

10
1. Structured Big Data:
✓ Any data that can be stored, accessed and processed in the form of fixed format is termed as
a ‘structured’ data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, nowadays, we are foreseeing
issues when a size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.
✓ Examples Of Structured Data

✓ An ‘Employee’ table in a database is an example of Structured Data


Employee_ID Employee Gender Department Salary_In_lacs
Name
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 650000
7500 Shubhojit Das Male Finance 650000
7699 Priya Sane Female Finance 650000

2. Unstructured Big Data:


✓ Any data with unknown form or the structure is classified as unstructured data. In addition
to the size being huge, un-structured data poses multiple challenges in terms of its
processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images, videos etc.
Now day organizations have wealth of data available with them but unfortunately, they don’t
know how to derive value out of it since this data is in its raw form or unstructured format.
✓ Examples Of Un-Structured Data: The output returned by ‘Google Search’
3. Semi-structured Big Data:
✓ Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g., a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
✓ Examples Of Semi-Structured Data
✓ Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>

<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>

<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>

<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

11
TYPES OF BIG DATA (10 Marks)
1.Structured Big Data:

✓ Data is stored in rows & columns. This type of data constitutes about 10% of the today’s total data &
is accessible through DBMS.
✓ Eg: Official registers that are created by governmental institutions to store data on individuals,
enterprises & real estates

2.Unstructured Big Data:

✓ Data of different forms like text, image, video, document, etc. This type of data accounts for about
90% of the data.

3.Geographic Big Data

✓ Data related to roads, buildings, lakes, addresses, people, workplaces & transportation routes that
are generated from geographic information systems.
✓ Eg: Google Maps

4.Real-Time Media:

✓ Real-time streaming of live or stored media data. One of the main source of media data is services
like e.g., YouTube, Flicker & Vimeo that produce a huge amount of video, pictures & audio. Another
important source of real-time media is video conferencing which allow two or more locations to
communicate in two-way video & audio transmission.

5.Natural Language Data:

✓ Human-generated data, particularly in the verbal form. The sources of natural language data include
speech capture devices, land phones & IoT that generate large sizes of text-like communication
between devices

6.Time Series:

✓ A sequence of data points or observations, typically consisting of successive measurements made


over a time interval.
✓ Eg: Ocean tides, counts of sunspots, measuring the level of unemployment

7.Event Data:

✓ Data generated from the matching between external events with time series. This requires the
identification of important events from the unimportant.
✓ Eg: information related to vehicle crashes or accidents can be collected & analysed to help
understand what the vehicles were doing before, during & after the event. The data is generated by
sensors fixed in different places of the vehicle body.

12
8.Network Data:

✓ Data concerns very large networks, such as social networks (e.g., Facebook & Twitter), information
networks (e.g., the World Wide Web), biological networks (e.g., biochemical, ecological & neural
networks) & technological networks (e.g., the Internet, telephone & transportation networks.

9.Linked Data:

✓ Data that is built upon standard Web technologies such as HTTP, RDF, SPARQL & URIs to share
information that can be semantically queried by computers. This allows data from different sources to
be connected & read.

APPLICATIONS OF BIG DATA


1.Healthcare

✓ Big data analytics have improved healthcare by providing personalized medicine and prescriptive
analytics. Researchers are mining the data to see what treatments are more effective for particular
conditions, identify patterns related to drug side effects, and gains other important information that
can help patients and reduce costs. It’s possible to predict disease that will escalate in specific areas.
Based on predictions, it’s easier to strategize diagnostics and plan for stocking serums and vaccines.

2.Media & Entertainment

✓ Various companies in the media and entertainment industry are facing new business models, for the
way they – create, market and distribute their content. Big Data applications benefits media and
entertainment industry by:

• Predicting what the audience wants

• Scheduling optimization

• Increasing acquisition and retention

• Ad targeting

• Content monetization and new product development Spotify,

✓ An on-demand music service, uses Hadoop Big Data analytics, to collect data from its millions of users
worldwide and then uses the analysed data to give informed music recommendations to individual
users. Amazon Prime, which is driven to provide a great customer experience by offering video, music,
and Kindle books in a one-stopshop, also heavily utilizes Big Data.

3.Traffic Optimization

✓ Big Data helps in aggregating real-time traffic data gathered from road sensors, GPS devices and video
cameras. The potential traffic problems in dense areas can be prevented by adjusting public
transportation routes in real time.

13
4.Real-time Analytics to Optimize Flight Route

✓ With each unsold seat of the aircraft, there is a loss of revenue. Route analysis is done to determine
aircraft occupancy and route profitability. By analysing customers’ travel behaviour, airlines can
optimize flight routes to provide services to maximum customers Increasing the customer base is most
important for maximizing capacity utilization. Through big data analytics, we can do route optimization
very easily. We can increase the number of aircraft on the most profitable routes

5.E-commerce Recommendation

✓ By tracking customer spending habit, shopping behaviour, Big retails store provide a recommendation
to the customer. E-commerce site like Amazon, Walmart, Flipkart does product recommendation. They
track what product a customer is searching, based on that data they recommend that type of product
to that customer. As an example, suppose any customer searched bed cover on Amazon. So, Amazon
got data that customer may be interested to buy bed cover. Next time when that customer will go to
any google page, advertisement of various bed covers will be seen. Thus, advertisement of the right
product to the right customer can be sent. YouTube also shows recommend video based on user’s
previous liked, watched video type. Based on the content of a video, the user is watching, relevant
advertisement is shown during video running. As an example suppose someone watching a tutorial
video of Big data, then advertisement of some other big data course will be shown during that video

6.Big data applications in agriculture

✓ Traditional tools are being replaced by sensor-equipped machines that can collect data from their
environments to control their behaviour – such as thermostats for temperature regulation or
algorithms for implementing crop protection strategies. Technology, combined with external big data
sources like weather data, market data, or standards with other farms, is contributing to the rapid
development of smart farming.

BIG DATA ANALYTICS


✓ Big data analytics is use of advanced analytic techniques against very large, diverse data sets that
include structured, semi-structured & unstructured data, from different sources & different sizes
from terabytes to zettabytes.
✓ Big data analytics is the process of collecting, organizing & analysing large sets of data called Big
Data to discover patterns & other useful information. Big data analytics can help organizations to
better understand the information contained within the data & will also help identify the data that
is most important to the business & future business decisions.

14
PROCESS OF BDA
1. Data Collection

✓ Data collection plays the most important role in the Big Data cycle. The Internet provides almost
unlimited sources of data for a variety of topics. The importance of this area depends on the type of
business, but traditional industries can acquire a diverse source of external data and combine those
with their transactional data. For example, let’s assume we would like to build a system that
recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants
from different websites and store them in a database.

2. Data Cleansing

✓ Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled.

3. Data Exploration

✓ Data exploration is the first step of data analysis used to explore and visualize data to uncover insights
from the start or identify areas or patterns to dig into more. Using interactive dashboards and point-
and-click data exploration, users can better understand the bigger picture and get to insights faster.

4. Data Visualization

✓ Big data visualization is the process of displaying data in charts, graphs, maps, and other visual forms.
It is used to help people easily understand and interpret their data at a glance, and to clearly show
trends and patterns that arise from this data.
BIG DATA ANALYTICS TOOLS
▪ R-Programming
▪ Altamira LUMIFY
▪ Apache Hadoop
▪ MongoDB
▪ RapidMiner
▪ Apache Spark
▪ Microsoft Azure
▪ Zoho Analytics

15

You might also like