0% found this document useful (0 votes)
19 views

BDA U1

The document outlines a course on Big Data Analytics, detailing course outcomes, assessment structure, and a comprehensive syllabus covering data management, big data tools, analytics, machine learning algorithms, and data visualization. It emphasizes the importance of big data in business, the evolution of big data technologies, and the characteristics that define big data. Additionally, it highlights potential career paths in the field and provides resources for further learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

BDA U1

The document outlines a course on Big Data Analytics, detailing course outcomes, assessment structure, and a comprehensive syllabus covering data management, big data tools, analytics, machine learning algorithms, and data visualization. It emphasizes the importance of big data in business, the evolution of big data technologies, and the characteristics that define big data. Additionally, it highlights potential career paths in the field and provides resources for further learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Big Data Analytics

(Course Code: AI732PE)

Dr. A.PramodKumar
Associate Professor,
Dept. of AIML.,
CMR Engineering College (Autonomous),
Email: [email protected]
Contact:9000159660
COURSE OUTCOMES
Upon successful completion of this course, student will be able to:

❑ CO1: Demonstrate the basic big data concept


❑ CO2: Simulate and apply various big data technologies like Hadoop, Map
Reduce, Spark, Impala, Pig and Hive.
❑ CO3: Categorize and summarize the Big Data and its importance in Business
domains.
❑ CO4: Differentiate various learning approaches in machine learning to process
data, and to interpret the concepts of ML algorithms and test cases
❑ CO5: Develop the numerous visualization plots with the aid of different tools
like Tableau, Qlick View and D3.
Course Assessment

Course Title INTERNET OF THINGS Course Type Integrated

IV Year
Course Code: AI732PE Credits 3 Class
I Semester
Contact Total Number of
TLP Credits Work Load
Hours Classes Assessment in
Weightage
Theory 3 4 4 Per Semester

Course Practice 0 0 0
Structure Theory Practical CIE SEE
Tutorial - - -

Total 3 4 4 42 0 30% 70%

Course Lead:
Theory Practice

Course
Instructors 1. Dr. A. Pramod Kumar
A. Dr. A. Pramod Kumar
COURSE SYLLABUS
UNIT I: Data Management Maintain Healthy, Safe & Secure Working Environment:
Data Management: Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/signal/GPS etc. Data Management, Data Quality (noise, outliers, missing
values, duplicate data) and Data Preprocessing. Export all the data onto Cloud ex. AWS/Rackspace etc.
Maintain Healthy, Safe & Secure Working Environment : Introduction, workplace safety, Report
Accidents & Emergencies, Protect health & safety as your work, course conclusion, and assessment.

UNIT- II: Big Data Tools & Provide Data/Information in Standard Formats :
Big Data Tools: Introduction to Big Data tools like Hadoop, Spark, Impala etc., Data ETL process,
Identify gaps in the data and follow-up for decision making.
Provide Data/Information in Standard Formats: Introduction, Knowledge Management, and
Standardized reporting & compliances, Decision Models, course conclusion. Assessment

UNIT- III: Big Data Analytics :


Big Data Analytics: Run descriptives to understand the nature of the available data, collate all the
data sources to suffice business requirement, Run descriptive statistics for all the variables and
observer the data ranges, Outlier detection and elimination.

UNIT- IV: Machine Learning Algorithms


Machine Learning Algorithms: Hypothesis testing and determining the multiple analytical
methodologies, Train Model on 2/3 sample data using various Statistical/Machine learning
algorithms, Test model on 1/3 sample for prediction etc.
MODULE V: Data Visualization:
Data Visualization: Prepare the data for Visualization, Use tools like Tableau, Qlick View and D3, Draw
insights out of Visualization tool. Product Implementation.
TEXT BOOKS
❑ Michael Minelli, Michelle Chambers and AmbigaDhiraj, “Big Data, Big Analytics: Emerging
Business Intelligence and Analytic Trends for Today's Businesses”, Wiley, 2013.
❑ ArvindSathi, “Big Data Analytics: Disruptive Technologies for Changing the Game”, 1st Edition,
IBM Corporation, 2012.
❑ Davy Cielen, Arno D. B. Meysman, and Mohamed Ali ,“Introducing Data Science - Big data,
machine learning, and more, using Python tools” , Dreamtech Press 2016
❑ Data Science & Big Data Analytics Discovering, Analyzing, Visualizing and Presenting Data EMC
Education Services, Wiley Publishers, 2015.
REFERENCE BOOKS:
❑ Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006
❑ Cay Horstmann, Wiley John Wiley & Sons, “Big Java”, 4th Edition, INC
❑ Data Mining Analysis and Concepts, M. Zaki and W. Meira, New Edition 2014.Camebridge
University press. NPTEL/SWAYAM/MOOCS:
❑ https://onlinecourses.nptel.ac.in/noc23_cs112/preview
E books
❑ https://bmsce.ac.in/Content/IS/Big_Data_Analytics_-_Unit_1.pdf
❑ https://mrcet.com/downloads/digital_notes/IT/(R17A0528)%20BIG%20DATA%20A
NALYTICS.pdf
3
E-RESOURCES:
1. http://freevideolectures.com/Course/3613/Big-Data-and-Hadoop/18
2. http://www.comp.nus.edu.sg/~ooibc/mapreduce-survey.pdf
UNIT-
I
Data Management &
MaintainHealthy, Safe & Secure
Working Environment

5
Contents:
• OBJECTIVES
• OUTCOMES
• Units overview
• Big Data – Meaning & Definition
• Fields that generate big data
• Traditional Data Vs Big Data
• Big Data Analytics – Meaning
• The importance of big data analytics
• Analytics Models
• How big data analytics works
• Applications and key data sources
• Big Data Analytics - Use cases

6
Motivation
The world's technological per-capita capacity to store information doubled every 40
months
As of 2012, 2.5 Exabyte's (2.5×1018) of data/day
Relational database management systems and desktop statistics and visualization
packages often have difficulty handling big data.
Big Data: new driver for digital economy & society
Gartner: hundreds of billions of GDP by 2020.
Intangible factor after labor and capital
Data Science: The fourth paradigm

According to Gartner, the definition of Big Data –


“Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.”
Big Data refers to complex and large data sets that have to be processed and analyzed
to uncover valuable information that can benefit businesses and organizations.

Careers in Big Data –


● Data Analyst ● Data Scientist ● Big Data Architect ● Data Engineer ● Hadoop
Admin ● Hadoop Developer ● Spark Developers
9
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large data sets go
back to the 1960s and '70s when the world of data was just getting started with the first
data centers and the development of the relational database.
Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open-source framework
created specifically to store and analyze big data sets) was developed that same year.
NoSQL also began to gain popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark)
was essential for the growth of big data because they make big data easier to work with
and cheaper to store. In the years since then, the volume of big data has skyrocketed.
Users are still generating huge amounts of data—but it’s not just humans who are doing
it.

With the advent of the Internet of Things (IoT), more objects and devices are connected to
the internet, gathering data on customer usage patterns and product performance. The
emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability,
where developers can simply spin up ad hoc clusters to test a subset of data.
Big data analytics is a method to uncover the hidden designs in large data, to
extract useful information that can be divided into two major sub-systems: data
management and analysis.
• Big data analytics is a process of inspecting, differentiating and transforming
big data with the goal of identifying useful information, suggesting conclusion
and helping to take accurate decisions.
• Analytics include both data mining and communication or guide decision
making. 11
● 1960 - txt files
● 1970 - spreadsheets/DBMS
● 1980 - RDBMS but Licenced
● 1990 - Data warehouse (many RDBMS) - Damm costly
● 1995 - DFS found by Doug Cutting but failed due to processing wasn’t good
● 2003 - Google released GFS paper same as DFS == GFS
● 2004 - Google Released MapReduce paper → Distributed Processing → Doug Cutting read the
paper and was Shocked ● 2005 - DFS enhanced and use MapReduce
● 2006 - DFS + MapReduce = Hadoop (name on son's elephant Doll) , HDFS and MapReduce
● 2008 - Emerged a lot -- Problem -- Hey MapReduce code is damn hard to understand JAVA —>
Doug Cutting said --> do not worry I have a solution to simplify MapReduce —> Learn SQL —>
Trigger SQL —> I will ensure my TOOL converts SQL query to MapReduce JAVA
● I found a tool, it looks like SQL but not --- If you trigger any queries -- it converts into a
MapReduce JAVA
● In 2008 - Hive and Pig came into the market ○ Mark Zuckerberg --- Facebook using --- HIVE ○
Yahoo--using -- PIG (Not Popular)
● 2009 - CLOUDERA - Problem was --for installation and Maintenance -need Laptop ○ Give me
Laptop But Installation, maintenance, and Upgrade - are Paid services
● 2011 - Horton Works - Give me laptop and Installation —> Free, Maintenance and Upgrade
are Paid
● 2014 - SPARK - Birth of a Superpower, Process data very faster, Free of cost, Will perform SQL,
Will support Streaming, Will support Machine learning, Can run ON Hadoop
● 2016 - Cloud - I will give u laptops, I will install, upgrade, and Maintenance ○ AWS EMR - Single
Click of Button Hadoop Laptops will be ready within 10 Min by AWS ○ Azure HDInsights ○ Google
data Proc (Google cloud platform - GCP) ○ Alibaba cloud
● 2017 - Cloudera Bought Horton works
● 2022 -- We live in the Cloud Big Data market = Cloud-based Hadoop Big Data analytics Cloud
Hadoop Spark ===> Crazy Technology today
Traditional Data Vs Big Data
The importance of big data analytics
Big data analytics through specialized systems and software can lead to positive
business related
outcomes:
• New revenue opportunities
• More effective marketing
• Better customer service
• Improved operational efficiency
• Competitive advantages over rivals
How big data analytics works
Once the data is ready, it can be analyzed with the software commonly used for
advanced analytics processes. That includes tools for:
• data mining, which sift through data sets in search of patterns and
relationships;
• predictive analytics, which build models to forecast customer behavior and
other future developments;
• machine learning, which taps algorithms to analyze large data sets; and
• deep learning, a more advanced offshoot of machine learning.
Big Data technologies can be divided into two groups: batch processing, which
are analytics on data at rest, and stream processing, which are analytics on
data in motion
Applications and key data sources for big data and business analytics
Use cases for Big data analytics
Big Data Analytics Workflow
Data Structures
Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic mappings
most of the Big Data is unstructured or semi-structured in nature, which
requires different techniques and tools to process and analyze.
Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the preferred
approach to process such complex data.

Big Data Growth is increasingly unstructured


Structured data: Data containing a defined data type, format, and structure
(that is, transaction data, online analytical processing [OLAP] data cubes,
traditional RDBMS, CSV files, and even simple spread sheets). data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized
information that can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms.

Example of structured data


Semi-structured data: Semi-structured data is also referred to as self-describing
structure. This is the data which does not conform to a data model but has some
structure. Textual data files with a discernible pattern that enables parsing (such
as Extensible Markup Language [XML] data files that are self-describing and
defined by an XML schema).

Example of semi-structured data


Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web clickstream data that
may contain inconsistencies in data values and formats).

Example of Quasi-structured data


Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
Unstructured data refers to the data that lacks any specific form or structure.
This makes it very difficult and time-consuming to process and analyze
unstructured data. Email is an example of unstructured data.

Example of unstructured data: video about Antarctica expedition


Characteristics of Big Data
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety,
Velocity, and Volume. Let’s discuss the characteristics of big data.
a) Variety : Variety of Big Data refers to structured, unstructured, and
semi-structured data that is gathered from multiple sources. today data comes
in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and
so much more.
1. Volume 2. Variety 3. Value 4. Velocity
5. Veracity - Data can’t be perfect in a realistic environment, it will have inconsistencies,
errors, and noise. For instance, collecting data for marketing surveys but there can’t be
data from fake accounts.
6. Variability - Data flow is inconsistent with period peak. The same tweets can have
different meanings based on the content.

Volume, Velocity, Variety, Variability, Veracity, Value, and Visualization


b) Velocity: Velocity essentially refers to the speed at which data is being
created in real-time. In a broader prospect, it comprises the rate of change,
linking of incoming data sets at varying speeds, and activity bursts.
c) Volume: Volume is one of the characteristics of big data. We already know
that Big Data indicates huge ‘volumes’ of data that is being generated on a daily
basis from various sources like social media platforms, business processes,
machines, networks, human interactions, etc. Such a large amount of data is
stored in data warehouses. Thus comes to the end of characteristics of big data.
d) Veracity: Veracity is all about the trust score of the data. If the data is collected from
trusted or reliable sources then the data neglect this rule of big data. It refers to
inconsistencies and uncertainty in data,
Kilobytes (KB) 1,000 bytes a paragraph of a text document
Megabytes (MB) 1,000 Kilobytes a small novel
Gigabytes (GB) 1,000 Megabytes Beethoven’s 5th Symphony
Terabytes (TB) 1,000 Gigabytes all the X-rays in a large hospital
Petabytes (PB) 1,000 Terabytes half the contents of all US academic research
libraries
Exabytes (EB) 1,000 Petabytes about one fifth of the words people have ever spoken

Zettabytes (ZB) 1,000 Exabytes as much information as there are grains of sandon
all the world’s beaches
Yottabytes (YB) 1,000 Zettabytes as much information as there are atoms in 7,000
human bodies
As data needs grew, so did more scalable data warehousing solutions. These
technologies enabled data to be managed centrally, providing benefits of
security, failover, and a single repository where users could rely on getting an
“official” source of data for financial reporting or other mission-critical tasks.
State of the Practice in Analytics
Current business problems provide many opportunities for organizations to become more
analytical and data driven.

Organizations have been trying to reduce customer churn, increase sales, and cross-sell
customers for many years.
What is new is the opportunity to fuse advanced analytical techniques with Big Data to
produce more impactful analyses for these traditional problems.

The last example in which represent additional complexity and data requirements for
organizations. Laws related to anti-money laundering (AML) and fraud prevention require
advanced analytical techniques to comply with and manage properly.
Bl Versus Data Science
there are several ways to compare these groups of analytical techniques.
BI tends to provide reports, dashboards, and queries on business questions for the
current period or in the past. BI systems make it easy to answer questions related to
quarter-to-date revenue, progress toward quarterly targets, and understand how much
of a given product was sold in a prior quarter or year.
BI provides hindsight and some insight and generally answers questions related to“when”
and “where” events occurred.
BI provides hindsight and some insight and generally answers questions related to
“when” and “where” events occurred. BI provides hindsight and some insight and
generally answers questions related to “when” and “where” events occurred.
Data Science projects if it needs to do a more sophisticated analysis with disaggregated or
varied datasets.
Current Analytical Architecture
Data Science projects need workspaces that are purpose-built for experimenting with
data, with flexible and agile data architectures.
Most organizations still have data warehouses that provide excellent support for
traditional reporting and simple data analysis activities but unfortunately have a more
difficult time supporting more robust analyses.
For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.
As a result of this level of control on the EDW, additional local systems may emerge in the
form of departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. These local data marts may not have the
same constraints for security and structure as the main EDW and allow users to do some
level of more in-depth analysis.
Once in the data warehouse, data is read by additional applications across the enterprise
for BI and reporting purposes. These are high-priority operational processes getting critical
data feeds from the data warehouses and repositories.

At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive analytics
on production databases, analysts create data extracts from the EDW to analyze data
offline in R or other local analytical tools.
Data moves in batches from EDW to local analytical tools. This workflow means that data
scientists are limited to performing in-memory analytics (such as with R, SAS, SPSS, or
Excel), which will restrict the size of the datasets they can use. As such, analysis may be
subject to constraints of sampling, which can skew model accuracy.
Drivers of Big Data
To better understand the market drivers related to Big Data, it is helpful to first
understand some past history of data stores and the kinds of repositories and tools to
manage these data stores.
The Big Data trend is generating an enormous amount of information from many new
sources. This data deluge requires advanced analytics and new market players to take
advantage of these opportunities and new market dynamics
Emerging Big Data Ecosystem and a New Approach to Analytics

Data devices:the “Sensornet” gather data from multiple locations and continuously
generate new data about this data. For each gigabyte of new data created, an additional
petabyte of data is created
Smartphones provide another rich source of data. In addition to messaging and basic
phone usage, they store and transmit data about Internet usage, SMS usage, and
real-time location. This metadata can be used for analyzing traffic patterns by scanning
the density of smartphones in locations to track the speed of cars or the relative traffic
congestion on busy roads.
Retail shopping loyalty cards record not just the amount an individual spends, but the
locations of stores that person visits, the kinds of products purchased, the stores where
goods are purchased most often, and the combinations of products purchased together.
Data collectors:include sample entities that collect data from the device and users
Data results from a cable TV provider tracking the shows a person watches, which TV
channels someone will and will not pay for to watch on demand, and the prices someone
is willing to pay for premium TV content
Retail stores tracking the path a customer takes through their store while pushing a
shopping cart with an RFID chip so they can gauge which products get the most foot
traffic using geospatial data collected from the RFID chips
Data aggregators:the data collected from the various entities from the “Sensor Net” or
the “Internet of Things.” These organizations compile data from the devices and usage
patterns collected by government agencies, retail stores, and websites.
they can choose to transform and package the data as products to sell to list brokers,
who may want to generate marketing lists of people who may be good targets for
specific ad campaigns.
Data users and buyers:These groups directly benefit from the data collected and
aggregated by others within the data value chain.
Retail banks, acting as a data buyer, may want to know which customers have the highest
likelihood to apply for a second mortgage or a home equity line of credit. To provide input
for this analysis, retail banks may purchase data from a data aggregator. This kind of data
may include demographic information about people living in specific locations;
Using technologies such as Hadoop to perform natural language processing on
unstructured, textual data from social media websites, users can gauge the reaction to
events such as presidential campaigns.
Key Roles for the New Big Data Ecosystem:
The Big Data ecosystem demands three categories of roles, These roles were described in
the McKinsey Global study on Big Data, from May 2011 .
The first group—Deep Analytical Talent— is technically savvy, with strong analytical skills.
Members possess a combination of skills to handle raw, unstructured data and to apply
complex analytical techniques at massive scales. This group has advanced training in
quantitative disciplines, such as mathematics, statistics, and machine learning.
The second group—Data Savvy Professionals—has less technical depth but has a basic
knowledge of statistics or machine learning and can define key questions that can be
answered using advanced analytics. These people tend to have a base knowledge of
working with data, or an appreciation for some of the work being performed by data
scientists and others with deep analytical talent.
The second group—Data Savvy Professionals—has less technical depth but has a basic
knowledge of statistics or machine learning and can define key questions that can be
answered using advanced analytics. These people tend to have a base knowledge of
working with data, or an appreciation for some of the work being performed by data
scientists and others with deep analytical talent.
These three groups must work together closely to solve complex Big Data challenges.
There are three recurring sets of activities that data scientists perform:
Reframe business challenges as analytics challenges. Specifically, this is a skill to diagnose
business problems, consider the core of a given problem, and determine which kinds of
candidate analytical methods can be applied to solve it
Design, implement, and deploy statistical models and data mining techniques on Big Data.
This set of activities is mainly what people think about when they consider the role of the
Data Scientist: namely, applying complex or advanced analytical methods to a variety of
business problems using data.
Develop insights that lead to actionable recommendations. It is critical to note that
applying advanced methods to data problems does not necessarily drive new business
value.
Data scientists are generally thought of as having five main sets of skills and behavioral
characteristics
Quantitative skill: such as mathematics or statistics
Technical aptitude: namely, software engineering, machine learning, and programming
skills
Skeptical mind-set and critical thinking: It is important that data scientists can examine
their work critically rather than in a one-sided way.
Curious and creative: Data scientists are passionate about data and finding creative ways
to solve problems and portray information.
Communicative and collaborative: Data scientists must be able to articulate the business
value in a clear way and collaboratively work with other groups, including project
sponsors and key stakeholders
Examples of Big Data Analytics
the emerging Big Data ecosystem and new roles needed to support its growth. Big Data
presents many opportunities to improve sales and marketing analytics.
After analyzing consumer-purchasing behavior, Target’s statisticians determined that the
retailer made a great deal of money from three main life event situations.
Key Roles for a Successful Analytics Project

there are actually seven key roles that need to be fulfilled for a high functioning data
science team to execute analytic projects successfully.
What is DFS:
Introduced by Doug Cutting in 1995.
● License-free
● We don’t need RDBMS servers
● How DFS works - for example there are 4 laptops - 1 is Gateway and other 3 are
interconnected to each like below image - ○ Gateway - Once you give data, data gets
stored on the Gateway node first then after running a few commands it is distributed in
other 3 nodes.
Why DFS failed :
As per Doug cutting if data increases then have to increase Gateway Node size as well to
process the data. Which was really an issue, as putting a lot of money for Gateway Node
How is GFS different from DFS -
GFS(Google file system), they are using the same mechanism to store data as DFS but for
processing data GFS uses Map-Reduce. It means data won’t come to process instead
process will go to data and Map it after collecting data from all nodes data gets reduced
and returned.
Advantages of using Big Data Disadvantages of Big Data
1. Improved business processes 1. Privacy and security concerns
2. Fraud detection 2. Need for technical expertise
3. Better decision-making 3. Need for talent
4. Increased productivity 4. Data quality
5. Reduce costs 5. Need for cultural change
6. Improved customer service 6. Compliance
7. Increased revenue 7. Cyber security risks
8. Increased agility 8. Rapid change
9. Greater innovation 9. Hardware needs
10. Faster speed to market 10. Costs
11. Difficulty integrating legacy systems
Design Data Architecture and manage the data for analysis
When do I need big data architecture
In the beginning times of computers and Internet, the data used was not as much
of as it is today.
because the data never exceeded to the extent of 19 exabytes but now in this era,
the data has increased about 2.5 quintillions per day.
Most of the data is generated from social media sites like Facebook, Instagram,
Twitter, etc, and the other sources can be e-business, e-commerce transactions,
hospital, school, bank data, etc.
Big Data is the field of collecting the large data sets from various sources like
social media, GPS, sensors etc and analyzing them systematically and extract
useful patterns using some tools and techniques by enterprises. Before analyzing
and determining the data, the data architecture must be designed by the architect.
the data architecture is formed by dividing into three essential models and then are
combined
•Conceptualmodel –
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
•Logicalmodel –
It is a model where problems are represented in the form of logic such as rows
and column of data, classes, xml tags and other DBMS techniques.
•Physicalmodel –
Physical models holds the database design like which type of database technology
will be suitable for architecture.
understand various sources of Data like Sensors/signal/GPS
Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured or
unstructured form like text, video, audio, XML files, records, or other
image files used in later stages of data analysis.
The main goal of data collection is to collect information-rich data.
The data which is Raw, original, and extracted directly from the official sources is
known as primary data.
This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys
These can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.
Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.
Data Management
Data management is an administrative process that includes acquiring, validating,
storing, protecting, and processing required data to ensure the accessibility,
reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and consuming data at
unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization.
Importance of Data management:
Data management plays a significant role in an organization's ability to
generate revenue, control costs.
Data management helps organizations to mitigate risks.
It enables decision making in organizations.
benefits of good data management
Optimum data quality
Improved user confidence
Efficient and timely access to data
Improves decision making in an organization
Managing data Resources:
An information system provides users with timely, accurate, and relevant
information.
The information is stored in computer files. When files are properly arranged
and maintained, users can easily access and retrieve the information when
they need.
If the files are not properly managed, they can lead to chaos in information
processing.
Even if the hardware and software are excellent, the information system can
be very inefficient because of poor file management.
Areas of Data Management:
Data Modelling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For
instance, data needs to be moved from where it is collected to a database and then
to an end user.
Data Quality (noise, outliers, missing values, duplicate data)
Data quality is the ability of your data to serve its intended purpose based on
factors such as accuracy, completeness, consistency, reliability and these factors
that play a huge role in determining data quality.

Accuracy:
Erroneous values that deviate from the expected. The causes for inaccurate data
can be various
Human/computer errors during data entry and transmission
Users deliberately submitting incorrect values (called disguised missing data)
Incorrect formats for input fields
Duplication of training examples
Completeness:
Lacking attribute/feature values or values of interest. The dataset might be
incomplete due to:
Unavailability of data
Deletion of inconsistent data
Deletion of data deemed irrelevant initially
Consistency:
Inconsistent means data source containing discrepancies between different data
items.
Reliability:
Reliability means that data are reasonably complete and accurate, meet the
intended purposes, and are not subject to inappropriate alteration.
To make the process easier, data preprocessing is divided into four stages: data
cleaning, data integration, data reduction, and data transformation.
Data Quality is also effected by Outliers Missing Values Noisy
Duplicate Values
Outliers:
Outliers are extreme values that deviate from other observations on data, they may
indicate a variability in a measurement, experimental errors or a novelty.
It is a point or an observation that deviates significantly from the other
observations
Outlier detection from graphical representation: Scatter plot and Box plot
Scatter plot:
Scatter plots are used to plot data points on a horizontal and a vertical axis in the
attempt to show how much one variable is affected by another. A scatter plot uses
dots to represent values for two different numeric variables.

Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
Minimum First quartile (Q1), Median, Third quartile (Q3), and
Maximum”).
Most common causes of outliers on a data set:
Data entry errors (human errors)
Measurement errors (instrument errors)
Experimental errors planning/executing errors) (data extraction or experiment
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
Natural (not an error, novelties in data)
How to remove Outliers?
Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers. We can also
use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial.
Missing data or Missing value:
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Data Extraction: It is possible that there are problems with extraction process. In
such cases, we should double-check for correct data with data guardians.
Data collection: These errors occur at time of data collection and are harder to
correct. They can be categorized in four types:
Missing completely at random: This is a case when the probability of missing
variable is same for all observations.
Missing at random: This is a case when variable is missing at random and missing
ratio varies for different values / level of other input variables. For example: We
are collecting data for age and female has higher missing value compare to male
Missing that depends on unobserved predictors: This is a case when the missing
values are not random and are related to the unobserved input variable. For
example: In a medical study, if a particular diagnostic causes discomfort, then
there is higher chance of drop out from the study.
Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response
to their earning.
Which are the methods to treat missing values?
Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces
the power of model because it reduces the sample size.
In pair wise deletion, we perform analysis with all cases in which the variables
of interest are present. Advantage of this method is, it keeps as many cases
available for analysis. One of the disadvantage of this method, it uses different
sample size for different variables.
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. Mean/ Mode/ Median Imputation: It can be of two
types:
Generalized Imputation: In this case, we calculate the mean or median for all
non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take average
of all non missing values of “Manpower” (28.33) and then replace missing value
with it.
Similar case Imputation: In this case, we calculate average for gender “Male”
(29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of
manpower with 29.75 and for “Female” with 25.
Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can be


generated due to faulty data collection, data entry errors etc.
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters. Incorrect attribute values may due to
faulty data collection instruments
data entry problems data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Duplicate values:
Duplicate values: A dataset may include data objects which are duplicates of one
another. It may happen when say the same person submits a form more than once.

Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain
many errors.
Major Tasks in Data Preprocessing are
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Data Integration:
Data with different representations are
put together and conflicts within the
data are resolved. Integration of
multiple databases, data cubes, or files.
There are mainly 2 major approaches for
data integration. one is “Tight coupling
approach” and another is “Loose coupling
approach”.
Tight Coupling:
a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
an interface is provided that takes the query from the user, transforms it in a way
the source database can understand and then sends the query directly to the source
databases to obtain the result.
Issues in Data Integration:
Schema Integration: Integrate metadata from different sources. The real- world
entities from multiple sources are referred to as the entity identification problem.
Redundancy: An attribute may be redundant if it can be derived or obtained
from another attribute or set of attributes. Inconsistencies in attributes can also
cause redundancies in the resulting data set.
Detection and resolution of data value conflicts: Attribute values from different
sources may differ for the same real-world entity. An attribute in one system may
be recorded at a lower level of abstraction than the “same” attribute in another.
Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process.
Normalization: It is done in order to scale the data values in a specified range
(-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization: This transforms the original data linearly. Suppose that
min_F is the minima and max_F is the maxima of an attribute

Attribute Selection: In this strategy, new attributes are constructed from the given
set of attributes to help the mining process.
Discretization: This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
Concept Hierarchy Generation: Here attributes are converted from lower level to
higher level in hierarchy. For Example-The attribute “city” can be converted to
“country”
Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase the
storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
Attribute Subset Selection: The highly relevant attributes should be used, rest all
can be discarded
Data compression: It reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into
two types based on their compression techniques.
Lossless Compression: Encoding techniques (Run Length Encoding) allows a
simple and minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed data.
Lossy Compression: Methods such as Discrete Wavelet transform technique,
PCA (principal component analysis) are examples of this compression. In
lossy-data compression, the decompressed data may differ to the original data but
are useful enough to retrieve information from them.
Numerosity Reduction: This enables to store the model of data instead of whole
data, for example: Regression Models.
Dimensionality Reduction: This reduces the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from compressed
data, original data can be retrieved, such reduction are called lossless reduction
else it is called lossy reduction. The two effective methods of dimensionality
reduction are: Wavelet transforms and PCA (Principal Component Analysis).
Data Discretization:
Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
Export all the data onto Cloud ex.
AWS/Rackspace
We usually export our data to cloud for purposes like safety, multiple access and
real time simultaneous analysis.
There are various vendors which provide cloud storage services. We are
discussing Amazon S3.
An Amazon S3 export transfers individual objects from Amazon S3 buckets to
your device, creating one file for each object. You can export from more than one
bucket and you can specify which files to export using manifest file options.
Export Job Process
1.You create an export manifest file that specifies how to load data onto your
device, including an encryption PIN code or password and details such as the
name of the bucket that contains the data to export.
2.You initiate an export job by sending a Create Job request that includes the
manifest file. You must submit a separate job request for each device. Your job
expires after 30 days. If you do not send a device, there is no charge.
You can send a CreateJob request using the AWS Import/Export Tool, the AWS
Command Line Interface (CLI), the AWS SDK for Java, or the AWS REST API.
The easiest method is the AWS Import/Export Tool.
EX: Sending a CreateJob Request Using the AWS Import/Export Web Service
Tool Sending a CreateJob Request Using the AWS SDK for Java Sending a
CreateJob Request Using the REST API
3. AWS Import/Export sends a response that includes a job ID, a signature value,
and information on how to print your pre-paid shipping label. The response also
saves a SIGNATURE file to your computer. You will need this information in
subsequent steps.
4.You copy the SIGNATURE file to the root directory of your storage device.
You can use the file AWS sent or copy the signature value from the response into
a new text file named SIGNATURE. The file name must be SIGNATURE and it
must be in the device's root directory.
Each device you send must include the unique SIGNATURE file for that device
and that JOBID. AWS Import/Export validates the SIGNATURE file on your
storage device before starting the data load. If the SIGNATURE file is missing
invalid (if, for instance, it is associated with a different job request),
5. Generate, print, and attach the pre-paid shipping label to the exterior of your
package. See Shipping Your Storage Device for information on how to get your
pre-paid shipping label.
6. You ship the device and cables to AWS through UPS. Make sure to include
your job ID on the shipping label and on the device you are shipping. Otherwise,
your job might be delayed. Your job expires after 30 days. If we receive your
package after your job expires, we will return your device. You will only be
charged for the shipping fees, if any.
7. AWS Import/Export validates the signature on the root drive of your storage
device. If the signature doesn't match the signature from the CreateJob response,
AWS Import/Export can’t load your data.
Once your storage device arrives at AWS, your data transfer typically begins by
the end of the next business day.
8. AWS reformats your device and encrypts your data using the PIN code or
password you provided in your manifest.
9. We repack your storage device and ship it to the return shipping address listed
in your manifest file. We do not ship to post office boxes.
You use your PIN code or TrueCrypt password to decrypt your device. For more
information, see Encrypting Your Data
Basic Workplace Safety Guidelines
Fire Safety
Employees should be aware of all emergency exits, including fire escape routes,
of the office building and also the locations of fire extinguishers and alarms.
Falls and Slips
To avoid falls and slips, all things must be arranged properly. Any spilt liquid,
food or other items such as paints must be immediately cleaned to avoid any
accidents.
First Aid
Employees should know about the location of first-aid kits in the office. First-aid
kits should be kept in places that can be reached quickly. These kits should
contain all the important items such as cuts, burns, headaches, muscle cramps, etc.
Security
Employees should make sure that they keep their personal things in a safe place.
Electrical Safety
Employees must be provided basic knowledge of using electrical equipment and
common problems. Employees must also be provided instructions about electrical
safety such as keeping water and food items away from electrical equipment.
Report Accidents & Emergencies
An accident is an unplanned, uncontrolled, or unforeseen event resulting in injury
or harm to people and damages to goods.
For example, a customer having a heart attack or sudden outbreak of fire in your
organization needs immediate attention.
Each organization or chain of organizations has procedures and practices to
handle and report accidents and take care of emergencies.
The following are some of the guidelines for identifying and reporting an accident
or emergency:
Notice and correctly identify accidents and emergencies: You need to be aware of
what constitutes an emergency and what constitutes an accident in an
organization
Types of
Accidents
1.Trip and fall 2.Slip and fall 3. injuries caused due to escalators or elevators
(or lifts) 4. Accidents due to falling of goods 5. Accidents due to moving objects
Handling Accidents
Try to avoid accidents in your organization by finding out all potential hazards
and eliminating them. In case of an injury to a colleague or a customer due to an
accident in your organization, you should do the following
Attend to the injured person immediately
Inform your supervisor
Assist your supervisor
Types of
Emergencies
Each organization also has policies and procedures to tackle emergency
situations. The purpose of these policies and procedures is to ensure safety and
well-being of customers and staff during emergencies.
Medical emergencies, such as heart attack or an expectant mother in labor: It is a
medical condition that poses an immediate risk to a person’s life or a long-term
threat to the person’s health if no actions are taken promptly.
Substance emergencies, such as fire, chemical spills, and explosions: Substance
emergency is an unfavorable situation caused by a toxic, hazardous, or
inflammable substance that has the capability of doing mass scale damage to
properties and people.
Structural emergencies, such as loss of power or collapsing of walls: Structural
emergency is an unfavourable situation caused by development of some faults in
the building in which the organization is located.
Security emergencies, such as armed robberies, intruders, and mob attacks or civil
disorder: Security emergency is an unfavorable situation caused by a breach in
security posing a significant danger to life and property.
Natural disaster emergencies, such as floods and earthquakes: It is an emergency
situation caused by some natural calamity leading to injuries or deaths, as well as
a large-scale destruction of properties and essential service infrastructures.
Handling General Emergencies
What is the evacuation plan and procedure to follow in case of an emergency?
Who all should you notify within the organization?
Which external agencies, such as police or ambulance, you should notify in
which emergency?
Regularly check that all emergency handling equipment’s are in working
condition, such as the fire extinguisher and fire alarm system.
Ensure that emergency exits are not obstructed and keys to such exists are easily
accessible. Never place any objects near the emergency doors or windows.
Summary
Identify and report accidents and emergencies:
▪ Notice and correctly identify accidents and emergencies.
▪ Get help promptly and in the most suitable way.
▪ Follow company policy and procedures for preventing further injury while
waiting for help to arrive.
▪ Act within the limits of your responsibility and authority when accidents and
emergencies arise.
▪ Promptly follow the instructions given by senior staff and the emergency
services personnel.
Handling accidents:
▪ Attend the injured person immediately.
▪ Inform your supervisor about the accident giving details.
▪ Assist your supervisor in investigating and finding out the actual cause of the
accident.
General emergency handling procedures:
▪ Keep a list of numbers to call during emergencies.
▪ Regularly check that all emergency handling equipment is in working
condition.
▪ Ensure that emergency exits are not obstructed.
Protect Health & Safety
Hazards:A hazard can harm an individual or an organization. For example, hazard
to an organization include loss of property or equipment while hazard to an
individual involve harm to health or body.
Material: Knife or sharp edged nails can cause cuts.
Substance: Chemicals such as Benzene can cause fume suffocation. Inflammable
substances like petrol can cause fire.
Electrical energy: Naked wires or electrodes can result in electric shocks.
Condition: Wet floor can cause slippage. Working conditions in mines can cause
health hazards.
Gravitational energy: Objects falling on you can cause injury.
Rotating or moving objects: Clothes entangled into ratting objects can cause
serious harm.
Potential Sources of Hazards in an Organization
Using computers: Hazards include poor sitting postures or excessive duration of
sitting in one position. These hazards may result in pain and strain.
Handling office equipment: Improper handling of office equipment can result in
injuries.
Handling objects: Lifting or moving heavy items without proper procedure or
techniques can be a source of potential hazard.
Stress at work: In today’s organization, you may encounter various stress causing
hazards.
Working environment: Potential hazards may include poor ventilation,
inappropriate height chairs and tables, stiffness of furniture, poor lighting, staff
unaware of emergency procedures, or poor housekeeping.
General Evacuation Procedures
Each organization will has its own evacuation procedures as listed in its policies.
Leave the premises immediately and start moving towards the nearest emergency
exit.
Guide your customers to the emergency exits.
If possible, assist any person with disability to move towards the emergency exit.
Keep yourself light when evacuating the premises. You may carry your hand-held
belongings; such as bags or briefcase as you move towards the emergency exit.
Do not use the escalators or elevators (lifts) to avoid overcrowding and getting
trapped, in case there is a power failure. Use the stairs instead.
Go to the emergency assembly area. Check if any of your colleagues are missing
and immediately inform the personnel in charge of emergency evacuation or your
supervisor.
Do not go back to the building you have evacuated till you are informed by
authorized personnel that it is safe to go inside.
Summary
▪ Hazards can be defined as any source of potential harm or danger to someone
or any adverse health effect produced under certain condition.
▪ Some potential sources of hazards in an organization are as follows:
▪ Using computers
▪ Handling office equipment
▪ Handling objects
▪ Stress at work
▪ Working environment
▪ Every employee should be aware of evacuation procedures and follow them
properly during an emergency evacuation.
▪ Follow all safety rules and warning to keep your workplace free from
accidents.
▪ Recognize all safety signs in offices.
▪ Report any incidence of non-compliance to safety rules and anything that is a
safety hazard.

You might also like