0% found this document useful (0 votes)

15 views

1_introduction_to_big_data_management_and_processing

Uploaded by

hungvdsoict

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

1_introduction_to_big_data_management_and_processing

Uploaded by

hungvdsoict

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to big data

management and processing

Viet-Trung Tran
Syllabus
STT Lecture

1 Tổng quan về lưu trữ và xử lý dữ liệu lớn

2 Hệ sinh thái Hadoop (Hadoop ecosystem)

3 Hệ thống tập tin phân tán Hadoop HDFS

4 Cơ sở dữ liệu phi quan hệ NoSQL - phần 1

Tổng quan
5 Cơ sở dữ liệu phi quan hệ NoSQL - phần 2
Kiến trúc phân tán phổ biến
6 Cơ sở dữ liệu phi quan hệ NoSQL - phần 3
Truy vấn SQL trên NoSQL
7 Hệ thống truyền thông điệp phân tán

8 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 1

Map Reduce
9 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 2
Apache Spark
10 Các kĩ thuật xử lý luồng dữ liệu lớn
Spark Streaming
11 Kiến trúc dữ liệu lớn
Lambda architecture
12 Phân tích dữ liệu lớn 2
Spark ML
How big is big data?

3
4
How big is big data?

5
Data science: The 4th paradigm for scientific discovery

6
Big data in 2008

7
Big data in 2014

8
Big data today

9
Big data’s numbers

10
Big data sources
• E-commerce
• Social networks
• Internet of things
• Data-intensive experiments (bioinformatics, quantum physics, etc)

11
Data is the new oil

12
Big data 5'V

Big data is a term for data sets that are so large

or complex that traditional data processing
application software is inadequate to deal with
them (wikipedia) 13
Big data – big value

source: wipro.com
14
Big Data in education industry
• Customized and Dynamic Learning Programs
• Customized programs and schemes to benefit
individual students can be created using the data
collected on the bases of each student’s learning
history. This improves the overall student results.
• Reframing Course Material
• Reframing the course material according to the
data that is collected on the basis of what a student
learns and to what extent by real-time monitoring
of the components of a course is beneficial for the
students.
• Grading Systems
• New advancements in grading systems have been
introduced as a result of a proper analysis of
student data.
• Career Prediction
• Appropriate analysis and study of every student’s
records will help understand each student’s
progress, strengths, weaknesses, interests, and
more. It would also help in determining which
career would be the most suitable for the student
in future.

15
Edtech
• Coursera
• VioEdu
• https://byjus.com/
• Engaging Video Lessons
• Personalized Learning Journeys
• Mapped to the Syllabus
• In-depth Analysis
• Engaging Interactive Questions

16
Big Data in healthcare industry
• Big data reduces costs of treatment since
there is less chances of having to perform
unnecessary diagnosis.
• It helps in predicting outbreaks of
epidemics and also in deciding what
preventive measures could be taken to
minimize the effects of the same.
• It helps avoid preventable diseases by
detecting them in early stages. It prevents
them from getting any worse which in
turn makes their treatment easy and
effective.
• Patients can be provided with evidence-
based medicine which is identified and
prescribed after doing research on past
medical results.

17
Big Data in government sector
• Welfare Schemes
• In making faster and informed decisions
regarding various political programs
• To identify areas that are in immediate
need of attention
• To stay up to date in the field of
agriculture by keeping track of all
existing land and livestock.
• To overcome national challenges such
as unemployment, terrorism, energy
resources exploration, and much more.
• Cyber Security
• Big Data is hugely used for deceit
recognition.
• It is also used in catching tax evaders.

18
Big Data in media and entertainment industry

• Predicting the interests of audiences

• Optimized or on-demand scheduling of media streams in digital media
distribution platforms
• Getting insights from customer reviews
• Effective targeting of the advertisements
• Example
• Spotify, an on-demand music providing platform, uses Big Data Analytics,
collects data from all its users around the globe, and then uses the analyzed
data to give informed music recommendations and suggestions to every
individual user.
• Amazon Prime that offers, videos, music, and Kindle books in a one-stop shop
is also big on using big data.

19
Big data in scientific discovery

20
Maximilien Brice, © CERN
Use Case: Thomson Reuters

• About: Thomson Reuters is the world’s leading source of news and

information for the financial and risk, legal, tax & accounting, and
media markets.
• Challenge: Their data source (Twitter) produces a gigantic amount of
data daily. It is challenging to (a) quickly analyze all the tweets and (b)
disguise real news from fake news and opinions.
• Impact after applying Big Data solutions:
• Able to process 13 million tweets daily
• Captures and detects news events across millions of tweets in 40 milliseconds
• (Source: https://www.cloudera.com/about/customers/thomson-
reuters.html)
Use Case: MasterCard

• About: MasterCard operates the world’s fastest payments processing

network, delivering the products and services that make everyday
commerce activities — such as shopping, traveling, running a business,
and managing finances — easier, more secure, and more efficient.
• Challenge: Identify frauds from one million inquiries/month to their
database which contains hundreds of millions of fraudulent
businesses.
• Impact after applying Big Data solutions:
• 5x increase in number of searches supported annually
• 25x increase in searches per customer daily
• Dramatically improved search accuracy
• (Source: https://www.cloudera.com/about/customers/thomson-
reuters.html)
Use Case: Intel Supply Chain

• About: Intel’s supply chain reflects the company’s global operations—

Intel does business in more than 100 countries, with over 450 supplier
factories and 16,000 suppliers.
• Challenge:
• Multiple data hops -- data latencies of up to 12 hours
• Data fragmentation, data reconciliation and quality issues
• Impact after applying Big Data solutions:
• Reduce planning DB by 63% and the enterprise common core (ECC) DB by 80%
• DB transactions decreased by 25-50%; ECC processing time decreased by 50%
• 75% efficiency gain in account reconciliation; 45% reduction of IT support staff
• (Source: https://www.intel.com/content/www/us/en/it-
management/intel-it-best-practices/transforming-intels-supply-chain-
with-real-time-analytics-paper.html)
Use Case: Cisco WebEx

• About: Cisco WebEx supports more than 26 billion conference minutes

each month. Its audio, video, and web conferencing services help
users connect and collaborate with colleagues around the world.
• Challenge:
• Gain an end-to-end view of the customer experience
• Support an increasing volume of telemetry data
• More rapidly uncover new fraud tactics
• Impact after applying Big Data solutions:
• Identified 17x more fraud
• Delivered platform at 1/10 the cost of traditional data warehouse and BI
environment
• (Source:https://www.cloudera.com/about/customers/cisco.html)
Use Case: Deutsche Telekom

• About: Deutsche Telekom is a leading European telecommunications

provider, delivering services to more than 150 million customers
globally.
• Challenge:
• Preventing network fraud
• Data visibility and scalability
• Impact after applying Big Data solutions:
• 5-10% lower customer churn
• 10-20% lower revenue losses from fraud activities
• 50% better operational efficiency
• (Source: https://www.cloudera.com/about/customers/deutsche-
telekom.html)
Top 10 Company Market Cap Ranking History
(1998-2018)

https://www.youtube.com/watch?v=fobx4wIS6W0
26
Top 10 Company Market Cap Ranking History
(1998-2018)

27
Big data technology stack

28
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O) into data storage
systems
• Transparency
• In distributed environment, users should be able to access data over the
network as easily as if the data were stored locally.
• Users should not have to know the physical location of data to access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences of distribution
shouldn’t compromise the availability.

29
Data I/O landscape

0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin

600MB/s same
rack

3-12 msrandom 0.1 ms random

access access

$0.025 perGB $0.35 perGB

30
Scalable data ingestion and processing

• Data ingestion
• Data from different complementing information systems is to be combined to
gain a more comprehensive basis to satisfy the need
• How to ingest data efficiently from various, distributed heterogeneous
sources?
• Different data formats
• Different data models and schemas
• Security and privacy
• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink

31
Scalable analytic algorithms

• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single
machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-conquer
approach

32
Eg. Curse of dimensionality
• The required number of samples (to achieve the same accuracy) grows
exponentionally with the number of variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a large number of
features!

In fact, after a certain point, increasing

the dimensionality of the problem by
adding new features would actually
degrade the performance of classifier.

33
Utilization and interpretability of big data

• Domain expertise to findout problems and

interprete analytics results
• Scalable visualization and interpretability of
million data points
• to facilitate their interpretability and
understanding

34
Privacy and security

35
Big data job trends

36
Talent shortage in big data

37
Big data skill set

38
Data engineers vs. data scientists

39
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects

40
Data science method
1. Formulate a question

4. Product
2. Gather data

3. Analyze data

Source: Foundational Methodology for Data Science, IBM, 2015 41

DeepQA: Incremental Progress in Precision and Confidence
6/2007-11/2010

Now Playing in the

100% Winners Cloud
90% 11/2010

80% 4/2010

70% 10/2009
5/2009
60%
12/2008
Precision

50% 8/2008

5/2008
40% 12/2007

30%

20%
Baseline
10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
42
Cleaning big data: most time-consuming, least
enjoyable data science task

• Data preparation accounts for about 80% of the work of data scientists

source: https://www.forbes.com/ 43
Cleaning big data: most time-consuming, least
enjoyable data science task

• 57% of data scientists regard cleaning and organizing data as the least
enjoyable part of their work and 19% say this about collecting data
sets.

44
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems.
" O'Reilly Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information
Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4
(2017): 353-365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the 2012
ACM SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of computer
trends and technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data.
ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.
45
Online courses
• https://www.coursera.org/learn/nosql-database-systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index.htm
• https://www.coursera.org/learn/big-data-
introduction?specialization=big-data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-
management?specialization=big-data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data

CST2355 - A2
No ratings yet
CST2355 - A2
34 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
1 - Big Data
No ratings yet
1 - Big Data
204 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Big Data
No ratings yet
Big Data
24 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
BD-Unit-1
No ratings yet
BD-Unit-1
63 pages
Data Science: Lecture #1
No ratings yet
Data Science: Lecture #1
22 pages
Unit I: Understanding Big Data
No ratings yet
Unit I: Understanding Big Data
10 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Big Data Unit1
No ratings yet
Big Data Unit1
70 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
An Introduction To Big Data: Data Management For Data Science
No ratings yet
An Introduction To Big Data: Data Management For Data Science
32 pages
ETB 1 (Big data)
No ratings yet
ETB 1 (Big data)
28 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Big Data
No ratings yet
Big Data
31 pages
Big-Data-sent-24-10-24 (2)
No ratings yet
Big-Data-sent-24-10-24 (2)
49 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
978-3-319-21569-3_4
No ratings yet
978-3-319-21569-3_4
23 pages
Big Data - Cloud - AI
No ratings yet
Big Data - Cloud - AI
45 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Big Data Technologie
No ratings yet
Big Data Technologie
36 pages
Big Data PPT
No ratings yet
Big Data PPT
34 pages
Computer Networks TCP
No ratings yet
Computer Networks TCP
48 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
35 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
1 - Understanding Big Data
No ratings yet
1 - Understanding Big Data
46 pages
Big Data Analytics_AAM_Unit 1
No ratings yet
Big Data Analytics_AAM_Unit 1
178 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
2 LecturE 1 2
No ratings yet
2 LecturE 1 2
28 pages
Lect 3 Big Data Lesson02
No ratings yet
Lect 3 Big Data Lesson02
51 pages
MBA933 - Lectures 1-2
No ratings yet
MBA933 - Lectures 1-2
45 pages
Big-Data-ppt
No ratings yet
Big-Data-ppt
30 pages
BDA-1
No ratings yet
BDA-1
26 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Chapter III
No ratings yet
Chapter III
52 pages
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
No ratings yet
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
12 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
21RH5A0511
No ratings yet
21RH5A0511
13 pages
(15) Big Data
No ratings yet
(15) Big Data
10 pages
Big Data Overview
No ratings yet
Big Data Overview
15 pages
Big Data
No ratings yet
Big Data
20 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
23 pages
DS231 Module 3.PDF
No ratings yet
DS231 Module 3.PDF
41 pages
Unit 1
No ratings yet
Unit 1
63 pages
Wk1_Overview of Data Analytics and Big Data
No ratings yet
Wk1_Overview of Data Analytics and Big Data
21 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
88 pages
DS231_Week_3
No ratings yet
DS231_Week_3
41 pages
L1
No ratings yet
L1
53 pages
Big Data Technologies (1)
No ratings yet
Big Data Technologies (1)
9 pages
big_data_in_the_future_of_workforce_-_prof_abdullah
No ratings yet
big_data_in_the_future_of_workforce_-_prof_abdullah
30 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Valantic Business Analytics Case Study Mosca EN
No ratings yet
Valantic Business Analytics Case Study Mosca EN
10 pages
PHP Interview Questions and Answers
No ratings yet
PHP Interview Questions and Answers
25 pages
Queue
No ratings yet
Queue
6 pages
Tulsi Ism File Final
No ratings yet
Tulsi Ism File Final
41 pages
Pos - Point of Sale System in PHP Using Codeigniter 4 Free Source Code
No ratings yet
Pos - Point of Sale System in PHP Using Codeigniter 4 Free Source Code
6 pages
Facebook Walet-8 - 12062021
No ratings yet
Facebook Walet-8 - 12062021
5 pages
DBMS-UNIT-1 pre (1)
No ratings yet
DBMS-UNIT-1 pre (1)
15 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Unit IV
No ratings yet
Unit IV
11 pages
Unit-1 DBMS
No ratings yet
Unit-1 DBMS
82 pages
SQL Case Study 1 Corrected
No ratings yet
SQL Case Study 1 Corrected
3 pages
Select Sum (LINE - UNITS From Line Group by INV - NUMBER
No ratings yet
Select Sum (LINE - UNITS From Line Group by INV - NUMBER
10 pages
Practical No 1 Aim Horizontal Fragmentat
No ratings yet
Practical No 1 Aim Horizontal Fragmentat
42 pages
Pandas-A-Powerful-Data-Analysis-Tool
No ratings yet
Pandas-A-Powerful-Data-Analysis-Tool
8 pages
Itcam Oraclerac63
No ratings yet
Itcam Oraclerac63
578 pages
Oracle SQL Cheatsheet
No ratings yet
Oracle SQL Cheatsheet
2 pages
Data Integrity
No ratings yet
Data Integrity
7 pages
Ontracks SSRS Brochure
No ratings yet
Ontracks SSRS Brochure
2 pages
p6 Eppm Bi Publisher Config
No ratings yet
p6 Eppm Bi Publisher Config
50 pages
Practice 2 - IZO 083
No ratings yet
Practice 2 - IZO 083
107 pages
Forms Tutorial
No ratings yet
Forms Tutorial
37 pages
Sja Pgdca Online Classes: We Will Be Live in Few Minutes
No ratings yet
Sja Pgdca Online Classes: We Will Be Live in Few Minutes
12 pages
A6
No ratings yet
A6
2 pages
Business Continuity For EBS Using Oracle 11g Physical Standby DB (Oracle E-Business Suite Technology)
No ratings yet
Business Continuity For EBS Using Oracle 11g Physical Standby DB (Oracle E-Business Suite Technology)
7 pages
Usage of Dataset
No ratings yet
Usage of Dataset
12 pages
Chapter5 Exercise14
No ratings yet
Chapter5 Exercise14
4 pages
John Mashey Capture Curate: Mca Iii Sem
No ratings yet
John Mashey Capture Curate: Mca Iii Sem
4 pages
Romania
No ratings yet
Romania
25 pages
Building A Data Warehouse: Slide 29-1
No ratings yet
Building A Data Warehouse: Slide 29-1
8 pages