0% found this document useful (0 votes)
15 views

1_introduction_to_big_data_management_and_processing

Uploaded by

hungvdsoict
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

1_introduction_to_big_data_management_and_processing

Uploaded by

hungvdsoict
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to big data

management and processing


Viet-Trung Tran
Syllabus
STT Lecture

1 Tổng quan về lưu trữ và xử lý dữ liệu lớn

2 Hệ sinh thái Hadoop (Hadoop ecosystem)

3 Hệ thống tập tin phân tán Hadoop HDFS

4 Cơ sở dữ liệu phi quan hệ NoSQL - phần 1


Tổng quan
5 Cơ sở dữ liệu phi quan hệ NoSQL - phần 2
Kiến trúc phân tán phổ biến
6 Cơ sở dữ liệu phi quan hệ NoSQL - phần 3
Truy vấn SQL trên NoSQL
7 Hệ thống truyền thông điệp phân tán

8 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 1


Map Reduce
9 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 2
Apache Spark
10 Các kĩ thuật xử lý luồng dữ liệu lớn
Spark Streaming
11 Kiến trúc dữ liệu lớn
Lambda architecture
12 Phân tích dữ liệu lớn 2
Spark ML
How big is big data?

3
4
How big is big data?

5
Data science: The 4th paradigm for scientific discovery

6
Big data in 2008

7
Big data in 2014

8
Big data today

9
Big data’s numbers

10
Big data sources
• E-commerce
• Social networks
• Internet of things
• Data-intensive experiments (bioinformatics, quantum physics, etc)

11
Data is the new oil

12
Big data 5'V

Big data is a term for data sets that are so large


or complex that traditional data processing
application software is inadequate to deal with
them (wikipedia) 13
Big data – big value

source: wipro.com
14
Big Data in education industry
• Customized and Dynamic Learning Programs
• Customized programs and schemes to benefit
individual students can be created using the data
collected on the bases of each student’s learning
history. This improves the overall student results.
• Reframing Course Material
• Reframing the course material according to the
data that is collected on the basis of what a student
learns and to what extent by real-time monitoring
of the components of a course is beneficial for the
students.
• Grading Systems
• New advancements in grading systems have been
introduced as a result of a proper analysis of
student data.
• Career Prediction
• Appropriate analysis and study of every student’s
records will help understand each student’s
progress, strengths, weaknesses, interests, and
more. It would also help in determining which
career would be the most suitable for the student
in future.

15
Edtech
• Coursera
• VioEdu
• https://byjus.com/
• Engaging Video Lessons
• Personalized Learning Journeys
• Mapped to the Syllabus
• In-depth Analysis
• Engaging Interactive Questions

16
Big Data in healthcare industry
• Big data reduces costs of treatment since
there is less chances of having to perform
unnecessary diagnosis.
• It helps in predicting outbreaks of
epidemics and also in deciding what
preventive measures could be taken to
minimize the effects of the same.
• It helps avoid preventable diseases by
detecting them in early stages. It prevents
them from getting any worse which in
turn makes their treatment easy and
effective.
• Patients can be provided with evidence-
based medicine which is identified and
prescribed after doing research on past
medical results.

17
Big Data in government sector
• Welfare Schemes
• In making faster and informed decisions
regarding various political programs
• To identify areas that are in immediate
need of attention
• To stay up to date in the field of
agriculture by keeping track of all
existing land and livestock.
• To overcome national challenges such
as unemployment, terrorism, energy
resources exploration, and much more.
• Cyber Security
• Big Data is hugely used for deceit
recognition.
• It is also used in catching tax evaders.

18
Big Data in media and entertainment industry

• Predicting the interests of audiences


• Optimized or on-demand scheduling of media streams in digital media
distribution platforms
• Getting insights from customer reviews
• Effective targeting of the advertisements
• Example
• Spotify, an on-demand music providing platform, uses Big Data Analytics,
collects data from all its users around the globe, and then uses the analyzed
data to give informed music recommendations and suggestions to every
individual user.
• Amazon Prime that offers, videos, music, and Kindle books in a one-stop shop
is also big on using big data.

19
Big data in scientific discovery

20
Maximilien Brice, © CERN
Use Case: Thomson Reuters

• About: Thomson Reuters is the world’s leading source of news and


information for the financial and risk, legal, tax & accounting, and
media markets.
• Challenge: Their data source (Twitter) produces a gigantic amount of
data daily. It is challenging to (a) quickly analyze all the tweets and (b)
disguise real news from fake news and opinions.
• Impact after applying Big Data solutions:
• Able to process 13 million tweets daily
• Captures and detects news events across millions of tweets in 40 milliseconds
• (Source: https://www.cloudera.com/about/customers/thomson-
reuters.html)
Use Case: MasterCard

• About: MasterCard operates the world’s fastest payments processing


network, delivering the products and services that make everyday
commerce activities — such as shopping, traveling, running a business,
and managing finances — easier, more secure, and more efficient.
• Challenge: Identify frauds from one million inquiries/month to their
database which contains hundreds of millions of fraudulent
businesses.
• Impact after applying Big Data solutions:
• 5x increase in number of searches supported annually
• 25x increase in searches per customer daily
• Dramatically improved search accuracy
• (Source: https://www.cloudera.com/about/customers/thomson-
reuters.html)
Use Case: Intel Supply Chain

• About: Intel’s supply chain reflects the company’s global operations—


Intel does business in more than 100 countries, with over 450 supplier
factories and 16,000 suppliers.
• Challenge:
• Multiple data hops -- data latencies of up to 12 hours
• Data fragmentation, data reconciliation and quality issues
• Impact after applying Big Data solutions:
• Reduce planning DB by 63% and the enterprise common core (ECC) DB by 80%
• DB transactions decreased by 25-50%; ECC processing time decreased by 50%
• 75% efficiency gain in account reconciliation; 45% reduction of IT support staff
• (Source: https://www.intel.com/content/www/us/en/it-
management/intel-it-best-practices/transforming-intels-supply-chain-
with-real-time-analytics-paper.html)
Use Case: Cisco WebEx

• About: Cisco WebEx supports more than 26 billion conference minutes


each month. Its audio, video, and web conferencing services help
users connect and collaborate with colleagues around the world.
• Challenge:
• Gain an end-to-end view of the customer experience
• Support an increasing volume of telemetry data
• More rapidly uncover new fraud tactics
• Impact after applying Big Data solutions:
• Identified 17x more fraud
• Delivered platform at 1/10 the cost of traditional data warehouse and BI
environment
• (Source:https://www.cloudera.com/about/customers/cisco.html)
Use Case: Deutsche Telekom

• About: Deutsche Telekom is a leading European telecommunications


provider, delivering services to more than 150 million customers
globally.
• Challenge:
• Preventing network fraud
• Data visibility and scalability
• Impact after applying Big Data solutions:
• 5-10% lower customer churn
• 10-20% lower revenue losses from fraud activities
• 50% better operational efficiency
• (Source: https://www.cloudera.com/about/customers/deutsche-
telekom.html)
Top 10 Company Market Cap Ranking History
(1998-2018)

https://www.youtube.com/watch?v=fobx4wIS6W0
26
Top 10 Company Market Cap Ranking History
(1998-2018)

27
Big data technology stack

28
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O) into data storage
systems
• Transparency
• In distributed environment, users should be able to access data over the
network as easily as if the data were stored locally.
• Users should not have to know the physical location of data to access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences of distribution
shouldn’t compromise the availability.

29
Data I/O landscape

0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin


600MB/s same
rack

3-12 msrandom 0.1 ms random


access access

$0.025 perGB $0.35 perGB

30
Scalable data ingestion and processing

• Data ingestion
• Data from different complementing information systems is to be combined to
gain a more comprehensive basis to satisfy the need
• How to ingest data efficiently from various, distributed heterogeneous
sources?
• Different data formats
• Different data models and schemas
• Security and privacy
• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink

31
Scalable analytic algorithms

• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single
machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-conquer
approach

32
Eg. Curse of dimensionality
• The required number of samples (to achieve the same accuracy) grows
exponentionally with the number of variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a large number of
features!

In fact, after a certain point, increasing


the dimensionality of the problem by
adding new features would actually
degrade the performance of classifier.

33
Utilization and interpretability of big data

• Domain expertise to findout problems and


interprete analytics results
• Scalable visualization and interpretability of
million data points
• to facilitate their interpretability and
understanding

34
Privacy and security

35
Big data job trends

36
Talent shortage in big data

37
Big data skill set

38
Data engineers vs. data scientists

39
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects

40
Data science method
1. Formulate a question

4. Product
2. Gather data

3. Analyze data

Source: Foundational Methodology for Data Science, IBM, 2015 41


DeepQA: Incremental Progress in Precision and Confidence
6/2007-11/2010

Now Playing in the


100% Winners Cloud
90% 11/2010

80% 4/2010

70% 10/2009
5/2009
60%
12/2008
Precision

50% 8/2008

5/2008
40% 12/2007

30%

20%
Baseline
10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
42
Cleaning big data: most time-consuming, least
enjoyable data science task

• Data preparation accounts for about 80% of the work of data scientists

source: https://www.forbes.com/ 43
Cleaning big data: most time-consuming, least
enjoyable data science task

• 57% of data scientists regard cleaning and organizing data as the least
enjoyable part of their work and 19% say this about collecting data
sets.

44
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems.
" O'Reilly Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information
Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4
(2017): 353-365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the 2012
ACM SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of computer
trends and technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data.
ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.
45
Online courses
• https://www.coursera.org/learn/nosql-database-systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index.htm
• https://www.coursera.org/learn/big-data-
introduction?specialization=big-data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-
management?specialization=big-data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data

46

You might also like