0% found this document useful (0 votes)
3 views

BigData_V08_CameraReady

Uploaded by

kingsroom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

BigData_V08_CameraReady

Uploaded by

kingsroom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308302879

Ins and Outs of Big Data: A Review

Conference Paper · October 2016


DOI: 10.1007/978-3-319-51234-1_7

CITATIONS READS

9 4,082

3 authors:

Hamidur Rahman Shahina Begum


Mälardalen University Sweden Mälardalen University
35 PUBLICATIONS 331 CITATIONS 122 PUBLICATIONS 1,729 CITATIONS

SEE PROFILE SEE PROFILE

Mobyen Uddin Ahmed


Mälardalen University
160 PUBLICATIONS 2,391 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hamidur Rahman on 24 October 2016.

The user has requested enhancement of the downloaded file.


The 3rd EAI International Conference on IoT Technologies for HealthCare (HealthyIoT2016),
Västerås, Sweden, 18-19 Oct.2016.

Ins and Outs of Big Data: A Review

Hamidur Rahman, Shahina Begum and Mobyen Uddin Ahmed


School of Innovation, Design and Engineering,
Mälardalens University, Västerås, Sweden
{[email protected]}
Abstract. Today with the fast development of digital technologies and
advance communications a gigantic amount of data sets with massive and
complex structures called ‘Big data’ is being produced everyday enormously
and exponentially. Again, the arrival of social media, advent of smart homes,
offices and hospitals are connected as Internet of Things (IoT), this influence
also a lot to Big data. According to the study, Big data presents data sets with
large magnitude including structured, semi-structured or unstructured data. The
study also presents the new technologies for data analyzing, collecting, fast
searching, proper sharing, exact storing, speedy transferring, hidden pattern
visualization and violations of privacy etc. This paper presents an overview of
ins and outs of Big Data where the content, scope, samples, methods,
advantages, challenges and privacy of Big data have been discussed. The goal of
this article is to provide big data knowledge to the research community for the
sake of its many real life applications such as traffic management, driver
monitoring, health care in hospitals, meteorology and so on.
Keywords: Big data Issue, Framework, Analytics, Challenges, Tools.

1 Introduction
The ‘Big data’ term has come into the research community more clearly during
2013 and afterwards. Several authors have tried to explain the definition and the
possible issues, technologies, challenges and privacy of big data in a concise way [1-
5]. For example in 2001, Laney et al. have highlighted the challenges and
opportunities generated by increased data through a 3Vs model, i.e., increases in
volume, velocity and variety [6]. In recent years, the world has become so much
digitalized and interconnected and as a result the amount of data has been exploding.
Therefore, to manage the massive amount of records it requires extremely powerful
business intelligence. The problem may arise even more during data acquisition if the
amount of data is too large and then it may have a confusion level that what data to
keep and what to discard and how to store the data in a reliable way. A clear
definition of Big data has been using for the accumulation of different sort of huge
amount of data since last 2-3 years. In 2015, the digital world expanded to 5.6
exabytes (1018 bytes) of data created each day. This figure is expected to double by
every 24 months or so [7]. As a result, storing, managing, sharing, analyzing and
visualizing information via typical database software tools is not only so difficult but
also very hazardous task. Big data can be structured, semi-structured and unstructured
in nature but it could help in businesses by producing automated services to target
their potential partners, agents or customers.
There are some Big data review articles available in online but most of them have
emphasized on specific area e.g., big data framework, big challenges, big data
applications etc. but almost all of them have failed to provide complete overview of
Big data [2, 8, 9]. In this paper, we have presented a complete overview of Big Data
and its present state-of-the-art. Additionally, we have tried to find out big data
important characteristics, Big data frameworks and analytic, challenges of big data
and possible solutions, big data tools and its applications in famous companies. This
article will be very helpful for new researchers specially data scientist, research
institutes and companies to get insights view and latest technologies of big data for
their research planning, business activities and future demand for handling massive
amount of data.

2 Materials and Methods


The Big data is relatively a new topic and the amount of research articles published so
far is limited in this area. Around 60 Big data related articles have been collected
from different online sources where IEEE explore, Research Gate and Google Scholar
databases are the privileged sources. Some of the articles were searched using google
Chrome search engine and during the searching period different key words were used
such as ‘big data’, ‘big data issues’, ‘big data challenges’, ‘big data analytics’, ‘recent
trend of big data’ etc. and it was also considered the most recent articles which are
available in online database. In our case we only considered the articles published in
between 2013 to 2016. About 70% of the collected articles were considered for
detailed study through the paper and remaining 30% of the articles are excluded due
to similarity with considered articles and less important for the study.

3 Big Data Characteristics


Big data is usually characterized by the three dimensions or 3V called Volume,
Velocity and Variety [6]. However, other dimensions presented in Fig. 1 such as
variety, validity and value can be at least equally important. According to the study,
additional three dimensions Veracity, Validity and Value have considered and
presented in Figure 1 with “6 Vs” of big data. The 1st V is Volume which concerns
the fact of amount of generated data that is increasing tremendously in each day. The
second V is Velocity which has come into light due to more and more data and is
provided to the users immediately whenever required for real time processing. Variety
is the 3rd V considered due to the tremendous growth in data sources which are
needed for analysis.
Veracity which includes trust in the information received, is often cited as an
important 4th V dimension in addition to big Data. Validity does not only involve
ensuring accurate measurements but also the transparency of assumptions and
connections behind the process. Value refers to recent large volumes of data measured
in exabytes, petabytes or higher ranked of data and highly valuable for research
institutes and industries.
Volume:
Terabyte
Petabyte
Records
Value: Transaction Velocity:
Statistical Data-in-Motion
Hypothetical Data-in-Rest
Correlations Real-
Modeling time/offline
6 Vs of Big
Data
Veracity:
Validity: Authenticity
Correct Data Availability
Incorrect Accountability
Data Variety: Trustworthine
Structured ss
Unstructured
Semi-
Structured

Fig. 1. The important characteristics of Big data (6V’s)

4 Big Data Analytic and Frameworks


In general view, data analytics is one of the major part in Big data environment
which is responsible to simplify complexity of the data and calculation for achieving
of expected pattern of data sets and outcome. As a whole, there are 3 main tasks in
Big data framework which includes initial planning, implementation and evaluation
and all the tasks have 8 layers as described in Figure 2 [1].
Initialization: The first layer of any Big data framework is the primary planning
which requires new investments for big changes and the changes basically include
installation of a new technological infrastructure and a new way to process and
control data [10]. At the beginning it is extremely important to find problems that
needs a solution and decision whether they could be solved using new technologies or
just with available software and techniques. These problems could be large volume
challenges, real time processing, predictive analytics, on-demand analytics and so on.

Initialization Implementation

Starting Data Data Search Data Visualization


Acquisition Preprocessing Retrieval Analysis

Post Evaluation and sharing


Implementation
Storage

Fig. 2. Layers of Big Data Analytic Task and Framework


Implementation: The second task is implementation and there are several
activities such as data storage, pre-processing, search retrieval, analysis and
visualization. To overcome the storage capacity problem cloud computing technology
has a great advantage [11, 12] and provides easy access for applications from
different corners of the world. Data analysis is one of the most important steps in
implementation where various preprocessing operations are necessary to address
different imperfections in collected raw data. As the data sources are different multi-
source data fusion technology such as [13-15] could be applied. After that all the data
must be pre-processed to avoid similarity, remove noise and delete unwanted signals
[16-20]. For example, data can have multiple formats as heterogeneous sources are
involved. It can also mix with noises where unnecessary data, errors, outliers etc. are
included. Additionally, it may subsequently necessary to fit requirements of analysis
algorithms. Therefore, data preprocessing includes a wide range of operations such as
cleaning, integration, reduction, normalization, transformation, discretization etc.
When the pre-processing stage is done then the search retrieval is performed to
extract values for further analysis for companies and institutions. Advance analytics is
one of the most efficient approaches which provides algorithms such as descriptive
analytics, inquisitive analytics, predictive and prescriptive analytics to perform
complex analytics on either structured or unstructured data. When the analyzing part
is done then the visualization is very important where it guides the analysis process
and presents the summary of the results in a transparent, understandable and
meaningful way. For the simple graphical representation of data, most software
packages support classical charts and dashboards.
Evaluation and Sharing: The third task of the big data framework is evaluation
the outcome and sharing it among the agents [10]. To evaluate a Big data project, it is
necessary to consider a range of diverse data inputs, quality of data and expected
results. To develop procedures for Big Data evaluation, the project first needs to allow
real time stream processing and incremental computation of statistics. There is also
necessity to have parallel processing and exploitation of distributed computing so that
data can be processed in a reasonable amount of time. It is also considering that the
project can easily be integrated with visualization tools. Finally, it should perform
summary indexing to accelerate queries on big datasets to accelerate running queries.

5 Big Data Challenges and Inconsistencies


Context awareness is one of the major analytic challenge that focuses on some
portions of data and which is useful for resource consumption [3]. Another crucial
challenge is visual analysis that how data seems to be for the perception of human
vision. Similarly, data efficiency, correlation between the features of data and
contents validation are notable challenges. Data privacy, security and trust are also
major concern among organizations. When volume of data grows, it is difficult to
gain insight into data within time period. Processing near real time data will always
require processing interval in order to produce satisfactory output. Transition between
structured data- stored in well-defined tables and unstructured data (images, videos,
text) required for analysis will affect end to end processing of data. Invention of new
non-relational technologies will provide some flexibility in data representation and
processing.
In circumstances where big data are collected, aggregated, transformed or
represented inconsistencies invariably find their way into large datasets [21]. This can
be attributed to a number of factors in human behaviors and in decision-making
process. When datasets contain a temporal attribute, data items with conflicting
circumstances may coincide or overlap in time. The time interval relationships
between conflicting data items can result in partial temporal inconsistency. Spatial
inconsistencies can be arisen from the geometric representation of objects, spatial
relations between objects or aggregation of composite objects. As big datasets are
increasingly generated from social media, blogs, emails, crowd-sourced ratings,
inconsistencies in unstructured text and messages become an important research topic.
If two texts are referring to the same event or entity, then they are said to be of co-
reference. Event or entity co-referencing is a necessary condition for text
inconsistencies.

6 Big Data Domain, Technology, Tools and Solution


Big Data domain has no boundary including retail application to governmental
works. A big data might be petabyte (1024 terabyte) or Exabyte (1024 petabyte) of
data consisting of billions to trillions of records of millions of people from different
sources like educational institutions, research institutions, medical hospitals, small or
multinational company data, customer care, weather records, demographical data,
social media records, astronomical data etc. [21]. These massive data sets and its
applications include technologies such Mathematics, Artificial Intelligence Especially
Machine Learning, Data Mining, Cloud Computing, Real Time Data Streaming
technology and so on [21]. Time series analysis is also very useful and of course there
are many visualization technologies that can be used in Big data.
Today most of the renowned companies are using big data tools for their special
needs. For example, Hadoop1 and MongoDB2 are the two best data storage and
management tool used by Google, Amazon and MIT, MTV respectively. For data
cleaning, Stratebi and Platon companies are using DataCleaner3 tool. Teradata4 is
another big data tool for data mining used by Air Canada or cisco. Autodesk company
uses Qubole5 for their data analysis. For big data visualization, Plot.ly6 is one of the
greatest and many renowned companies like Google, Goji, VTT are using this tool.
For the data integration, Pentaho7 is one of the best tool used by CAT, Logitech etc.
Python8 is widely used as a Big data language for company like AstraZeneca,
Carmanah etc. As a big data collection tool Import.io9 is pioneer and used by Quid,
Nygg, OpenRise etc. A list of Big data tools used by famous companies are listed in
Table 1.

1 http://wiki.apache.org/hadoop/PoweredBy#G
2 https://www.mongodb.com/industries
3 https://datacleaner.org/testimonials
4 http://www.teradata.se/customers-list/browse/?LangType=1053&LangSelect=true
5 https://www.qubole.com/customer/?nabe=5695374637924352:1
6 https://plot.ly/#trusted-by
7 http://www.pentaho.com/customers
8 https://www.python.org/about/success/#engineering
9 https://www.import.io/
Table 1. Big Data Tools used by renowned Companies
No Big data Tools Where it is used
1 Hadoop Google, Amazon, Alibaba, Facebook etc.
2 MongoDB citiGroup, MIT, GOV.UK, ebay, MTV etc.
3 DataCleaner Stratebi, Platon, BestBrains etc.
4 Teradata Air Canada, cisco, Coca-Cola, Coop, Dell, Daimler etc.
5 Qubole Autodesk, Answers.com, Capilary, Quora, Nextdoor, etc.
6 Plot.ly Google, Goji, VTT, U.S. Air Force etc.
7 Pentaho CAT, Nasdaq, Logitech, U.S. Navy etc.
8 Python Forecastwatch.com, AstraZeneca, Carmanah etc.
9 Import.io Quid, Nygg, OpenRise, University of Houston etc.

There are thousands of Big data tools both available in the market to buy and also
for free trial for extraction, storage, cleaning, mining, visualizing, analyzing and
integrating. Table 2 shows the most popular big data tools.

Table 2. A number of popular Big Data Tools 10.


No Big data Area Tools
1 Data Storage and Management Hadoop, Cloudera, MongoDB, Talend
2 Data Cleaning OpenRefine, DataCleaner
3 Data Mining RapidMiner, Teradata, FramedData, Kaggle
4 Data Analysis Qubole, BigML, Statwing
5 Data Visualization Tableau, Silk, CartoDB, Chartio, Plot.ly,
6 Data Integration Blockspring, Pentaho
7 Data Languages R, Python, RegEx, XPath
8 Data Collection Import.io

7 Conclusion
A general overview and concept of the Big data has been discussed in this article
including Big data 6V, it’s framework and analytic issues. Additionally, the
difference between big and small data, popular tools, inconsistencies and challenges
also have been reviewed. Due to management and analyzis of petabytes and exabytes
of data, the big data management system cooperates and ensures a high level of data
quality, accessibility and helps to locate valuable information in large set of
unstructured and unplanned data. This review of different techniques can be applied
to various fields of engineering, industry and medical science. Some real life
applications such as autonomous driving, smooth transaction for semi-autonomous
driving or driver monitoring in context of big data analysis will be presented as future
work.

Acknowledgement: The authors would like to acknowledge the Swedish Knowledge Foundation
(KKS), Swedish Governmental agency for innovation Systems (VINNOVA), Volvo Car Corporation, The
Swedish National Road and Transportation Research Institute, Autoliv AB, Hök instrument AB, and Prevas
AB Sweden for their support of the research projects in this area.

10 https://www.import.io/post/all-the-best-big-data-tools-and-how-to-use-them/
References
[1] F. Tekiner and J. A. Keane, "Big Data Framework," in 2013 IEEE International Conference on
Systems, Man, and Cybernetics, 2013, pp. 1494-1499.
[2] S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems
(CTS), 2013 International Conference on, 2013, pp. 42-47.
[3] A. Katal, M. Wazid, and R. H. Goudar, "Big data: Issues, challenges, tools and Good practices,"
in Contemporary Computing (IC3), 2013 Sixth International Conference on, 2013, pp. 404-409.
[4] W. Xiong, Z. Yu, Z. Bei, J. Zhao, F. Zhang, Y. Zou, et al., "A characterization of big data
benchmarks," in Big Data, 2013 IEEE International Conference on, 2013, pp. 118-125.
[5] T. Lu, X. Guo, B. Xu, L. Zhao, Y. Peng, and H. Yang, "Next Big Thing in Big Data: The
Security of the ICT Supply Chain," in Social Computing (SocialCom), 2013 International
Conference on, 2013, pp. 1066-1073.
[6] D. Laney, "3-D Data Management: Controlling Data Volume, Velocity and Variety," META
Group Original Research Note, 2001.
[7] F. D. Ahmed, A. N. Jaber, M. B. A. Majid, and M. S. Ahmad, "Agent-based Big Data Analytics
in retailing: A case study," in Software Engineering and Computer Systems (ICSECS), 2015 4th
International Conference on, 2015, pp. 67-72.
[8] P. Gupta and N. Tyagi, "An approach towards big data-A review," in Computing,
Communication & Automation (ICCCA), 2015 International Conference on, 2015, pp. 118-123.
[9] T. Rout, M. Garanayak, M. R. Senapati, and S. K. Kamilla, "Big data and its applications: A
review," in Electrical, Electronics, Signals, Communication and Optimization (EESCO), 2015
International Conference on, 2015, pp. 1-5.
[10] H. Mousannif, H. Sabah, Y. Douiji, and Y. O. Sayad, "From Big Data to Big Projects: A Step-
by-Step Roadmap," in Future Internet of Things and Cloud (FiCloud), 2014 International
Conference on, 2014, pp. 373-378.
[11] G. Huang, J. He, C. H. Chi, W. Zhou, and Y. Zhang, "A Data as a Product Model for Future
Consumption of Big Stream Data in Clouds," in Services Computing (SCC), 2015 IEEE
International Conference on, 2015, pp. 256-263.
[12] I. Khan, S. K. Naqvi, M. Alam, and S. N. A. Rizvi, "Data model for Big Data in cloud
environment," in Computing for Sustainable Global Development (INDIACom), 2015 2nd
International Conference on, 2015, pp. 582-585.
[13] G. Suciu, A. Vulpe, R. Craciunescu, C. Butca, and V. Suciu, "Big data fusion for eHealth and
Ambient Assisted Living Cloud Applications," in Communications and Networking
(BlackSeaCom), 2015 IEEE International Black Sea Conference on, 2015, pp. 102-106.
[14] L. T. Yang, L. Kuang, J. Chen, F. Hao, and C. Luo, "A Holistic Approach to Distributed
Dimensionality Reduction of Big Data," IEEE Transactions on Cloud Computing, vol. PP, pp.
1-1, 2015.
[15] Y. Zheng, "Methodologies for Cross-Domain Data Fusion: An Overview," IEEE Transactions
on Big Data, vol. 1, pp. 16-34, 2015.
[16] S. Pandey and V. Tokekar, "Prominence of MapReduce in Big Data Processing," in
Communication Systems and Network Technologies (CSNT), 2014 Fourth International
Conference on, 2014, pp. 555-560.
[17] J. Wang, Z. Song, Q. Li, J. Yu, and F. Chen, "Semantic-based intelligent data clean framework
for big data," in Security, Pattern Analysis, and Cybernetics (SPAC), 2014 International
Conference on, 2014, pp. 448-453.
[18] S. Biookaghazadeh, Y. Xu, S. Zhou, and M. Zhao, "Enabling scientific data storage and
processing on big-data systems," in Big Data (Big Data), 2015 IEEE International Conference
on, 2015, pp. 1978-1984.
[19] Y. Diao, K. Y. Liu, X. Meng, X. Ye, and K. He, "A Big Data Online Cleaning Algorithm Based
on Dynamic Outlier Detection," in Cyber-Enabled Distributed Computing and Knowledge
Discovery (CyberC), 2015 International Conference on, 2015, pp. 230-234.
[20] I. Taleb, R. Dssouli, and M. A. Serhani, "Big Data Pre-processing: A Quality Framework," in
2015 IEEE International Congress on Big Data, 2015, pp. 191-198.
[21] D. Zhang, "Inconsistencies in big data," in Cognitive Informatics & Cognitive Computing
(ICCI*CC), 2013 12th IEEE International Conference on, 2013, pp. 61-67.

View publication stats

You might also like