Big Data For Org
Big Data For Org
http://www.scirp.org/journal/jcc
ISSN Online: 2327-5227
ISSN Print: 2327-5219
Keywords
Big Data, Big Data Models, Organization, Information System
1. Introduction
Business organizations have been using big data to improve their competitive
advantages. According to McKinsey [1], organizations which can fully apply big
data get competitive advantages over its competitors. Facebook users uploads
hundreds of Terabytes of data each day and these social media data are used for
developing more advanced analysis which aim is to take more value from user
data. Search Engines like Google and Yahoo are already monetizing by associat-
ing appropriate ads based on user queries (i.e. Google use big data to give the
right ads to the right user in a split seconds). In applying information systems to
improve their organization system, most government organization left behind
compared to the business organizations [2]. Meanwhile some government al-
ready take initiative to get the advantages of big data. E.g. Obama’s government
announced investment of more than $200 million for Big Data R & D in Scien-
tific Foundations in 2012 [3]. Today, people are living in the data age where data
become oxygen to people as organizations are producing data more than they
41
P. P. Khine, W. Z. Shun
Hierarchy Description
Information is extracted and used in a way that help improve the business
Business Insight processes. (e.g. predicting the trends of customer buying patterns
based on current information)
at rest” are data that can be retrieved from the storage systems such as data from
warehouses, RDBMS (Relational Database Management Systems) databases, File
systems e.g. HDFS (Hadoop Distributed File Systems), etc.
Traditional “Bringing data to perform operations” style is not suitable in vo-
luminous big data because it will definitely waste the huge amount of computa-
tional power. Therefore, big data adopts the style of “Operations go where data
exist” to reduce computational costs which is done by using the already
well-established distributed and parallel computing technology [5]. Big data is
also different from traditional data paradigm. Traditional data warehouses ap-
proaches map data into predefined schema and used “Schema-on-write” ap-
proach. But when big data handle data, there is no predefined schema. Instead,
the required schema definition is retrieved from data itself. Therefore, big data
approach can be considered as “Schema-on-Read” approach.
In information age with the proliferation of data in every corner of the world,
the sources of big data can be difficult to differentiate. Big data sourced in the
proliferation of social media, IoTs, traditional operation systems and people in-
volvement. The sources of big data stated in [4] are IoTs (Internet of Things)
such as sensor, social networks such as Twitter, open data permitted to be used
by government or some business organizations (e.g. twitter data) and crowd
sourcing which encourage people to provide and enter data especially for mas-
sive scale projects (e.g. census data). The popularity, major changes or new
emergence of different organizations will create the new sources of big data. E.g.
in the past, data from social media organizations such as Facebook, twitter, are
not predicted to become a big data source. Currently, data from mobile phones
handled by telecommunication companies, and IoTs for different scientific re-
searches become important big data sources. In future, transportation vehicles
with Machine-to-Machine communication (data for automobile manufacturing
firms), and data from Smart city with many interconnected IoT devices will be-
come the big data sources because of their involvement in people daily life.
42
P. P. Khine, W. Z. Shun
Vs of Big data are Volume for huge data amount, Variety for different types of
data, and Velocity for different data rate required by different kinds of systems
[6].
Volume: When the scale of the data surpass the traditional store or technique,
these volume of data can be generally labeled as the big data volume. Based on
the types of organization, the amount of data volume can vary from one place to
another from gigabytes, terabytes, petabytes, etc. [1]. Volume is the original
characteristic for the emergence of big data.
Variety: Include structured data defined in specific type and structure (e.g.
string, numeric, etc. data types which can be found in most RDBMS databases),
semi-structured data which has no specific type but have some defined structure
(e.g. XML tags, location data), unstructured data with no structure (e.g. audio,
voice, etc. ) which their structures have to be discovered yet [7], and mul-
ti-structured data which include all these structured, semi-structured and un-
structured features [7] [8]. Variety comes from the complexity of data from dif-
ferent information systems of target organization.
Velocity: Velocity means the rate of data required by the application systems
based on the target organization domain. The velocity of big data can be consi-
dered in increasing order as batch, near real-time, real-time and stream [7]. The
bigger data volume, the more challenges will likely velocity face. Velocity the one
of the most difficult characteristics in big data to handle [8].
As more and more organizations are trying to use big data, big data Vs cha-
racteristics become to appear one after another such as value, veracity and valid-
ity. Value mean that data retrieved from big data must support the objective of
the target organization and should create a surplus value for the organization
[7]. Veracity should address confidentiality in data availablefor providing re-
quired data integrity and security. Validity means that the data must come from
valid source and clean because these big data will be analyzed and the results will
be applied in business operations of the target organization.
Another V of data is “Viability” or Volatility of data. Viability means the time
data need to survive i.e. in a way, the data life time regardless of the systems.
Based on viability, data in the organizations can be classified as data with unli-
mited lifetime and data with limited lifetime. These data also need to be re-
trieved and used in a point of time. Viability is also the reason the volume chal-
lenge occurs in organizations.
43
P. P. Khine, W. Z. Shun
for sparse data, etc. NoSQL database use the BASE properties (Basically Availa-
ble, Soft state, Eventual consistency). Because big data are based on parallel
computing and distributed technology, CAP (Consistency, Availability, and Par-
tition) theorem will affect in big data technologies [10].
Data warehouses and data marts store valid and cleaned data by the process of
ETL (Extract-Transform-Load). Preprocessed, highly summarized and inte-
grated (Transformed) data are loaded into the data warehouses for further usage
[11]. Because of heterogeneous sources of big data, traditional transformation
process will charge a huge computational burden. Therefore, big data first
“Load” all the data, and then transform only the required data based on need of
the systems in the organizations. The process can change into Extract-Load-
Transform. As a result, new idea like “Data Lake” also emerged which try to
store all data generated by organizations and has overpower of data warehouses
and data mart although there are critics for becoming a “data swamp” [12].
44
P. P. Khine, W. Z. Shun
technical effects. This socio-technical model suggests that all these components-
organizational structure, people, job tasks and Information Technology (IT)-
must be changed simultaneously to achieve the objective of the target organiza-
tion and information systems [2]. Sometimes, these changes can result in chang-
ing business goals, relationship with people, and business processes for target
organization, blur the organizational boundaries and cause the flattening of the
organization [1] [2]. Big data transforms traditional siloed information systems
in the organizations into digital nervous systems with information in and out of
relating organizational systems. Organization resistance to change is need to be
considered in every implementation of Information systems. The most common
reason for failure of large projects is not the failure of the technology, but orga-
nizational and political resistance to change [2]. Big data projects need to avoid
these kind of mistake and implement based on not only from information sys-
tem perspective but also from organizational perspective.
45
P. P. Khine, W. Z. Shun
Based on this big data project life cycle, organization can develop their own
big data projects. The best way to implement big data projects is to use both
technologies that are before and after big data. E.g. use both Hadoop and ware-
house because they implement each other. US government considers “all con-
tents as data” when implementing big data projects. In digital era, data has the
power to change the world and need careful implementation.
46
P. P. Khine, W. Z. Shun
cessed data stream is again consumed and this process is repeated until the op-
eration is halted by user. Spout performsas a source of streams in a topology,
and Bolt consumes streams and produce new streams, as they execute in parallel
[15].
There are other real-time processing tools for Big Data such as Yahoo’s S4
(Simple Scalable Streaming System) which is based on the combination of actor
models and MapReduce model. S4 works with Processing Elements (PEs) that
consume the keyed data events. Messages are transmitted between PEs in the
form of data events. Each PE’s state is inaccessible to other PEs and event emis-
sion and consumption is the only mode of interaction between PEs. Processing
Nodes (PN) are the logical hosts of PEs which are responsible for listening to the
events, executing operating on the incoming events, dispatching events with the
assistance of the communication layer, and emitting output events [16]. There is
no specific winner in stream processing models, and organizations can use ap-
propriate data models that are consistent with their works.
Regardless of batch or real-time, there are many open source and proprietary
software framework for big data. Open source big data framework are Hadoop,
EPCC (High Performance Computing Cluster), etc. [7]. Many other proprietary
big data tools such as IBM BigInsight, Accumulo, Microsoft Azure, etc. has been
successfully used in many business areas of different organizations. Now, big
data tools and libraries are available in other languages such as Python, R, etc.
for many different kinds of specific organizations.
4. Conclusion
Big data is a very wide and multi-disciplinary field which requires the collabora-
tion from different research areas and organizations from various sources. Big
data may change the traditional ETL process into Extract-Load-Transform
(ELT) process as big data give more advantages in moving algorithms near
where the data exist. Like other information systems, the success of big data
projects depend on organizational resistance to change. Organizational struc-
ture, people, tasks and information technologies need to change simultaneously
to get the desired results. Based on the layered view of the big data [13], big data
projects can implement with step-by-step roadmap [4]. Big data sources will
vary based on the past, present and future of the organizations and information
systems. Big data have power to change the landscape of organization and in-
formation systems because of its different unique nature from traditional para-
digms. Using big data technologies can make organizations get overall advantage
with better efficiency and effectiveness. The future of big data will be the digital
nervous systems for organization where every possible systems need to consider
the big data as a must have technology. Data age is coming now.
Acknowledgements
I want to express my gratitude for my supervisor Professor Wang Zhao Shun for
encouraging and giving suggestions for improving my paper.
47
P. P. Khine, W. Z. Shun
References
[1] Manyika, J., et al. (2011) Big Data: The Next Frontier for Innovation, Competition,
and Productivity. San Francisco, McKinsey Global Institute, CA, USA.
[2] Laudon, K.C. and Laudon, J.P. (2012) Management Information Systems: Managing
the Digital Firm. 13th Edition, Pearson Education, US.
[3] House, W. (2012) Fact Sheet: Big Data across the Federal Government.
[4] Mousanif, H., Sabah, H., Douiji, Y. and Sayad, Y.O. (2014) From Big Data to Big
Projects: A Step-by-Step Roadmap. International Conference on Future Internet of
Things and Cloud, 373-378
[5] Oracle Enterprise Architecture White Paper (March 2016) An Enterprise Archi-
tect’s Guide to Big Data: Reference Architecture Overview.
[6] Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and
Variety, Gartner Report.
[7] Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. International Conference on
Collaboration Technologies and Systems (CTS), 42-47.
[8] de Roos, D., Zikopoulos, P.C., Melnyk, R.B., Brown, B. and Coss, R. (2012) Hadoop
for Dummies. John Wiley & Sons, Inc., Hoboken, New Jersey, US.
[9] Grolinger, K., Hayes, M., Higashino, W.A., L’Heureux, A., Allison, D.S. and Ca-
pretz1, M.A.M. (2014) Challenges of MapReduce in Big Data, IEEE 10th World
Congress on Services, 182-189.
[10] Hurwitz, J.S., Nugent, A., Halper, F. and Kaufman, M. (2012) Big Data for Dum-
mies, 1st Edition, John Wiley & Sons, Inc, Hoboken, New Jersey, US.
[11] Han, J., Kamber, M. and Pei, J. (2006) Data Mining: Concepts and Techniques. 3rd
Edition, Elsevier (Singapore).
[12] Data Lake. https://en.m.wikipedia.org/wiki/Data_lake
[13] Hu, H., Wen, Y.G., Chua, T.-S. and Li, X.L. (2014) Toward Scalable Systems for Big
Data Analytics: A Technology Tutorial. IEEE Access, 2, 652-687.
https://doi.org/10.1109/ACCESS.2014.2332453
[14] Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large
Clusters. Commun ACM, 107-113. https://doi.org/10.1145/1327452.1327492
[15] Storm Project. http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html
[16] Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. (2010) S4: Distributed Stream
Computing Platform. 2010 IEEE International Conference on Data Mining Work-
shops (ICDMW). https://doi.org/10.1109/ICDMW.2010.172
48
Submit or recommend next manuscript to SCIRP and we will provide best
service for you:
Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.
A wide selection of journals (inclusive of 9 subjects, more than 200 journals)
Providing 24-hour high-quality service
User-friendly online submission system
Fair and swift peer-review system
Efficient typesetting and proofreading procedure
Display of the result of downloads and visits, as well as the number of cited articles
Maximum dissemination of your research work
Submit your manuscript at: http://papersubmission.scirp.org/
Or contact [email protected]