2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
Abstract—There are many people using social media to it has the limitation of the storage, CPU, and main memory.
comment appearance of a various event and post own activity Hence, business organizations require a new technology such
in every day. The business organizations and industries can use as Apache Hadoop [3] or Apache Spark [4] for the sentiment
social media to improve a produce by using the technique of analysis in order to classify a comment feel (positive or
text mining. However, a single machine cannot perform to negative).
compute a large amount of data because it has the resource The sentiment analysis is the most popular task of text
limitation such as CPU, main memory, storage and so on. mining area in the current. It can be divided into four sub-
Moreover, a data also consists of structured and unstructured tasks [5], but the text preprocessing task is a crucial step [6].
data that must be prepared before a computation with text
Moreover, it also uses the most computation time and is
mining or machine learning. Therefore, the text preprocessing
intensive and careful step than other tasks. This task affects
becomes one of the most very important tasks because it
includes many steps. In this paper, we design and develop an an accuracy of data mining or machine learning algorithms
efficient text preprocessing framework on a big data (ML) like Naïve Bayes, Decision Tree, Support Vector
infrastructure, which is proposed to support the text Machine and so on if the sentiment data is prepared not
preprocessing task to reduce the computation time. The appropriately. In general, the text preprocessing task consists
framework consists of the main four modules: data collection of data cleaning, data normalization, data transformation,
module, storage module, data cleaning module and feature feature extraction and selection, etc. [7]-[8]. For sentiment
extraction module. We collected the sentiment data from analysis, it implicates the various techniques of Natural
Facebook page, named “Tasty” that has more than 90 million Language Processing (NLP) that deals information
members for the performance evaluation. The data was correction and information extraction [9]. In the big data age,
separated into different three sizes and each dataset was tested sentiment analysis task becomes the problem and challenge
on different the cluster system environments. Moreover, we of researcher and the business organizations. Especially, the
also compared the computation time between the proposed our business organizations demand a new technique and
framework and the single machine. As the result, our technology to reduce the computation time of big sentiment
framework can reduce the computation time significantly. data preprocessing that was commented in various social
Thus, this framework can apply to solve the problem of the media sites.
complex text preprocessing in text mining areas. Fortunately, Doug Cutting and Mike Cafarella are co-
founders of an open-source software framework called
Keywords-Apache Hadoop; Apache Spark; big data; text
mining; text preprocessing; sentiment analysis
Apache Hadoop in 2006. It stores the large amount of data
file by splitting method into blocks (128MB by default).
Then each block is distributed to nodes in a cluster system
I. INTRODUCTION and computes the large amount of data based on MapReduce
Social media such as Facebook, Twitter, Instagram, and programming model. In 2009, Apache Spark was developed
LinkedIn are big data sources. Especially, Facebook has 1.40 by Matei Zanaria at UC Berkeley’s AMAP Lab and became
billion daily active users and generated 4 petabytes of new a top level Apache Project in February 2014. It is an open
data per day. Moreover, there are also more than 60 million source tool that computes the large amount of data in
of business pages [1]. In twitter statistics, every month has distributed main memory system. Moreover, Apache Spark
330 million active users and sent 500 millions tweets per day. also provides a various programming languages that include
There are 36% of the companies of tweet more than 5 times Java, Python Scala and R to develop application. The both
per day. If the companies post a product image in twitter, it Apache Hadoop and Spark also assure fault tolerant of
will be liked and retweeted more than 89% and 150% distributed data processing in the cluster system. Moreover,
respectively [2]. To develop own produce and create new Apache Spark computes MapReduce programs faster than
customer group, many business organizations require Apache Hadoop.
sentiment, opinions or review data of the users from different The rest of this paper is organized as follows. In section
social media sites. However, the data also cannot be applied II, we briefly survey the paper regarding the text
directly because it is raw and unstructured or semi-structured preprocessing techniques for text mining analysis using the
data. Moreover, the huge volume of the sentiment data sentiment data. Section III provides background of the big
affects the computation time on the single computer because data technologies in the detail. Next, section IV presents the
Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 10,2023 at 14:43:51 UTC from IEEE Xplore. Restrictions apply.
text preprocessing framework that utilizes the big data machine learning algorithm. Some paper only introduced the
technologies and explains each step of the text preprocessing. big data technologies for sentiment analysis and presented to
Then the comparison results of the performance evaluation improve some step of the text preprocessing. Therefore, this
are provided in section V. Finally, the conclusion and future paper investigates the big data technology and proposes the
work are presented in section VI. text preprocessing framework using Apache Spark for text
mining analysis. This framework can reduce the computation
II. RELATED WORKS time of the text preprocessing step and supports the
This section summarizes a brief detail of related works in execution of the large amount of the sentiment data.
the area of the text preprocessing for sentiment analysis. A.
Krouska et al. [6] studied and reported the effect of the text III. BACKGROUND
preprocessing techniques to analyze sentiment using Twitter As mentioned earlier, the large amount of data is currently
datasets. The experimental results has shown that the feature becoming the problem of many research areas. So,
selection step to bring sentiment analysis accuracies of each researchers require a new technology to manage and
machine learning algorithms. Moreover, Z. Jianqiang [10] compute the very huge of data. In this section, Apache
proposed the comparison results of different the text Hadoop, MapReduce and Apache Spark are described in
preprocessing methods by using five Twitter datasets. The detail.
aim of this studying is to increase the accuracies
performance of classification in each dataset. G. Angiani et A. Apache Hadoop and MapRduce
al. [11] compared the various techniques of the text Apache Hadoop [3] is a framework that uses for the
processing that impact the accuracies performance. The data distributed processing of the large amount of data. It can
clearing is the important step that must be careful more than install on the single machine or computer cluster. Hadoop
other process but it takes a long time. E. Haddi et al. [12] Distributed File System (HDFS) is the core storage of
showed the experimental results of the text preprocessing distributed file system to compute the big data on commodity
step to predict sentiment information of online movie machines or low-cost machines. The architecture of Apache
reviews using SVM. The data transformation and filtering Hadoop cluster consists of NameNode as the manager node
methods were combined to achieve the performance of the and a number of DataNodes as the worker nodes. The
accuracy. D. Torunoğlu et al. [13] analyzed the various steps NameNode manages the file system metadata and controls
of the text processing (i.e. Stemming, stop words filtering access to file by clients. The DataNodes manages storage of
and word weighting) that impact the accuracies of Turkish actual data file that is split into one or more block and each
texts analysis. M. Mhatre et al. [14] combined the techniques block is replicated (3 blocks by default) to distribute onto
of the text preprocessing to analyze the dataset named “bag different DataNodes in cluster system.
of Words Meets Nags of popcorn”. The combination of three MapReduce [18] is one of the most the parallel
the text processing techniques that consist of Slangs handing, programming models in current. It automatically distributes
stop words removal and Lemmatization achieved highest and computes the large amount of data that runs in parallel
accuracy. using the cluster system environment. MapReduce comprises
A. Teixeira and R. M. S. Laureano [15] demonstrated and of the two important functions, called map and reduce.
explained the data extraction and preparation of Facebook Initially, MapReduce splits the large dataset as pieces
dataset for sentiment analysis. These processes used the open (128MB per piece) and then distributes them to store on each
source tools. This paper also introduced the technologies of machine in the cluster system. After that, the map function
big data that can be adopted against the text preprocessing generates a set of intermediate key/value pairs. It counts each
process. However, this paper only presented the big data word occurrences and assigns then a key1/value1 pairs. The
technologies. F. Chao and L. Baoan [16] studied and reduce function generates the result of summing the value of
proposed the idea to create a Chinese words segmentation the same words. Finally, the result of the each word emits
tool based on MapReduce programming model to reduce key2/value2 pair.
time of the text preprocessing step. However, this paper B. Apache Spark
should compare the computation time of the text
preprocessing step between new and traditional tool. Apache Spark [4] is open source tool that processes the
S. Vijayarani et al. [17] provided a detail of the text large amount of data analysis in main memory. It is widely
mining and text processing techniques to discover used in academics, research, business organizations and so
knowledge in a text content that was posted by people in on to process the big complex dataset that cannot process and
social media. They discussed the main step of the text analyze using traditional technologies. Apache Spark stack
preprocessing that consists of stop words removal, stemming consists of Spark core, Spark SQL, Spark Streaming, MLib,
and TF/IDF algorithm. Moreover, the stemming algorithms GraphX and SparkR. The Resilient Distributed Dataset
of each group also were discussed and provided the (RDD) is the core of Apache Spark. It was designed to
advantages and disadvantages of some step. This paper support in-memory data storage, distributed across a cluster
shown each step of the text preprocessing that was a very that can provide fault tolerant and efficient. In order to
important to perform the text mining or sentiment analysis. archive the data processing, it contains two types of
As survey above, some paper proposed the new method operation including transformations (e.g., map, filter, flatmap,
to increase the accuracy performance of classification with union) and action (e.g., reduce, collect, count, take). The
170
Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 10,2023 at 14:43:51 UTC from IEEE Xplore. Restrictions apply.
main components of a Spark cluster system for running a 1764720153780626_1765800420339266|" PIN IT FOR
Spark application are as follows: a driver program, a cluster LATER: http://bzfd.it/1tvA4z1 "
manager, a worker node, an executor and a task. Apache 1764720153780626_133454937256083|" " יאיר כהן אוכל אמיתי
Spark can access the data from various sources such as 1764720153780626_544699815715298|" Looks awesome! "
HDFS, HBASE, Hive and so on. Moreover, a Spark 1764720153780626_1809406662656380|" That's good "
application that was written and designed by developer using 1764720153780626_1610458909246220|" " ﺑﺠﺪ ﺟﺎﻣﺪﻩ
a Spark library can run on several cluster system modes 1764720153780626_1140193149374318|"看的都饿了 哈哈
including standalone, Messos and Hadoop YARN. 哈"
1764720153780626_956166207836824|" This very good"
IV. PROPOSED FRAMEWORK
Figure 1 shows the overview of the text preprocessing
Figure 3. Example comments dataset of Facebook page, “Tasty”.
framework that is improved and designed based on Apache
Hadoop and Apache Spark to compute the large amount of
sentiment data. This framework excellently solves of the B. Strorage Module
problem of the computation time. The proposed framework The storage module provides storing the large amount of
consists of main four modules: data collection module, sentiment data. The advantage of this module is that it can
storage module, data cleaning module and feature extraction
module. We develop the application of all text preprocessing handles structured, semi-structured and unstructured data. In
steps with python version 3.6 language. The main role of the our cluster system, we choose Hortonworks Data Platform
each module is as follows. 2.6 (HDP) [19] that contains many the Hadoop ecosystems
including Apache Spark.
C. Data Cleaning Module
The sentiment data contains many words and special
language styles that are shown as an example in Figure 3. In
this paper, we propose the text preprocessing step for text
mining analysis using sentiment data of English language
only. Thus, the other languages are removed out from the
sentiment data; moreover, hashtag (#) and URL are also
removed. The process of data cleaning module achieves
based on Apache Spark. Figure 4 shows the result of some
examples of the sentiment data that was cleaned already.
171
Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 10,2023 at 14:43:51 UTC from IEEE Xplore. Restrictions apply.
learning algorithms that generally require the main memory V. EXPERIMENTAL AND COMPARISON RESULTS
space and the most computation time. In this paper, the step This section explains and discusses the experimental and
of feature extraction that takes advantage of Apache Spark comparison results between our framework and traditional
are proposed to decrease attributes into the appropriate framework. Moreover, we also compare the computation
sentiment data for classification with data mining or machine time of the text preprocessing step on the different cluster
learning algorithms. Figure 5 shows all the step of feature system environments. Each the sentiment data is executed 5
extraction module. The following is the feature extraction times to calculate the average of the computation time.
step.
Sentence segmentation step: This step is the process of A. Experimental Setup
the word separation in sentences into sequences of the words. The cluster system that uses to evaluate the performance
For example, the sentence “This very good” is separated into of the text preprocessing step based on Apache Spark
“This”, “very” and “good”. In this step, our application also consists of a master node and four slave nodes. These nodes
is developed to remove the commas, spaces and all symbols have CPU Intel® Xeon® 3.10GHz E3-1220 V3, 16GB
(e.g !, #, $ and so on). DDR3 RAM, 1TB SATA HDD, and Inter® 1210 Gigabit
Word stemming step: It is the step to remove the various Network. Each node on the cluster system is connected
suffix to identify as based on the root or stem of each word through a Gigabit Ethernet with a Baseline Switch 2928
in the sentiment data. The stemming step reduces the most 3Com, and runs on the CentOS Linux 7.1. The cluster
computation time and space usage of the main memory for system was installed with Hortonworks Data Platform (HDP
the next step. This paper uses Suffix-stripping algorithm to 2.6.3) [19] on which Java JDK 1.8.0_112. We choose the
remove suffix of each word to find its root form. For HDP because it is 100 % an open source and contains many
example, if the word has suffix as “ed”, “ly”, or “ing”. They Apache Hadoop ecosystems. The framework of each text
will be removed. preprocessing step develops on version of Python 3.6 and
Part of Speech (POS) tagging step: The POS tagging step run on Apache Spark 2.1.
is the process of part-of-speech tagging of each word in the
sentiment data. This step is important and useful to know B. Comparison Results
whether a word is noun, verb, adjective or adverb in a To evaluate the computation time of the text
context of a sentence, phrase or paragraph. Generally, POS preprocessing, we separate the raw sentiment dataset into the
are placed on the back of each word and can help the word different three datasets. Table I shows the number of record,
selection easy to remove out from the sentiment data. In this word numbers, word-per-line average and file size along
paper, we utilize the Hidden Markov Model (HMM) for POS with the sentiment data.
tagging. The POS tagger examples are “JJ” (adjective), “RB”
(adverb), “DT” (determiner) and so on. For example POS TABLE I. DATASET SUMMARY
tagging in this paper, the results are emitted such as Avg. # of word
“This=IN”, “very=RB”, “good=JJ”. Dataset # of
#of Words per line in File size
Name records
Stop words removal step: The sentiment data of the user dataset
usually includes the most common words in a language. The dataset_1 2,556,603 18,815,547 7.359 176 MB
stop words removal step removes the words that cause the
problem for classification with machine learning algorithms. dataset_2 4,559,717 32,780,415 7.189 309 MB
This step also enables reducing the feature that directly
dataset_3 7,170,220 51,277,922 7.151 485 MB
effects, if each machine in the cluster system has insufficient
main memory of each of machines. The most common words
in the sentiment data are removed such as articles (i.e. a, an, TABLE II. THE RESULTS OF AVERAGE COMPUTATION TIME
and the), preposition (e.g in, on, about and so on), pronouns dataset_1 dataset_2 dataset_3
Dataset Name
(e.g he, she, it, you and so on) and so on. These words are (seconds) (seconds) (seconds)
not the importance against sentiment analysis. Traditional method 1899.862 3234.230 4964.510
Term weighting step: It is widely used in text mining
areas to calculate the appearing statistic of the word 1 node 33.208 57.423 90.767
frequency and represent the importance of the word in the 2 nodes 32.725 57.162 90.343
document. The TF-IDF is the product of both TF (Term
frequency) and IDF (inverse document frequency) statistics. 3 nodes 32.408 56.848 89.862
In the case of the TF (t, d), it is used the term t calculation
4 nodes 32.012 56.169 88.370
that appears in document d. The IDF (t, D) is the document
number that contains term t. There are several methods to
calculate the TF and IDF weight. In our experiments, we use Table II shows the comparison results of the average
the raw count method for TF weighting and inverse computation time of the text preprocessing between the
document frequency method for IDF weighting, respectively. traditional method and the cluster system environments. The
The TF-IDF can be formulated as follows [20]: algorithm of the traditional method was implemented with
python 3.6 and computed the sentiment data on the single
TFIDF (t , d , D) = TF (t , d ) ⋅ IDF (t , d ) (1) machine. The traditional method took the computation time
172
Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 10,2023 at 14:43:51 UTC from IEEE Xplore. Restrictions apply.
of the dataset_1, dataset_2 and dataset_3 dataset, with REFERENCES
1899.862, 3234.230 and 4964.510 seconds respectively. [1] Facebook statisics. Avaliable at https://www.brandwatch.com/blog/
Clearly, when all the three datasets are computed in 1 node. 47-facebook-statistics/
It spent the computation time with 33.208, 57.423 and [2] Twitter statisics. Avaliable at https://www.brandwatch.com/blog/44-
90.767 only. Regarding the average computation time of 2 twitter-stats/
nodes spent the computation time of the each datasets, with [3] Apache Hadoop. Avaliable at http://hadoop.apache.org/
32.725, 57.162 and 90.343 seconds respectively. In the case [4] Apache Spark. Avaliable at https://spark.apache.org/
of the computation of all the datasets with 4 nodes, it spent [5] S. Bhuta and U. Doshi, “A review of techniques for sentiment
the average computation time with 32.012, 56.169 and analysis Of Twitter data,” Proc. 2014 International Conference on
88.370 seconds respectively. Issues and Challenges in Intelligent Computing Techniques (ICICT),
Feb. 2014, pp. 583-591, doi: 10.1109/ICICICT.2014.6781346
The experimental results of the average computation time
based on Apache Spark were shown in Figure 6. We can see [6] A. Krouska, C. Troussas and M. Virvou, “The effect of preprocessing
techniques on Twitter sentiment analysis,” Proc. 7th International
that the average computation time evidently differ, if all the Conference on Information, Intelligence, Systems & Applications
three datasets are computed in the same cluster system. On (IISA), Jul. 2016, pp. 1-5, doi:10.1109/IISA.2016.7785373
the other hand, if each sentiment dataset computed on the [7] S. Kotsiantis, D. Kanellopoulos and P. Pintelas, “Data Preprocessing
different nodes number. It spent few the average for Supervised Learning”, International Journal of Computer Science,
computation time. vol. 1, Dec. 2007, pp. 4091–4096.
[8] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez and F.
Herrera, “Big data preprocessing: methods and prospect”, Big Data
Analytics, vol. 1, Dec. 2016, pp. 1-22, doi:10.1186/s41044-016-
0014-0.
[9] K. Ravi and V. Ravi, “A survey on opinion mining and sentiment
analysis: tasks, approaches and applications,” Knowledge-Based
Systems, vol. 89, Nov. 2015, pp. 14-46, doi.org/10.1016/j.knosys.
2015.06.015
[10] Z. Jianqiang, “Pre-processing Boosting Twitter Sentiment Analysis?,”
Proc. IEEE International Conference on Smart City/SocialCom
/SustainCom (SmartCity), Dec. 2015, pp. 748-753, doi:
10.1109/SmartCity.2015.158
[11] G. Angiani and et al., “A Comparison between Preprocessing
Techniques for Sentiment Analysis in Tewtter,” Proc. 2nd
International Workshop on Knowledge Discovery on the WEB
Figure 6. The comparison results of the average computation time of the (KDWEB 2016), vol. 1748, Sep. 2016.
text preprocessing step in the different nodes number. [12] E. Haddi, X. Liu and Y. Shi, “The role of text pre-processing in
sentiment analysis,” Procedia Computer Science, vol. 17, May. 2013,
pp. 26-32, doi:org/10.1016/j.procs.2013.05.005.
VI. CONCLUSION AND FUTURE WORK [13] D. Torunoğlu, E. Çakirman, M. C. Ganiz, S. Akyokuş and M. Z.
The text preprocessing is the most important task. One of Gürbüz, “Analysis of preprocessing methods on classification of
Turkish texts,” Proc. International Symposium on Innovations in
the most tools is Apache Spark that computes the large Intelligent Systems and Applications, Jun. 2011, pp. 112-117,
amount of data in main memory faster than Apache Hadoop. doi:10.1109/INISTA.2011.5946084
This paper proposed the text preprocessing framework that [14] M. Mhatre, D. Phondekar, P. Kadam, A. Chawathe and K. Ghag,
can perform the execution of the large amount of the “Dimensionality reduction for sentiment analysis using pre-
sentiment data. The framework applied Apache Hadoop to processing techniques,” Proc. International Conference on Computing
store the large amount of sentiment data that was collected Methodologies and Communication (ICCMC), Jul. 2017, pp. 16-21,
doi:10.1109/ICCMC.2017.8282676
from Facebook page named “Tasty”. However, the sentiment
[15] A. Teixeira and R. M. S. Laureano, “Data extraction and preparation
data is unstructured data and consists of many languages and to perform a sentiment analysis using open source tools: The example
symbols. Therefore, it needs be cleaned into the better of a Facebook fashion brand page,” Proc. 12th Iberian Conference on
sentiment data with the cleaning module that efficiently Information Systems and Technologies (CISTI), Jun. 2017, pp. 1-6,
performed based on Apache spark. Finally, the large amount doi:10.23919/CISTI.2017.7975879
of sentiment data was executed with the feature extraction [16] F. Chao and L. Baoan, “Chinese Words Segmentation Based on
module. We evaluated the performance of the computation Double Hash Divtionary Running on Hadoop,” Proc. International
Conference on Electromechanical Control Technology and
time of the text preprocessing on the different cluster system Transportation, Nov. 2015, pp. 461-465, doi:10.2991/icectt-
environments and compared against the traditional method. 15.2015.88.
As the results, our framework showed the better performance [17] S. Vijayarani, J Ilamathi and Nithya, “Preprocessing Techniques for
than traditional method. Text Mining-An Overview,” International Journal of Computer
In future work, we plan the performance evaluation of Science & Communication Networks, vol. 5, 2015, pp. 7-16.
the sentiment classifications with machine learning [18] J. Dean and S. Ghemawat, “MapReduce: simplified data processing
algorithms based on Apache spark. Moreover, we will also on large clusters,” Communications of the ACM, vol. 51, Jan. 2008,
extend this paper to compare the accuracy measurement of pp. 107-113, doi: 10.1145/1327452.1327492
feature selection. Our framework also can be improved and [19] Hortonworkks Data Platform. Avaliable at https://hortonworks.com/
designed for using the other big data technologies. [20] TF-IDF Document. Avaliable at https://en.wikipedia.org/wiki/Tf-idf
173
Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 10,2023 at 14:43:51 UTC from IEEE Xplore. Restrictions apply.