0% found this document useful (0 votes)
30 views

Sentiment Analysis of Code Mixed Text A Review

Uploaded by

husnamahadzir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Sentiment Analysis of Code Mixed Text A Review

Uploaded by

husnamahadzir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350805126

Sentiment Analysis of Code-Mixed Text: A Review

Article in Turkish Journal of Computer and Mathematics Education (TURCOMAT) · April 2021
DOI: 10.17762/turcomat.v12i3.1239

CITATIONS READS

9 930

1 author:

Nurul Husna Mahadzir


Universiti Teknologi MARA
8 PUBLICATIONS 38 CITATIONS

SEE PROFILE

All content following this page was uploaded by Nurul Husna Mahadzir on 14 November 2022.

The user has requested enhancement of the downloaded file.


Turkish Journal of Computer and Mathematics Education Vol.12 No.3 (2021), 2469-2478
Research Article

Sentiment Analysis of Code-Mixed Text: A Review


Nurul Husna Mahadzir1, Mohd Faizal Omar2*, Mohd Nasrun Mohd Nawi3, Anas A. Salameh4,
Kasmaruddin Che Hussin5
1
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Kedah,
08400 Merbok, Kedah, Malaysia.
2
School of Quantitative Sciences, Universiti Utara Malaysia, 06010 Kedah, Malaysia.
3
School of Technology Management and Logistics, Universiti Utara Malaysia, 06010 Kedah, Malaysia.
4
College of Business Administration, Prince Sattam bin Abdulaziz University, 165 Al-Kharj 11942,
Saudi Arabia.
5
Faculty of Entrepreneurship and Business, Universiti Malaysia Kelantan, Malaysia
Corresponding Author’s email: [email protected]*

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27 January 2021;
Published online: 05 April 2021
Abstract: In recent times, sentiment analysis has become one of the most active research and progressively popular areas in
information retrieval and text mining. To date, sentiment analysis has been applied in various domains such as product,
movie, sport and political reviews. Most of the previous work in this field has focused on analyzing only a single language,
especially English. However, with the need of globalization and the increasing number of the Internet used worldwide; it is
common to see the post written in multiple languages. Moreover, in an unstructured content like Twitter posts, people tend to
mix languages in one sentence, which make sentiment analysis process even harder and more challenging. This paper
reviews the state-of-the-art of sentiment analysis for code-mixed, which includes the detail discussions of each focus area,
qualitative comparison and limitations of current approaches. This paper also highlights challenges along this line of research
and suggests several recommendations for future works that should be explored.
Keywords: Sentiment Analysis, code-mixed, language

1. Introduction

Sentiment Analysis (SA) is considered as one of the most active research areas in Natural Language
Processing (NLP) since early 2000(B. Liu, et al..2012) Its aim is to automatically detect emotions or opinions
conveyed by a speaker or writer based on the subjective information shared especially on the Web (B. Liu, et
al..2012; B. Pang, et al..2008; E. Cambria, et al..2013). The importance of this field has been proven by the high
number of approaches and techniques proposed in research, as well as by the interest of organizations and
companies that it raised in recent years. To date, SA has been applied to a wide variety of topics and issues such
as online products reviews (e.g., movies, mobile phones)( G. Di Fabbrizio,et al..2013hotel reviews (M. Vela, et
al..2011) political and financial analysis (K. Ahmad, et al.. 2006; Y.E. Soelistio, et al..,2015)

Most of the previous work in this field has focused on analyzing only a single language especially English.
However, with the need of globalization and the increasing number of Internet used worldwide, it is common to
see the post written in multiple languages which make SA process even harder and more challenging.
Furthermore, in an unstructured content like Twitter posts, people tend to mix languages in one sentence. In fact,
the practice of using more than one language in a single sentence has arisen and such mixed language has rarely
been a subject of SA before. It is crucial to have different approach or technique in order to cater for this kind of
data as certain information in another language might miss out if the analysis is done only for a single language.
(K. Dashtipour, et al..,2016)

The code-mixed language usage in a daily conversation has arises from the fact that some multilingual
speakers feel more comfortable to convey messages in their native language compared to English.

In the next section, research findings in SA for code-mixed text will be discussed. Following which, the
qualitative comparison is presented. Then the issues and challenges are highlighted and conclusion and future
work are drawn in the last section.

2. Research Findings

Sentiment Analysis

2469
Nurul Husna Mahadzir, Mohd Faizal Omar, Mohd Nasrun Mohd Nawi, Anas A. Salameh,
Kasmaruddin Che Hussin

There are two main approaches in SA; subjectivity analysis and sentiment classification. Subjectivity
analysis deals with the detection of opinions or sentiments, while sentiment classification focuses on classifying
those opinions with various polarities or rankings (A.Montoyo, et al..2012).Some researchers focused on
classifying text into positive, negative or neutral (A. Pak, et al. 2010) while other consider various levels of
granularity such as highly positive, positive, neutral, negative or highly negative (S. Bhattacharjee, et al..,2015)
in their classification.

Figure 1. SA approach (W. Medha, et al.,.2014)

The approaches that have been broadly used in order to face the challenges of SA are either machine learning
or lexicon-based approach as depicted in Figure 1. The machine learning approach uses supervised, semi-
supervised or unsupervised learning to construct a model from a large training corpus (K. Ravi, et al..2015) As
for this approaches, supervised learning techniques have been the most widely used in many SA tasks. This
method entails the use of a training corpus to learn a certain classifier function. The efficiency of SA systems
using supervised algorithms depends on the combination of appropriate algorithms together with a set of
appropriate features. Among the commonly applied sentiment classifiers for supervised learning includes Naïve
Bayes (NB) Classifier, Support Vector Machine (SVM) and Maximum Entropy (ME) to classify data into
positive or negative categories (B. Liu, 2010)

On the other hand, the lexicon-based approach requires human annotation to manually construct a lexicon
and it is divided into the dictionary-based and corpus-based approach. The dictionary-based approach totally
depends on available resources such as WordNet to find the opinion seed words while the corpus-based approach
is applied not only to obtain the opinion seed words but also to find other opinions words using a large domain-
specific corpus (R. Feldman,2013) Specifically, this approach largely relies on lexical resources containing
words and their associated sentiment (sentiment lexicons) in order to perform the classification. Among the most
well-known sentiment lexicon includes Senti Word Net (K. Denecke, 2008; (A. Kumar, 2014)Word Net ( G. a.
Miller,1995) and Sentic Net (E. Cambria, et al..,2014)

A huge amount of previous research has been done in mining the sentiment written in English. Despite the
fact that English remains the main language used in various research studies in this area, there are also efforts in
other languages such as Japanese (H.T.T.F. Tadashi Kumano Hideki Kashioka, 2003; A. Danielewicz-Betz, et
al..,2015) Chinese (H.Y. Lee, et al..,2011) and (Malay A. Alsaffar, et al..,2014). SA for a language is usually
relying on manually or semi-automatically constructed lexicons found in dictionaries or corpora (G. Dehong, et
al.,2014; A.A. Ríos, et al..,2014; A.B. Muhammad, 2016) The availability of these resources enables the
creation of rule-based SA or the construction of training data for classification purposes (D. Sitaram, et
al..,2015).

Code-Mixed

Many terms are found in the literature that is used interchangeably to refer to this concept including mixed
language, code-mixing, and code-switching. All these terms refer to the use of more than one language in the

2470
Sentiment Analysis of Code-Mixed Text: A Review

same conversational event either in speaking or writing(R. Bhargava, et al.2016; .J. Gumperz, 1982).
Throughout this paper, the term ‘code-mixed’ will be used to refer to this phenomenon.

The use of code-mixed arises from the fact that some multilingual speakers or writers feel more comfortable
to convey information in their native language compared to English. Code-mixed text either verbally or in
written form is considered common especially in multilingual societies like Malaysia and Singapore. The use of
code-mixed is usually found in social media content such as Facebook, Twitter and forums. In Malaysia, social
media users tend to mix Malay and English language known as ‘Bahasa Rojak’ in their informal communication
(K. Chuah, 2013) Below are the examples of code-mixed posted on Twitter that contains both Malay and
English texts:-

Example 1: buku ni brilliant…everyone should read!!


Example 2: tahniah Azizul…the Keirin World Champion!
Example 3: jammed teruk from Tapah to Ipoh, dah 2jam stuck kat sini…

The statement in the example above is a mixture of two languages; Malay and English. Words in italic
belong to the English language while the rest belongs to the Malay language. Among the issues associated with
the use of code-mixed are the grammatical differences and improper switching of languages in one sentence
which introduce new challenges in the field of NLP. Therefore, different approach and techniques will be needed
in order to achieve comparable performance level to what has been achieved in a single language such as
English.

Sentiment Analysis of Code-Mixed

Although a great deal of work has been focusing on analysing data for single and multilingual languages,
there are some recent studies have been conducted to analyse code-mixed content as well. A thorough search in
the literature based upon the title, abstract and introduction were conducted through several scholarly
publication's search engine and online databases such as Scopus, Google Scholar, Springer, ACM, and IEEE.
The keyword used to find the articles included ‘sentiment analysis mixed language’, ‘sentiment analysis code-
mixed’ and ‘sentiment analysis code-switching’. Articles in conference proceedings as well as refereed journals
that included these particular terms were considered. As a result, seventy papers published from 2008 were
scanned during this process and forty papers related to SA of code-mixed content were identified and included
into the analysis. It has been identified that most efforts concentrated on five focus areas or specific task. The
categories included i) pre-processing ii) language identification iii) lexicon creation iv) sentiment classification
and v) subjectivity analysis. It was worthy of note that although this review is written by each focus area, some
of the previous works are also proposed more than one focus area in their study. Table 1 summarized SA for
code-mixed related publications, their language pairs, and research focus. The detailed review of each research
focus is discussed in the following section.

Table 1. Research focuses areas


Pre – Language Lexicon Sentiment Subjectivity
Publication Language Pairs
processing Identification Creation Classification Analysis
[31] Maltese –English √
[32] English – Spanish √
[33] English – Bengali √
[34] Urdu – English √ √
Mandarin –
[35] √
English
[36] Chinese – English √
[37] Malay – English √
English – 30 non-
[38] √
English language
[39] English – Arabic √ √
[40] Chinese – English √ √ √
[41] English – Spanish √
[42] Chinese – English √
[43] English – Hindi √
English – Hindi –
[44] √
Bengali
[45] English – Hindi √

2471
Nurul Husna Mahadzir, Mohd Faizal Omar, Mohd Nasrun Mohd Nawi, Anas A. Salameh,
Kasmaruddin Che Hussin

[46] English – Hindi √ √


[47] English – Hindi √ √ √
[48] English – German √
[27] English – Hindi √
[49] English – Hindi √ √
[50] English – Hindi √
[51] English – Spanish √
[52] Chinese – English √
[53] English – Bangla √
[54] Chinese – English √
[55] English – Hindi √
[56] English – Bangla √
[28] English-Hindi √ √
Singaporean
[57] √
English
[58] English – Spanish √
[59] English – Hindi √
[60] English – Hindi √
[61] English – Manipuri √
[62] English-Bengali √ √
Hindi-English-
[63] √
Bengali- Gujarati
[64] English – Spanish √
English –
[65] √
Portuguese
[66] English – Chinese √

Pre-processing

Early work on this subject matter has focused on pre-processing or normalization task which involves the
activities such as identification of noisy text, correction of spelling and stop words removal N. Samsudin, et
al..,2013; Y. Vyas, 2014) Normalization of mix English and Bangla language was studied by (Dutta et al.
2015)and they focused on spelling correction using noisy channel model. (Zhang, Chen & Huang 2014)
introduced word translation and word categorization methods to perform normalization on Chinese and English
texts. For word translation, neural network language model was used to translate in-vocabulary English words to
Chinese, while for out-of-vocabulary words; a graph-based unsupervised model is applied to categorize them.
(Sitaram et al,2015) focused on normalizing four elements of code-mix in Indian social media which is phonetic
typing; abbreviations, wordplay and slang words and they have achieved an accuracy of 85% with their model.

Language Identification

Identifying languages of words are considered as one of the most significant tasks in code-mixed content.
The majority of the previous works have used word level approach in identifying languages (U. Barman,et al..,
2014; A. Das, et al..,2014; S. Dutta, et al..,2015; P. Lamabam,et al..,2016) In other research works, (
Sharma et al. 2015) proposed to use the nearest neighbor approach in dealing with ambiguous words during
language identification phase on a mix of English – Hindi language and lexicon-based approach has been applied
to judge the sentiment of a statement.

(King, Abney, 2013) proposed a weakly supervised method to perform word level language identification in
multilingual documents while (Barman et al. 2014) used a hybrid approach to perform language identification
task in three languages; Bengali, English, and Hindi. The approach was able to classify the ambiguous words as
they take contextual clues into consideration during classification task. (Rudra et al..,2016) reported language
preference for tweets by Indian users written in Hindi and English language and they argued that Hindi is
preferred when expressing a negative opinion and swearing. (Nguyen,Dogruoz 2013) performed language
identification on randomly selected posts of Turkish and Dutch language from an online chat forum and has used
manual annotation to annotate the data.
Lexicon Creation

Another related research is concerning on the construction of sentiment lexicon. (Lee,Wang 2015) have
annotated and analyzed the English – Chinese lexicon with five basic emotions; happiness, sadness, fear, anger,

2472
Sentiment Analysis of Code-Mixed Text: A Review

and surprise. They used Multiple Classifier System (MCS) to do the classification. In the same line of study, (Li,
Yu, Fung 2012) have developed Mandarin – English sentiment lexicon based on code-switching speech and text
data which includes both intra-sentential and inter-sentential code-switching in the lexicon. For the text data, an
algorithm to automatically downloading the code-switching data from Chinese language news has been
developed. (Vilares et al..2017) have proposed English – Spanish corpus of tweets with code-switching. The
annotation of each tweet was based on Senti Strength criteria and they have applied a trinary scale (positive,
neutral and negative categories) to classify the polarity (M. Thelwall, et al..,2010)

Sentiment Classification / Polarity Detection

One of the major tasks in any SA activities is sentiment classification or polarity detection. As mentioned
previously, there are two main approaches in classifying sentiments which is machine learning and lexicon-
based approach. In code-mixed environment, various methods within both approaches have been applied in
judging the sentiments.

Machine Learning

Various machine learning techniques such as Naive Bayes (NB), Support Vector Machines (SVM),
Maximum Entropy (MaxEnt) etc. have been applied to classify the sentiments. (Narr et al. 2011) concluded
71.5% accuracy with code-mixed using NB classifier on unigrams. Mukund & Srihari address issues and
challenges related to Urdish blog data consist of Urdu mixed with English and proposed to use statistical Part-of-
Speech (PoS) tagger and Structural Correspondence Learning (SCL) for the classification task. Sitaram et al.
trained a classifier on the mixed English – Hindi language data directly rather than translated to a single
language. The technique used was able to learn the grammatical transitions of both languages. Vilares et al.
applied the various machine learning approaches to classify the polarity in three different environments. Raghavi
et allearn a basic Support Vector Machine (SVM) based question classification system for English - Hindi data.
All the data have been translated into English before feature selection and classifications were performed. In
contrast, Yan et al. proposed a bilingual approach to process review comments written in Chinese and English.
Their models are able to analyse sentiments without translation and to process two different languages
simultaneously. While Wang et al. predicted emotion using a joint factor graph model by considering both
bilingual and emotional information and their models were able to significantly outperform the baseline model
with a p-value less than 0.01.

Lexicon-based

Lo et al. have constructed a toolkit to analyse polarity for Singlish (Singaporean English) using a semi-
supervised approach. Unlike previous research, which relying on English knowledge-based such as
SenticWordNet and Word Net, Lo has used SenticNet which includes 30,000 common sense concept, negation
and adversative terms handling as the core resource for their polarity detection. Gajakosh et al. have applied
fuzzy sets to classify five different polarity categories for hotel reviews.

Subjectivity Analysis

One of the efforts in performing subjectivity analysis is found in [33] where they have generated a Bengali
subjectivity lexicon and proposed Conditional Random Field (CRF) based approach as a subjectivity classifier
for mixed English and Bengali language. (Abdul-Mageed et al. 2011) have developed a system known as
SAMAR (Subjectivity and Sentiment Analysis for Arabic Social Media) where one of the focus is to analyse
subjective information in various dialectical Arabic language.

Qualitative Comparison

Table 2 summarizes the limitations of different techniques and a dataset of previous research in the area of
this study.

Table 2. Comparisons of various techniques used, datasets and limitations


References Focus Area Technique Domain/Dataset Limitations
[42] Normalization Noisy channel approach, Sina Weibo: 210 Lack of contextual
neural network million post information

2473
Nurul Husna Mahadzir, Mohd Faizal Omar, Mohd Nasrun Mohd Nawi, Anas A. Salameh,
Kasmaruddin Che Hussin

Classification: graph-
based unsupervised
method

[57] Lexicon Semi supervised Twitter Concept ambiguity was


Creation, approach: SVM removed manually
Sentiment
Classification

[49] Normalization, Classification: Lexicon- Data from Forum for Not able to deal with
Language based approach, Hindi IR Evaluation 2013, ambiguous words
Identification SentiWordNet, WordNet 2014, 500 posts from correctly
Facebook and
Youtube

[27] Sentiment Sentiment combination Facebook and Google Does not cover the
Classification rules: recursive neural Plus: Reviews on ambiguous sentiment
tensor network Virat Kohli
Training data: 345
Testing data: 97

[43] Language Unsupervised dictionary Facebook group of The classifier without


Identification based approach, Indian Univ students contextual clues did not
supervised classification (2335 post, 9813 perform well for the
(context based SVM), comments) Hindi language
Conditional Random
Fields
[38] Language Weakly supervised 643 languages from Not able to handle
Identification method, n-grams, CRF four monolingual named entity,
model samples incorrectly classify
shared words

[51] Sentiment Supervised model based Twitter (3062 tweets) The accuracy obtained
Classification on bag-of-words on code-switch corpus
is lower than
monolingual corpus

[28] Language L.I: Machine Learning Training Ambiguous words


Identification, S.C: SentiWordNet set:banglalyrics.net, manually removed
Sentiment Final score: statistical Tamil lyrics, Indic
Classification technique websites

[61] Language Trigram based Twitter (700 code- Data only tokenized, no
Identification CRF model -> perform mixed tweets) pre-processing task has
better Facebook (300 code- been adopted
mixed posts)
[45] Language Supervised: n-grams, Facebook posts Error found on
Identification dictionary-based, SVM language boundary
detection and
evaluation on testing
data do not perform
well

[59] Pre-processing, CRF Shared task by 12th Cannot disambiguate


Sentiment Supervised approach Conf on NLP similar tags
Classification
3. Open Issues and Challenges

The studies that have been done in SA of code-mixed language pose some issues. This paper will highlight
three of the issues; language pairs, ambiguous words and subjectivity analysis.

2474
Sentiment Analysis of Code-Mixed Text: A Review

Language Pairs

As shown in Figure 2, there is a large volume of published studies concentrating on a mix of English
language with Hindi (D. Sitaram, et al..,2015; Y. Vyas, et al., 2014; S. Sharma,et al..,2015; A. Jamatia, et
al..,2015) or Chinese ( J. Zhao, et al..,2012; Q. Zhang, et al..,2014; S. Lee,et al..,2015) . It is observed that only a
few studies have been carried out on other language pairs such as German, Portuguese and Malay. As a
phenomenon of code-mixed is very common to many multicultural and multilingual countries, it is important to
conduct more researches for various languages to cater for this need.

2.70 2.70
5.41
English - Spanish
13.51
English - Hindi
English - Arabic
27.03 English - Chinese
English - Malay
40.54
English -Portuguese
8.11
English - German

Figure 2. Language pairs.

Ambiguous Words

Another common challenge in SA of code-mixed is to deal with ambiguity issue. Ambiguity in code-mixed
environment can exist in few situations. First, a word may share similar spelling by multiple languages but it
carries different meaning (S. Sharma, et al..,2015; S.K. Singh , et al, 2017)For example, the word “fail” exists in
both Malay and English language. In English, the meaning of fail is unsuccessful, whereas in Malay it means a
file or a folder. Second, one single word may carry multiple meaning in a language such as madu (honey from
the bee / women sharing a husband). Madu in the example is a Malay word which has two different meaning.
Word-by-word language identification as practiced by most previous research failed to accurately identify this
ambiguous word since it is spelled similarly. It is required to take the surrounding words into consideration in
order to get a sense and context information in identifying the word (A. Chanda,et al.,,2016)

Subjectivity Analysis

Little attention has been paid to the study of subjectivity analysis task. Subjectivity analysis is usually
implemented prior to detail sentiment analysis and it is considered as an essential task to categorize the
subjective and objective sentences. It is worth investigating if the outcome from the analysis can improve the
sentiment classification of the system as a whole.

4. Conclusion and Future Work

In this paper, a comprehensive literature search on numerous state-of-the-art sentiment analysis of code-
mixed, illustrating the current trend of the domain has been performed. The literature review revealed that most
of the research efforts into SA for code-mixed have centered on the pre-processing, language identification,
lexicon construction, and sentiment classification task.

In the future, it will be necessary to cater for another language pairs and to resolve the ambiguity issues by
going beyond word level analysis in order to understand the context and sentiment it conveys.

5. Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

2475
Nurul Husna Mahadzir, Mohd Faizal Omar, Mohd Nasrun Mohd Nawi, Anas A. Salameh,
Kasmaruddin Che Hussin

6. Funding Statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-
profit sectors.

References

1. B. Liu, Sentiment analysis and opinion mining, Synth. Lect. Hum. Lang. Technol. 5 (2012) 1–167.
2. B. Pang, L. Lee, Opinion Mining and Sentiment Analysis, 2008. doi:10.1561/1500000011.
3. E. Cambria, B. Schuller, Y. Xia, C. Havasi, New Avenues in Opinion Mining and Sentiment Analysis,
IEEE Intell. Syst. 28 (2013) 15–21. doi:10.1109/MIS.2013.30.
4. G. Di Fabbrizio, A.J. Stent, R. Gaizauskas, Summarizing Opinion-Related Information for Mobile
Devices, in: Mob. Speech Adv. Nat. Lang. Solut., Springer New York, New York, NY, 2013: pp. 289–
317. doi:10.1007/978-1-4614-6018-3_11.
5. M. Vela, Sentiment Analysis for Hotel Reviews, Proc. Comput. Linguist. Conf. 231527 (2011) 45–52.
doi:10.1051/matecconf/20167503002.
6. K. Ahmad, D. Cheng, Y. Almas, Multi-lingual Sentiment Analysis of Financial News Streams, Proc.
1st Intl Conf. Grid Technol. Financ. Model. Simul. (2006). doi:10.1109/IV.2005.143.
7. Y.E. Soelistio, M. Raditia, S. Surendra, Simple Text Mining for Sentiment Analysis of political figure
using naïve bayes classifier method, (2015) 99–104.
8. K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A.Y.A. Hawalah, A. Gelbukh, Q. Zhou, Multilingual
Sentiment Analysis: State of the Art and Independent Comparison of Techniques, Cognit. Comput. 8
(2016) 757–771.
9. A. Montoyo, P. Martínez-Barco, A. Balahur, Subjectivity and sentiment analysis: An overview of the
current state of the area and envisaged developments, Decis. Support Syst. 53 (2012) 675–679.
10. A. Pak, P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, LREc. 10
(2010).
11. S. Bhattacharjee, A. Das, U. Bhattacharya, S.K. Parui, S. Roy, Sentiment analysis using cosine
similarity measure, in: 2015 IEEE 2nd Int. Conf. Recent Trends Inf. Syst., IEEE, 2015: pp. 27–32.
12. W. Medhat, A. Hassan, H. Korashy, Sentiment analysis algorithms and applications: A survey, Ain
Shams Eng. J. 5 (2014) 1093–1113.
13. K. Ravi, V. Ravi, A survey on opinion mining and sentiment analysis: Tasks, approaches and
applications, 2015.
14. B. Liu, Sentiment Analysis and Subjectivity, in: Handb. Nat. Lang. Process., 2010: pp. 1–38.
15. R. Feldman, Techniques and applications for sentiment analysis, Commun. ACM. 56 (2013) 82.
16. K. Denecke, Using SentiWordNet for multilingual sentiment analysis, in: Proc. - Int. Conf. Data Eng.,
2008: pp. 507–512.
17. A. Kumar, A. R, Sentiment Analysis Using Sentiwordnet And Semantic Approach, Int. J. Adv. Inf. Arts
Sci. Manag. ISSN. 1 (2014).
18. G. a. Miller, WordNet: a lexical database for English, Commun. ACM. 38 (1995) 39–41.
19. E. Cambria, D. Olsher, D. Rajagopal, SenticNet 3: a common and common-sense knowledge base for
cognition-driven sentiment analysis, in: Twenty-Eighth AAAI Conf., 2014: pp. 1515–1521.
20. H.T.T.F. Tadashi Kumano Hideki Kashioka, Construction and analysis of Japanese-English broadcast
news corpus with named entity tags, ACL2003
21. A. Danielewicz-Betz, H. Kaneda, M. Mozgovoy, M. Purgina, Creating English and Japanese Twitter
Corpora for Emotion Analysis, People. 1634 (2015) 5869.
22. H.Y. Lee, H. Renganathan, Chinese Sentiment Analysis Using Maximum Entropy, (2011) 89–93.
23. A. Alsaffar, N. Omar, Study on feature selection and machine learning algorithms for Malay sentiment
classification, in: Proc. 6th Int. Conf. Inf. Technol. Multimed., 2014: pp. 270–275.
24. G. Dehong, Cross-Lingual Sentiment Lexicon Learning, 2014.
25. A.A. Ríos, P.J. Amarilla, G.A.G. Lugo, Sentiment categorization on a creole language with lexicon-
based and machine learning techniques, Proc. - 2014 Brazilian Conf. Intell. Syst. BRACIS 2014. (2014)
37–43.
26. A.B. Muhammad, Contextual Lexicon-based Sentiment Analysis for Social Media, 2016.
27. D. Sitaram, S. Murthy, D. Ray, D. Sharma, K. Dhar, Sentiment analysis of mixed language employing
Hindi-English code switching, Mach. Learn. Cybern. (ICMLC), 2015 Int. Conf. 1 (2015) 271–276.
28. R. Bhargava, Y. Sharma, S. Sharma, Sentiment Analysis for Mixed Script Indic Sentences, in: Adv.
Comput. Commun. Informatics (ICACCI), 2016 Int. Conf. IEEE, 2016: pp. 524–529.
doi:10.1109/ICACCI.2016.7732099.
29. J.J. Gumperz, Discourse Strategies, Cambridge University Press, 1982.

2476
Sentiment Analysis of Code-Mixed Text: A Review

30. K. Chuah, Aplikasi Media Sosial Dalam Pembelajaran Bahasa Inggeris : Persepsi Pelajar Universiti,
Issues Lang. Stud. 2 (2013) 56–63.
31. P.-J. Farrugia, TTS Pre-processing Issues for Mixed Language Support, in: Proc. CSAW’04, 2004: pp.
36–41.
32. T. Solorio, Y. Liu, Part-of-Speech Tagging for English-Spanish Code-Switched Text, in: Proc. Conf.
Empir. Methods Nat. Lang. Process., 2008: pp. 1051–1060.
33. A. Das, S. Bandyopadhyay, Subjectivity Detection in English and Bengali: A CRF-based Approach, in:
Proceeding ICON 2009, 2009.
34. S. Mukund, R. Srihari, Analyzing Urdu social media for sentiments using transfer learning with
controlled translations, in: Proc. Second Work. Lang. Soc. Media, 2012: pp. 1–8.
35. Y. Li, Y. Yu, P. Fung, A Mandarin-English Code-Switching Corpus, in: Lr. 2012 - Eighth Int. Conf.
Lang. Resour. Eval., 2012: pp. 2515–2519.
36. J. Zhao, X. Qiu, S. Zhang, F. Ji, X. Huang, Part-of-Speech Tagging for Chinese-English Mixed Texts
with Dynamic Features, in: Proc. 2012 Jt. Conf. Empir. Methods Nat. Lang. Process. Comput. Nat.
Lang. Learn., 2012: pp. 1379–1388.
37. N. Samsudin, A.R. Hamda, M. Puteh, M.Z.A. Nazri, Mining Opinion in Online Messages, Int. J. Adv.
Comput. Sci. Appl. 4 (2013) 19–24. http://ijacsa.thesai.org/.
38. B. King, S. Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly
Supervised Methods, in: HLT-NAACL, 2013: pp. 1110–1119.
39. M. Abdul-Mageed, M. Diab, S. Kübler, SAMAR: Subjectivity and sentiment analysis for Arabic social
media, Comput. Speech Lang. 28 (2014) 20–37.
40. G. Yan, W. He, J. Shen, C. Tang, A bilingual approach for conducting Chinese and English social
media sentiment analysis, Comput. Networks. 75 (2014) 491–503. doi:10.1016/j.comnet.2014.08.021.
41. S. Vicente, R. Agerri, G. Rigau, Simple , Robust and (almost) Unsupervised Generation of Polarity
Lexicons for Multiple Languages, in: EACL, 2014: pp. 88–97.
42. Q. Zhang, H. Chen, X. Huang, Chinese-English Mixed Text Normalization, in: Proc. 7th ACM Int.
Conf. Web Search Data Min., 2014: pp. 433–442. doi:10.1145/2556195.2556228.
43. U. Barman, A. Das, J. Wagner, J. Foster, Code Mixing: A Challenge for Language Identification in the
Language of Social Media, in: EMNLP 2014, 2014: p. 13.
44. U. Barman, J. Wagner, G. Chrupała, J. Foster, DCU-UVT: Word-Level Language Classification with
Code-Mixed Data, in: Proc. 2014 Conf. Empir. Methods Nat. Lang. Process., 2014: pp. 127–132.
45. A. Das, B. Gamback, Identifying Languages at the Word Level in Code-Mixed Indian Social Media
Text, in: Proc. 11th Int. Conf. Nat. Lang. Process. Goa, India, 2014: pp. 169–178.
46. Y. Vyas, S. Gella, J. Sharma, K. Bali, M. Choudhury, POS Tagging of English-Hindi Code-Mixed
Social Media Content, in: Proc. Conf. Empir. Methods Nat. Lang. Process., 2014: pp. 974–979.
47. S. Sharma, P.Y.K.L. Srinivas, R.C. Balabantaray, Sentiment analysis of code - Mix script, in: 2015 Int.
Conf. Comput. Netw. Commun. CoCoNet 2015, 2015: pp. 530–534.
doi:10.1109/CoCoNet.2015.7411238.
48. E. Gredel, Metaphorical patterns and the subprime mortgage crisis: Towards cross-linguistic, discourse-
specific and n-gram-based dictionaries for sentiment analysis, Stud. Commun. Sci. 15 (2015) 37–44.
doi:10.1016/j.scoms.2015.03.003.
49. S. Sharma, P.Y.K.L. Srinivas, R.C. Balabantaray, Text normalization of code mix and sentiment
analysis, in: 2015 Int. Conf. Adv. Comput. Commun. Informatics, ICACCI 2015, 2015: pp. 1468–1473.
doi:10.1109/ICACCI.2015.7275819.
50. A. Jamatia, B. Gamback, A. Das, Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and
Facebook Chat Messages, in: Proc. Recent Adv. Nat. Lang. Process., 2015: pp. 239–248.
doi:10.13140/RG.2.1.1222.0640.
51. D. Vilares, M.A. Alonso, C. Gómez-Rodriguez, Sentiment Analysis on Monolingual, Multilingual and
Code-Switching Twitter Corpora, in: Proc. 6th Work. Comput. Approaches To Subj. Sentim. Soc.
Media Anal., 2015: pp. 2–8.
52. S. Lee, Z. Wang, Emotion in Code-switching Texts : Corpus Construction and Analysis, in: ACL-
IJCNLP 2015, 2015: pp. 91–99.
53. S. Dutta, T. Saha, S. Banerjee, S.K. Naskar, Text normalization in code-mixed social media text, in:
2015 IEEE 2nd Int. Conf. Recent Trends Inf. Syst. ReTIS 2015 - Proc., 2015: pp. 378–382.
doi:10.1109/ReTIS.2015.7232908.
54. Z. Wang, S.Y.M. Lee, S. Li, G. Zhou, Emotion Detection in Code-switching Texts via Bilingual and
Sentimental Information, in: Proc. 53rd Annu. Meet. Assoc. Comput. Linguist. 7th Int. Jt. Conf. Nat.
Lang. Process. (Volume 2 Short Pap., 2015: pp. 763–768.
55. K.C. Raghavi, M. Chinnakotla, M. Shrivastava, "Answer ka type kya he?" Learning to
Classify Questions in Code-Mixed Language, in: Proc. 24th Int. Conf. World Wide Web, 2015: pp.

2477
Nurul Husna Mahadzir, Mohd Faizal Omar, Mohd Nasrun Mohd Nawi, Anas A. Salameh,
Kasmaruddin Che Hussin

853–858. doi:10.1145/2740908.2743006.
56. S. Banerjee, A. Kuila, A. Roy, S.K. Naskar, P. Rosso, S. Bandyopadhyay, A Hybrid Approach for
Transliterated Word-Level Language Identification:: CRF with Post-Processing Heuristics, in: Proc.
Forum Inf. Retr. Eval. - FIRE ’14, ACM Press, New York, New York, USA, 2015: pp. 54–59.
doi:10.1145/2824864.2824876.
57. S.L. Lo, E. Cambria, R. Chiong, D. Cornforth, A multilingual semi-supervised approach in deriving
Singlish sentic patterns for polarity detection, Knowledge-Based Syst. 105 (2016) 236–247.
doi:10.1016/j.knosys.2016.04.024.
58. D. Vilares, C. Gómez Rodríguez, M.A. Alonso, EN-ES-CS : An English-Spanish Code-Switching
Twitter Corpus for Multilingual Sentiment Analysis, in: Proc. Tenth Int. Conf. Lang. Resour. Eval.,
2016: pp. 4149–4153.
59. S. Ghosh, S. Ghosh, D. Das, Part-of-speech Tagging of Code-Mixed Social Media Text, in: EMNLP
2016, 2016: pp. 90–97.
60. K. Rudra, S. Rijhwani, R. Begum, K. Bali, M. Choudhury, N. Ganguly, Understanding Language
Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?,
in: Proc. 2016 Conf. Empir. Methods Nat. Lang. Process., 2016: pp. 1131–1141.
61. P. Lamabam, K. Chakma, A Language Identification System for Code-Mixed English-Manipuri Social
Media Text, in: 2nd IEEE Int. Conf. Eng. Technol. (ICETECH), 17th& 18thMarch 2016, Coimbatore,
TN, India., 2016: pp. 79–83.
62. A. Chanda, D. Das, C. Mazumdar, Unraveling the English-Bengali Code-Mixing Phenomenon, in:
Proc. Second Work. Comput. Approaches to Code Switch., 2016: pp. 80–89.
63. N. Bjørner, S. Prasad, L. Parida, Language Identification and Disambiguation in Indian Mixed-Script,
in: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics),
2016: pp. 113–121. doi:10.1007/978-3-319-28034-9.
64. D. Vilares, M.A. Alonso, C. Gómez-Rodríguez, Supervised sentiment analysis in multilingual
environments, Inf. Process. Manag. 53 (2017) 595–607. doi:10.1016/j.ipm.2017.01.004.
65. K. Becker, V.P. Moreira, A.G.L. dos Santos, Multilingual emotion classification using supervised
learning: Comparative experiments, Inf. Process. Manag. 53 (2017) 684–704.
doi:10.1016/j.ipm.2016.12.008.
66. Z. Wang, S. Lee, S. Li, G. Zhou, Emotion Analysis in Code-Switching Text with Joint Factor Graph
Model, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (2017) 469–480.
doi:10.1109/TASLP.2016.2637280.
67. B. King, S. Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly
Supervised Methods, in: HLT-NAACL, 2013: pp. 1110–1119.
68. D.-P. Nguyen, A.S. Dogruoz, Word level language identification in online multilingual communication,
Assoc. Comput. Linguist. (2013).
69. M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, A. Kappas, Sentiment strength detection in short
informal text, J. Am. Soc. Inf. Sci. Technol. 61 (2010) 2544–2558. doi:10.1002/asi.21416.
70. S. Narr, M. Ulfenhaus, S. Albayrak, Language-Independent Twitter Sentiment Analysis, Knowl.
Discov. Mach. Learn. (2012) 12–14.
71. M. Abdul-Mageed, M.T. Diab, Subjectivity and Sentiment Annotation of Modern Standard Arabic
Newswire, in: Proc. 49th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol. Short Pap.,
2011: pp. 110–118.
72. S.C.; Carter, W.; Weerkamp, E. Tsagkias, S. Carter, @bullet Wouter, W. @bullet, M. Tsagkias, S.
Carter, Á.W. Weerkamp, Á.M. Tsagkias Isla, W. Weerkamp, M. Tsagkias, Microblog Language
Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text Microblog language
identification: overcoming the limitations of short, unedited and idiomatic text, Lang. Resour. Eval.
Lang Resour. Eval. 47 (2013).
73. D. Jurgens, S. Dimitrov, D. Ruths, EMNLP 2014 First Workshop on Computational Approaches to
Code Switching Proceedings of the Workshop, in: EMNLP 2014, 2014: pp. 51–61.
74. S.K. Singh, K.S. Manoj, Importance and Challenges of Social Media Text, Int. J. Adv. Res. Comput.
Sci. 8 (2017) 2015–2018.

2478

View publication stats

You might also like