0% found this document useful (0 votes)
25 views

Identifying Groups of Fake Reviewers Using A Semisupervised Approach

The document discusses identifying groups of fake reviewers on online platforms. It proposes a semisupervised framework using the DeepWalk approach on reviewer graphs and clustering to detect candidate fake reviewer groups. The framework is validated on a real review dataset from Google Play with some known fraudulent reviewers. Experimental results show the approach can reasonably identify spammer groups.

Uploaded by

y.tria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Identifying Groups of Fake Reviewers Using A Semisupervised Approach

The document discusses identifying groups of fake reviewers on online platforms. It proposes a semisupervised framework using the DeepWalk approach on reviewer graphs and clustering to detect candidate fake reviewer groups. The framework is validated on a real review dataset from Google Play with some known fraudulent reviewers. Experimental results show the approach can reasonably identify spammer groups.

Uploaded by

y.tria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO.

6, DECEMBER 2021 1369

Identifying Groups of Fake Reviewers Using


a Semisupervised Approach
Punit Rathore , Jayesh Soni, Nagarajan Prabakar, Marimuthu Palaniswami , Life Fellow, IEEE,
and Paolo Santi

Abstract— Online product reviews have become increasingly five million apps on the iOS and Google Play Store together,
important in digital consumer markets where they play a opinion spamming will become worse and more sophisticated
crucial role in making purchasing decisions by most consumers. in the future. Hence, detecting fake reviews and fake reviewers
Unfortunately, spammers often take advantage of online reviews
by writing fake reviews to promote/demote certain products. becomes extremely challenging.
Most of the previous studies have focused on detecting fake Crowdsourcing sites play a vital role in such scenarios.
reviews and individual fake reviewer-ids. However, to target a Though generic crowdsourcing sites exist in the market,
particular product, fake reviewers work collaboratively in groups some specialized fraudsters’ crowdsourcing sites have emerged
and/or create multiple fake ids to write reviews and control the whose primary focus is only on performing fraud activities.
sentiments of the product. This article addresses the problem of
finding such fake reviewer groups. More specifically, we propose Since the incentive structure appeals to fraudsters to perform
a top-down framework for candidate fake reviewer groups’ malicious behaviors, fraudulent sites hire such people and
detection based on the DeepWalk approach on reviewers’ graph make communities commit fraud, such as spread false opin-
data and a (modified) semisupervised clustering method, which ions, promote activities, instigate controversy and debates to
can incorporate partial background knowledge. We validate our build hype about certain topics, or control the sentiments about
proposed framework on a real review dataset from the Google
Play Store, which has partial ground-truth information about certain products to promote/demote them. Such communities
2207 fraud reviewer-ids out of all 38 123 reviewer-ids in the often work in groups to fully control of sentiment of target
dataset. Our experimental results demonstrate that the proposed products and distribute total effort. Moreover, to prevent being
approach is able to identify the candidate spammer groups detected, some group members review nontarget products and
with reasonable accuracy. The proposed approach can also be review like normal users to deceive spam detection tools.
extended to detect groups of opinion spammers in social media
(e.g. fake comments or fake postings) with temporal affinity, Therefore, such fraudster groups are more harmful than the
semantic characteristics, and sentiment analysis. individual fraudster.
In recent years, there is a rapid increase in such commu-
Index Terms— Computational social science, fake reviewer
groups’ detection, semisupervised clustering, spammer detection. nities that perform fraudulent activities in groups. Fraudsters
first create multiple unique accounts, then use crowdsourcing
sites, try to connect with developers, and ultimately perform
I. I NTRODUCTION
fraudulent activities by performing fraudulent public opinion.

O NLINE reviews play a central role in making purchasing


decisions by most consumers. Several reports [1], [2]
suggest that over 93% [1] consumers say that online reviews
Therefore, it is important to identify such groups of fraudsters.
This article deals with this problem in the context of potential
fake reviewer groups’ detection from online markets. Such a
influence their purchasing decisions. Moreover, many service group of reviewers works collaboratively to write fake reviews.
providers [3] use these reviews to enhance the quality of their By a group of reviewers, we mean a set of reviewer-ids where
products/services. Therefore, driven by the immense financial actual reviewers behind the reviewer-ids could be a single
profits from products, fraudsters influence the business by person with multiple (sock-puppet) ids, multiple persons, or a
posting fake or deceptive reviews/opinions to promote or combination of both, as shown in Fig. 1.
demote some products. Since, currently, there are more than Previous studies [4]–[13] mostly focused on detecting fake
Manuscript received November 17, 2020; revised May 14, 2021; accepted reviews and individual-level fraud detection. Surprisingly, only
May 18, 2021. Date of publication June 10, 2021; date of current version a few studies [14]–[22] aim to detect spammer groups. The
December 1, 2021. (Corresponding author: Punit Rathore.) reviewers of a spammer group usually coreview one or more
Punit Rathore and Paolo Santi are with the Senseable City Labora-
tory, Department of Urban Planning and Studies, Massachusetts Institute target products. Hence, most of the existing studies use fre-
of Technology, Cambridge, MA 02139 USA (e-mail: [email protected]; quent itemset mining (FIM) methods to discover potential
[email protected]). candidate groups, i.e., reviewers co-occurring in multiple
Jayesh Soni and Nagarajan Prabakar are with the School of Computing and
Information Sciences, Florida International University, Miami, FL 33199 USA reviewer–product transactions. These methods require the user
(e-mail: [email protected]; [email protected]). to choose a minimum support threshold to identify candidate
Marimuthu Palaniswami is with the Department of Electrical and Electronics spammer groups. Using a very high support value, we may
Engineering, University of Melbourne, Parkville, VIC 3010, Australia (e-mail:
[email protected]). lose many useful potential spammer groups, whereas using a
Digital Object Identifier 10.1109/TCSS.2021.3085406 low support value causes high computational complexity due
2329-924X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
1370 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 6, DECEMBER 2021

available to us are used both for partial background knowledge


in our semisupervised clustering-based framework and for the
validation of our proposed framework.
Unlike the majority of existing methods that focus on review
text or behavioral analysis, our approach does not use review
text information as review text analysis is computationally
inefficient and is often unreliable in fraud review detection
[8], [15], [28]. We demonstrate that our proposed frame-
work can discover candidate spammer groups using only the
reviewer information from the social network data. Moreover,
many previous studies on detecting review spammers relied
on human evaluation of the candidate spammer groups their
algorithm produced, which is a time-consuming and labor-
intensive step prone to human judgment. To the best of our
knowledge, this is the first study that evaluates the identified
candidate spammer groups using the proposed framework
against the available ground-truth information about fraud
Fig. 1. App review graph. reviewers.
The remainder of this article is organized as follows.
Section II summarizes the literature on individual and group
to an exponential increase in itemsets [20]. Moreover, these spammers detection. Section III discusses the proposed
studies are unable to capture overlapping groups. approach including DeepWalk based graph-embedding and
Most of the existing studies on candidate spammer group semisupervised clustering technique for candidate spammer
detection have adopted the unsupervised approach due to groups’ detection. Section IV presents the experiments and
the lack of labeled data, such as ranking the groups using results on a real-world online review dataset, followed by
individual and group behavior indicators [14], [21], [22], discussion in Section V and conclusions in Section VI.
clustering [19], [23], and community detection [24]. However,
in many cases, partially labeled data or some useful back-
II. R ELATED W ORK
ground knowledge about certain reviewers or reviews may be
available [7], [14], [22]. We believe that semisupervised meth- The problem of detecting fake reviews/reviewers has been
ods can enhance prediction accuracy for candidate groups’ extensively studied in the literature in recent years. These
detection. Though semisupervised classification techniques studies can be categorized in mainly three categories: fake
make use of partially labeled data, they only work well when reviews detection, fake reviewer detection, and fake reviewer
the labeled data is significant and represent all the rele- groups’ detection. Compared to the first two categories, limited
vant classes [25]. In contrast, semisupervised clustering [26] work has been done for fake reviewer groups’ detection. Due
approaches can partition the data using the classes in the initial to the abundance of the literature on the first two categories,
labeled data, as well as extend and modify the existing set of we restrict our discussion to spammer groups’ detection, which
classes as needed to reflect other patterns in the data [27]. we deem as pertinent to this article.
Therefore, to address the problems mentioned above, Mukherjee et al. [29] introduced the use of the FIM
we propose a top-down framework for candidate groups’ method to identify candidate review spammer groups from
detection based on a modified semisupervised clustering product-review transactions. Then, the authors proposed the
method and a DeepWalk approach on the topological structure GSRank model [14] that ranked the identified groups based
of the reviewer graph. More specifically, we consider the data on the spamicity of each group, computed by exploiting
as a graph structure where nodes represent reviewer-ids and relationships among groups, target products, and individual
edges represent the number of products (apps in our case) reviewers. Xu et al. [15] identified the candidate spammer
commonly reviewed by both reviewer-ids. Then, we employ groups using FIM and then proposed a kNN-based method
a DeepWalk approach that learns a social representation of a and a graph-based classification method, based on pair-
graph’s nodes by modeling a stream of short random walks wise Markov network, to predict the spam/nonspam label of
followed by Word2Vec’s skip-gram model. We demonstrate each reviewer belonging to the candidate spammer groups.
the effectiveness of our framework by performing several Allahbakhsh et al. [16] employ two clustering algorithms
experiments on a real-world review dataset, which is one of build upon FIM technique to detect biclique and then check
its kind and unique dataset in the sense that it has partial their spamicity based on group spam indicators. A biclique
ground-truth information about fraud reviewers-ids and the is a group of reviewers and a group of products where each
corresponding individuals behind them. This review dataset reviewer in a group has rated each product in the correspond-
consists of the 640 Google Play Store apps reviewed by ing group.
38 123 reviewer-ids. As partial background knowledge, Wang et al. [20] proposed a divide-and-conquer algorithm,
we have information about 2207 fraud reviewer-ids that called GSBP, which solely relies on the topological struc-
belonged to the 23 unique reviewers. The ground-truth data ture of the reviewer graph. They argued that FIM tends to

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
RATHORE et al.: IDENTIFYING GROUPS OF FAKE REVIEWERS USING A SEMISUPERVISED APPROACH 1371

Fig. 2. Candidate spammer group detection framework.

detect small-size and tighter groups. Unlike GSRank, GSBP useful candidate groups; and 3) FIM is unable to capture
does not take into consideration the review content as its overlapping groups. Moreover, the majority of the existing
features. Akoglu et al. [11] proposed a Markov random methods manually label the fake reviews or reviewers based
field (MRF)-based model, FRAUDEagle, which ranks the on various spam indicators. To the best of our knowledge, this
individual reviewers based on the spamicity, by exploiting is the first study that validates the candidate spammer groups
the network effect among reviewers and products. To detect the identified using our proposed framework against the actual
spammer groups, they used a graph clustering technique on (ground-truth) fake reviewer groups.
the subgraphs containing top-ranked spammers and products.
Rayana and Akoglu [30] proposed SpEagle that exploits both III. P ROPOSED M ETHODOLOGY
the relational data (review graph with only the review nodes)
In this section, first, we define the problem statement and
and metainformation (e.g., review content, timestamp, and
then discuss our proposed computing framework.
ratings) as prior probabilities to detect spammer groups and
fake reviews.
Choo et al. [24] performed sentiment analysis on reviews’ A. Problem Definition
interaction data among users to differentiate nonspam com- Let G = (R, E) denote a weighted coreview graph, where R
munities from spam ones using community structures. The is the node-set representing reviewer-ids, and E is the edge-set
authors obtained comparable results on the Amazon dataset whose weights represent the number of apps coreviewed by the
with respect to state-of-the-art approaches. Their results sug- reviewer-ids of the associated vertices. For example, an edge
gested that opinion group spammers have strong positive (r1 , r2 ) with edge weight w means that reviewer accounts r1
communities. Ye and Akoglu [18] proposed a new two-step and r2 coreview the same w number of apps. Therefore, given
model, called GroupStrainer, to identify the products targeted the partially labeled data about the relationship between a set
by a group of spammers. First, they compute the likelihood of of reviewer spammers (or group members) Rg ⊂ R, the task is
the products being spam campaign targets using a new network to identify all the candidate spammer groups from unlabeled
footprint score (NFS) measure and then cluster spammers on nodes R  ⊂ R \ Rg , where the number of candidate spammer
a two-hope subgraph induced by top-ranking products. groups is unknown.
Dhawan et al. [21] proposed a model, called DeFrauder, In the proposed framework, as shown in Fig. 2, we detect
which first detects candidate fraud groups by leveraging the candidate spammer groups merely based on the topological
underlying graph and incorporating behavioral signals and structure of the coreview graph. First, we obtain representation
then ranks each reviewer group based on the spam score. for each node (reviewer-ids) in G using the DeepWalk method,
Bitarafan and Dadkhah [22] proposed a heterogeneous infor- and then, we partition these reviewer-ids, represented by
mation network (HIN)-based approach that first identifies feature vectors, into different candidate spammer groups by
candidate spammer groups using biconnected components in employing a modified version of a semisupervised cluster-
the reviewer graph and then classifies the groups based on ing algorithm, known as PCKMeans (Pairwise Constrained
several group spammer indicators. K -Means).
Most of the existing methods to discover candidate spammer
groups exploit the FIM technique. However, there are several
limitations [20] with FIM-based methods for candidate groups’ B. DeepWalk
detection, such as: 1) a low support value causes high compu- We leverage the DeepWalk approach to get the high-level
tational complexity as frequent itemsets grow exponentially; representation of our coreview graph of reviewers. It is a nat-
2) a high support threshold value results in loss of many ural language-based modeling approach that aims to generate

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
1372 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 6, DECEMBER 2021

a meaningful representation between vertices. It achieves this In our modification, the first λ centroids are computed from
utilizing two different techniques: Random-Walk followed by ML and CL constraints as mentioned above; however, the next
Word2vec [31]. k − λ centroids are chosen using the maximin (MM) sampling
Through Random-Walk, it performs random sampling on scheme. The MM sampling scheme iteratively selects the
the graph to generate a sequence of Reviewer-ids. The sam- points that are further from each other in the input data. Thus,
pling is governed by some hyperparameters that include the we initialize the first λ samples with the λ centroids, and
following: then, the remaining k − λ points (centroids) are chosen using
1) number of walks; MM sampling, in turn, to have maximum separation from the
2) minimum path length (number of edges along the path); existing centers. These cluster centers are well scattered over
3) minimum distance (sum of edge weights along the path). the sample space.
Word2vec: It accepts random walk output generated in the In the second (cluster assignment) step, each point is
previous step as its input. It is basically a neural network, assigned to its cluster such that its distance to the cluster
which uses the skip-gram [32] technique to generate the vector centroid is minimized while satisfying as many ML and CL
representation of each node. This representation has semantic constraints as possible by the assignment. Let M and C denote
meaning, wherein similar reviewer accounts will get similar the set of ML and CL constraints, respectively. Also, let
vector representation and vice versa. The Word2vec model is li ∈ {1, 2, . . . , k} be the cluster assignment of a point x i and
governed by following two hyperparameters: 1) representation 1 be the Indicator function with 1[true] = 1 and 1[false] = 0;
size (number of features for the path) and 2) window size then, the PCKMeans seeks to minimize the following modified
objective function:

C. Semisupervised Clustering: Modified PCKMeans 1 


J= ||x i − µli ||2
In many cases, partially labeled data about reviews or some 2 x ∈X
 
i
background knowledge about reviewers, e.g., reviewer location + wij 1(li = l j ) + wij 1(li = l j ) (1)
and IP address where he/she is posting a tweet from, may x i ,x j ∈M x i ,x j ∈C
be available to us. Most of the existing studies on spammer
group detection ignored the importance of such background where point x i is assigned to the cluster X li with centroid
information and used unsupervised approaches, such as the µli ; wij and w ij are user-defined penalties (incurred loss) for
ranking of groups or clustering. Semisupervised clustering violating a ML constraint (if the ML points are assigned to
approaches have been successful in recent years to solve prac- two different clusters) and CL constraint (if the CL points are
tical problems in many applications, including road detection, assigned to the same clusters).
image classification, bioinformatics, information retrieval, and In the third (last) step, the centroids are reestimated by
speech recognition [26]. taking the mean of all feature vectors in the corresponding
In our work, we demonstrate that clustering performance cluster. The pairwise constraints are not considered in this step.
for candidate spammer groups’ detection can be significantly The second and third steps are performed repeatedly until the
improved by integrating partial background knowledge into a algorithm converges.
semisupervised clustering framework. In this work, we modify 2) Constraints Selection: Note that, among available con-
an existing semisupervised clustering method, called pairwise straints, not all of them will be useful, particularly when the
constrained k-means (PCKMeans) [33], to make it work corresponding relationships can automatically and easily be
efficiently when input constraints do not reflect all unique deduced by a clustering algorithm. Only a few constraints,
clusters present in the data. In the following, we first briefly which greatly assist the algorithm in identifying complex and
discuss the PCKMeans clustering algorithm and then discuss difficult patterns, will be more useful. To select the pairwise
our modifications. constraints that are more informative than random constraints,
1) PCKMeans [33]: PCKMeans is a semisupervised ver- Basu et al. [33] proposed an active learning scheme that
sion of k-means clustering algorithm that integrates the par- works in two phases: Explore and Consolidate. In the Explore
tial labeled data in its framework in the form of pairwise phase, the farthest-first traversal property is used to obtain the
constraints, e.g., two instances must be in the same cluster disjoint neighborhoods until k neighborhoods are found. In this
[must-link (ML) constraints] or two instances cannot be in the process, queries to generate pairwise constraints are made by
same cluster [cannot-link (CL) constraints]. pairing a point x with a random point from each neighborhood.
In the first step (Initialization) of PCKMeans, λ neighbor- If x is an ML-point with any existing neighborhood, it is added
hood sets {N p }λp=1 are computed from the ML and CL con- to that neighborhood; else, a new neighborhood is initiated
straints such that points within each neighborhood have ML with x. In the Consolidate phase, the neighborhoods obtained
between them, and the points in two different neighborhoods from Explore step are consolidated with more points where
are connected by CL constraints. These neighborhood sets a proper neighborhood of a random point x ∈ X can be
are then used to initialize the cluster centroids. If λ ≥ k, determined with a maximum of k − 1 queries.
then k neighborhood sets of largest size are considered for Note that, if the actual number of clusters k is unknown,
computing initial centroids. If λ < k, then first λ centers are which is common for real-world datasets, only the Explore
computed with the centroids of the λ neighborhood sets, and step can be used in above active learning scheme to obtain
the remaining k − λ centers are chosen randomly. consolidated neighborhoods, which is a more time-consuming

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
RATHORE et al.: IDENTIFYING GROUPS OF FAKE REVIEWERS USING A SEMISUPERVISED APPROACH 1373

process than the Consolidate step [33]. Therefore, to efficiently


implement the above scheme, we do the following modifica-
tion. First, we extract k  MM points (neighborhoods) using
the MM sampling scheme, which is furthest from each other,
where k  is an overestimate (k  > k) of a true but unknown
number of the clusters. Hathaway et al. [34] theoretically
proved a proposition, which states that if k  ≥ k and data
are compact-separated (CS), then MM sampling selects at
least one object from each cluster in its sample. Moreover,
we also discovered in our experiments in [35] that the above
proposition holds for most of the non-CS datasets, especially Fig. 3. Number of reviewer-ids (class size) belonging to each unique
when k   k. Initially, each of the k  neighborhoods contains reviewer (class) in partially labeled data.
exactly one MM point. Next, pair(s) of neighborhoods that are
connected by ML-constraints are merged into one, giving us,
The participants are male and are of the age of 18–28 years,
say c (≤ k  ), neighborhoods. Next, the Consolidate step [33]
with different levels of education. The fraud expertise of
is performed to find the ML neighborhood for a random point
surveyed participants covered a wide spectrum of reviewing
x ∈ X. This modification adds points to the corresponding
and assigning ratings in the online app market with mislead-
neighborhoods at a faster rate than the actual active learning
ing influential information or messages. We focused on the
scheme discussed in [33].
analysis of search rank fraud behavior targeted toward Google
3) Additional Constraints Generation: Constraints provided
Play apps.
by domain experts or generated from partially labeled data
From the data collection of our custom application from
may be incomplete. However, by using the transitivity property
their computers, we observed that each fraudster has a unique
on the available constraints, additional constraints can be
search rank fraud capability both in quality and quantity. For
generated, as follows:
instance, some fraudster has the ability to write several reviews
“similar” “similar” “similar”
x1 −−−−→ x2 and x2 −−−−→ x3 then x1 −−−−→ x3 (both positive and negative) for each app. These fraudsters
“similar” “dissimilar” “dissimilar” together used 2207 unique Google Play reviewer-ids to write
x1 −−−−→ x4 and x4 −−−−−−→ x5 then x1 −−−−−−→ x5 .
their reviews. The collected data about these reviewers and
The transitive closure expresses the transitive relation their reviews serve as the ground-truth data for our analysis.
between objects. Therefore, we build the transitive closure of 1) Reviewer Graph: We also collected review data for
the initial constraint sets to expand the constraint sets using 640 apps from Google Play Store with reviewer-ids for all
the Floyd–Warshall algorithm [36]. reviews of these apps. These 640 apps were reviewed by
38 123 unique reviewer-ids in our dataset. A fraudster uses
IV. N UMERICAL E VALUATION multiple fake reviewer-ids to submit multiple reviews for the
A. Dataset same app to avoid being detected. Hence, there is a high
probability of having reviews from a group of reviewer-ids to
Gathering information about online reviewers during their
the same app. To identify the association between a group of
review process is difficult as it violates their privacy. In addi-
reviewer-ids and a fraudster, we created a coreview graph for
tion, it is intricate in determining whether the review is genuine
each app. These 640 coreview graphs were then merged into
or fake. Fake reviews often refer to “illegal” activities since
a unified coreview graph G having |E| = 3 572 409 edges and
they are of biased positive opinions for promoting some target
|R| = 38 123 nodes, where a node represents a reviewer-id,
entities and/or of negative opinions to competitors to dam-
and an edge weight indicates the total number of apps reviewed
age their reputations. This challenge of detecting fraudsters
together by the respective reviewer-ids.
requires some background information about fake reviewers as
2) Partial Background Knowledge About Some Reviewers:
ground-truth attributes. Collecting such sensitive data is hard
The partial ground-truth dataset contains information about
and raises critical ethical issues, while good quality data are
some fake accounts that belong to a single unique reviewer.
practically nonexistent in the literature.
As mentioned in the previous paragraphs, we have information
For addressing this difficult problem of the fraudulent
about 23 fraudsters (unique reviewer) using 2207 unique
app reviewing the scheme, we developed a survey question-
Google Play reviewer-ids. Fig. 3 shows the distribution of each
naire to 23 fraud freelancers obtained from Fiverr—a fast-
class, i.e., the number of reviewer-ids (class size) belonging
growing marketplace for freelance services. We compensated
to each unique reviewer (class).
these participants for their participation in the study. As part
of the compensation agreement, the participants agreed and
installed our custom application on their local computers B. Sentiment Analysis of Review Data
to collect their review samples. Also, they understood that The sentiment analysis of the comments of users gives vital
they own the data collected by this application. Through information about the applications. It is basically classified
the IP address of the questionnaire response, we identified into positive, negative, or neutral. Manually reviewing every
that the participants were from Bangladesh, Egypt, Germany, single comment is a cumbersome task; therefore, we computed
India, The Netherlands, Pakistan, the U.K., and the USA. the sentiment score for each text review using the TextBlob

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
1374 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 6, DECEMBER 2021

TABLE I
C LASSIFICATION A CCURACY ON PARTIALLY L ABELED D ATA

optimal values for each hyperparameter of the Random-Walk


and Word2Vec methods. In this process, each combination
of the hyperparameter values generates different Word2Vec
embeddings. To expedite the entire process, we used a grid-
search approach and implemented DeepWalk as a parallel
processing task on multiple cores. For each set of hyperpa-
rameters, we generate a feature vector (embedding) for each
reviewer account and compute the accuracy for 2207 feature
vectors classification into 23 classes against the available
ground-truth data. Then, we choose the set of hyperpara-
meters and corresponding embedding as optimal parameters
and feature vectors that yield the highest cross-validation
classification accuracy on the partially labeled dataset.
We employed four popular multiclass classification models,
random forest, support vector machine, decision tree (J48), and
K-nearest neighbor to obtain the best Word2vec embedding.
Fig. 4. (a) Number of reviews written by each unique (fraudster) reviewer Table I shows the fivefold cross-validation, classification accu-
group and (b) average sentiment score for each (fraudster) group. racy for best Word2vec embedding. We can see that all clas-
sifiers achieve reasonably good classification accuracy (>80%
with kNN achieving the highest accuracy (87%) among them).
python library. The sentiment score ranges in [−1, 1] where At the end of this step, we have a 300-D feature vector for
a score toward 1 indicates a positive review, a score toward each reviewer-id as the best Word2vec embedding for our data.
−1 indicates a negative review, and 0 indicates a neutral Therefore, the size of the entire (both labeled and unlabeled)
review. data is 38 123 × 300 and of the labeled data is 2207 × 300.
We observed that most of the app review text data for the From here on, we denote the entire data with X and the labeled
ground-truth 2207 spammer accounts are of positive sentiment. data with X l .
To confirm this fact, we have performed sentiment analysis on
the app review text data for these accounts that are clustered D. t-SNE Visualization
into 23 fraudster groups. The bar plots in Fig. 4 show the To verify that our vector representations for reviewer-ids
number of text reviews and average sentiment score for each are reasonable, we used the t-SNE [39] method to project the
unique group (class). We notice that the average sentiment high-dimensional vector representations of each reviewer-id
score of the reviews from each fraudster group is positive. into two dimensions for visualization. The foremost advantage
Although it is possible for reviewers to write both positive and of t-SNE is that it preserves local structure while reducing
negative reviews, these reviewers were paid by the sponsors dimension. This means that points closer to each other in the
to write positive reviews to promote product ratings. Also, high-dimensional space set will also be closer to one another
the reviewers are not native English speakers, and they used in the lower dimensional space.
a limited and repeated vocabulary in their reviews. Based on Fig. 5 shows the embeddings of all 2207 fraud reviewer-
this sentiment analysis, we determined that the text analysis ids (X l ) in 2-D, with each embedding labeled in different
on app review text data for nonground-truth reviewer accounts colors according to the available ground-truth information
will be inconclusive. (23 unique reviewers). We can see that there are some natural
groupings of different reviewer-ids in the form of distinguish-
C. Reviewer Representation Using DeepWalk able dense clusters. Some embeddings are scattered in certain
Once the coreview graph G is created, we employ the portions of the graph and, however, have little overlap with
DeepWalk technique to generate embedding for each reviewer embeddings from other reviewer accounts. These affinities to
account. While DeepWalk has demonstrated positive results specific regions and natural groupings can be attributed to
for representation learning, its performance can significantly candidate spammer groups, which suggests that our vector
change based on the settings of hyperparameters. Although representations are reasonable.
there exist methods [37], [38] that can automatically learn the
optimal value of the hyperparameters from the graph to capture E. Constraint Generation Using Partially Labeled Data
better representations in generated embedding, we make use We generate constraints from partially labeled data X l to use
of the partial ground-truth data available to us to identify the them in the semisupervised clustering framework. To generate

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
RATHORE et al.: IDENTIFYING GROUPS OF FAKE REVIEWERS USING A SEMISUPERVISED APPROACH 1375

Fig. 5. T-SNE visualization of 2207 reviewer-ids that belong to 23 unique reviewers.

TABLE II
C OMPARISON OF T HREE U NSUPERVISED AND F IVE S EMISUPERVISED C LUSTERING A LGORITHMS BASED ON THE AVERAGE PA (%)

constraints, we randomly select two instances from the par- constraints in this experiment. In this way, the partially
tially available data and check their labels (which are used labeled data used to generate constraints in this experiment
only for constraints generation and evaluation and not for do not have any representation from the remaining eight
clustering). If they are from the same class, we place the pair in clusters. We generated a total of 200 constraints (100 ML
the ML constraint set, else in the CL constraint set. Note that and 100 CL) and used them in all four semisupervised clus-
we could generate hundreds of thousands of constraints from tering algorithms as an input. We set the constraint violation
the partially labeled data available to us; however, not all the cost w = 1.
constraints will be useful. Therefore, we use the active learning The similarity of output partitions to ground-truth labels
technique, as mentioned in Section III-C2, to select the most is measured using the partition accuracy (PA). The PA of
informative constraints from the total available constraints. a clustering algorithm is the ratio of the number of objects
with matching ground-truth labels and algorithmic labels to
F. Validation of Proposed Framework Using Ground-Truth the total number of objects in the data. The value of (%) PA
Data, X l ranges from 0 to 100, and a higher value implies a better
Before applying the proposed framework to the entire match to the ground-truth partition. Before the PA can be
dataset X to identify candidate spammer group, we first calculated, we ensure that the algorithmic labels obtained from
validate it on the 2207 fraud reviewer-accounts’ data, the clustering algorithms correspond to the same subsets in
X l , for which we have ground-truth information readily the ground truth. For a fair comparison, we set the number
available. We also compare the semisupervised clustering of clusters k = 23 as an input for all seven comparison
algorithm, (modified) PCKMeans (used in the proposed algorithms. The PA is computed only on those datapoints for
framework) with three standard unsupervised clustering algo- which we did not use ground-truth information for constraints
rithm, namely, k-means, DBSCAN [40], Ward’s (hierarchical) generation.
algorithm [41], and four other semisupervised clustering algo- Table II shows the comparison of three unsupervised and
rithms, namely, seeded k-means (Seeded-KM) [27], constraint five semisupervised clustering algorithms based on the average
k-means (CKM) [27], pairwise-constraint based k-means (ten trials) PA(%). We observed that: 1) among unsupervised
(COPKM) [42], PCKMeans [33], and constraint iVAT clustering algorithms, Ward’s clustering algorithm achieves the
(ConiVAT) [43]. For detailed explanation about these algo- highest clustering accuracy; 2) all five semisupervised algo-
rithms, we refer readers to [26], [43], [44]. rithms boost the clustering accuracy; and 3) among five semi-
In many practical cases, the partially labeled data have supervised clustering algorithms, PCKMeans and Modified
information about only some datapoints from certain clusters PCKMeans achieve reasonably better accuracy than other three
and may not have any representation from the remaining clustering algorithms. Though the accuracy of PCKMeans and
clusters present in the full data. Therefore, to have the same Modified PCKMeans is comparable, an additional boost of 2%
characteristics in the partially labeled data, we consider only in the latter algorithm is achieved through our modifications,
40% datapoints from 15 clusters (classes) of X l to generate as discussed in Section III.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
1376 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 6, DECEMBER 2021

Fig. 6. Partially labeled data: Silhouette score for different number of clusters, Fig. 7. Entire data: Silhouette score for different number of clusters, with
with an optimal number (vertical dashed line) of clusters (fraudster groups) an optimal number (vertical dashed line) of clusters (fraudster groups) in the
in the labeled data. entire data.

Although we had the information about the actual number


of clusters (fraudster groups), k (23 classes) present in ground-
truth data X l , however, for many real-world unlabeled datasets,
k is usually unknown. Similarly, we do not have any prior
knowledge about k for the full dataset X used in this work.
Therefore, we plan to use a popular cluster validation metric,
the Silhouette measure [45], to estimate the target k in our
work. For a given clustering partition, the Silhouette coefficient
is calculated using the mean intracluster distance and the
mean nearest-cluster distance for each sample. Then, the final Fig. 8. Cluster size distribution.
Silhouette score is the mean Silhouette coefficient of all data
points. Before using it for the unlabeled dataset X, we verify it The constraint violation cost was set as w = 1. The con-
on the labeled data X l to assess the (known) target number of vergence threshold was set to 0.001, and the number of
the labeled subsets. The values of the Silhouette score ranges maximum iterations for modified PCKMeans was set to 100.
in [−1, 1], where 1 indicates the best clustering partition, Since the total number of reviewer groups (k) present in
−1 indicates the worst clustering partition, and values near the entire data X is unknown to us, we used the Silhouette
0 indicate the overlapping clusters. measure to estimate k in the same manner as we did for the
We applied the modified PCKMeans clustering on ground- ground-truth data X l . We applied our proposed framework
truth review data X l , with the input number of clusters k on the entire data X with varying number of clusters from
varying between 2 and 60. Fig. 6 plots the Silhouette score 24 to 600 and computed the Silhouette score of each output
for different k. The maximum Silhouette value in this plot partition. Fig. 7 plots these Silhouette scores for a different
suggests k = 23 as the optimal number of clusters, which number of clusters. The optimal number of clusters k = 298,
matches with the true number of classes (fraudster groups) corresponding to the highest Silhouette score, suggests 298
presented in the ground-truth data X l . In summary, our pro- candidate spammer groups in the entire data X using our
posed framework is able to estimate the true number of fraud proposed framework. Fig. 8 shows the distribution of the
reviewer groups in X l and identify them from the topological number of points in each cluster suggesting on an average
data with ∼67% accuracy. 127 reviewer-ids in each group, with minimum 4 and max-
imum 495 reviewer-ids in a group. The average number of
G. Experiments on the Entire (Both Labeled and Unlabeled) reviewer-ids (127) in the entire data is comparable to the
Data, X average number of reviewer-ids (95) in the partially labeled
data. Moreover, the PA = 54.8% on validation data of X l
In this experiment, we generate the pairwise constraints
indicates that the proposed framework is able to identify
from the partially labeled data, X l , and use them in our
fraud reviewers from 23 known spammer groups with 54.8%
proposed approach to identify candidate spammer groups from
accuracy.
the entire data, X. Since the number of observations to be
clustered in this experiment is relatively much higher than in
our previous experiment, we also increase the input number of V. D ISCUSSION AND F UTURE W ORK
constraints from 200 to 500. More specifically, we generated a In this work, we demonstrated that our proposed framework
total 500 constraints using 70% datapoints from 2207 labeled can make use of the partially available ground-truth data to
datapoints from 23 known classes and keep 30% remaining enhance the detection of fake reviewer groups. Although the
data for validation. Although we compute the overall clustering number of ground-truth data points available to us was far
performance on the entire data X using the Silhouette measure, fewer than the total number of data points, our approach
we also measure the clustering accuracy PA (%) on valida- was able to identify groups of fraud reviewers with reason-
tion data of X l that were not used to generate constraints. able accuracy. Note that the partial ground-truth data are

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
RATHORE et al.: IDENTIFYING GROUPS OF FAKE REVIEWERS USING A SEMISUPERVISED APPROACH 1377

not essential in our semisupervised approach, but it boosts 2207 reviewer-ids as ground-truth information. We validated
the performance of fake reviewer groups’ detection. A key our proposed framework on the partial ground data (2207 fake
application of our approach in the evaluation of a new review reviewer-ids belonging to 23 unique reviewers) to identify
for genuineness—when the associated reviewer-id is a member fraud reviewer groups from reviewers’ graph data with ∼67%
of a spammer group—is the significant reduction in the search accuracy. Moreover, our experimental results on the entire
space from the entire collection of past reviews to the set of (both labeled and unlabeled) data demonstrate that the pro-
reviews by the reviewer-ids of the spammer group. posed framework is able to identify candidate spammers
In our work, we used the DeepWalk embedding approach groups with satisfactory accuracy, without review text data
under the assumption that the network is static. Since the analysis.
DeepWalk method is a transductive technique, it must be
retrained whenever a new node is added. In most social R EFERENCES
networks, nodes and edges accrue to a growing network
as new data arrive. Therefore, to extend our approach for [1] D. Kaemingk. (2019). 20 Online Review Stats to Know in 2019.
Accessed: Aug. 25, 2020. [Online]. Available: https://www.qualtrics.
dynamic networks, we plan to leverage a real-time streaming com/blog/online-review-stats/
graph-embedding technique [46], [47] in our future work. [2] R. Kats. (2020). How are Consumers Spending Some Their Time? Read-
The augmentation of stream data processing in our approach ing Reviews. Lots of Reviews. Accessed: Aug. 25, 2020. [Online]. Avail-
able: https://www.emarketer.com/content/how-are-consumers-spending-
will dynamically reconfigure spammer groups in real-time some-of-their-time-reading-reviews-lots-of-reviews
with newly created reviewer-ids that will enable real-time [3] K. Saleh. (2015). The Importance of Online Customer Review [Info-
detection of fake reviewer groups. Future online reviews from graphic]. Accessed: Aug. 25, 2020. [Online]. Available: https://www.
invespcro.com/blog/the-importance-of-online-customer-reviews-
reviewer accounts of fraudsters detected by this approach will infographic/
be quarantined at online forums to improve the authenticity [4] N. Jindal and B. Liu, “Review spam detection,” in Proc. 16th Int. Conf.
of online reviews. In this work, the hyperparameters for the World Wide Web, 2007, pp. 1189–1190.
[5] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proc. Int. Conf.
DeepWalk approach are learned directly using the partial Web Search Web Data Mining, 2008, pp. 219–230.
ground-truth data in a semisupervised fashion. In our future [6] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw, “Detecting
work, we intend to explore other graph embedding techniques, product review spammers using rating behaviors,” in Proc. 19th ACM
Int. Conf. Inf. Knowl. Manage., 2010, pp. 939–948.
such as node2vec [48], which can learn representations that [7] F. H. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to identify review
organize nodes based on their network roles and/or communi- spam,” in Proc. 22nd Int. Joint Conf. Artif. Intell., 2011, pp. 1–6.
ties they belong to. Also, in the future, if we have additional [8] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, “Finding deceptive
opinion spam by any stretch of the imagination,” 2011, arXiv:1107.4557.
information, such as temporal information, ratings, and review [Online]. Available: http://arxiv.org/abs/1107.4557
text, then our proposed approach would be able to give better [9] S. Feng, L. Xing, A. Gogar, and Y. Choi, “Distributional footprints of
results for fake review groups’ detection. deceptive product reviews,” in Proc. ICWSM, 2012, vol. 12, no. 98,
p. 105.
Also, we can adapt our proposed approach to detect opinion [10] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh,
spammers on social media. Most spammers work collabora- “Exploiting burstiness in reviews for review spammer detection,” in
tively in groups to spread fake opinions through social media, Proc. ICWSM, 2013, vol. 13, pp. 175–184.
such as Facebook and Twitter, to reach a receptive audience. [11] L. Akoglu, R. Chandy, and C. Faloutsos, “Opinion fraud detection in
online reviews by network effects,” in Proc. ICWSM, 2013, vol. 13,
Mostly, they use the same pieces of false information to nos. 2–11, p. 29.
reply to or comment on the posting from real social media [12] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, “Detec-
users on certain topics or issues. Our proposed approach tion of review spam: A survey,” Expert Syst. Appl., vol. 42, no. 7,
pp. 3634–3642, May 2015.
can directly be extended to establish clusters of spammer- [13] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, and
ids that have correlated semantic characteristics and temporal H. A. Najada, “Survey of review spam detection using machine learning
affinity. Furthermore, suitable natural language processing techniques,” J. Big Data, vol. 2, no. 1, p. 23, Dec. 2015.
[14] A. Mukherjee, B. Liu, and N. Glance, “Spotting fake reviewer groups
(NLP) techniques will enhance the accuracy of semantic in consumer reviews,” in Proc. 21st Int. Conf. World Wide Web, 2012,
correlation. pp. 191–200.
[15] C. Xu, J. Zhang, K. Chang, and C. Long, “Uncovering collusive
VI. C ONCLUSION spammers in Chinese review websites,” in Proc. 22nd ACM Int. Conf.
Conf. Inf. Knowl. Manage., 2013, pp. 979–988.
Online review spamming has been increasingly becoming a [16] M. Allahbakhsh et al., “Collusion detection in online rating systems,”
serious issue to the online rating system. Therefore, identifying in Proc. Asia–Pacific Web Conf. Berlin, Germany: Springer, 2013,
group spamming activities is an important problem to prevent pp. 196–207.
[17] C. Xu and J. Zhang, “Combating product review spam campaigns via
online customers from being influenced by fake reviews. multiple heterogeneous pairwise features,” in Proc. SIAM Int. Conf. Data
In this article, we proposed a top-down framework to identify Mining, Jun. 2015, pp. 172–180.
candidate fake reviewers’ groups from social media data. [18] J. Ye and L. Akoglu, “Discovering opinion spammer groups by network
footprints,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discovery
The proposed framework first uses a DeepWalk approach to Databases. Cham, Switzerland: Springer, 2015, pp. 267–282.
represent different reviewers in the form of feature vectors and [19] J. Soni and N. Prabakar, “Effective machine learning approach to detect
then employs a semisupervised clustering method to identify groups of fake reviewers,” in Proc. 14th Int. Conf. Data Sci. (ICDATA),
Las Vegas, NV, USA, 2018, pp. 3–9.
candidate fake reviewers’ groups using partial background [20] Z. Wang, T. Hou, D. Song, Z. Li, and T. Kong, “Detecting review
knowledge. spammer groups via bipartite graph projection,” Comput. J., vol. 59,
We conducted experiments on a real-world review dataset no. 6, pp. 861–874, Jun. 2016.
[21] S. Dhawan, S. C. R. Gangireddy, S. Kumar, and T. Chakraborty, “Spot-
that consists of 640 Google Play Store apps reviewed by ting collective behaviour of online frauds in customer reviews,” 2019,
38 123 reviewer-ids and partial background knowledge about arXiv:1905.13649. [Online]. Available: http://arxiv.org/abs/1905.13649

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.
1378 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 6, DECEMBER 2021

[22] A. Bitarafan and C. Dadkhah, “SPGD_HIN: Spammer group detection Punit Rathore received the master’s degree in
based on heterogeneous information network,” in Proc. 5th Int. Conf. instrumentation engineering from IIT Kharagpur,
Web Res. (ICWR), Apr. 2019, pp. 228–233. Kharagpur, India, in 2011, and the Ph.D. degree
[23] J. Soni, N. Prabakar, and H. Upadhyay, “Feature extraction through from the Department of Electrical and Electronics
deepwalk on weighted graph,” in Proc. 15th Int. Conf. Data Sci. Engineering, University of Melbourne, Melbourne,
(ICDATA), 2019, pp. 1–7. VIC, Australia, in 2019.
[24] E. Choo, T. Yu, and M. Chi, “Detecting opinion spammer groups through He was a Research Fellow with the School
community discovery and sentiment analysis,” in Proc. IFIP Annu. of Computing, National University of Singapore
Conf. Data Appl. Secur. Privacy. Cham, Switzerland: Springer, 2015, (NUS), Singapore. He is currently a Post-Doctoral
pp. 170–187. Researcher with the Senseable City Laboratory,
[25] J. K. Rout, A. Dalmia, K.-K.-R. Choo, S. Bakshi, and S. K. Jena, “Revis- Department of Urban Studies and Planning, Massa-
iting semi-supervised learning for online deceptive review detection,” chusetts Institute of Technology (MIT) Cambridge, MA, USA. His research
IEEE Access, vol. 5, pp. 1319–1327, 2017. interests include big data clustering, spatiotemporal data mining, the Internet
[26] Y. Qin, S. Ding, L. Wang, and Y. Wang, “Research progress on semi- of Things, and urban analytics.
supervised clustering,” Cogn. Comput., vol. 11, no. 5, pp. 599–612,
Oct. 2019.
[27] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by Jayesh Soni is currently pursuing the Ph.D. degree
seeding,” in Proc. 19th Int. Conf. Mach. Learn. (ICML, 2002, pp. 27–34. with the School of Computing and Information Sci-
[28] A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance, “What yelp ences, Florida International University, Miami, FL,
fake review filter might be doing?” in Proc. ICWSM, 2013, pp. 409–418. USA.
[29] A. Mukherjee, B. Liu, J. Wang, N. Glance, and N. Jindal, “Detecting His current research on anomaly detection at
group review spam,” in Proc. 20th Int. Conf. Companion World Wide the system level by leveraging artificial intelli-
Web, 2011, pp. 93–94. gence techniques is supported by the Department of
[30] S. Rayana and L. Akoglu, “Collective opinion spam detection: Bridging Defense, Test Resource Management Center, USA.
review networks and metadata,” in Proc. 21th ACM SIGKDD Int. Conf. His research interests include cyber-security, big
Knowl. Discovery Data Mining, Aug. 2015, pp. 985–994. data, and parallel processing.
[31] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” 2013, arXiv:1301.3781. [Online].
Available: http://arxiv.org/abs/1301.3781
[32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distrib- Nagarajan Prabakar received the M.Eng. degree
uted representations of words and phrases and their compositionality,” in automation from the Indian Institute of Science,
in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119. Bangalore, in 1979, and the Ph.D. degree in com-
[33] S. Basu, A. Banerjee, and R. J. Mooney, “Active semi-supervision for puter science from The University of Queensland,
pairwise constrained clustering,” in Proc. SIAM Int. Conf. Data Mining, Brisbane, QLD, Australia, in 1985.
Apr. 2004, pp. 333–344. He is currently an Associate Professor with the
[34] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, “Scalable visual School of Computing and Information Sciences,
assessment of cluster tendency for large data sets,” Pattern Recognit., Florida International University, Miami, FL, USA.
vol. 39, no. 7, pp. 1315–1324, Jul. 2006. His research interests include machine learning-
[35] P. Rathore, D. Kumar, J. C. Bezdek, S. Rajasegarar, and M. Palaniswami, based object detection, anomaly detection for system
“A rapid hybrid clustering algorithm for large volumes of high dimen- security, and distributed optimization for real-world
sional data,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 4, pp. 641–654, problems.
Apr. 2018.
[36] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
to Algorithms. Cambridge, MA, USA: MIT Press, 2009. Marimuthu Palaniswami (Life Fellow, IEEE)
[37] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi, “Watch received the M.E. degree in electrical, electronic and
your step: Learning node embeddings via graph attention,” in Proc. Adv. control engineering (EECE) from the Indian Institute
Neural Inf. Process. Syst., 2018, pp. 9180–9190. of Science, Bengaluru, India, in 1979, the M.Eng.Sc.
[38] A. Epasto and B. Perozzi, “Is a single embedding enough? Learning degree in EECE from the University of Melbourne,
node representations that capture multiple social contexts,” in Proc. VIC, Australia, in 1983, and the Ph.D. degree from
World Wide Web Conf., 2019, pp. 394–404. the University of Newcastle, Callaghan, NSW, Aus-
[39] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” tralia, 1987.
J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. He is currently a Professor with the University
[40] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm of Melbourne. He is representing Australia as a
for discovering clusters in large spatial databases with noise,” in Proc. Core Partner in EU FP7 projects such as SENSEI,
KDD, 1996, vol. 96, no. 34, pp. 226–231. SmartSantander, IoT Initiative, and SocIoTal. He has been funded by several
[41] J. H. Ward, “Hierarchical grouping to optimize an objective function,” Australian Research Council (ARC) and industry grants (over 40 million)
J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 236–244, Mar. 1963. to conduct research in sensor network, IoT, health, environmental, machine
[42] K. Wagstaff et al., “Constrained k-means clustering with background learning areas. His current research interests include sensor networks, IoT,
knowledge,” in Proc. ICML, vol. 1, 2001, pp. 577–584. machine learning, pattern recognition, and signal processing and control.
[43] P. Rathore, J. C. Bezdek, P. Santi, and C. Ratti, “ConiVAT: Clus-
ter tendency assessment and clustering with partial background
knowledge,” 2020, arXiv:2008.09570. [Online]. Available: http://arxiv. Paolo Santi received the Laurea degree and the
org/abs/2008.09570 Ph.D. degree in computer science from the Univer-
[44] E. Bair, “Semi-supervised clustering methods,” Wiley Interdiscipl. Rev., sity of Pisa, Pisa, Italy, in 1994.
Comput. Statist., vol. 5, no. 5, pp. 349–361, Sep. 2013. He is currently a Principal Research Scientist with
[45] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation the MIT Senseable City Laboratory, and a Senior
and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, Research with the Istituto di Informatica e Telemat-
pp. 53–65, Nov. 1987. ica, CNR, Pisa. His research interests include wire-
[46] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation less multihop networks, vehicular networks, smart
learning on large graphs,” in Proc. Adv. Neural Inf. Process. Syst., 2017, mobility, and intelligent transportation.
pp. 1024–1034. Dr. Santi is a member of the IEEE Computer
[47] X. Liu, P.-C. Hsieh, N. Duffield, R. Chen, M. Xie, and X. Wen, Society and has recently been recognized as a Dis-
“Real-time streaming graph embedding through local actions,” in Proc. tinguished Scientist by the Association for Computing Machinery. He has been
Companion World Wide Web Conf., May 2019, pp. 285–293. involved in the technical and organizing committee of several conferences in
[48] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for the field. He is/has been an Associate Editor of the IEEE T RANSACTIONS
networks,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery ON M OBILE C OMPUTING , the IEEE T RANSACTIONS ON PARALLEL AND
Data Mining, Aug. 2016, pp. 855–864. D ISTRIBUTED S YSTEMS , and Computer Networks.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 31,2022 at 10:14:50 UTC from IEEE Xplore. Restrictions apply.

You might also like