0% found this document useful (0 votes)
13 views8 pages

4335-Article Text-7382-1-10-20190706

This paper presents a User Profiling Algorithm (UPA) designed to dynamically track and profile user interests in the context of short text streams, such as those found on Twitter. UPA incorporates two models: the Collaborative Interest Tracking topic Model (CITM) for tracking user interests and the Streaming Keyword Diversification Model (SKDM) for generating diversified keywords. The proposed approach addresses limitations in previous user profiling methods by considering collaborative information and diversifying keyword selection, demonstrating improved performance in experiments.

Uploaded by

jiaqi bao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

4335-Article Text-7382-1-10-20190706

This paper presents a User Profiling Algorithm (UPA) designed to dynamically track and profile user interests in the context of short text streams, such as those found on Twitter. UPA incorporates two models: the Collaborative Interest Tracking topic Model (CITM) for tracking user interests and the Streaming Keyword Diversification Model (SKDM) for generating diversified keywords. The proposed approach addresses limitations in previous user profiling methods by considering collaborative information and diversifying keyword selection, demonstrating improved performance in experiments.

Uploaded by

jiaqi bao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Collaborative, Dynamic and Diversified User Profiling

Shangsong Liang1,2
1
School of Data and Computer Science, Sun Yat-sen University, China
2
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 51006, China
[email protected]

Abstract tion that the content of documents is rich enough to infer


users’ dynamic interests. (2) They ignore any collaborative
In this paper, we study the problem of dynamic user profiling
information, such as friends’ messages when inferring users’
in the context of streams of short texts. Previous work on user
profiling works with long documents, do not consider collab- interests at a specific point in time. (3) They just simply re-
orative information, and do not diversify the keywords for trieve a list of top-k keywords as a user’s profile that may be
profiling users’ interests. In contrast, we address the problem semantically similar to each other and thus redundant.
by proposing a user profiling algorithm (UPA), which con- Accordingly, in this paper, to address the aforementioned
sists of two models: the proposed collaborative interest track-
ing topic model (CITM) and the proposed streaming keyword
drawbacks in the previous work, we propose a User Profil-
diversification model (SKDM). UPA first utilizes CITM to ing Algorithm in the context of streams of short documents,
collaboratively track each user’s and his followees’ dynamic abbreviated as UPA, which is collaborative, dynamic and di-
interest distributions in the context of streams of short texts, versified. UPA consists of two proposed models– a Collabo-
and then utilizes SKDM to obtain top-k relevant and diversi- rative Interest Tracking topic Model, abbreviated as CITM,
fied keywords to profile users’ interests at a specific point in and a Streaming Keyword Diversification Model, abbrevi-
time. Experiments were conducted on a Twitter dataset and ated as SKDM. UPA algorithm first utilizes our proposed
we found that UPA outperforms state-of-the-art non-dynamic CITM to track the changes of users’ dynamic interests in the
and dynamic user profiling algorithms. context of streams of short documents. It then utilizes our
proposed SKDM to produce top-k diversified keywords for
Introduction profiling users’ interests at a specific point in time.
To capture users’ dynamic interests underlying their posts Our CITM topic model works with streaming short texts
in microblogging platforms such as Twitter is of impor- and is a dynamic multinomial Dirichlet mixture topic model
tance to the success of further design of applications that that is able to infer and track each user’s dynamic interest
cater for users of such platforms, such as dynamic user distributions based not only on the user’s posts but also the
clustering (Liang et al. 2017a; 2017b). In this paper, we collaborative information, i.e., his followees’ posts. Our hy-
study the problem of user profiling for streaming short pothesis in CITM is that accounting for collaborative infor-
documents (Balog et al. 2012; Liang 2018; Liang et al. mation is critical, especially for those users with limited ac-
2018): collaboratively identifying users’ dynamic interests tivities, infrequent short posts, and thus sparse information.
and tracking how they evolve over time in the context of To perform the inference of users’ interest distributions in
streaming short texts. Our goal is to infer users’ and their streams of short documents, we propose a collapsed Gibbs
collaborative topic distributions over time and dynamically sampling algorithm. Then, our SKDM model works with
profile their interests with a set of diversified keywords in users’ dynamic interest distributions produced by CITM and
the context of streaming short texts. aims at retrieving a set of relevant and also diversified key-
The first user profiling model was proposed in (Balog et words for profiling users’ interests at time t such that redun-
al. 2007), where a set of relevant keywords were identified dancy of the retrieved keywords can be avoided while still
for each user in a static collection of long documents and the keeping relevant keywords to profile the users.
dynamics of users’ interests were ignored. Recent work re-
Our contributions are: (1) We propose a user profiling
alize the importance of capturing users’ dynamic interests
algorithm, UPA, to address the user profiling task in the
over time and a number of temporal profiling algorithms
context of streams of short documents. (2) We propose a
have been proposed for streams of long documents. How-
topic model, CITM, that can collaboratively and dynami-
ever, previous work on user profiling suffer from the fol-
cally track each user’s and his followees’ interests. (3) We
lowing problems: (1) They work with streams of long doc-
propose a collapsed Gibbs sampling algorithm to infer users’
uments rather than short documents and made the assump-
and his followees’ interest distributions. (4) We propose a
Copyright c 2019, Association for the Advancement of Artificial streaming keyword diversification model, SKDM, to diver-
Intelligence (www.aaai.org). All rights reserved. sify the top-k keywords as users’ profiling results at time t.

4269
Related Work the top-k diversified keywords, for user ui at t. We assume
User Profiling. User profiling has been gaining attention af- that the length of a document d in Dt is no more than a pre-
ter the launch of user finding task at TREC 2005 enterprise defined small length (e.g., 140 characters in Twitter).
track (Craswell, de Vries, and Soboroff 2005). Balog and
de Rijke (2007) worked with a static long document corpus User Profiling Algorithm
and modeled the profile of a user as a set of relevant key- In this section, we detail our proposed User Profiling Al-
words. Recent work were aware of the importance of tem- gorithm (UPA) that consists of the proposed Collaborative
poral user profiling. Temporal profiling for long documents Interest Tracking topic Model (CITM) and the proposed
was first introduced in (Rybak, Balog, and Nørvåg 2014), Streaming Keyword Diversification Model (SKDM).
where topical areas were organized in a predefined taxon-
omy and interests were represented as a weighted unchanged Overview
tree built by the ACM classification system. A probabilistic We model users’ interests in streams by latent topics. There-
model was proposed in (Fang and Godavarthy 2014), where fore, the dynamic interests of each user u ∈ ut at time pe-
experts’ academic publications were used to investigate how riod t can be represented as a multinomial distribution θt,u
personal interests evolve over time. We follow most previous
over topics, where θt,u = {θt,u,z }Z z=1 with θt,u,z being the
work, and retrieve top-k words as profile of a user’s interests.
interest score on topic z for user u at time period t and Z be-
Topic Modeling. Topic models provide a suite of algorithms
ing the total number of latent topics. Similarly, the dynamic
to discover hidden thematic structure in a collection of docu-
interests of each user’s followees at t can be represented as
ments. A topic model takes a set of documents as input, and
a multinomial distribution ψt,u = {ψt,u,z }Z z=1 with ψt,u,z
discovers a set of “latent topics”—recurring themes that are
being the interest score of user u’s followees ft,u as a whole
discussed in the collection—and the degree to which each
on topic z at t. Here, ft,u denotes user u’s all followees at t.
document exhibits those topics (Blei, Ng, and Jordan 2003).
Our UPA algorithm consists of two main steps: (1) UPA
Since the well-known topic models, PLSI (Probabilistic La-
first utilizes the proposed CITM to capture each user’s dy-
tent Semantic Indexing) (Hofmann 1999) and LDA (Latent
namic interests θt,u and his collaborative interests ψt,u . (2)
Dirichlet Allocation) (Blei, Ng, and Jordan 2003), were pro-
Given θt,u and ψt,u having been inferred, UPA then utilizes
posed, topic models with dynamics have been widely stud-
SKDM to identify top-k relevant and diversified keywords
ied. These include Dynamic Topic Model (Blei and Laf-
for profiling the user u’s dynamic interests at time period t.
ferty 2006), Dynamic Mixture Model (Wei, Sun, and Wang
2007), Topic over Time (Wang and McCallum 2006), Topic
Collaborative Interest Tracking Model
Tracking Model (Iwata et al. 2009), and more recently, dy-
namic Dirichlet multinomial mixture topic model (Liang et Modeling Interests over Time. We close follow the pre-
al. 2017c), user expertise tracking topic model (Liang 2018) vious work in (Liang, Yilmaz, and Kanoulas 2018; Liang
and user collaborative interest tracking topic model (Liang, et al. 2017a), and aim at inferring each user’s dynamic in-
Yilmaz, and Kanoulas 2018). To our knowledge, none of ex- terest distribution θt,u = {θt,u,z }Z
z=1 and his collaborative
isting dynamic topic models has considered the problem of interest distribution ψt,u = {ψt,u,z }Zz=1 at t in the context
user profiling for short texts that utilizes collaborative infor- of streams of short documents in our CITM. We provide
mation to infer topic distributions. CITM’s graphical representation in Fig. 1.
To track the dynamics of a user u’s interests, we assume
Problem Formulation that the mean of his current interests θt,u at time period t
We follow most of the previous work (Balog and de Rijke is the same as that at t − 1, unless otherwise newly arrived
2007; Berendsen et al. 2013; Liang et al. 2018), and retrieve documents associated with the user u in the streams can be
top-k words as profile of a user. Then, the problem we ad- observed. With this assumption and following the previous
dress in this paper is: given a set of users and streams of work on dynamic topic models (Iwata et al. 2010; 2009; Wei,
short documents generated by them, track their interests over Sun, and Wang 2007), we use the following Dirichlet prior
time and dynamically identify a set of top-k relevant and di- with a set of precision values αt = {αt,z }Zz=1 , where we let
versified keywords to each of the users. The dynamic user the mean of the current distribution θt,u depend on the mean
profiling algorithm is essentially a function h that satisfies: of the previous distribution θt−1,u as:
h Z
Dt , ut −→ Wt , Y α t,u,z θt−1,u,z −1
P (θt,u |θt−1,u , αt ) ∝ θt,u,z , (1)
where Dt = {. . . , dt−2 , dt−1 , dt } represents the stream z=1
of short documents generated by the users ut up to time t
|u |
with dt being the most recent set of short documents ar- where the precision value αt,z = {αt,u,z }u=1 t
represents the
riving at t, ut = {u1 , u2 , . . . , u|ut | } represents a set of persistency of users’ interests, which is how saliency topic z
users appearing in the stream up to time t, with ui be- is at time period t in contrast to that at t − 1 for the users. As
ing the i-th user in ut and |ut | being the total number of the distribution is a conjugate prior of the multinomial dis-
users in the user set, and Wt = {wt,u1 , wt,u2 , . . . , wut,|ut | } tribution, the inference is able to performed by Gibbs sam-
represents all users’ profiling results at t with wt,ui = pling (Liu 1994). Similarly, to track the dynamic changes
{wt,ui ,1 , wt,ui ,2 , . . . , wt,ui ,k } being the profiling result, i.e., of a user u’s collaborative interest distribution ψt,u , we use

4270
αt−1 βt−1 αt βt Algorithm 1: Inference for our CITM model at time t.
Input : Distributions Θt−1 , Ψt−1 and Φt−1 at t − 1;
Initialized αt , βt and γt ; Number of iterations
θt−1 ψt−1 θt ψt Niter .
|ut−1 | |ut−1 | |ut | |ut |
Output: Current distributions Θt , Ψt and Φt .
1 Initialize topic assignments randomly for all documents
in dt
z z 2 for iteration = 1 to Niter do
3 for user = 1 to |ut | do
4 for d = 1 to dt,u do
d d 5 Draw zt,u,d from (5)
|dt−1,u | |dt,u | 6 Update mt,u,zt,u,d , ot,u,zt,u,d and nt,zt,u,d ,v
|ut−1 | |ut |
7 Update αt , βt and γt
φt−1 φt 8 Compute the posterior estimates Θt , Ψt and Φt .
Z Z

γt−1 γt
Ψt−1 and Φt−1 . For initialization and without loss of gen-
eralization, we let θ0,u,z = 1/Z, ψ0,u,z = 1/Z and φ0,z,v =
Figure 1: Graphical representation of our proposed CITM 1/V at t = 0. Let all the short documents posted by user
model. Shaded nodes represent observed variables. u at time period t denote as dt,u . The generative process of
our model for documents in stream at time t, is as follows,
the following Dirichlet prior with a set of precision values i. Draw Z multinomials φt,z , one for each topic z, from
βt = {βt,z }Z
z=1 , where the mean of the current distribution
a Dirichlet prior distribution γt,z φt−1,z ;
ψt,u evolves from that of the previous distribution βt−1,u : ii. For each user u ∈ ut , draw multinomials θt,u and ψt,u
Z
from Dirichlet distributions with priors αt,u θt−1,u and
Y β
t,u,z ψt−1,u,z −1 βt,u ψt−1,u , respectively;
P (ψt,u |ψt−1,u , βt ) ∝ ψt,u,z , (2)
z=1
iii. For each document d ∈ dt,u , draw a topic zd based on
the mixture of θt,u and ψt,u , and then for each word vd
t |u |
where the precision value βt,z = {βt,u,z }u=1 represents in the document d:
the persistency of users’ collaborative interest, which is how (a) Draw a word vd from multinomial φt,zd .
saliency topic z is at time period t in contrast to that at t − 1
for the users. In a similar way, to model the dynamic changes In the above generative process, given the documents in
of the multinomial distribution of words specific to topic z, streams are short, and because most of the short documents
we assume a Dirichlet prior, in which the mean of the cur- are likely to talk about one single topic only (Yin and Wang
rent distribution φt,z = {φt,z,v }Vv=1 evolves from the mean 2014), we let all the words in the same document d be drawn
of the previous distribution φt−1,z : from the multinomial distribution associated with the same
topic zd . See the graphical representation of CITM in Fig. 1.
V
γ φt−1,z,v −1 Interest Distribution Inference. We propose a collapsed
Y
t,z,v
P (φt,z |φt−1,z , γt ) ∝ φt,z,v , (3)
v=1
Gibbs sampling algorithm that is based on the basic col-
lapsed Gibbs sampler (Griffiths and Steyvers 2004; Wallach
where V is the total number of words in a vocabulary v = 2006) to approximately infer the distributions in our CITM
{vi }Vi=1 and γt = {γt,v }Vv=1 , with γt,v = {γt,z,v }Z z=1 rep- topic model. As shown in Fig. 1 and the generative process,
resenting the persistency of the word v in all topics at time t, we adopt a conjugate prior (Dirichlet) for the multinomial
a measure of how consistently the word belongs to the top- distributions, and thus we can easily integrate out the uncer-
ics at t compared to that at t − 1. Later in this subsection, tainty associated with multinomials θt,u , ψt,u and φt,z .
we propose a collapsed Gibbs sampling algorithm to infer We provide an overview of our proposed collapsed
|ut |
all users’ dynamic interest distributions Θt = {θt,u }u=1 , Gibbs sampling algorithm in Algorithm 1, where we de-
their corresponding dynamic collaborative interest distribu- note mt,u,z , ot,u,z and nt,z,v to be the number of doc-
tions Ψt = {ψt,u }u=1
|ut |
, and the words’ dynamic topic dis- uments assigned to topic z for user u, the number of
documents assigned to topic z for user u’s followees
tributions Φt = {φt,z }Z z=1 , and describe our update rules to and the number of times word v assigned to topic z
obtain the optimal persistency values αt , βt and γt . for user u at t, respectively. In the Gibbs sampling pro-
Assume that we know all users’ interest distribution at cedure, we need to calculate the conditional distribution
time t − 1, Θt−1 , their collaborative interest distribution at P (zt,u,d |zt,−(u,d) , dt , Θt−1 , Ψt−1 , Φt−1 , ut , αt , βt , γt ) at
time t − 1, Ψt−1 , and the words’ topic distribution, Φt−1 . time t, where zt,−(u,d) represents the topic assignments
Then the proposed collaborative interest tracking model is for all the documents in dt except the document d ∈
essentially a generative topic model that depends on Θt−1 , dt,u associated with user u at t, and zt,u,d is the topic

4271
assigned to the document d ∈ dt,u . For obtaining this Algorithm 2: SKDM model for generating top-k key-
conditional distribution used during sampling, we begin words for collaborative, dynamic, diversified user profil-
with the joint probability of the current document set, ing.
P (zt , dt |Θt−1 , Ψt−1 , Φt−1 , ut , αt , βt , γt ) at time t:
Input : Current distributions Θt and Φt
P (zt , dt |Θt−1 , Ψt−1 , Φt−1 , ut , αt , βt , γt ) (4) Output: All users’ profiling results at time t, Wt
= (1 − λ)P (zt , dt |Θt−1 , Φt−1 , ut , αt , γt )
1 for u = 1, . . . , |ut | do
= +λP (zt , dt |Ψt−1 , Φt−1 , ut , βt , γt )
P Q ! P Q
2 wt,u ← ∅ /* wt,u ∈ Wt */
= (1 − λ)
Y Γ ( v (κb ))
Q
v Γ (κa )
P ·
Y Γ ( z (κ2 ))
Q
z Γ (κ1 )
P 3 e←v
v
z v Γ (κb ) Γ ( v κa ) u z Γ (κ2 ) Γ ( z κ1 ) 4 for z = 1, . . . , Z do
!
δt,u,z ← (1 − λ)P (z|t, u) + λP (z|t, ft,u )
P Q P Q
Y Γ ( v (κb )) v Γ (κa )
Y Γ ( z (κ4 )) z Γ (κ3 )
= +λ Q P · Q P , 5
z v Γ (κb ) Γ ( v κa ) u z Γ (κ4 ) Γ ( z κ3 ) 6 sz|t,u ← 0
where Γ (·) is a gamma function, λ is a free parameter that 7 for all positions in the ranked list wt,u do
governs the linear mixture of a user’s interests and his fol- 8 for z = 1, . . . , Z do
lowees’ interests, and the set of parameters κ are defined as: δt,u,z
κ1 = mt,u,z + αt,z θ, κ2 = αt,u,z θ, κ3 = ot,u,z + βt,z ψ, 9 qt[z|t, u] = 2sz|t,u +1
κ4 = βt,u,z ψ, κa = nt,z,v + γt,v φ, and κb = γt,z,v φ. 10 z ∗ ← arg maxz qt[z|t, u]
Here, we let θ, ψ and φ abbreviate for θt−1,u,z , ψt−1,u,z 11 v ∗ ← arg maxv∈evPη1 × qt[z ∗ |t, u] ×
and φt−1,z,v , respectively. Based on the above joint distribu- P (v|t, z ∗ ) + η2 z6=z∗ qt[z|t, u] ×
tion (4) and using the chain rule, we can obtain the following P (v|t, z) + (1 − η1 − η2 ) × tfidf(v|t, u)∗
conditional distribution conveniently for the proposed Gibbs 12 wt,u ← wt,u ∪ {v ∗ } /* append v to
sampling (step 5 of Algorithm 1) as the following: wt,u */
P (zt,u,d = z|zt,−(u,d) , dt , Θt−1 , Ψt−1 , Φt−1 , ut , αt , βt , γt ) = 13 e←v
v e \{v ∗ } /* remove v ∗ from v e */

mt,u,z + αt,u,z θ − 1 14 for z = 1, . . . , Z do

(1 − λ) PZ + |t,u)
+ αt,u,z θ) − 1 15 sz|t,u ← sz|t,u + PZ P (vP (v
z=1 (mt,u,z
∗ |t,z 0 )
z 0 =1

ot,u,z + βt,u,z ψ − 1
λ PZ
z=1 (ot,u,z + βt,u,z ψ) − 1
Q QNd,v
v∈d j=1 (nt,z,v,−(u,d) + γt,z,v φ + j − 1)
× QNd PV , (5) words’ topic distribution φt,z at t, respectively as: θt,u,z =
i=1 (nt,z,−(u,d) + i − 1 + v=1 γt,z,v φ) m +αt,u,z o +βt,u,z
PZ t,u,z , ψt,u,z = PZ t,u,z , and
where Nd , Nd,v , zt,−(u,d) , nt,z,v,−(u,d) and nt,z,−(u,d) are m
z 0 =1 0 +α 0
t,u,z t,u,z o 0 +β 0
z 0 =1 t,u,z t,u,z
nt,z,v +γt,z,v
the length of document d, the number of word v appearing φt,z,v = PV
n +γt,z,v0
.
in d, topic assignments for all documents except the doc- v 0 =1 t,z,v 0

ument d from user u at t, the number of word v assigned


to topic z in all documents except the one from user u at Streaming Keyword Diversification Model
t, and the number of documents assigned to z in all docu-
ments except the one from user u at t, respectively. At each After we obtain θt,u , ψt,u and φt,z , inspired by PM-2 di-
iteration during the sampling (steps 2 to 7 of Algorithm 1), versification method (Dang and Croft 2012), we closely fol-
the precision parameters αt , βt and γt can be estimated by low the work in (Liang et al. 2017c; 2018) and propose a
maximizing the joint distribution (4). We apply fixed-point streaming keyword diversification model (i.e., Algorithm 2),
iterations to obtain the optimal αt , βt and γt . By applying SKDM. To generate top-k diversified keywords for each user
the two bounds in (Minka 2000), we can derive the follow-
ing update rules of αt , βt and γt for maximizing the joint u at t, SKDM starts with an empty keyword set wt,u with
distribution in our fixed-point iterations: k empty seats (step 2 of Algorithm 2), and a set of candi-
 date keywords (step 3), v e , which is the whole words v in
(1 − λ)αt,u,z ∆(mt,u,z + αt,u,z θ) − ∆(αt,u,z θ)
αt,u,z ← PZ PZ , the vocabulary, i.e., initially let v e = v. For each of the
∆( z=1 mt,u,z + αt,u,z θ) − ∆( z=1 αt,u,z θ) seats, it computes the quotient qt[z|t, u] for each topic z

λβt,u,z ∆(ot,u,z + βt,u,z ψ) − ∆(βt,u,z ψ) given a user u at t by the Sainte-Laguë formula (step 9):
βt,u,z ← PZ PZ , (6) δt,u,z
∆( z=1 ot,u,z + βt,u,z ψ) − ∆( z=1 βt,u,z ψ) qt[z|t, u] = 2sz|t,u +1 , where δt,u,z is the final probability
of the user u has interest on topic z at t and is set to be

γt,z,v ∆(nt,z,v + γt,z,v φ) − ∆(γt,z,v φ)
γt,z,v ← ,
∆(
PV
nt,z,v + γt,z,v φ) − ∆(
PV
γt,z,v φ) δt,u,z = (1 − λ)P (z|t, u) + λP (z|t, ft,u ) (step 5), and sz|t,u
v=1 v=1
is the “number” of seats occupied by topic z (in initializa-
where ∆(x) = ∂ logxΓ (x) is a Digamma function. Our tion, let sz|t,u = 0 for all topics (step 6)). Here P (z|t, u)
derivations of the update rules for αt , βt and γt in (6) are and P (z|t, ft,u ) are the probabilities of user u’s own and his
analogous to those in (Liang, Yilmaz, and Kanoulas 2018; collaborative interest on topic z at t, respectively. Obviously,
Liang et al. 2017c; 2017b). we can obtain P (z|t, u) and P (z|t, ft,u ) by our CITM algo-
After the Gibbs sampling is done, with the fact that rithm such that P (z|t, u) = θt,u,z and P (z|t, ft,u ) = ψt,u,z ,
Dirichlet distribution is conjugate to multinomial distribu- i.e., we have:
tion, we can conveniently infer each user’s interest distribu-
tion θt,u , his collaborative interest distribution ψt,u and the δt,u = (1 − λ)θt,u + λψt,u , (7)

4272
where δt,u = {δt,u,z }Z z=1 is user u’s final interest distri- from the beginning of their registrations up to May 31, 2015.
butions inferred based on his own and his collaborative in- According to the statistics, most of the users are being fol-
formation at time t. According to the Sainte-Laguë method, lowed by 2 to 50 followers. In total, we have 7.52 mil-
seats should be awarded to the topic with the largest quo- lion tweets with timestamps including those from users’ fol-
tient in order to best maintain the proportionality of the re- lowees’. The average length of the tweets is 12 words.
sult list. Therefore, our SKDM assigns the current seat to the We use this dataset as our stream of short documents. We
topic z ∗ with the largest quotient (step 10). The keyword to obtain two categories of Ground Truths: one for evaluat-
fill this seat is the one that is not only relevant to topic z ∗ ing Relevance-oriented (RGT) performance and another for
but to other topics and should be specific to the user, and evaluating Diversity-oriented (DGT) performance. To create
thus we propose to obtain the keyword v ∗ for user u’s pro- the RGT ground truth, we split the dataset into 5 different
filing at t as (step 11):Pv ∗ ← arg maxv∈ev η1 × qt[z∗ |t, u] × partitions of time periods, i.e., a week, a month, a quarter,
P (v|t, z ∗ ) + η2 × z6=z ∗ qt[z|t, u] × P (v|t, z) + (1 −
half a year and a year, respectively. For each Twitter user at
η1 − η2 ) × tfidf(v|t, u), where 0 ≤ η1 , η2 ≤ 1 are two free every specific time period, an annotator was asked to gen-
parameters that satisfy 0 ≤ η1 + η2 ≤ 1, P (v|t, z) is the erate a ranked list of top-k relevant keywords (k were de-
probability that v is associated with topic z at time t and cided by the annotators) as the user’s profile. In total, 68
thus can be set to be P (v|t, z) = φt,z,v , and tfidf(v|t, u) is annotators took part in the labelling with each of them la-
a time-sensitive term frequency-inverse document frequency belled about 5 Twitter user for these 5 different partitions.
function for user u at t, which can be defined as: To create the ground truth for diversity evaluation, DGT, as
it is expensive to manually obtain aspects of the keywords
tfidf(v|t, u) = tf(v|dt,u ) × idf(v|u, dt ), (8) from annotators, we cluster the relevant keywords with their
|{d∈d :v∈d}|
embeddings2 into 15 categories 3 by k-means (MacQueen
t,u
where tf(v|dt,u ) = |dt,u | is a term frequency func- 1967). Relevant keywords within a cluster are regarded as
tion that computes how many percents of the documents being relevant to the same aspect in the DGT ground truth.
that contain the word v in the whole document set dt,u ,
and idf(v|u, dt ) = log |{d∈dt|d:v∈d}|+t|
is an inverse docu- Baselines
ment frequency function with  being set to 1 to avoid the We make comparisons between our UPA and the follow-
division-by-zero error. According to (8), if v frequently ap- ing state-of-the-art baseline algorithms: (1) tfidf. It simply
pears in the document set dt,u generated by user u but not utilizes (8), i.e., the content of users’ documents to retrieve
frequently appears in the document set dt generated by all top-k keywords as profiles for the users. (2) Predictive Lan-
the users, tfidf(v|t, u) will return a high score. After the guage Model (PLM). It models the dynamics of personal
word v ∗ is selected, SKDM adds v ∗ as a result keyword interests via a probabilistic language model (Fang and Go-
to wt,u , i.e., wt,u ← wt,u ∪ {v ∗ } (step 12), removes it davarthy 2014). (3) Latent Dirichlet Allocation (LDA).
from the candidate word set v e , i.e., v e←v e \{v ∗ } (step 13), This model (Blei, Ng, and Jordan 2003) infers topic distri-
and increases the “number” of seats occupied by each of butions specific to each document via the LDA model. (4)
the topics z by its normalized relevance to v ∗ as (step 15): Author Topic model (AuthorT). This model (Rosen-Zvi et

|t,u)
sz|t,u ← sz|t,u + PZ P (vP (v al. 2004) infers topic distributions specific to each user in
∗ |t,z 0 ) . The process (steps 7 to 15)
z 0 =1 a static dataset. (5) Dynamic Topic Model (DTM). This
repeats until we get k diversified keywords. The order in dynamic model (Blei and Lafferty 2006) utilizes a Gaus-
which a keyword is appended to wt,u determines its rank- sian distribution for inferring topic distribution of long doc-
ing for the profiling. After the process is done, we obtain a uments in streams. (6) Topic over Time model (ToT). This
set of diversified keywords wt,u that profile the user u at t. dynamic model (Wang and McCallum 2006) normalizes
timestamps of long documents in a collection and then infers
Experimental Setup topics distribution for each document. (7) Topic Tracking
Research Questions Model (TTM). This dynamic model (Iwata et al. 2009) cap-
tures the dynamic topic distributions of long documents ar-
The research questions guiding the remainder of the pa- riving at time t in streams of long documents. (8) GSDMM.
per are: (RQ1) How does UPA perform for user profiling This is a Gibbs Sampling-based Dirichlet Multinomial Mix-
compared to state-of-the-art methods? (RQ2) How does the ture model that assigns one topic for each short document in
contribution of the proposed interest tracking topic model, a static collection (Yin and Wang 2014).
CITM, to the overall performance of UPA? (RQ3) What For fair comparisons, the topic model baselines, GS-
is the contribution of the collaborative information for user DMM, TTM, ToT, DTM and LDA, use both each user’s
profiling? (RQ4) What is the impact of the length of the time interests θt,u and their collaborative interests for profiling.
intervals, ti − ti−1 , in UPA? As these baselines can not directly infer collaborative inter-
est distributions, we use the average interests of the user’s
Dataset
We work with a dataset collected from Twitter.1 It contains 2
Publicly available from https://nlp.stanford.edu/projects/
1,375 active randomly selected users and their tweets posted glove/.
3
Information of the categories is available from http://
1
Crawled from https://dev.twitter.com/. dmoztools.net.

4273
followees as his collaborative interest distribution. Thus, Table 1: Relevance performance of UPA, UPAavg and the
unlike (7), in these baselines we baselines using time periods of each month. Statistically sig-
P use the mixture interests
δt,u = (1 − λ)θt,u + λ |ft,u 1 nificant differences between UPAavg and GSDMM, and be-
u0 ∈ft,u θt,u for represent-
0
|
ing each user’s final interest distribution with θt,u being tween UPA and UPAavg are marked in the upper right hand
inferred by the corresponding baseline topic models. The corner of UPAavg ’s and UPA scores, respectively. Statistical
baselines, tfidf, PLM and AuthorT, are static profiling al- significance is tested using a two-tailed paired t-test and is
gorithms, while the others are dynamic. Again, for fair com- denoted using N for α = .01, and M for α = .05.
parisons, UPA and all the other topic models use our SKDM
Pre NDCG MRR MAP Pre-S NDCG-S MRR-S MAP-S
algorithm to obtain the top-k keywords. We set the num-
tfidf .254 .229 .375 .135 .409 .392 .853 .203
ber of topics Z = 20 in all the topic models. For tun- PLM .273 .239 .668 .140 .417 .398 .870 .212
ing parameters, λ, η1 and η2 , we use a 70%/20%/10% split LDA .281 .252 .674 .142 .424 .407 .878 .217
for our training, validation and test sets, respectively. The AuthorT .288 .260 .674 .145 .429 .408 .897 .220
DTM .295 .270 .694 .153 .436 .419 .883 .226
train/validation/test splits are permuted until all users were TTM .301 .276 .728 .156 .440 .426 .882 .228
chosen once for the test set. We repeat the experiments 10 ToT .312 .283 .744 .158 .445 .428 .884 .230
times and report the average results. GSDMM .321 .301 .746 .163 .452 .437 .891 .236
UPAavg .367N .361N .840N .195N .483N .468N .939N .262N
For further analysis of the contribution of collaborative UPA .399N .398N .860N .211N .501N .490N .946M .274N
interests ψt,u inferred by our CITM model to the profiling,
we use another baseline denoted
1
P as UPAavg , in which δt,u is Table 2: Diversification performance of UPA, UPAavg and
set to be (1 − λ)θt,u + λ |ft,u | u0 ∈ft,u θt,u with θt,u being
0
the baselines using time periods of every month. Notational
inferred by CITM. Note that we still denote the proposed conventions for the statistical significances are as in Table 1.
profiling algorithm using (7) as UPA.
Pre α-ND MRR MAP Pre α-ND MRR MAP
Evaluation Metrics -IA CG -IA -IA -IA-S CG-S -IA-S -IA-S
We use standard relevance-oriented evaluation metrics, tfidf .157 .187 .480 .185 .257 .325 .725 .150
PLM .162 .192 .487 .187 .265 .332 .742 .152
Pre@k (Precision at k), NDCG@k (Normalized Discounted LDA .171 .203 .493 .192 .272 .338 .744 .155
Cumulative Gain at k), MRR@k (Mean Reciprocal Rank at AuthorT .174 .205 .505 .195 .276 .343 .748 .157
k), and MAP@k (Mean Average precision at k) (Croft, Met- DTM .177 .206 .507 .197 .279 .347 .748 .159
TTM .180 .208 .509 .221 .282 .351 .751 .162
zler, and Strohman 2015), and diversity-oriented metrics, ToT .182 .213 .513 .225 .290 .355 .754 .170
Pre-IA@k (Intent-Aware Pre@k) (Agrawal et al. 2009), α- GSDMM .194 .228 .525 .237 .304 .368 .780 .173
NDCG@k (Clarke et al. 2008), MRR-IA@k (Agrawal et al. UPAavg .238N .265N .597N .252N .362N .421N .808N .216N
UPA .266N .302N .623N .266N .395N .452N .814M .231N
2009), MAP-IA@k (Agrawal et al. 2009). We also propose
semantic versions of the original metrics, denoted as Pre-
S@k, NDCG-S@k, MRR-S@k, MAP-S@k, Pre-IA-S@k,
α-NDCG-S@k, MRR-IA-S@k, and MAP-IA-S@k, respec- relevance and diversity on all the metrics, which confirms
tively. Here the only difference between the original metrics the effectiveness of the proposed user profiling algorithm for
and the corresponding semantic ones is the way to compute the task. (3) The ordering of the methods, UPA > UPAavg >
the relevance score of a retrieval keyword v ∗ to ground truth GSDMM > ToT ∼ TTM ∼ DTM ∼ AuthorT ∼ LDA >
keyword vgt . For original metrics, we let the relevance score PLM > tifdf, is mostly consistent across the two ground
be 1 if and only if v ∗ = vgt , otherwise be 0; whereas for truths and on the relevance and diversity evaluation metrics.
the semantic versions, we let the relevance score be the co- Here A > B denotes that method A statistically significantly
sine similarity between the word embedding vectors of v ∗ performs better than method B and A ∼ B denotes that
and vgt . Since we usually choose not too many keywords to we did not observe a significant difference between A and
describe a user’s profile, we compute the scores at depth 10, B. This, once again, confirms that UPA and its averaged
i.e., let k = 10. For all the metrics we abbreviate M @k as version, UPAavg , outperform all the baselines. (4) In most
M , where M is one of the metrics. cases, UPA > UPAavg holds, which confirms that the col-
laborative information inferred by the proposed topic model,
Results and Discussions CITM, does help to improve the profiling performance.
Additionally, Table 3 shows the top six keywords of an
In this section, we analyse our experimental results. example user’s dynamic profile with time being five quar-
ters from April 2014 to May 2015. As shown in the table,
Overall Performance the diversified keywords generated by UPA are semantically
We start by answering research question RQ1. The follow- closer to those from the ground truth compared to those gen-
ing findings can be observed from Tables 1 and 2: (1) In erated by the baseline, GSDMM, which again demonstrates
terms of both relevance and diversity, all the topic model- the effectiveness of the proposed UPA algorithm.
based profiling algorithms, i.e., UPA, UPAavg , GSDMM,
ToT, TTM, DTM, AuthorT and LDA, outperform traditional Contribution of CITM
algorithms, i.e., PLM and tfidf, which demonstrates that We now turn to answer research question RQ2. Recall that
topic modeling does help to profile users’ interests. (2) UPA the only difference between our UPA/UPAavg and the base-
and UPAavg outperform all the baseline models in terms of lines is that UPA utilizes our CITM to track users’ dynamic

4274
Table 3: Top six keywords of an example user’s dynamic profile with the time being five quarters from April 2014 to May 2015.
The keywords from the DGT ground truth, generated from GSDMM and UPA are presented for the user, respectively.

Apr. 2014 to Jul. 2014 to Sep. 2014 Oct. 2014 to Jan. 2015 to Apr. 2015 to May
Jun. 2014 Dec. 2014 Mar. 2015 2015
Ground Apple Java iPhone Apple Git iPad Ojec- AppleEvent Linin- Microblog Students SocialMedia Edu-
Truth Python ApplePay tiveC AppleEvent Profile openEduca- LinkedInProfile cation NatsTwitter
OjectiveC Python tion iOS NatsTwitter ArtsEducation FB ConnectedLearning
Education AfterSchool FB Courses
GSDMM Apple Computer Apple Company Uni- Apple Christmas Online Education Courses Online Pre-
iPhone Science Java versity Technology LinkedIn Education Students Website sentation Digital
Technology iPad Language iOS Friends Degree Presentation Learning Education
UPA Apple Java iPhone Apple Programming Apple LinkedIn Ed- LinkedIn Students Education Media
Programming CPlus- iPad Git Event Python ucation iOS Twitter Microblog Education Learning FB Courses
Plus Computer Education FB Art Twitter

0.35
interests and then our SKDM to diversify the keywords,

0.45
UPA UPA
UPAavg UPAavg
whereas other topic models utilize different topic models GSDMM GSDMM

Precision
to obtain users’ interests and then the SKDM for keyword

Pre-IA
0.25
0.35
diversification. As in Tables 1 and 2, UPA/UPAavg outper-
forms all the topic models, i.e., GSDMM, ToT, TTM, DTM
AuthorT and LDA, which illustrates that the proposed topic

0.25

0.15
model, CITM, does be effective and has significant contri- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
λ λ
bution to the performance of our user profiling algorithm.
Contribution of Collaborative Interests Figure 2: Relevance and diversity performance of UPA,
UPAavg and GSDMM on representative metrics, Precision
Here we turn to answer research question RQ3. We vary the and Pre-IA, with varying scores of λ, respectively.
parameter λ that governs how much the collaborative infor-
mation, ψt,u , are utilized for profiling. A larger λ indicates
more collaborative information is utilized for the profiling.
0.45

0.30
Fig. 2 shows the performance on the relevance and di-
versity evaluation metrics (use Precision and Pre-IA as rep-

0.20 0.25
Precision

Pre-IA
0.35

resentative metrics only), where we use the best baseline,


GSDMM, as a representative. When we increase λ from 0 UPA UPA

0.15
to 0.6, i.e., giving more weight to the collaborative informa-
0.25

UPAavg UPAavg
GSDMM GSDMM

tion, the performance of all the models gradually improves, We


ek
Mo
nth
Qu
arte Half Y Year
We
ek
Mo
nth
Qu
arte Half Y Year
r ear r ear
with UPA still outperforming UPAavg and GSDMM. This,
again, illustrates that integrating collaborative information Figure 3: Relevance and diversity performance of UPA,
into the models helps to improve the performance. More- UPAavg and GSDMM on time periods of a week, a month,
over, as shown in Fig. 2, UPA that utilizes collaborative a quarter, half a year, and a year, respectively.
interests outperforms UPAavg that simply utilizes the aver-
age of the followees’ interests as its collaborative interests,
which once again demonstrates that the inferred collabora-
tive interests in UPA is effective. quarter, whereas it reaches a plateau as the time periods fur-
ther increase from a quarter to a year. In all the cases UPA
Impact of Time Period Length and UPAavg significantly outperform the best baseline, GS-
DMM. These findings illustrate the fact that the performance
Finally, we answer research question RQ4. We compare the of the proposed algorithms is robust and is able to main-
performance for different time periods, a week, a month, a tain significant improvements over the state-of-the-art non-
quarter, half a year and a year, respectively, using the two dynamic and dynamic algorithms. In addition, UPA always
ground truths, RGT and DGT, on the representative rele- outperforms UPAavg on all the metrics and all the different
vance and diversity metrics, Precision and Pre-IA, in Fig. 3. period lengths, which once again illustrates that the collab-
As is shown in Fig. 3, UPA and UPAavg beat the base- orative interest distribution inferred by the proposed CITM
lines for time periods of all lengths, which illustrates that model helps to enhance the user profiling performance.
our proposed user profiling algorithm works better than the
state-of-the-art ones for dynamic user profiling regardless of
period length. The performance of UPA, UPAavg and the
Conclusions
best baseline, GSDMM, improves significantly on all the We have studied the problem of collaborative, dynamic and
metrics when the period length increases from a week to a diversified user profiling in the context of streams of short

4275
texts. To tackle the problem, we have proposed a streaming Fang, Y., and Godavarthy, A. 2014. Modeling the dynamics
profiling algorithm, UPA, that consists of two models: the of personal expertise. In SIGIR, 1107–1110.
proposed collaborative interest tracking topic model, CITM, Griffiths, T. L., and Steyvers, M. 2004. Finding scientific
and the proposed streaming keyword diversification model, topics. PNAS 101:5228–5235.
SKDM. Our CITM tracks the changes of users’ and their
Hofmann, T. 1999. Probabilistic latent semantic indexing.
followees’ interest distribution in streams of short texts, a
In SIGIR, 50–57.
sequentially organized corpus of short texts, and our SKDM
diversifies the top-k keywords for profiling users’ dynamic Iwata, T.; Watanabe, S.; Yamada, T.; and Ueda, N. 2009.
interests. To effectively infer users’ and their followees’ dy- Topic tracking model for analyzing consumer purchase be-
namic interest distribution in our CITM model, we have pro- havior. In IJCAI, volume 9, 1427–1432.
posed a collapsed Gibbs sampling algorithm, where during Iwata, T.; Yamada, T.; Sakurai, Y.; and Ueda, N. 2010. On-
the sampling one single topic is assigned to a document line multiscale dynamic topic models. In KDD, 663–672.
to address the textual sparsity problem. We have conduced ACM.
experiments on a Twitter dataset. We evaluated the perfor- Liang, S.; Ren, Z.; Yilmaz, E.; and Kanoulas, E. 2017a.
mance of our UPA and the baseline algorithms using two Collaborative user clustering for short text streams. In AAAI,
categories of ground truths on both the original metrics and 3504–3510.
the proposed semantic versions of the metrics. Experimental
Liang, S.; Ren, Z.; Zhao, Y.; Ma, J.; Yilmaz, E.; and Rijke,
results show that our UPA is able to profile users’ dynamic
M. D. 2017b. Inferring dynamic user interests in streams
interests over time for streams of short texts. In the future,
of short texts for user clustering. ACM Trans. Inf. Syst.
we intend to utilize auxiliary resources such as Wikipedia
36(1):10:1–10:37.
articles that the entities in the short documents link to for
further improvement of user profiling. Liang, S.; Yilmaz, E.; Shen, H.; Rijke, M. D.; and Croft,
W. B. 2017c. Search result diversification in short text
streams. ACM Trans. Inf. Syst. 36(1):8:1–8:35.
References
Liang, S.; Zhang, X.; Ren, Z.; and Kanoulas, E. 2018. Dy-
Agrawal, R.; Gollapudi, S.; Halverson, A.; and Ieong, S. namic embeddings for user profiling in twitter. In KDD,
2009. Diversifying search results. In WSDM, 5–14. 1764–1773.
Balog, K., and de Rijke, M. 2007. Determining expert Liang, S.; Yilmaz, E.; and Kanoulas, E. 2018. Collabo-
profiles (with and application to expert finding). In IJCAI, ratively tracking interests for user clustering in streams of
2657–2662. short texts. IEEE Transactions on Knowledge and Data En-
Balog, K.; Bogers, T.; Azzopardi, L.; de Rijke, M.; and gineering.
van den Bosch, A. 2007. Broad expertise retrieval in sparse Liang, S. 2018. Dynamic user profiling for streams of short
data environments. In SIGIR, 551–558. texts. In AAAI, 5860–5867.
Balog, K.; Fang, Y.; de Rijke, M.; Serdyukov, P.; and Si, L. Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian
2012. Expertise retrieval. Found. Trends Inf. Retr. 6:127– computations with applications to a gene regulation prob-
256. lem. J. Am. Stat. Assoc. 89(427):958–966.
Berendsen, R.; Rijke, M.; Balog, K.; Bogers, T.; and Bosch, MacQueen, J. B. 1967. Some methods for classification and
A. 2013. On the assessment of expertise profiles. JAIST analysis of multivariate observations.
64(10):2024–2044. Minka, T. 2000. Estimating a dirichlet distribution.
Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models. Rosen-Zvi, M.; Griffiths, T.; Steyvers, M.; and Smyth, P.
In ICML, 113–120. 2004. The author-topic model for authors and documents.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent In UAI, 487–494.
dirichlet allocation. J. Mach. Learn. Res. 3:993–1022. Rybak, J.; Balog, K.; and Nørvåg, K. 2014. Temporal ex-
Clarke, C. L. A.; Kolla, M.; Cormack, G. V.; Vechtomova, pertise profiling. In ECIR, 540–546.
O.; Ashkan, A.; Büttcher, S.; and MacKinnon, I. 2008. Nov- Wallach, H. M. 2006. Topic modeling: beyond bag-of-
elty and diversity in information retrieval evaluation. In SI- words. In ICML, 977–984.
GIR, 659–666. Wang, X., and McCallum, A. 2006. Topics over time: a non-
Craswell, N.; de Vries, A. P.; and Soboroff, I. 2005. markov continuous-time model of topical trends. In KDD,
Overview of the TREC 2005 enterprise track. In TREC’05, 424–433.
1–7. Wei, X.; Sun, J.; and Wang, X. 2007. Dynamic mixture
Croft, W. B.; Metzler, D.; and Strohman, T. 2015. Search models for multiple time-series. In IJCAI, 2909–2914.
engines: Information retrieval in practice. Addison-Wesley Yin, J., and Wang, J. 2014. A dirichlet multinomial mixture
Reading. model-based approach for short text clustering. In KDD,
Dang, V., and Croft, W. B. 2012. Diversity by proportional- 233–242.
ity: An election-based approach to search result diversifica-
tion. In SIGIR, 65–74.

4276

You might also like