Doc2vec Explain

This document describes a model that combines Doc2Vec and latent factor models to predict user ratings of items from text reviews. It summarizes related work on latent factor models and models that combine latent factors with topic models. It then describes the components of the proposed joint model, including how Doc2Vec represents documents as vectors, and how these vectors will be used as input to a neural network regression model for rating prediction.

Uploaded by

Pushkar Mishra

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

201 views

Doc2vec Explain

Uploaded by

Pushkar Mishra

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment 2: A Doc2Vec Model for Rating Prediction

Steven Hill
University of California, San Diego
San Diego, California
PID: A53040946
[email protected]

ABSTRACT 2. RELATED WORK

We design a joint model that combines Word2Vec’s exten- One of the most crucial works in recent years for user-item
sion Doc2Vec with latent models for predicting user-item rating prediction is the latent factor model described in [4].
ratings. We explore this task by look at the BeerAdvocate A function for rating prediction r is defined as r(u, i) =
dataset, a large corpus containing user-item written reviews α + βi + βu + γi · γu , where γ is a multi-dimensional latent
and corresponding ratings. By using user and item embed- representation for user u and item i. This model is com-
dings generated during training jointly as input to a neural parable to most state of the art and generalizes well with
network regression model, we obtain improved performance proper regularization.
over the popular rating prediction approaches. The joint model we will look at can be viewed as a follow-
up to [6]. They combine the latent factor model with Latent
Dirichlet Allocation (LDA) defined in [1]. LDA is a model
Categories and Subject Descriptors that discovered hidden dimensions in text. Every document
[Recommendation Systems]: Latent Factor Model trained on by LDA will get a distribution off “topics” that
; [Distributed Word Representations]: Word2Vec are discovered automatically. This combination is referred
to as Hidden Factors as Topics (HFT). By using the in-
1. INTRODUCTION teraction between a user’s (or item’s) topic vector and an
item’s (or user’s) latent vector, they achieve improved per-
For this assignment we will look at the task of predicting formance over the baseline latent factor model. The topic
beer ratings for given users and beers from the BeerAdvocate vectors are learned from the review text corresponding to
dataset. Our task is as follows: given an item i (a beer a user or item. Later will use this model in a comparison
brand) and a user u (the drinker), our goal is to develop a against our own. [2, 11, 3] are a small collection of works
function r(u, i) which tries to predict the rating that user u that deal with model-based rating prediction.
would give to item i. The BeerAdvocate dataset has been used in past work
To help us with this task, we look at the BeerAdvocate when studying review text. [7] try to model the various
dataset. As the name implies, the reviews and ratings here aspects that come from beer reviews (like feel, look smell,
cover mainly beers. This data spans more than 10 years taste, and overall rating). [8] tries to model how users’ beer
from 1998 to 2011. It consist of more than 1.5 million re- taste change over time and makes recommendations based
views, with over 30,000 users and 66,000 beers. Each review on this temporal change. [6] uses the beer dataset for rating
includes ratings on appearance, aroma, palate, taste, and prediction; much like we are for this assignment.
overall impression (which is what we care about for this In this work we will use Doc2Vec as means of extracting in-
assignment). While the dataset includes a lot potentially formation for text. Word2Vec ([10, 9]) deals with efficiently
useful information, for this task I only use the user id, beer representing word vectors in low dimensional space (often
id, beer rating and text included in the review. referred to as embeddings). Doc2Vec ([5]) is an extension of
A great characteristic of this dataset is that the median Word2Vec that produces document vectors. Similar to how
number of words per review is 126, meaning that we want a LDA was used previously to produce latent vectors, we will
model that extracts information from this abundant amount be using Doc2Vec for a similar task, hoping to gain more
of text to better its prediction. This promotes our design of semantics in our vectors.
a model that makes use of review text at training time to
better predict user-item ratings at test time.
So as stated, at training time we have access to the user- 3. MODEL
item reviews along with ratings. However at test time we Before going into the details of the joint model, it is im-
only have access to user and item id’s. We present a model portant understand each component individually and how
that capitalizes on the text in the reviews to gain better they come together. We review how Doc2Vec (and it’s base
rating predictions. Word2Vec) work.
The remainder of the paper is structured as follows: we
briefly cover related work in the area of user-item rating 3.1 Word2Vec
prediction and review text, cover the model in whole, and A well known framework for learning word vectors is shown
present experimental results using this model for this very in Figure 1. The main idea to predict what word occurs
task. given the surrounding (context) words. In this model, every
word is mapped to a unique vector, represented by a column
in weight matrix W (note that W also corresponds to the wM
wM
embedding layer of neural network).
There are actually two flavors of Word2Vec. Continuous-
Bag-of-Words (CBOW) tries to predict a single target word wM-1
wM-1
given multiple context words, while Skip-gram (SG) tries
to do the opposite; predicting multiple target words given ... ...
...
a single context word. They both produce similar results
(at least for our use case), and thus we will not distinguish
w2 w2
between the two going forward.
More specifically, given a sequence of training words, w1 ,
w2 , · · · , wT , we want to optimize the following: w1 w1

T −k
1 X
log p(wt |wt−k , · · · , wt+k )
T
t=k

This corresponds to maximize the average log probability. Figure 1: Word2Vec. This (CBOW) model tries to
The predicted output word is decided based on a multiclass capture the probability of a target word being at
classifier. Using softmax, we have: the center of input context words; predicting which
word should appear given surround words. For each
eywt word in the input context words, the word is repre-
p(wt |wt−k , c . . . , wt+k ) = P yw
ie
i
sented as a one-hot encoding of that context word
These models are trained using stochastic gradient de- i, and the output node corresponds to the probabil-
scent with back propagation. These are commonly known ity of that target word appearing given the context
as neural language models. The output is dependent on word. During training, the weights are adjusted try
the weights from the first layer and second layer to define to push the probability of the target word appearing
the multinomial distribution of the output. After training with the context word to 1.
converges, the weights corresponding to each input word and
output word are used as vectors for word embeddings. Word
embeddings with similar semantics are mapped to similar gation with approximate methods for weight training. After
positions in vector space. These vectors can be used for a training, there are vector embeddings for every word and
wide variety of tasks, including word algebra and language document (if concatenation is used they can be of different
translation. lengths). The authors of this methodology claim to have
But because this learning is done through back propa- gained state-of-the-art results on sentiment analysis and in-
gation, it’s not tractable to do softmax classification given formation retrieval.
the size of the word vocabulary (which can be in the mil-
lions). This is a problem for the output end of the model. 3.3 Multi-layer Perceptron Regression
In practice either negative sampling or hierarchical softmax Lastly, before we discuss the joint model for prediction,
schemes are used to speed up training. We will forego an it will also help to look at regression in a neural network
indepth review of alternative training methods for brevity. setting (shown in Figure 3). The main difference between a
In the negative sampling case, logistic regression is used to neural network regression and classification model is that the
compare a positive outcome to sampled “negative” cases. In output is no longer non-linear. With proper regularization
hierarchical softmax, the probability distribution is encoded techniques, these standard models can be trained to have
in a binary tree, where the left and right probabilities gen- weights to generalize very well on test data.
erate the probability of seeing a word given context.

3.2 Doc2Vec 3.4 Joint Doc2Vec Model for Predicting Rat-

The authors of [10] use an extension of this to learn para- ings
graph vectors [5]. Paragraph ids are used in a similar manner In this section we describe the joint model that combines
to predict the target word given context words, except now both Doc2Vec and a neural regression model to predict user-
there are additional vectors representing the paragraphs. item ratings. Figure 4 shows an example of such a joint
Every paragraph is mapped to a unique vector, represented model.
by a column in matrix D, and every word is mapped to During training, the model takes in three inputs in the
a unique vector, represented by a column in matrix W . form of three one-hot vectors per training instance. These
The paragraph vectors and word vectors are either directly inputs correspond to some item i, user u, and word w for
summed, averaged, or concatenated at the hidden layer to each w in the review text for item i and user u. The joint
predict a target word given context words. The only real model has two objective functions it is trying to optimize.
change from the Word2Vec to this model is how to handle On one side, we want we have function related to the per-
the hidden layer. The summed hidden layer model is shown formance of the regression model F1 . This tries to minimize
in Figure 2. the mean square error for user-item pairs. On the other side
The paragraph (or document) variables can thought of as is F2 , a function related to the Doc2Vec side. We want to
an additional context words trying to predict a target word. maximum the average log probability of seeing target words
Training happens in the same way as Word2Vec; back propa- given context words, user id, and item id. We want to min-
w uK

... w

u1
dN ru,i wM
... ...
iN
... wM-1
wM ...
... ...

d1 i1
wM-1 w2

wM
wM ... ... w1

wM-1
w2 ...
wM-1
... w2
w1
w1

Figure 4: A neural network model for jointly pre-

w1
dicting user-item ratings and producing word vec-
tors. The input (the middle column) corresponds to
the concatenation of one-hot encodings of users (u),
Figure 2: Doc2Vec. Similar to Word2Vec, except items (i), and words (w). The first layer (the encod-
there is an additional binary document variables. ing layer) of both sides of the network represent an
There are multiple ways to process the addition embedding of the input users and words. The right
document information; either document vectors can side is trained on the user-item review text using
be summed (shown above), averaged into the word a softmax output scheme for predicting neighbor-
vectors, or more preferred, concatenating document ing context words, while the left side is trained for
vectors to the word vectors. Shown above is a sim- predicting the rating using regression.
plified version of Doc2Vec that ignores concatena-
tion and averaging of input vectors. Aside from the
hidden layer decision, the output and back propaga- imize the following:
tion remain the same.
F = F1 − µ · F2
µ is a hyperparameter for how much weight we want to
give the Doc2Vec model in comparison to the regression
mode. We want the embeddings for each side to influence
eachother, so the part of W corresponding the documents in
the Doc2Vec model is the same as the W used for the regres-
i1
sion. This means the embeddings used for word prediction
and rating prediction are the same.
i2
For every word of every review in the training set, we
perform an update to the weights of the model. The model
h1 can be fit to optimize F1 then −µ·F2 (or vice versa) for every
i3 example, or after every batch. The algorithms for training
either individual model can be used in an alternating fashion
h2 o
to update the weights for the full model. F1 just uses user i
i4 and item j, while F2 will additionally use the review text.
h3 Intuitively, these models can be thought of as using low di-
mensional embeddings of the input for their predictive tasks.
i5
Word2Vec uses embeddings of each word to predict a tar-
get word. Doc2Vec uses the concatenation of document and
i6 word embeddings to predict a target word. This joint model
uses the concatenation of user, item, and word embeddings
to predict a target word, while simultaneously predicting a
user-item rating using a concatenation of the user and item
Figure 3: A simple feed-forward neural network for embeddings.
regression. ij is j-th input, hk is a k-th hidden node A typical issues with these models is the number of param-
which applies a nonlinear transform to the sum of eters and how they tend to overfit. To help this, we apply
the input, and o is output of the model; a linear sum very small L2 penalty on the user and item vectors, and
of the hidden nodes. apply a dropout scheme to the remainder of the regression
part of the model. Dropout [12] is a method of regulariza-
tion that has proved extremely effective in neural models in
recent year. The idea is that for each weight update instance Method Avg. MSE
during training each node has a chance of “turning off” with Latent Factor 0.371
probability p. The nodes do not influence the output at that Hidden Factor as Topics 0.367
time, and thus the weights are not updated. The intuitive MLP Regression 0.385
reason that this is effective is it uses the idea of majority MLP Regression w/ Dropout 0.369
influence. It’s hard to overfit when told a model with nearly Joint Doc2Vec 0.365
p of its nodes gone has to perform similarly to the full archi- Joint Doc2Vec w/ Dropout 0.362
tecture. In the results section we illustrate the importance
of this regularization. Table 1: Results from experiments. The MSE is
We will see some of the following pros and cons in the suc- averaged over 10 trials.
ceeding sections, but it will help mention them here clarity.
The major pro is that with dropout regularization the model
has consistently strong performance. The negatives are that
training is very slow (training takes around 5 hours), and 0.9
there are many hyperparameters. Another issue is the cold Joint Doc2Vec
start problem, in which a user and/or item have never been Joint Doc2Vec w/ Dropout
0.8
seen before. In this case, we use averages of user ratings,
otherwise set to global average.
0.7

MSE on Testing Set

4. EXPERIMENTS AND RESULTS
0.6
As per our task, we will use the BeerAdvocate beer dataset
to evaluate our joint Doc2Vec model. But in order to eval-
uate our model, we need to decide which hyperparameters 0.5
to use as there is a lot of flexibility in the architecture. We
will use the following: 0.4

• Concatenation of user, item, and word vectors 0.3

6 8 10 12 14 16 18 20
• 5 dimensional user and item vectors; 200 dimensional # Epochs on Training Set
word vectors
Figure 5: Performances provided by regulariza-
• Training with negative sampling
tion using dropout scheme on the regression side
• The input layer of the regression is 10 (user + item of model. The joint Doc2Vec model with dropout
dimensionality) and L2 input regularization shows improved per-
formance, even many iterations beyond the point
• The hidden layer of the regression is 5, using Rectified where L2 regularization starts to increase on the test
Linear Units (ReLUs). dataset.

• Let µ = 10

• Implemented in Tensorflow using Adam optimizer These results are significant because they show us that
the joint Doc2Vec embeddings convey useful information
• Uses dropout with probability p = 0.9 (at least more so than previous models). According to [5],
Doc2Vec embeddings capture document semantics just as
The hyperparameters were chosen from various criteria. Word2Vec embeddings does for words. A potential rea-
Word vector sizes were chosen from what has worked heuris- son that our joint model outperforms HFT is that user and
tically well in past word embedding findings. The user and item vectors might contain more semantic information than
item dimensionality were picked to keep in line with past the document topic distributions achieved through HFT and
work done by [6]. The architecture and µ were selected on LDA. It’s hard to say whether the performance gain comes
results from random subsets of the data. Little was gained from the different training methodology, or whether to con-
from increasing the size of the hyperparameters, and not tribute it to architecture and regularization.
enough to decide they were a major factor. Our joint Doc2Vec model creates a nonlinear function that
We run 10 trials of experiments, each of which randomly determines the “agreeance” between user and item embed-
chooses 80% for training, 10% for validation, and 10% for dings. The Doc2Vec part of the model is pushing these
testing. For each iteration we stop training once mean square vectors towards some semantically meaningful positions in
error has been minimized on the validation set, The results vector space, while the regression part is pushing the seman-
from the 10 trials are then averaged and presented in Ta- tics to correspond to beer ratings. All the parameters in the
ble 1. Similar to results we have seen previously, the Topic network are working towards balancing the two tasks, with
Model based model that also used review text outperformed the hopes that one influences the other in a positive way. At
a baseline latent factor model. Interestingly enough, what least the for the task of rating prediction, we can see this
ended up being a major influence over the results was the goal has been accomplished.
regularization. In all models we use L2 regularization, but Earlier we mentioned how important dropout regulariza-
the models that employed a dropout scheme had the best tion is to the model. Figure 5 show the training on one trial
and most consistent results. of the joint Doc2Vec model with and without dropout.
Although not shown in the table 1, in all trials the joint [11] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Doc2Vec model performed similarly and outperformed the Item-based collaborative filtering recommendation
other models. algorithms. In Proceedings of the 10th international
conference on World Wide Web, pages 285–295. ACM,
2001.
5. CONCLUSION
[12] N. Srivastava, G. E. Hinton, A. Krizhevsky,
We looked at a joint model that uses both Doc2Vec and I. Sutskever, and R. Salakhutdinov. Dropout: a simple
neural regression to achieve state-of-the-art performance (at way to prevent neural networks from overfitting.
least for the BeerAdvocate dataset and up until 2013). This Journal of Machine Learning Research,
approach was easily implemented in Tensorflow, and even 15(1):1929–1958, 2014.
though the hyperparameters were mostly heuristically cho-
sen, still outperformed similar model-based methods.
There are still many questions remaining in regards to
the model we have presented. Would these results hold for
other datasets? With better regularization could the HFT
model achieve better results? It there an ideal setting for
hyperparameters? Although the model is simple in theory,
it has a lot of weights that need to be adjusted.
With all the recent work in vector encodings of words and
text, it would be interesting to see how this model could be
adjusted to further improve upon results.

6. REFERENCES
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. Journal of machine Learning
research, 3(Jan):993–1022, 2003.
[2] J. S. Breese, D. Heckerman, and C. Kadie. Empirical
analysis of predictive algorithms for collaborative
filtering. In Proceedings of the Fourteenth conference
on Uncertainty in artificial intelligence, pages 43–52.
Morgan Kaufmann Publishers Inc., 1998.
[3] A.-L. Deng, Y.-Y. Zhu, and B. Shi. A collaborative
filtering recommendation algorithm based on item
rating prediction. Journal of software,
14(9):1621–1628, 2003.
[4] Y. Koren and R. Bell. Advances in collaborative
filtering. In Recommender systems handbook, pages
77–118. Springer, 2015.
[5] Q. V. Le and T. Mikolov. Distributed representations
of sentences and documents. In ICML, volume 14,
pages 1188–1196, 2014.
[6] J. McAuley and J. Leskovec. Hidden factors and
hidden topics: understanding rating dimensions with
review text. In Proceedings of the 7th ACM conference
on Recommender systems, pages 165–172. ACM, 2013.
[7] J. McAuley, J. Leskovec, and D. Jurafsky. Learning
attitudes and attributes from multi-aspect reviews. In
Data Mining (ICDM), 2012 IEEE 12th International
Conference on, pages 1020–1025. IEEE, 2012.
[8] J. J. McAuley and J. Leskovec. From amateurs to
connoisseurs: modeling the evolution of user expertise
through online reviews. In Proceedings of the 22nd
international conference on World Wide Web, pages
897–908. ACM, 2013.
[9] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781, 2013.
[10] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
J. Dean. Distributed representations of words and
phrases and their compositionality. In Advances in
neural information processing systems, pages
3111–3119, 2013.