Doc2vec Explain
Doc2vec Explain
Steven Hill
University of California, San Diego
San Diego, California
PID: A53040946
[email protected]
T −k
1 X
log p(wt |wt−k , · · · , wt+k )
T
t=k
This corresponds to maximize the average log probability. Figure 1: Word2Vec. This (CBOW) model tries to
The predicted output word is decided based on a multiclass capture the probability of a target word being at
classifier. Using softmax, we have: the center of input context words; predicting which
word should appear given surround words. For each
eywt word in the input context words, the word is repre-
p(wt |wt−k , c . . . , wt+k ) = P yw
ie
i
sented as a one-hot encoding of that context word
These models are trained using stochastic gradient de- i, and the output node corresponds to the probabil-
scent with back propagation. These are commonly known ity of that target word appearing given the context
as neural language models. The output is dependent on word. During training, the weights are adjusted try
the weights from the first layer and second layer to define to push the probability of the target word appearing
the multinomial distribution of the output. After training with the context word to 1.
converges, the weights corresponding to each input word and
output word are used as vectors for word embeddings. Word
embeddings with similar semantics are mapped to similar gation with approximate methods for weight training. After
positions in vector space. These vectors can be used for a training, there are vector embeddings for every word and
wide variety of tasks, including word algebra and language document (if concatenation is used they can be of different
translation. lengths). The authors of this methodology claim to have
But because this learning is done through back propa- gained state-of-the-art results on sentiment analysis and in-
gation, it’s not tractable to do softmax classification given formation retrieval.
the size of the word vocabulary (which can be in the mil-
lions). This is a problem for the output end of the model. 3.3 Multi-layer Perceptron Regression
In practice either negative sampling or hierarchical softmax Lastly, before we discuss the joint model for prediction,
schemes are used to speed up training. We will forego an it will also help to look at regression in a neural network
indepth review of alternative training methods for brevity. setting (shown in Figure 3). The main difference between a
In the negative sampling case, logistic regression is used to neural network regression and classification model is that the
compare a positive outcome to sampled “negative” cases. In output is no longer non-linear. With proper regularization
hierarchical softmax, the probability distribution is encoded techniques, these standard models can be trained to have
in a binary tree, where the left and right probabilities gen- weights to generalize very well on test data.
erate the probability of seeing a word given context.
... w
u1
dN ru,i wM
... ...
iN
... wM-1
wM ...
... ...
d1 i1
wM-1 w2
wM
wM ... ... w1
wM-1
w2 ...
wM-1
... w2
w1
w1
w2
• Let µ = 10
• Implemented in Tensorflow using Adam optimizer These results are significant because they show us that
the joint Doc2Vec embeddings convey useful information
• Uses dropout with probability p = 0.9 (at least more so than previous models). According to [5],
Doc2Vec embeddings capture document semantics just as
The hyperparameters were chosen from various criteria. Word2Vec embeddings does for words. A potential rea-
Word vector sizes were chosen from what has worked heuris- son that our joint model outperforms HFT is that user and
tically well in past word embedding findings. The user and item vectors might contain more semantic information than
item dimensionality were picked to keep in line with past the document topic distributions achieved through HFT and
work done by [6]. The architecture and µ were selected on LDA. It’s hard to say whether the performance gain comes
results from random subsets of the data. Little was gained from the different training methodology, or whether to con-
from increasing the size of the hyperparameters, and not tribute it to architecture and regularization.
enough to decide they were a major factor. Our joint Doc2Vec model creates a nonlinear function that
We run 10 trials of experiments, each of which randomly determines the “agreeance” between user and item embed-
chooses 80% for training, 10% for validation, and 10% for dings. The Doc2Vec part of the model is pushing these
testing. For each iteration we stop training once mean square vectors towards some semantically meaningful positions in
error has been minimized on the validation set, The results vector space, while the regression part is pushing the seman-
from the 10 trials are then averaged and presented in Ta- tics to correspond to beer ratings. All the parameters in the
ble 1. Similar to results we have seen previously, the Topic network are working towards balancing the two tasks, with
Model based model that also used review text outperformed the hopes that one influences the other in a positive way. At
a baseline latent factor model. Interestingly enough, what least the for the task of rating prediction, we can see this
ended up being a major influence over the results was the goal has been accomplished.
regularization. In all models we use L2 regularization, but Earlier we mentioned how important dropout regulariza-
the models that employed a dropout scheme had the best tion is to the model. Figure 5 show the training on one trial
and most consistent results. of the joint Doc2Vec model with and without dropout.
Although not shown in the table 1, in all trials the joint [11] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Doc2Vec model performed similarly and outperformed the Item-based collaborative filtering recommendation
other models. algorithms. In Proceedings of the 10th international
conference on World Wide Web, pages 285–295. ACM,
2001.
5. CONCLUSION
[12] N. Srivastava, G. E. Hinton, A. Krizhevsky,
We looked at a joint model that uses both Doc2Vec and I. Sutskever, and R. Salakhutdinov. Dropout: a simple
neural regression to achieve state-of-the-art performance (at way to prevent neural networks from overfitting.
least for the BeerAdvocate dataset and up until 2013). This Journal of Machine Learning Research,
approach was easily implemented in Tensorflow, and even 15(1):1929–1958, 2014.
though the hyperparameters were mostly heuristically cho-
sen, still outperformed similar model-based methods.
There are still many questions remaining in regards to
the model we have presented. Would these results hold for
other datasets? With better regularization could the HFT
model achieve better results? It there an ideal setting for
hyperparameters? Although the model is simple in theory,
it has a lot of weights that need to be adjusted.
With all the recent work in vector encodings of words and
text, it would be interesting to see how this model could be
adjusted to further improve upon results.
6. REFERENCES
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. Journal of machine Learning
research, 3(Jan):993–1022, 2003.
[2] J. S. Breese, D. Heckerman, and C. Kadie. Empirical
analysis of predictive algorithms for collaborative
filtering. In Proceedings of the Fourteenth conference
on Uncertainty in artificial intelligence, pages 43–52.
Morgan Kaufmann Publishers Inc., 1998.
[3] A.-L. Deng, Y.-Y. Zhu, and B. Shi. A collaborative
filtering recommendation algorithm based on item
rating prediction. Journal of software,
14(9):1621–1628, 2003.
[4] Y. Koren and R. Bell. Advances in collaborative
filtering. In Recommender systems handbook, pages
77–118. Springer, 2015.
[5] Q. V. Le and T. Mikolov. Distributed representations
of sentences and documents. In ICML, volume 14,
pages 1188–1196, 2014.
[6] J. McAuley and J. Leskovec. Hidden factors and
hidden topics: understanding rating dimensions with
review text. In Proceedings of the 7th ACM conference
on Recommender systems, pages 165–172. ACM, 2013.
[7] J. McAuley, J. Leskovec, and D. Jurafsky. Learning
attitudes and attributes from multi-aspect reviews. In
Data Mining (ICDM), 2012 IEEE 12th International
Conference on, pages 1020–1025. IEEE, 2012.
[8] J. J. McAuley and J. Leskovec. From amateurs to
connoisseurs: modeling the evolution of user expertise
through online reviews. In Proceedings of the 22nd
international conference on World Wide Web, pages
897–908. ACM, 2013.
[9] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781, 2013.
[10] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
J. Dean. Distributed representations of words and
phrases and their compositionality. In Advances in
neural information processing systems, pages
3111–3119, 2013.