#6. Semantic Product Search PDF
#6. Semantic Product Search PDF
2876
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
In this paper, we address the question: Given rich customer behavior Figure 1: System architecture for augmenting product
data, can we train a deep learning model to retrieve matching matching using semantic matching
products in response to a query? Intuitively, there is reason to
believe that customer behavior logs contain semantic information;
customers who are intent on purchasing a product circumvent 2 RELATED WORK
the limitations of lexical matching by query reformulation or by
There is a rich literature in natural language processing (NLP) and
deeper exploration of the search results. The challenge is the sheer
information retrieval (IR) on capturing the semantics of queries and
magnitude of the data as well as the presence of noise, a challenge
documents. Word2vec [18] garnered significant attention by demon-
that modern deep learning techniques address very effectively.
strating the use of word embeddings to capture semantic structure;
Product search is different from web search as the queries tend
synonyms cluster together in the embedding space. This technique
to be shorter and the positive signals (purchases) are sparser than
was successfully applied to document ranking for web search with
clicks. Models based on conversion rates or click-through-rates may
the DESM model [20]. Building from the ideas in word2vec, Diaz
incorrectly favor accessories (like a phone cover) over the main
et al. [6] trained neural word embeddings to find neighboring words
product (like a cell phone). This is further complicated by shop-
to expand queries with synonyms. Ultimately, based on these recent
pers maintaining multiple intents during a single search session:
advancements and other key insights, the state-of-the-art models
a customer may be looking for a specific television model while
for semantic search can generally be classified into three categories:
also looking for accessories for this item at the lowest price and
browsing additional products to qualify for free shipping. A prod- (1) Latent Factor Models: Nonlinear matrix completion ap-
uct search engine should reduce the effort needed from a customer proaches that learn query and document-level embeddings
with a specific mission (narrow queries) while allowing shoppers without using their content.
to explore when they are looking for inspiration (broad queries). (2) Factorized Models: Separately convert queries and docu-
As mentioned, product search typically operates in two stages: ments to low-dimensional embeddings based on content.
matching and ranking. Products that contain words in the query (3) Interaction Models: Build interaction matrices between
(Q i ) are the primary candidates. Products that have prior behavioral the query and document text and use neural networks to
associations (products bought or clicked after issuing a query Q i ) mine patterns from the interaction matrix
are also included in the candidate set. The ranking step takes these Deerwester et al. [5] introduced Latent Semantic Analysis (LSA),
candidates and orders them using a machine-learned rank function which computes a low-rank factorization of a term-document ma-
to optimize for customer satisfaction and business metrics. trix to identify semantic concepts and was further refined by [1, 7]
We present a neural network trained with large amounts of pur- and extended by ideas from Latent Dirichlet Allocation (LDA) [2]
chase and click signals to complement a lexical search engine in in [27]. In 2013, Huang et al. [11] published the seminal paper in
ad hoc product retrieval. Our first contribution is a loss function the space of factorized models by introducing the Deep Semantic
with a built-in threshold to differentiate between random negative, Similarity Model (DSSM). Inspired by LSA and Semantic Hashing
impressed but not purchased, and purchased items. Our second con- [23], DSSM involves training an end-to-end deep neural network
tribution is the empirical result that recommends average pooling with a discriminative loss to learn a fixed-width representation
in combination with n-grams that capture short-range linguistic for queries and documents. Fully connected units in the DSSM ar-
patterns instead of more complex architectures. Third, we show chitecture were subsequently replaced with Convolutional Neural
the effectiveness of consistent token hashing in Siamese networks Networks (CNNs) [10, 24] and Recurrent Neural Networks (RNNs)
for zero-shot learning and handling out of vocabulary tokens. [21] to respect word ordering. In an alternate approach, which ar-
In Section 2, we highlight related work. In Section 3, we describe ticulated the idea of interaction models, Guo et al. [9] introduced
our model architecture, loss functions, and tokenization techniques the Deep Relevance Matching Model (DRMM) which leverages an
including our approach for unseen words. We then introduce the interaction matrix to capture local term matching within neural ap-
readers to the data and our input representations for queries and proaches which has been successfully extended by MatchPyramid
products in Section 4. Section 5 presents the evaluation metrics and [22] and other techniques [12–14, 26, 29]. Nevertheless, these inter-
our results. We provide implementation details and optimizations action methods require memory and computation proportional to
to efficiently train the model with large amounts of data in Section the number of words in the document and hence are prohibitively
6. Finally, we conclude in Section 7 with a discussion of future work. expensive for online inference. In addition, Duet [19] combines
2877
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
3 MODEL
3.1 Neural Network Architecture
Our neural network architecture is shown in Figure 2. As in the
distributed arm of the Duet model, our first model component is
an embedding layer that consists of |V | × N parameters where V
is the vocabulary and N is the embedding dimension. Each row
corresponds to the parameters for a word. Unlike Duet, we share
our embeddings across the query and product. Intuitively, sharing
Figure 2: Illustration of neural network architecture used
the embedding layer in a Siamese network works well, capturing
for semantic search
local word-level matching even before training these networks.
Our experiments in Table 7 confirm this intuition. We discuss the
specifics of our query and product representation in Section 4. tradeoff between precision and recall. For example, accessories like
To generate a fixed length embedding for the query (E Q ) and mounts may also be relevant for the query “led tv.”
the product (E P ) from individual word embeddings, we use average Pruning results based on a threshold is a common practice to
pooling after observing little difference (<0.5%) in both MAP and identify the match set. Pointwise loss functions, such as mean
Recall@100 relative to recurrent approaches like LSTM and GRU squared error (MSE) or mean absolute error (MAE), require an
(see Table 2). Average pooling also requires far less computation, additional step post-training to identify the threshold. Pairwise loss
reducing training time and inference latency. We reconciled this functions do not provide guarantees on the magnitude of scores
departure from state-of-the-art solutions for Question Answering (only on relative ordering) and thus do not work well in practice
and other NLP tasks through an analysis that showed that, unlike with threshold-based pruning. Hence, we started with a pointwise
web search, both query and product information tend to be shorter, 2-part hinge loss function as shown in Equation (1) that maximizes
without long-range dependencies. Additionally, product search the similarity between the query and a purchased product while
queries do not contain stop words and typically require every query minimizing the similarity
between a query and random products.
word (or its synonym) to be present in the product. Define ŷ := cos E Q , E P , and let y = 1 if product P is purchased
Queries typically have fewer words than the product content.
in response to query Q, and y = 0 otherwise. Furthermore let
Because of this, we observed a noticeable difference in the magni-
ℓ+ (y) := (− min (0, y − ϵ+ ))m , and ℓ− (y) := max (0, y − ϵ− )m for
tude of query and product embeddings. This was expected as the
some predefined thresholds ϵ+ and ϵ− and m ∈ {1, 2}. The two part
query and the product models were shared with no additional pa-
hinge loss can be defined as
rameters to account for this variance. Hence, we introduced Batch
Normalization layers [15] after the pooling layers for the query L (ŷ, y) := y · ℓ+ (ŷ) + (1 − y) · ℓ− (ŷ) (1)
and the product arms. Finally, we compute the cosine similarity Intuitively, the loss ensures that when y = 0 then ŷ is less than ϵ−
between E Q and E P . During online A/B testing, we precompute E P and when y = 1 then ŷ is above ϵ+ . After some empirical tuning on
for all the products in the catalog and use a k-Nearest Neighbors a validation set, we set ϵ+ = 0.9 and ϵ− = 0.2.
algorithm to retrieve the most similar products to a given query Q i . As shown in Table 1, the 2-part hinge loss improved offline
matching performance by more than 2X over the MSE baseline.
3.2 Loss Function However in Figure 3, a large overlap in score distribution between
A critical decision when employing a vector space model is defining positives and negatives can be seen. Furthermore, the score distribu-
a match, especially in product search where there is an important tion for negatives appeared bimodal. After manually inspecting the
2878
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
2879
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
these n-gram projections. But we found that the simple approach to embed every attribute independently and concatenate them to
of combining all tokens in a single bag-of-tokens performs well. obtain the final product representation. However, large variability in
We conclude this section by referring the reader to Figure 5, which the accuracy and availability of structured data across products led
walks through our tokenization methods for the example “artistic to 5% lower recall than simply concatenating the attributes. Hence,
iphone 6s case”. In Table 6, we show example queries and products we decided to use an ordered bag of words of these attributes.
retrieved to highlight the efficacy of our best model to understand
synonyms, intents, spelling errors and overall robustness. 5 EXPERIMENTS
In this section we describe our metrics, training procedure, and the
results, including the impact of our method in production.
5.1 Metrics
We define two evaluation subtasks: matching and ranking.
(1) Matching: The goal of the matching task is to retrieve all
relevant documents from a large corpus for a given query. In
order to measure the matching performance, we first sample
a set of 20K queries. We then evaluate the model’s ability
to recall purchased products from a sub-corpus of 1 million
products for those queries. Note that the 1 million product
corpus contains purchased and impressed products for ev-
ery query from the evaluation period as well as additional
random negatives. We tune the model hyperparameters to
maximize Recall@100 and Mean Average Precision (MAP).
(2) Ranking: The goal of this task is to order a set of documents
by relevance, defined as purchase count conditioned on the
query. The set of documents contains purchased and im-
pressed products. We report standard information retrieval
Figure 5: Aggregation of different tokenization methods il- ranking metrics, such as Normalized Discounted Cumulative
lustrated with the processing of “artistic iphone 6s case” Gain (NDCG) and Mean Reciprocal Rank (MRR).
5.2 Results
4 DATA In this section, we present the durable learnings from thousands
We use 11 months of search logs as training data and 1 month as of experiments. We fix the embedding dimension to 256, weight
evaluation. We sample 54 billion query-product training pairs. We matrix initialization to Xavier initialization [8], batch size to 8192,
preprocess these sampled pairs to 650 million rows by grouping the and the optimizer to ADAM with the configuration α = 0.001, β 1 =
training data by query-product pairs over the entire time period 0.9, β 2 = 0.999, ϵ = 10−8 for all the results presented. We refer to
and using the aggregated counts as weights for the pairs. We also the hinge losses defined in Section 3.2 with m = 1 and m = 2 as
decrease the training time by 3X by preprocessing the training data the L1 and L2 variants respectively. Unigram tokenization is used
into tokens and using mmap to store the tokens. More details on our in Table 1 and Table 2, as the relative ordering of results does not
best practices for reducing training time can be found in Section 6. change with other more sophisticated tokenizations.
For a given customer query, each product is in exactly one of We present the results of different loss functions in Table 1.
three categories: purchased, impressed but not purchased, or ran- We see that the L2 variant of each loss consistently outperforms
dom. For each query, we target a ratio of 6 impressed and 7 random the L1. We hypothesize that L2 variants are robust to outliers in
products for every query-product purchase. We sample this way cosine similarity. The 3-part hinge loss outperforms the 2-part
to train the model for both matching and ranking, although in this hinge loss in matching metrics in all experiments although the two
paper we focus on matching. Intuitively, matching should differen- loss functions have similar ranking performance. By considering
tiate purchased and impressed products from random ones; ranking impressed negatives, whose text is usually more similar to positives
should differentiate purchased products from impressed ones. than negatives, separately from random negatives in the 3-part
We choose the most frequent words to build our vocabulary, re- hinge loss, the scores for positives and random negatives become
ferred to as |V |. Each token in the vocabulary is assigned a unique better separated, as shown in Section 3.2. The model can better
numeric token id, while remaining tokens are assigned 0 or a hash- differentiate between positives and random negatives, improving
ing based identifier. Queries are lowercased, split on whitespace, Recall and MAP. Because the ranking task is not distinguishing
and converted into a sequence of token ids. We truncate the query between relevant and random products but instead focuses on
tokens at the 99th percentile by length. Token vectors that are ordering purchased and impressed products, it is not surprising
smaller than the predetermined length are padded to the right. that the 2-part and 3-part loss functions have similar performance.
Products have multiple attributes, like title, brand, and color, that In Table 2, we present the results of using LSTM, GRU, and
are material to the matching process. We evaluated architectures averaging to aggregate the token embeddings. Averaging performs
2880
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 1: Loss Function Experiments using Unigram Tokenization and Average Pooling
Loss Recall MAP Matching NDCG Matching MRR Ranking NDCG Ranking MRR
BCE 0.586 0.486 0.695 0.473 0.711 0.954
MAE 0.044 0.020 0.275 0.192 0.611 0.905
MSE 0.238 0.144 0.490 0.377 0.680 0.948
2 Part L1 0.485 0.384 0.694 0.472 0.742 0.966
3 Part L1 0.691 0.616 0.762 0.536 0.760 0.971
2 Part L2 0.651 0.576 0.768 0.549 0.776* 0.973*
3 Part L2 0.735* 0.664* 0.791* 0.591* 0.772 0.973*
Loss Pooling Recall MAP Matching NDCG Matching MRR Ranking NDCG Ranking MRR
ave 0.238 0.144 0.490 0.377 0.680 0.948
MSE gru 0.105 0.052 0.431 0.348 0.700 0.951
lstm 0.102 0.048 0.404 0.286 0.697 0.948
ave 0.691 0.616 0.762 0.536 0.760 0.971
3 Part L1 gru 0.651 0.574 0.701 0.376 0.727 0.965
lstm 0.661 0.588 0.730 0.469 0.739 0.964
ave 0.735 0.664 0.791* 0.591* 0.772 0.973
3 Part L2 gru 0.739* 0.659 0.777 0.578 0.775* 0.975*
lstm 0.738 0.666* 0.767 0.527 0.775* 0.976*
Table 3: Tokenization Experiments with Average Pooling and 3 Part L2 Hinge Loss
Tokenization Recall MAP Matching NDCG Matching MRR Ranking NDCG Ranking MRR
Char Trigrams 0.673 0.586 0.718 0.502 0.741 0.955
Unigrams 0.735 0.664 0.791 0.591 0.772 0.973
Unigrams+Bigrams 0.758 0.696 0.784 0.577 0.768 0.974
Unigrams+Bigrams+Char Trigrams 0.764 0.707 0.800 0.615 0.794* 0.978
Unigrams+OOV 0.752 0.694 0.799 0.633 0.791 0.978
Unigrams+Bigrams+OOV 0.789 0.741 0.790 0.610 0.776 0.979
Unigrams+Bigrams+Char Trigrams+OOV 0.794* 0.745* 0.810* 0.659* 0.794* 0.980*
Unigrams(500K) 0.745 0.683 0.799 0.629 0.784 0.975
Word Unigrams(125K)+OOV(375K) 0.753 0.694 0.804 0.612 0.788 0.979
similar to or slightly better than recurrent units with significantly component of DSSM[11], has competitive ranking but not match-
less training time. As mentioned in Section 3.1, in the product search ing performance compared to unigrams. Adding bigrams improves
setting, queries and product titles tend to be relatively short, so matching performance as bigrams capture short phrase-level infor-
averaging is sufficient to capture the short-range dependencies that mation that is not captured by averaging unigrams. For example,
exist in queries and product titles. Furthermore, recurrent methods the unigrams for “chocolate milk” and “milk chocolate” are the
are more expressive but introduce specialization between the query same although these are different products. Additionally including
and title. Consequently, local word-level matching between the character trigrams improves the performance further as character
query and the product title may not be not captured as well. trigrams provide generalization and robustness to spelling errors.
In Table 3, we compare the performance of using different tok- Adding OOV hashing improves the matching performance as it
enization methods. We use average pooling and the 3-part L2 hinge allows better generalization to infrequent or unseen terms, with
loss. For each tokenization method, we select the top k terms by the caveat that it introduces additional parameters. To differentiate
frequency in the training data. Unless otherwise noted, k was set between the impact of additional parameters and OOV hashing,
to 125K, 25K, 64K, and 500K for unigrams, bigrams, character tri- the last two rows in Table 3 compare 500K unigrams to 125K uni-
grams, and out-of-vocabulary (OOV) bins respectively. It is worth grams and 375K OOV bins. These models have the same number
noting that using only character trigrams, which was an essential of parameters, but the model with OOV hashing performs better.
2881
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
In Table 4, we present the results of using batch normalization, embedding matrix size as the model must fit in a single GPU. The
layer normalization, or neither on the aggregated query and prod- simplicity of averaging pooling combined with the separability of
uct embeddings. The “Query Sorted” column refers to whether all the Siamese architecture allow us to use model parallel training to
positive, impressed, and random negative examples for a single increase the throughput. Letting k represent the embedding dimen-
query appear together or are shuffled throughout the data. The sion and n represent the number of GPUs, the similarity function
best matching performance is achieved using batch normalization of our model is shown in equation 3. The embedding matrix is split
and shuffled data. Using sorted data has a significantly negative among the GPUs along the embedding dimension. The input is sent
impact on performance when using batch normalization but not to all GPUs to look up the partial token embeddings and average
when using layer normalization. Possibly, the batch estimates of them. Sending the input to all GPUs tends to be inexpensive as the
mean and variance are highly biased when using sorted data. number of tokens is small in comparison with the token embed-
Finally, in Table 5, we compare the results of our model to four dings or the embedding matrix. Simply concatenating the partial
baselines: DSSM [11], Match Pyramid [22], ARC-II [10], and our average embeddings across GPUs requires O(2k) communication
model with frozen, randomly initialized embeddings. We only use of floating point numbers per example in both forward and back-
word unigrams or character trigrams in our model, as it is not ward propagation. Equation 6 and 7 show how instead the cosine
immediately clear how to extend the bag-of-tokens approach to similarity can be decomposed to transmit only the partial-sums
methods that incorporate ordering. We compare the performance of and partial-sum-of-squares. With this decomposition, we incur a
using the 3-part L2 hinge loss to the original loss presented for each constant communication cost of 3 scalars per GPU.
model. Across all baselines, matching performance of the model
improves using the 3-part L2 hinge loss. ARC-II and Match Pyramid
ranking performance is similar or lower when using the 3-part loss. Sim(Q, P) = cos(E Q , E P ) (3)
Ranking performance improves for DSSM, possibly because the
original approach uses only random negatives to approximate the
softmax normalization. More complex models, like Match Pyramid k
a i · bi
Í
and ARC-II, had significantly lower matching and ranking perfor-
a ·b i=1
mance while taking significantly longer to train and evaluate. These cos(a, b) = = s (4)
∥a∥ 2 ∥b ∥ 2
s
models are also much harder to tune and tend to overfit. k k
ai2 bi2
Í Í
The embeddings in our model are trained end-to-end. Previous i=1 i=1
experiments using other methods, including Glove and word2vec,
to initialize the embeddings yielded poorer results than end-to-end
Splitting the cosine similarity computation across n GPUs:
training. When comparing our model with randomly initialized
to one with trained embeddings, we see that end-to-end training
results in over a 3X improvement in Recall@100 and MAP. r = k/n (5)
k
Õ n Õ
Õ r
5.3 Online Experiments a i · bi = ar (j−1)+l · br (j−1)+l (6)
i=1 j=1 l =1
We ran a total of three online match set augmentation experiments
k n Õ
r
on a large e-commerce website across three categories: toys and Õ Õ
games, kitchen, and pets. In all experiments, the conversion rate, ai2 = ar2 (j−1)+l (7)
i=1 j=1 l =1
revenue, and other key performance indicators (KPIs) statistically
significantly increased. One challenge we faced with our semantic
Results from these experiments are shown in Figure 6. We ran
search solution was weeding out irrelevant results to meet the pre-
experiments on a single AWS p3.16xlarge machine with 8 NVIDIA
cision bar for production search quality. To boost the precision of
Tesla V100 GPUs (16GB), Intel Xeon E5-2686v4 processors, and
the final results, we added guard rails through additional heuris-
488GB of RAM. The training was run 5 times with 10 million exam-
tics and ranking models to filter irrelevant products. A qualitative
ples. The median time, scaled to 1 billion examples, is reported.
analysis of the augmented search results coupled with an increase
To achieve scaling, we had to ensure that the gradient variables
in relevant business metrics provided us with compelling evidence
were placed on the same GPUs as their corresponding operations.
that this approach contributed to our goal of helping customers
This allows greater distribution of memory usage and computation
effortlessly complete their shopping missions.
across all GPUs. Unsurprisingly, splitting the model across GPUs for
smaller embedding dimensions (<256) increases the overall training
6 TRAINING ACCELERATION time. But beyond an embedding dimension of 512, the communi-
During our offline experiments, we saw an average of 10% improve- cation overhead is less than the additional computational power.
ment in matching metrics by increasing the data from 200 million Note that the training time is almost linear at a constant embedding
to 1.2 billion query-product pairs. In this section, we describe our dimension per GPU. In other words, training with an embedding
multi-GPU training techniques for efficiently handling these larger dimension of 2048 on 2 GPUs and an embedding dimension of 1024
datasets. Most parameters for this model lie in the embedding layer on 1 GPU have similar speeds. In Figure 6, this is shown by the
and hence data parallel training approaches have high commu- dotted lines connecting points with the same embedding dimension
nication overhead. Furthermore data parallel training limits the per GPU. With ideal scaling, the lines would be horizontal.
2882
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Query Sorted Normalization Recall MAP Matching NDCG Matching MRR Ranking NDCG Ranking MRR
batch 0.730 0.663 0.763 0.553 0.751 0.970
T layer 0.782 0.733 0.817* 0.649 0.812* 0.982*
none 0.780 0.722 0.798 0.616 0.799 0.976
batch 0.794* 0.745* 0.810 0.659* 0.794 0.980
F layer 0.791 0.743 0.807 0.629 0.797 0.980
none 0.784 0.728 0.803 0.639 0.803 0.976
2883
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 6: Example Queries and Matched Products spring symposium on cross-language text and speech retrieval, Vol. 15. 21.
[8] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training
deep feedforward neural networks. In In Proceedings of the International Con-
Query: make it bake it suncatchers ference on Artificial Intelligence and Statistics (AISTATS?10). Society for Artificial
Comments: Robustness to Spelling Error Intelligence and Statistics.
[9] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance
matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International
on Conference on Information and Knowledge Management. ACM, 55–64.
[10] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional
Neural Network Architectures for Matching Natural Language Sentences. In
Proceedings of the 27th International Conference on Neural Information Processing
Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, USA, 2042–2050. http:
//dl.acm.org/citation.cfm?id=2969033.2969055
[11] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry
Query: healthy shampoo Heck. 2013. Learning deep structured semantic models for web search using
Comments: Associates sulfate-free to healthy clickthrough data. In Proceedings of the 22nd ACM international conference on
Conference on information & knowledge management. ACM, 2333–2338.
[12] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. Pacrr:
A position-aware neural ir model for relevance matching. arXiv preprint
arXiv:1704.03940 (2017).
[13] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. Re-pacrr: A
context and density-aware neural information retrieval model. arXiv preprint
arXiv:1706.10192 (2017).
[14] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2018. Co-pacrr: A
context-aware neural ir model for ad-hoc retrieval. In Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining. ACM, 279–287.
Query: collapsible step ladder [15] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating
Comments: Synonymous intent deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167 (2015).
[16] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag
of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of
the European Chapter of the Association for Computational Linguistics: Volume 2,
Short Papers. Association for Computational Linguistics, 427–431.
[17] CD Manning, R PRABHAKAR, and S HINRICH. 2008. Introduction to information
retrieval, volume 1 Cambridge University Press. Cambridge, UK (2008).
[18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Query: ninjago lego training kai minifigure Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
Comments: Drops uninformative token "training" [19] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using
local and distributed representations of text for web search. In Proceedings of the
26th International Conference on World Wide Web. International World Wide Web
Conferences Steering Committee, 1291–1299.
[20] Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. 2016. A dual
embedding space model for document ranking. arXiv preprint arXiv:1602.01137
(2016).
[21] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen,
Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long
short-term memory networks: Analysis and application to information retrieval.
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24, 4
(2016), 694–707.
[22] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.
2016. Text Matching as Image Recognition.. In AAAI. 2793–2799.
helped in hyperparameter optmization. Guy Lebanon, Vishy Vish- [23] Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International
wanathan and Inderjit Dhillon served as advisors throughout the Journal of Approximate Reasoning 50, 7 (2009), 969–978.
[24] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.
project. Scott Le Grand and Edward Kandrot provided guidance to A latent semantic model with convolutional-pooling structure for information
design and implement model parallel training. retrieval. In Proceedings of the 23rd ACM International Conference on Conference
on Information and Knowledge Management. ACM, 101–110.
[25] Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh,
REFERENCES and Liliana Chanona-Hernández. 2014. Syntactic n-grams as machine learning
[1] Michael W Berry and Paul G Young. 1995. Using latent semantic indexing for features for natural language processing. Expert Systems with Applications 41, 3
multilanguage information retrieval. Computers and the Humanities 29, 6 (1995), (2014), 853–860.
413–429. [26] Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng.
[2] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. 2016. Match-srnn: Modeling the recursive matching structure with spatial rnn.
Journal of machine Learning research 3, Jan (2003), 993–1022. arXiv preprint arXiv:1604.04378 (2016).
[3] Silviu Cucerzan and Eric Brill. 2004. Spelling correction as an iterative process [27] Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc
that exploits the collective knowledge of web users. In Proceedings of the 2004 retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on
Conference on Empirical Methods in Natural Language Processing. Research and development in information retrieval. ACM, 178–185.
[4] Hercules Dalianis. 2002. Evaluating a spelling support in a search engine. In In- [28] Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex
ternational Conference on Application of Natural Language to Information Systems. Smola. 2009. Feature hashing for large scale multitask learning. arXiv preprint
Springer, 183–190. arXiv:0902.2206 (2009).
[5] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and [29] Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking
Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the short answer texts with attention-based neural matching model. In Proceed-
American society for information science 41, 6 (1990), 391–407. ings of the 25th ACM International on Conference on Information and Knowledge
[6] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query expansion with Management. ACM, 287–296.
locally-trained word embeddings. arXiv preprint arXiv:1605.07891 (2016). [30] Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM
[7] Susan T Dumais, Todd A Letsche, Michael L Littman, and Thomas K Landauer. computing surveys (CSUR) 38, 2 (2006), 6.
1997. Automatic cross-language retrieval using latent semantic indexing. In AAAI
2884
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
A ADDITIONAL EXPERIMENTS dimension was 256 for the shared embedding layer but 128 for each
This section details additional experiments completed to determine of the decoupled query and product embedding layers.
the model architecture and to tune model hyperparameters. In Table 8, we present the results of varying the OOV bin size.
We demonstrate empirically in Table 7 that sharing the embed- We see that matching performance improves as the bin size in-
ding layer between the query and product arm tends to perform creases, although ranking performance peaks at lower bin sizes.
better for matching results across multiple tokenizations and loss These results confirm the intuition that adding OOV hashing leads
functions. As we described previously, sharing the embedding layer to generalization to unseen tokens. This generalization improves
helps local word-level matching and generalization to unseen to- matching performance as there are fewer spurious matches result-
kens when using OOV bins. Note that in this experiment, the num- ing from OOV tokens mapping to the same bucket and/or simply
ber of model parameters was held constant. So the embedding excluding OOV tokens.
2885