Subword Language Model for Query Auto-Completion (EMNLP-IJCNLP 2019)

Gyuwan Kim

arXiv Github

Motivations to Faster Neural Query Auto-Completion

When browsing on search engines, such as NAVER, users type in the information which they want to look for. Query auto-completion (QAC) suggests most likely completion candidates when a user enters the input. It is one of the essential features of search engines. In this paper, we suggest a method to speed-up QAC and eventually maximize user experience. 

The selection of an appropriate granularity level for text segmentation has been long studied over the variety of natural language processing problems: character, subword, word, phrase, sentence, and even paragraph. The best-performing granularity generally depends on tasks and datasets. The recent neural QAC models rely on a character-level language model since QAC systems need to respond whenever a user inputs a query in a character-by-character manner.

The generation process is auto-regressive, and the size of the search space is exponential to the sequence length. However, long character sequences make predictions slow and inaccurate in the constraints of limited computation. Also, character-level models are susceptible to errors because of long-range dependency. Thus, given these disadvantages, a need for a shorter query representation arises.

A word-level language model looks appealing choice because they can easily shorten sequence length. However, there are several critical disadvantages. Word-level models require larger vocabulary size and the number of parameters to learn, resulting in data sparsity. Furthermore, the world-level model has an out-of-vocabulary problem. As many search systems are in an open vocabulary setting, it is difficult for the word-level model to deal with the last incomplete token, which may not be in the vocabulary.

Despite the introduction of word-character hybrid models that word-level and character-level decoders are connected hierarchically (Luong and Manning, 2016; Wu et al., 2016), some factors hinder word-level processing for QAC. First of all, in some languages, such as Japanese and Chinese, the segmentation of sentences into words is unknown in advance and sometimes ambiguous. Word models assume whitespace as the boundary of words. Also, input queries usually include much noise, such as typos, grammatical errors, and spacing errors.

Subword Language Model for Query Auto-Completion

We introduce a subword language model for query auto-completion. The language model estimates the probability of token (in this case, a subword) sequence. The probability of a query q is the sum of all possible segmentation t.

We use two widely used segmentation algorithms: byte pair encoding (BPE) (Sennrich et al., 2015) and subword regularization (SR) (Kudo, 2018). Formally, a segmentation algorithm defines a probability distribution over a token sequence t conditioned on given query q: p_{seg}(t|q). The BPE algorithm is deterministic in that it segments greedily from left to right, whereas SR samples multiple segmentations stochastically, making the number of possible segmentations exponentially large. It is hard to calculate the likelihood of a given sentence using dynamic programming because even with the same prefix, hidden states vary upon different previous tokenization. The marginalization of all possible segmentations of very long sequences is intractable.

A subword language model is trained to maximize the average log-likelihood of the segmentation of sentences using the back-propagation through time (BPTT) (Rumelhart et al., 1986). 

Decoding (Retrace Algorithm and Reranking by Marginalization)

Given a prefix p, we want to find the most likely completion \hat{q}. However, searching in the infinitely large token sequence space is generally not an easy task. So we approximate the most likely completion by decoding for the best token sequence \hat{t} and then returning its corresponding query \tilde{q} by concatenating its token sequentially.

The fundamental practice is to segment p, feed it into language model to encode p, and use it for the decoding. Since finding \hat{t} is intractable as well, beam search decoding is used, but it only results in sub-optimal predictions. The two ways, including retrace algorithm and reranking method, is applied to improve such limitations. 

Language models for QAC should handle the last token that might be often incomplete. Subword language models, like character language models, can represent incomplete tokens in that they can generate any subsequence of sentences, whereas word language models cannot. For example, if we segment prefix as given to encode it by using neural networks, as illustrated in the figure, the segmentation of prefix may not match with that of a ground truth query. Such a phenomenon occurs because the prefix is an incomplete substring of the original desired query. In this case, this enforced segmentation is less likely to appear in training, especially for deterministic segmentation, such as BPE. Therefore, the model starting from this segmentation is unlikely to generate a ground truth query. 

To not miss any possible segmentation of target completion, we suggest a retrace algorithm that goes a few characters back from the end and generates candidates with restriction so that the candidates should match with retraced characters. In SR models, due to the stochasticity of segmentation, marginalization of all possible segmentations is recommended to calculate the likelihood of a query. Then we perform reranking with approximate marginalization using the output of beam search to better approximate than merely use argmax. Experimental results demonstrate that these techniques improve the robustness of the decoding process of the subword language model, enabling it to accomplish closer generation quality than that of the character baseline.

The gray area means the given prefix (“res”) of the query. The solid line indicates the boundary of the segmentation. Blue boxes indicate a fixed segmentation with a retrace algorithm at the end of the prefix.

Evaluation Metrics

Conventional QAC evaluation metrics, mean reciprocal rank (MRR), and partial-matching MRR (PMRR), require sampling of a prefix length and are favorable to short queries.

To remedy those problems, we propose a novel one called mean recoverable length (MRL). We define recoverable length RL as the number of characters right before the first position where candidates do not have the query. When all characters of a query are known, we can readily suggest it. If we delete chars from right to left in a one-by-one manner, the ground truth query is likely to disappear in the list of candidate completions. 

MRL is a useful metric for additive QAC, which suggests one word at a time instead of a whole-query completion (Vargas et al., 2016) in that it measures how many characters the system can predict correctly at once. MRL does not care about the order of candidates and check whether they contain the target query or not. Lastly, it eliminates the need to choose a prefix length in the test data. 

Decoding Results

Results of completion generation.
We group MPC, character language model baseline, and two subword language models separately. +R implies the retrace algorithm. +M implies reranking with approximate marginalization. QPS stands for query per second.
The higher the QPS, the better. The best results for each column related to accuracy are shown in bold for each segmentation algorithm (BPE and SR). SR model shows higher unseen PMRR scores (underlined).
Our models are faster than the character baseline.

We used the public AOL query log dataset (Pass et al., 2006) of three months and split them based on the time. LSTM language model architecture is used for the experiments. For more details, please refer to the paper.

For comparison, we train three individual language models: namely Char, BPE, SR. We implement a trie (Fredkin, 1960) data structure to employ the most popular completion baseline. All experiments were performed on NAVER Smart Machine Learning (NSML) platform (Sung et al., 2017; Kim et al., 2018).

As illustrated in the table above, we perform extensive experiments to investigate the performance of query auto-completion. MPC is a quite fast and robust baseline. However, it cannot predict unseen queries. In practice, combining the frequency-based traditional method and neural language model approach is likely to boost the accuracy and meet a trade-off between the performance and computational costs.

The results of MRR and PMRR in our best methods are close to that of the character model with less than 0.02 point drop. Notably, the SR model has better generalization ability in that its PMRR for unseen queries is higher than that of the character model. In a real scenario, it is more critical because the number of unseen queries gradually increases.

Models achieve higher performance with additional techniques, including retrace algorithm and reranking method than the models without them. In particular, the retrace algorithm enables a considerable improvement for the BPE case, while SR models only obtain small improvements. The reranking method by approximate marginalization gives a small amount of improvement and is orthogonal to the retrace algorithm.

In terms of the execution time with Tesla P40 GPU and Xeon CPU, subword-level models are up to 2.5 times faster than the character baseline with a minimal loss in performance.

Conclusion

In this paper, we propose subword language models for query auto-completion with additional techniques, retrace algorithm, and reranking with approximate marginalization. We aim to prove that the subword language model is an attractive choice for QAC, as an alternative to the character-level language model, if latency, in particular, is considered. Using a subword language model, we build an accurate and much faster QAC system compared to the character-level language model baseline. 

Our best models show improved performance in speed. While maintaining the generation query, we observe the significant speed-up of subword language models and compare them to the character-level baseline. The models achieve up to 2.5 times faster decoding speed with less than a 0.02 point drop of MRR and PMRR. We also suggest a new metric, mean recoverable length (MRL).

Further areas of research include knowledge distillation from the character language model, joint training of segmentation algorithm with language models, and efficient implementation to be deployed as a real application. Moreover, we hope that the proposed methods could be extended to other NLP/IR tasks.