Mixture Content Selection for Diverse Sequence Generation (EMNLP-IJCNLP 2019)

Mixture Content Selection for Diverse Sequence Generation (EMNLP 2019)

Jaemin Cho, Minjoon Seo, Hannaneh Hajishirzi

arXiv Github

Seq2Seq is not for One-to-Many Mapping

Comparison between the standard encoder-decoder model and ours

An RNN Encoder-Decoder (Seq2Seq) model is widely used for sequence generation, in particular, machine translation in which neural models are as competent as human translation in some languages. Despite such accuracy, a generic encoder-decoder model often exhibits poor performance in generating diverse outputs.

There are two different kinds of relationships between the source and target sequences. First, paraphrasing or machine translation tasks involves a one-to-one relationship, as the generated sequences should convey an identical meaning. On the other hand, summarization or question generation tasks have a one-to-many relationship in that a source sequence can be mapped to diverse sequences by focusing on different contents in the source sequence.

The semantic variance among the targets is generally low in tasks that have a one-to-one relationship. For these tasks, it is reasonable to train a model with maximum likelihood estimation (MLE) and encourage all generated sentences to have a similar meaning with the target sentence. For example, the sentence, “What kind of food do you like?” can be translated into as following: “Quel genre de nourriture aimez-vous?” / “Quel genre de plat aimez-vous?” Even though there are two variations in the translation of the source sentence, the meaning is almost identical.

However, the maximum likelihood estimation may show unsatisfactory performance at a one-to-many relationship task, where a model should produce outputs with different semantics. Given the following sentence in a question generation task, Donald Trump (1946 -) is the 45th president of the USA (Target Answer: Donald Trump),” one can think of the following questions with a different focus each: “Who was born in 1946? “Who is the 45th president of the USA?” Nevertheless, training a sequence generation model with maximum likelihood estimation often causes a degeneration problem (Holtzmann et al. 2019, Welleck et al. 2019). For example, a model produces “Who is the the the the the …” because “Who,” “is,” and “the” are the most frequent words in the data. This phenomenon happens when a model fails to capture the diversity in the target distribution. As you can see in the figure below, when the target distribution is diverse (has multiple modes), mode collapse can even hurt the accuracy as well as diversity.

In multi-modal distribution tasks, a one-to-one mapping learned from maximum likelihood estimation may result in suboptimal mapping.

To tackle this issue, we reformulate sequence generation in two stages: 1) Diverse Content Selection and 2) Focused Generation

Two-Stage Generation Method

We factorize p(y|x) with latent variable m into two stages:

p(y|x) = \sum_{m}{p(m|x) * p(y|x, m)}

1) Diverse Content Selection p(m|x)

This stage implements a one-to-many mapping. We introduce a content selection module called Selector. Selector consists of a mixture of experts (Jacobs et al. 1991; Eigen et al. 2014) to sample various latent variable m from a source sequence. Here, we define m as a binary mask that has the same length with a source sequence. Each value of m indicates whether the corresponding source token is focused during generation. Each expert samples a unique set of m to guide a generator (a seq2seq model in our case).

p(m|x) = \frac{1}{K} \sum^{1 \ldots K}_{z}  p(m|x, z)

2) Focused Generation p(y|x, m)

We allow a generator (a seq2seq model) to learn a one-to-one mapping. Based on the source tokens and the sampled binary masks, the generator outputs sentences that are guided by focus m. Now we can generate diverse sentences by sampling diverse m and conditioning the generator differently. In our experiments, we concatenate the binary masks to the source tokens embeddings of the seq2seq encoder.

The attention heatmap of the NQG++ decoder with three different foci by Selector.

We visualize the attention of NQG++ (Zhou et al. 2017a) decoder in question generation. Three different selector experts sample these foci, and the model can generate different questions by focusing on different contents.

Comparisons to the Existing Methods

The diagram above shows an overview of diverse sequence-to-sequence generation methods: (a) refers to our two-stage approach, (b) refers to search-based methods (Vijayakumar et al. 2018; Li et al. 2016b; Fan et al. 2018), and (c) refers to mixture decoders (Shen et al. 2019; He et al. 2018).

OursCurrent Approaches
Diverse EncodingDiverse Decoding
Manipulate where to focusManipulate an already-encoded input
High semantic diversitySimilar semantics in different expressions (paraphrasing)


Marginalizing the Bernoulli distribution by enumerating all possible focus is not an easy task, as the cardinality of focus space 2S grows exponentially with the source sequence length S. Thus, we create a focus guide and use it to independently and directly train Selector and the generator. A focus guide is a proxy target of whether a source token is focused during generation: a mask is 1 if the corresponding source token appears in target sentence and 0 otherwise. During the training, a focus guide acts as a target for Selector and is given as an additional input for generator. We illustrate the overall training procedure in the algorithm above. For Selector, we use hard-EM (line 2-6). For generator, we use MLE (line 7-8).

Accuracy-Diversity Tradeoff

Search-Based Methods
3-Beam Search13.59016.84867.277
5-Beam Search13.52618.80974.674
3-Diverse Beam Search13.69616.98968.018
5-Diverse Beam Search13.37918.29874.795
3-Truncated Sampling11.89015.44737.372
5-Truncated Sampling11.53017.65145.990
Mixture of Experts + Greedy Decoding
3-Mixture Decoder14.72019.32451.360
5-Mixture Decoder15.16621.96558.727
3-Mixture Selector (Ours)15.87420.43747.493
5-Mixture Selector (Ours)15.67222.45159.815
Focus Guide during Test Time
5-Beam Search + Focus Guide24.580

Question generation results: Comparison of diverse generation methods on SQuAD.
The score of NQG++ and the rest are from our experiments using NQG++ as a generator.

Method prefixes are the numbers of generated questions for each passage-answer pair.
The best scores are in red.

We use variants of sentence similarity metrics, such as BLEU, to assess the accuracy and diversity of different models. First, a pairwise metric measures the within-distribution similarity. A low pairwise metric denotes high diversity between generated hypotheses. Second, the Top-1 metric evaluates the Top-1 accuracy among the generated K-best hypotheses. Last but not least, an oracle metric measures the quality of the target distribution coverage among the Top-K generated target sequences (Ott et al. 2018; Vijayakumar et al. 2018).

Mixture models have a better diversity-accuracy tradeoff than search-based methods. Diverse Beam Search (Vijayakumar et al. 2018) methods do not exhibit ideal performance in both accuracy and diversity. While Truncated Sampling (Fan et al. 2018) shows the lowest self-similarity, it hurts the accuracy severely. With the highest accuracy and the most extensive target distribution coverage, Selector (ours) achieves the best trade-off between diversity and accuracy over the baselines.

The bottom row of the table shows the upper bound performance of Selector by feeding a focus guide (an overlap between an input and target sequences) to a generator during the testing time. The oracle metric (top-k accuracy) of Selector has almost reached the upper bound, indicating that Selector effectively covers various question distribution.

No Training Overhead

Comparison of training time on CNN-DM

The figure above compares the relationship between the number of mixtures and the training time in the two mixture models (ours, Shen et al. 2019). Since the decoder takes the most time in the training of seq2seq models, the number of the decoders (Shen et al. 2019) increases the training time linearly. On the other hand, the length of training time remains unaffected by the number of Selectors.


We propose an original two-stage sequence generation method based on diverse content selection. Our content selection module Selector can be supplemented to an existing encoder-decoder model. We demonstrate that our model achieves not only enhanced accuracy and diversity but also significantly shorter training time than the baselines in question generation and abstractive summarization*. Future areas for research may include the incorporation of Selector to other generation tasks, such as diverse image captioning.

* Please refer to our paper for further details regarding abstractive summarization