Redesiging Neural Architectures for Sequence to Sequence Learning

The Encoder-Decoder model with soft-attention is now the defacto standard for sequence to sequence learning, having enjoyed early success in tasks like translation, error correction, and speech recognition. In this talk, I will present a critique of various aspect of this popular model, including its soft attention mechanism, local loss function, and sequential decoding. I will present a new Posterior Attention Network for a more transparent joint attention that provides easy gains on several translation and morphological inflection tasks. Next, I will expose a little known problem of mis-calibration in state of the art neural machine translation (NMT) systems. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. I will discuss reasons for mis-calibration and some fixes. Finally, I will summarize recent research efforts towards parallel decoding of long sequences.