Scale your dot product in attentions
I analyse the unscaled dot product attention in language translation task using seq2seq model and experimentally show why scaling is needed for dot product attentions like in transformers.
I analyse the unscaled dot product attention in language translation task using seq2seq model and experimentally show why scaling is needed for dot product attentions like in transformers.