go to index

Scale your dot product in attentions

I analyse the unscaled dot product attention in language translation task using seq2seq model and experimentally show why scaling is needed for dot product attentions like in transformers.