Blog

Scale your dot product in attentions

I analyse the unscaled dot product attention in language translation task using seq2seq model and experimentally show why scaling is needed for dot product attentions like in transformers.

machine learning NLP attention transformers LLM