This article explores the evolution from Transformers to Large Language Models (LLMs), detailing the mechanisms of self-attentiBut greedy decoding is **short-sighted** - Can get trapped in local minima (good "local" choice of token, but poor text at the end, with no way to backtrack)n and multi-head attention, the role of position embeddings, various types of transformer models, and the training and fine-tuning processes of LLMs.
