Llama

llama from scratch

Transformers vs llama

normalization choice

Layer normalization or Root mean squear normalization

Attention mechanism

Rotary Position Embedding

self attention

KV cache

GPU problem

Multi query attention

SwiGLU activation

GLU variants improve transformer