CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
llama from scratch
Transformers vs llama

normalization choice
Layer normalization or Root mean squear normalization
Attention mechanism

Rotary Position Embedding




self attention

KV cache


GPU problem

Multi query attention




SwiGLU activation
GLU variants improve transformer