Gemma 2 2B — Architecture

Every shape annotated. Residual width d_model = 2304 is preserved end-to-end. Each DecoderLayer keeps the residual stream the same shape; attention and MLP write back additive updates.

io (strings) tensor residual stream RMSNorm attention MLP

Top-level flow: string → string

Input
list[str], len=B
e.g. ["The capital of France is"]
tokenizer.encode (SentencePiece BPE)
input_ids
[B, T] int64
T = number of tokens after BPE
embed_tokens lookup, then × √d_model = √2304 ≈ 48
residual stream (initial)
[B, T, 2304] bf16
embed_tokens.weight shape = (256000, 2304); same matrix is reused as lm_head
× 26 DecoderLayers — see expansion below
residual stream (after layer 26)
[B, T, 2304] bf16
final RMSNorm (rms_norm_eps=1e-6)
normalized hidden
[B, T, 2304] bf16
lm_head — tied to embed_tokens.T
logits
[B, T, 256000] fp32
logits = softcap(30.0) applied: 30 * tanh(logits / 30)
take last position [:, -1, :], softmax, argmax (or sample)
next_token_id
[B] int64
tokenizer.decode
Output
list[str], len=B
e.g. [" Paris"]

One DecoderLayer (×26)

x_in
[B, T, 2304]
residual stream entering this layer
residual₁ ← x_in (save for skip connection)
input_layernorm (RMSNorm)
[B, T, 2304] → [B, T, 2304]
self_attn — Grouped-Query Attention
Q proj: 2304 → 8·256 = 2048  |   K proj: 2304 → 4·256 = 1024  |   V proj: 2304 → 4·256 = 1024
Reshape Q→[B, T, 8, 256]; K,V→[B, T, 4, 256]; repeat K,V ×2 to match Q heads
RoPE on Q and K (theta=10000)
scores = Q @ Kᵀ / √256, softcap(50.0), mask (sliding-window=4096 OR full), softmax
attn_out = scores @ V  →  [B, T, 8, 256] → [B, T, 2048]
O proj: 2048 → 2304
attn output shape: [B, T, 2304]
post_attention_layernorm (RMSNorm)
[B, T, 2304]
x_mid = residual₁ + ↑ (residual add)
x_mid
[B, T, 2304]
residual₂ ← x_mid
pre_feedforward_layernorm (RMSNorm)
[B, T, 2304]
MLP — GeGLU
gate_proj: 2304 → 9216  |   up_proj: 2304 → 9216  |   down_proj: 9216 → 2304
hidden = gelu(gate_proj(x)) * up_proj(x)  →  [B, T, 9216]
out = down_proj(hidden)  →  [B, T, 2304]
[B, T, 2304]
post_feedforward_layernorm (RMSNorm)
[B, T, 2304]
x_out = residual₂ + ↑
x_out
[B, T, 2304]
passed as x_in to the next layer (or to the final RMSNorm if this was layer 26)

Config (from model.config)

fieldvalue
model_idgoogle/gemma-2-2b
params2.61 B
vocab_size256000
hidden_size (d_model)2304
num_hidden_layers26
num_attention_heads (Q)8
num_key_value_heads (K, V — GQA)4
head_dim256
intermediate_size (MLP up)9216
max_position_embeddings8192
attn_window (sliding layers)4096
attn_patternalternating: even layers sliding-window, odd layers full
rope_theta10000
rms_norm_eps1e-6
tied embeddingsyes (lm_head.weight == embed_tokens.weight)
activationGeGLU (gelu(gate) * up)
logit softcap30.0
attn softcap50.0

Parameter counts

componentshapeparams
embed_tokens (= lm_head, tied)(256000, 2304)589.8 M
per layer: q_proj(2304, 2048)4.7 M
per layer: k_proj(2304, 1024)2.4 M
per layer: v_proj(2304, 1024)2.4 M
per layer: o_proj(2048, 2304)4.7 M
per layer: gate_proj(2304, 9216)21.2 M
per layer: up_proj(2304, 9216)21.2 M
per layer: down_proj(9216, 2304)21.2 M
per layer: 4 × RMSNorm4 × (2304,)≈ 9 k
per layer (sum)≈ 77.8 M
26 layers≈ 2023.7 M
final RMSNorm(2304,)≈ 2 k
Total (with tied head)≈ 2.61 B