Gemma 2 2B — Architecture

Every shape annotated. Residual width d_model = 2304 is preserved end-to-end. Each DecoderLayer keeps the residual stream the same shape; attention and MLP write back additive updates.

Top-level flow: string → string

Input

list[str], len=B

e.g. ["The capital of France is"]

↓tokenizer.encode (SentencePiece BPE)

input_ids

[B, T] int64

T = number of tokens after BPE

↓embed_tokens lookup, then × √d_model = √2304 ≈ 48

residual stream (initial)

[B, T, 2304] bf16

embed_tokens.weight shape = (256000, 2304); same matrix is reused as lm_head

↓× 26 DecoderLayers — see expansion below

residual stream (after layer 26)

[B, T, 2304] bf16

↓final RMSNorm (rms_norm_eps=1e-6)

normalized hidden

[B, T, 2304] bf16

↓lm_head — tied to embed_tokens.T

logits

[B, T, 256000] fp32

logits = softcap(30.0) applied: 30 * tanh(logits / 30)

↓take last position [:, -1, :], softmax, argmax (or sample)

next_token_id

[B] int64

↓tokenizer.decode

Output

list[str], len=B

e.g. [" Paris"]

One DecoderLayer (×26)

x_in

[B, T, 2304]

residual stream entering this layer

↓residual₁ ← x_in (save for skip connection)

input_layernorm (RMSNorm)

[B, T, 2304] → [B, T, 2304]

↓

self_attn — Grouped-Query Attention

Q proj: 2304 → 8·256 = 2048 | K proj: 2304 → 4·256 = 1024 | V proj: 2304 → 4·256 = 1024
Reshape Q→[B, T, 8, 256]; K,V→[B, T, 4, 256]; repeat K,V ×2 to match Q heads
RoPE on Q and K (theta=10000)
scores = Q @ Kᵀ / √256, softcap(50.0), mask (sliding-window=4096 OR full), softmax
attn_out = scores @ V → [B, T, 8, 256] → [B, T, 2048]
O proj: 2048 → 2304

↓attn output shape: [B, T, 2304]

post_attention_layernorm (RMSNorm)

[B, T, 2304]

↓x_mid = residual₁ + ↑ (residual add)

x_mid

[B, T, 2304]

↓residual₂ ← x_mid

pre_feedforward_layernorm (RMSNorm)

[B, T, 2304]

↓

MLP — GeGLU

gate_proj: 2304 → 9216  |   up_proj: 2304 → 9216  |   down_proj: 9216 → 2304
hidden = gelu(gate_proj(x)) * up_proj(x)  →  [B, T, 9216]
out = down_proj(hidden)  →  [B, T, 2304]

↓[B, T, 2304]

post_feedforward_layernorm (RMSNorm)

[B, T, 2304]

↓x_out = residual₂ + ↑

x_out

[B, T, 2304]

passed as x_in to the next layer (or to the final RMSNorm if this was layer 26)

Config (from model.config)

field	value
model_id	google/gemma-2-2b
params	2.61 B
vocab_size	256000
hidden_size (d_model)	2304
num_hidden_layers	26
num_attention_heads (Q)	8
num_key_value_heads (K, V — GQA)	4
head_dim	256
intermediate_size (MLP up)	9216
max_position_embeddings	8192
attn_window (sliding layers)	4096
attn_pattern	alternating: even layers sliding-window, odd layers full
rope_theta	10000
rms_norm_eps	1e-6
tied embeddings	yes (lm_head.weight == embed_tokens.weight)
activation	GeGLU (gelu(gate) * up)
logit softcap	30.0
attn softcap	50.0

Parameter counts

component	shape	params
embed_tokens (= lm_head, tied)	(256000, 2304)	589.8 M
per layer: q_proj	(2304, 2048)	4.7 M
per layer: k_proj	(2304, 1024)	2.4 M
per layer: v_proj	(2304, 1024)	2.4 M
per layer: o_proj	(2048, 2304)	4.7 M
per layer: gate_proj	(2304, 9216)	21.2 M
per layer: up_proj	(2304, 9216)	21.2 M
per layer: down_proj	(9216, 2304)	21.2 M
per layer: 4 × RMSNorm	4 × (2304,)	≈ 9 k
per layer (sum)		≈ 77.8 M
26 layers		≈ 2023.7 M
final RMSNorm	(2304,)	≈ 2 k
Total (with tied head)		≈ 2.61 B