x_in
[B, T, 2304]
residual stream entering this layer
↓residual₁ ← x_in (save for skip connection)
input_layernorm (RMSNorm)
[B, T, 2304] → [B, T, 2304]
↓
self_attn — Grouped-Query Attention
Q proj: 2304 → 8·256 = 2048 | K proj: 2304 → 4·256 = 1024 | V proj: 2304 → 4·256 = 1024
Reshape Q→[B, T, 8, 256]; K,V→[B, T, 4, 256]; repeat K,V ×2 to match Q heads
RoPE on Q and K (theta=10000)
scores = Q @ Kᵀ / √256, softcap(50.0), mask (sliding-window=4096 OR full), softmax
attn_out = scores @ V → [B, T, 8, 256] → [B, T, 2048]
O proj: 2048 → 2304
↓attn output shape: [B, T, 2304]
post_attention_layernorm (RMSNorm)
[B, T, 2304]
↓x_mid = residual₁ + ↑ (residual add)
↓residual₂ ← x_mid
pre_feedforward_layernorm (RMSNorm)
[B, T, 2304]
↓
MLP — GeGLU
gate_proj: 2304 → 9216 | up_proj: 2304 → 9216 | down_proj: 9216 → 2304
hidden = gelu(gate_proj(x)) * up_proj(x) → [B, T, 9216]
out = down_proj(hidden) → [B, T, 2304]
↓[B, T, 2304]
post_feedforward_layernorm (RMSNorm)
[B, T, 2304]
↓x_out = residual₂ + ↑
x_out
[B, T, 2304]
passed as x_in to the next layer (or to the final RMSNorm if this was layer 26)