A 70B-parameter model trained on more data to follow the optimal compute/data scaling laws, attaining superior accuracy (e.g., 67.5% on MMLU):contentReference[oaicite:11]{index=11}:contentReference[oaicite:12]{index=12}.
Technical Specifications
Parameters
70
Architecture
Autoregressive decoder-only transformer with 70 billion parameters, 80 layers, 64 attention heads, and a hidden dimension of 8192.