A suite of foundation models (7B–65B parameters) trained on 1.4T tokens, with the 65B version outperforming GPT-3 on many benchmarks:contentReference[oaicite:15]{index=15}.
Technical Specifications
Parameters
65
Context Length
2.0K
Architecture
Autoregressive decoder-only Transformer with SwiGLU activation, RoPE positional embeddings, and RMSNorm