To run Llama 3.1 70B locally at Q3_K_M quantization, you need at minimum 48.4 GB of GPU VRAM.
Use a lower quantization to fit. Viable for testing at this scale.
Single-card 80GB. Industry-standard for large model inference.
State-of-the-art inference. 3× the bandwidth of A100.
Use the interactive calculator to compare Llama 3.1 70B across all available formats.
Open Live Calculator →