Spark Large-Context Benchmark
Last 48 Hours — 2026-05-19 · Large-context benchmark data only. This report is intentionally data-heavy and preserves all benchmark tables in full.
Benchmark Report
Large-Context Only
Executive Verdict
Key findings from the last 48 hours of large-context benchmarking across all tested model paths and runtimes.
🏆 Best Native Official Qwen Result
Qwen/Qwen3.6-35B-A3B on vLLM/Ray TP=2 passed ~256K prompt tokens with 5/5 recall at ~2,098 effective tok/s.
🔬 Best Qwen Extrapolation Result
Same official Qwen path passed ~384K prompt tokens with 5/5 recall at ~1,775 effective tok/s — exceeds native 262K context; treat as experimental.
📏 Largest Stable Prompt Tested
Llama 4 Scout NVFP4 with vLLM FlashAttention/BF16 KV survived ~496K prompt tokens at ~827 effective tok/s, but recall was only 3/5.
Best Scout Quality Result
Scout NVFP4 with vLLM FlashInfer/FP8 KV passed ~233K prompt tokens with 5/5 recall at ~1,006 effective tok/s, but crashed near ~387K.
Best GGUF Long-Context Qwen Path
Qwen3 30B A3B Instruct GGUF reached ~240K prompt tokens with 5/5 recall; effective throughput fell to ~559 tok/s at that size; short decode remained fast at ~87 tok/s.
Large-Context Comparison Table
All tested model/path combinations ranked by prompt token count. Includes recall, elapsed time, and effective throughput.
Official Qwen3.6-35B-A3B Large-Context Data
Runtime: vLLM/Ray distributed executor, tensor parallel size 2 across Spark1 + Spark2. Model alias: qwen36-official. Critical launch fix: --default-chat-template-kwargs '{"enable_thinking": false}'.

The 384K recall result exceeds the native 262K context window. It worked synthetically but should not be counted as a guaranteed production-quality result.
Qwen3.6 Throughput Scaling
Effective tok/s across all official Qwen3.6-35B-A3B recall tests, showing throughput degradation as prompt size grows from 32K to 384K tokens.
Throughput degrades gracefully from 2,540 tok/s at 32K to 1,775 tok/s at 384K — a ~30% reduction over a 12× increase in prompt size. All tests achieved perfect 5/5 recall.
Llama 4 Scout NVFP4 Large-Context Data
Model: nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4. Native config advertises massive context (text_config.max_position_embeddings = 10485760). The model itself is a strong architectural candidate; the runtime backend was the limiting factor.

FlashInfer + FP8 KV delivers better recall quality (5/5) at lower context sizes up to ~233K tokens.

FlashAttention + BF16 KV is more stable at extreme context sizes (up to ~496K) but recall degrades to 3/5.
Scout Backend Comparison
FlashInfer vs. FlashAttention backends for Llama 4 Scout NVFP4 — quality vs. scale tradeoff.
FlashInfer + FP8 KV
Best quality path
  • 5/5 recall at 233K tokens
  • 1,006 effective tok/s
  • Crashed at ~387K (ValueError)
  • Upper bound: ~233K reliable
FlashAttention + BF16 KV
Best scale path
  • 3/5 recall at 380K–496K tokens
  • 794–827 effective tok/s
  • No crash up to 496K
  • Clean HTTP 400 at 524K limit

The model architecture supports up to ~10.5M position embeddings natively. Runtime backends are the current bottleneck, not the model itself.
GGUF Large-Context Data — Qwen3 30B A3B Instruct 2507
Runtime: llama.cpp on Spark1. Short sustained decode: ~87.4 output tok/s. Model: Qwen3 30B A3B Instruct 2507 GGUF Q4_K_M.
Throughput drops significantly as context grows — from 1,812 tok/s at 30K to 559 tok/s at 240K — but recall remains perfect (5/5) throughout. Prompt evaluation time dominates at near-cap sizes.
GGUF Large-Context Data — HauhauCS Qwen3.6 Q4_K_M
Runtime: llama.cpp on Spark1. Short sustained decode: ~76.2 output tok/s. Model: HauhauCS Qwen3.6 GGUF Q4_K_M path.

The initial 272K attempt was cleanly rejected by the 262K context cap — not a crash. The 219K near-cap retry succeeded with 5/5 recall at ~1,440 prompt tok/s.
Blocked / No Large-Context Results
Several model/runtime combinations were tested but did not produce valid large-context throughput numbers. These are documented as failure boundaries.

TensorRT-LLM blocked all three Scout NVFP4 paths. The optimized backend does not yet support Llama4ForConditionalGeneration.
Cross-Model Throughput Comparison
Effective tok/s at the highest context size where each model/path achieved 5/5 recall. GGUF paths show prompt tok/s; vLLM paths show combined effective tok/s.
Official Qwen3.6-35B-A3B on vLLM/Ray leads all paths at native context sizes. HauhauCS GGUF is competitive at mid-range context. The Qwen3 30B GGUF path lags significantly at near-cap sizes due to prompt eval domination.
What the Large-Context Data Says
Five key conclusions from the last 48 hours of benchmarking on the two-Spark setup.
1
For reliable 250K-class context today, use official Qwen3.6-35B-A3B on vLLM/Ray TP=2.
It is the cleanest 5/5 native-context result — 2,098 effective tok/s at 256K tokens with no crashes.
2
For maximum experimental context, Scout FlashAttention/BF16 KV reached ~496K, but recall quality was weaker.
3/5 recall at extreme sizes. Use only when raw context length matters more than answer quality.
3
For fast short decode, GGUF wins.
The Qwen3 30B GGUF path (~87 tok/s short decode) is far faster for generated tokens than official BF16 vLLM at short output lengths.
4
For production-quality claims, do not count Qwen 384K as guaranteed.
It worked synthetically but exceeds the native 262K context window. Treat as experimental only.
5
The 1M-context dream is not solved yet on this two-Spark setup.
Scout is the right model family architecturally, but current runtime backends are the constraint. TensorRT-LLM paths remain blocked.