NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference

hank808 2 hours ago

You guys that continue to compare DGX Spark to the Mac Studios, please remember two things:

1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.

SethTro 5 hours ago

Article doesn't seem to mention price which is $4,000 which makes it comparable to a 5090 but with 128GB of unified LPDDR5x vs the 5090's 32GB DDR7.

bilekas 31 minutes ago

$4,000 is actually extremely competitive. Even for an at-home enthusiast setup this price is not our of reach. I was expecting something far higher, that said, nVidia's MSRP is something of a pipe dream recently so we'll see when it's actually released and the availability. Curious also to see how they may scale together.
EnPissant 3 hours ago

A 5090 is $2000.
- adrian_b 37 minutes ago
  
  But you put in a $1500 PC (with 128 GB DRAM).
  Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.
  The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.
- sandworm101 an hour ago
  
  Msrp, but try getting your hands on one without a bulk order and/or camping out in a tent all weekend. I have seen people in my area buying pre-biult machines as they often cost less than trying to buy an individual card.
CamperBob2 5 hours ago

And about 1/4 the memory bandwidth, which is what matters for inference.
- threeducks an hour ago
  
  More precisely, the RTX 5090 has a memory bandwidth of 1792 GB/s, while the DGX Spark only has 273 GB/s, which is about 1/6.5.
  For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.
nialse 4 hours ago

Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happens to launch a Mac Mini with 128GB RAM it would eat Nvidia Sparks’ lunch every day.
- pjmlp 9 minutes ago
  
  Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.
  People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.
  Even if Metal is actually a cool API to program for.
- moondev 3 hours ago
  
  Just don't try to run a NCCL
- newman314 3 hours ago
  
  Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.
  - yvbbrjdr 3 hours ago
    
    Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.
    
    ggerganov 3 hours ago
    
    FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:
    ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes | model | size | params | backend | ngl | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp4096 | 3564.31 ± 9.91 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 53.93 ± 1.71 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp4096 | 1792.32 ± 34.74 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 38.54 ± 3.10 |
    
    rajatgupta314 2 hours ago
    
    Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.
    Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
    https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
    Example looking at the same weight on Ollama is BF16:
    https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360
    
    yvbbrjdr 3 hours ago
    
    I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..
    
    __mharrison__ 3 hours ago
    
    Curious to how this compares to running on a Mac.

andrewgleave 38 minutes ago

Looks like MLX is not a supported backend in Ollama so the numbers for the Mac could be significantly higher in some cases.

It would be interesting to swap out Ollama for LM Studio and use their built-in MLX support and see the difference.

pixelpoet 4 hours ago

I wonder why they didn't test against the broadly available Strix Halo with 128GB of 256 GB/s memory bandwidth, 16 core full-fat Zen5 with AVX512 at $2k... it is a mystery...

yvbbrjdr 3 hours ago

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.
- EnPissant 2 hours ago
  
  Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.
  Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
  Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
  Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
  - yvbbrjdr 2 hours ago
    
    We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.
EnPissant 2 hours ago

Strix Halo has the problem that prefill is incredibly slow if your context is not very small.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.

limoce an hour ago

> ollama gpt-oss 120b mxfp4 1 94.67 11.66

This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.