MLXs CUDA backend is getting better. It's especially nice if you appreciate fast startup times. But it's also quite fast in general. Here's Qwen3 4B in fp8 running on my DGX Spark. - Processed 18.5k tokens in < 4 seconds - Generates at 32.5 tok/sec with 18.5k context
Also super simple to get up and running:
269