Enhancing DeepSeek R1 performance for on-device inference with ONNX Runtime.

By:

Parinita Rahi, Sunghoon Choi, Kunal Vaishnavi, Maanav Dalal

19TH FEBRURARY, 2025

Are you a developer looking to harness the power of your users’ local compute for AI inferencing on PCs with NPUs, GPUs, and CPUs? Look no further!

With the new release you can now run these models on CPU and GPU. You can now download and run the ONNX optimized variants of the models from Hugging Face. Additionally, you can also these models on NPU: Windows Developer Blog.

Download and run your models easily!

The DeepSeek ONNX models enables you to run DeepSeek on any GPU or CPU, achieving performance speeds 1.3 to 6.3 times faster than native PyTorch. To easily get started with the model, you can use our ONNX Runtime Generate() API.

Quickstart on CPU

Installing onnxruntime-genai and dependencies for CPU in a virtual environment:

python -m venv .venv && source .venv/bin/activate
pip install requests numpy --pre onnxruntime-genai

Download the model directly using the huggingface cli:

huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include "deepseek-r1-distill-qwen-1.5B/*" --local-dir ./

CPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly:

wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

See instructions for GPU (CUDA, DML) here.

ONNX Model Performance Improvements

ONNX enables you to run your models on-device across CPU, GPU, NPU. With ONNX you can run your models on any machine across all silica Qualcomm, AMD, Intel, Nvidia. See table below for some key benchmarks for Windows GPU and CPU devices.

Model Precision Execution Provider Device Token Generation Throughput Speed Up vs PyTorch
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B fp16 CUDA RTX 4090 197.195 4X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B Int4 CUDA RTX 4090 313.32 6.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B fp16 CUDA RTX 4090 57.316 1.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B Int4 CUDA RTX 4090 161.00 3.7X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B Int4 CPU 13th Gen Intel i9 3.184 20X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B Int4 CPU 13th Gen Intel i9 11.749 1.4X

CUDA BUILD SPECS: onnxruntime-genai-cuda==0.6.0, transformers==4.46.2, onnxruntime-gpu==1.20.1
CPU BUILD SPECS: onnxruntime-genai==0.6.0, transformers==4.46.2, onnxruntime==1.20.01

Easily Finetune your models with Olive.

This notebook provides a step-by-step guide to fine-tuning DeepSeek models using the Olive framework. It covers the process of setting up your environment, preparing your data, and leveraging Azure AI Foundry to optimize and deploy your models. The notebook is designed to help you get started quickly and efficiently with DeepSeek and Olive, making your AI development process smoother and more effective.

Conclusion

Optimizing DeepSeek R1 distilled models with ONNX Runtime can lead to significant performance improvements. These optimized models are coming soon via Azure AI Foundry and can be easily accessed via the command line or the VS Code AI Toolkit.

By leveraging our AI framework solution with Azure Foundry, AI Toolkit, Olive, and ONNX Runtime you get your end-to-end solution for model development experience. Stay tuned for more updates and best practices on enhancing AI model performance.