国思软件 - 英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100

　　新智元报道

　　编辑：好困犀牛

　　最近，英伟达开源了首个在 Blackwell 架构上优化的 DeepSeek-R1，实现了推理速度提升 25 倍，和每 token 成本降低 20 倍的惊人成果。同时，DeepSeek 连续开源多个英伟达 GPU 优化项目，共同探索模型性能极限。

　　当 FP4 的魔法与 Blackwell 的强大算力相遇，会碰撞出怎样的火花？

　　答案是：推理性能暴涨 25 倍，成本狂降 20 倍！

　　随着 DeepSeek-R1 本地化部署的爆火，英伟达也亲自下场，开源了首个基于 Blackwell 架构的优化方案——DeepSeek-R1-FP4。

　　在新模型的加持下，B200 实现了高达 21,088 token 每秒的的推理吞吐量，相比于 H100 的 844 token 每秒，提升了 25 倍。

　　与此同时，每 token 的成本也实现了 20 倍的降低。

　　通过在 Blackwell 架构上应用 TensorRT DeepSeek 优化，英伟达让具有 FP4 生产级精度的模型，在 MMLU 通用智能基准测试中达到了 FP8 模型性能的 99.8%。

DeepSeek-R1 首次基于 Blackwell GPU 优化

　　目前，英伟达基于 FP4 优化的 DeepSeek-R1 检查点现已在 Hugging Face 上开源。

　　模型地址：https://huggingface.co/nvidia/DeepSeek-R1-FP4

　　后训练量化

　　模型将 Transformer 模块内的线性算子的权重和激活量化到了 FP4，适用于 TensorRT-LLM 推理。

　　这种优化将每个参数从 8 位减少到 4 位，从而让磁盘空间和 GPU 显存的需求减少了约 1.6 倍。

　　使用 TensorRT-LLM 部署

　　要使用 TensorRT-LLM LLM API 部署量化后的 FP4 权重文件，并为给定的提示生成文本响应，请参照以下示例代码：

　　硬件要求：需要支持 TensorRT-LLM 的英伟达 GPU（如 B200），并且需要 8 个 GPU 来实现 tensor_parallel_size=8 的张量并行。

　　性能优化：代码利用 FP4 量化、TensorRT 引擎和并行计算，旨在实现高效、低成本的推理，适合生产环境或高吞吐量应用。

from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM

def main ():

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams (max_tokens=32)

llm = LLM (model="nvidia/DeepSeek-R1-FP4", tensor_parallel_size=8, enable_attention_dp=True)

outputs = llm.generate (prompts, sampling_params)

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print (f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
main ()