DeepSeek inizia ad aggiornarsi frequentemente: Kernel di piastrelle e DeepEP V2

deepseek

Just now, DeepSeek’s GitHub started updating frequently. It launched and open-sourced a new repository, Tile Kernels, and at the same time updated the DeepEP repository, bringing DeepEP V2 online. It has been less than a week since DeepSeek quietly updated Mega MoE and FP4 Indexer last time.

DeepSeek Tile Kernels

DeepSeek Tile Kernels

Link: https://github.com/deepseek-ai/TileKernels

According to the introduction, Tile Kernels are GPU kernels optimized for LLM operations, built with TileLang. TileLang is a domain-specific language for expressing high-performance GPU kernels in Python, with characteristics such as easy portability, agile development, and automatic optimization.

The performance of Tile Kernels is extremely strong. As DeepSeek itself wrote: “Most kernels in this project are already close to the hardware performance limit in terms of compute intensity and memory bandwidth. Some of them have already been used internally in training and inference scenarios. However, they do not yet represent best practices, and we are continuing to improve the code quality and documentation.”

There is not much introductory information in the repository, yet between the lines it already “spoils” the underlying architectural innovation path of DeepSeek’s next-generation models, signaling a leap comparable to the recent Hy3 preview launch.

DeepSeek Tile Kernels Features

Here are some specific features of Tile Kernels:

Gating mechanism: Top-k expert selection and scoring for MoE routing

MoE routing: Mapping tokens to experts, fused expand/reduce, and weight normalization

Quantization: Supports FP8/FP4/E5M6 conversion in per-token, per-block, and per-channel modes, and fuses SwiGLU + quantization operations

Transpose: Batched transpose operations

Engram: Engram gating kernels, fusing RMSNorm, forward/backward propagation, and weight-gradient reduction

Manifold HyperConnection: Hyper-connection kernels, including Sinkhorn normalization and split/apply for mix

Modeling: High-level torch.autograd.Function wrappers that combine the underlying kernels into trainable layers (engram gate, mHC pipeline)

DeepSeek EPv2: Faster EP With Engram, PP, and CP Support

DeepSeek EPv2: Faster EP With Engram, PP, and CP Support

EPv2 link: https://github.com/deepseek-ai/DeepEP/pull/605

Earlier today, DeepSeek also released the latest version of EPv2, delivering faster expert parallelism (EP) and support for Engram / pipeline parallelism (PP) / context parallelism (CP).

As hardware, networks, and model architectures have evolved alongside rapid industry releases like Qwen 3.6, DeepSeek’s earlier DeepEP V1 had already accumulated too much historical baggage and too many performance issues.

This update completely restructures Expert Parallelism. Compared with V1, it only needs a fraction of the SM resources to reach extreme performance, while also supporting larger-scale scale-up (within a single machine) and scale-out (across machines).

In addition, DeepSeek introduced an experimental 0 SM series in this update, including 0 SM Engram, 0 SM pipeline parallelism (PP), and 0 SM context parallelism (CP) All-gather operators. At the same time, the backend has been switched from NVSHMEM to the lighter NCCL Gin backend.

New Features in DeepSeek DeepEP V2

Here are some of the new features in DeepEP V2:

Fully JIT

NCCL Gin backend:

Header-only, extremely lightweight

Able to reuse existing NCCL communicators

EPv2:

Unifies high-throughput and low-latency APIs into a single interface, and adopts a brand-new GEMM layout

Supports larger scaling domains, up to EP2048

Introduces analytical SM and QP count calculation, so auto-tuning is no longer needed

Continues to support both Hybrid mode and Direct mode

For older V3-like training tasks, SM usage drops from 24 to 4–6, while maintaining the same or even better performance

0 SM Engram (with RDMA)

0 SM PP (with RDMA)

0 SM CP (with Copy Engine)

DeepSeek DeepEP V2 Performance

Following the configuration of DeepSeek-V3, tests were run under the new version with settings of 8K tokens per batch, 7168 hidden dimension, Top-8 experts, FP8 dispatch, and BF16 combine. The results are as follows:

DeepSeek DeepEP V2 Performance

Note: The results shown are logical bandwidth. For example, in the case of EP 8 x 2, the 90 GB/s bandwidth actually includes traffic between local GPUs (local ranks).

Compared with V1, V2 achieves up to 1.3× peak performance, while saving as much as 4× SM resource usage —a crucial optimization for staying competitive in a landscape dominated by heavyweights like Claude Opus 4.7.

Finally, just a bit of advice for DeepSeek: hurry up and release V4 already. Everyone is getting impatient.

Torna in alto