WeMM-Embedding — Multimodal Embedding Model

WeMM-Embedding is a general-purpose multimodal embedding model unifying text, image and video representations to support cross-business retrieval inside the WeChat ecosystem.

Approach

  • Backbone: Qwen3-VL.
  • Deep Fusion module integrating multi-level semantic representations.
  • Deduplicated InfoNCE loss to remove false negatives.
  • Hierarchical sampler supporting arbitrary modality mixing during training.

Results

  • MMEB-V2 0.7523 — open-source #1.
  • UVRB 0.686 for video retrieval — open-source #1.
  • +2.6% improvement on Chinese image–text retrieval.