profile image

Yuanxing Zhang (张远行)

I am currently a Senior Staff Researcher at Kling Team, Kuaishou Technology, leading the multimodal understanding group, covering MLLM foundation model, omni-modal captioning, prompt enhancement, multimodal representation, reward model, and understanding for unified model.
We are actively looking for research interns and full-time researchers to work on cutting-edge research topics. If you're interested in exploring these opportunities, please reach out to me at longo11070001@gmail.com.

I earned my Ph.D. from Peking University in 2020, where I was advised by Professor Kaigui Bian and Professor Xiaoming Li. My doctoral research in multimedia streaming and recommendation systems provided a strong foundation for my career in large-scale AI. After graduating, I joined Alibaba and contributed to XDL, a framework for training ultra-large-scale sparse machine learning models. My focus evolved in 2023 when I joined Taobao's Future Life Lab to work on Large Language Models. As of 2024, I am at Kuaishou, where I am dedicated to advancing multimodal understanding for the Kling video generation model.

News

  • [10/2025] 4 papers from Kling Team are accpeted by AAAI 2026, of which one paper (TEMPLE) is from our multimodal understanding group.
  • [9/2025] 8 papers from Kling Team are accpeted by NeurIPS 2025, of which two papers (MME-VideoOCR and MVU-Eval) are from our multimodal understanding group.
  • [8/2025] Three papers (SEA, MIO, and RICO) are accepted to EMNLP 2025.
  • [7/2025] Three papers (TimeChat-Online, Mavors, and EditWorld) are accepted to ACMMM 2025.
  • [5/2025] Four papers (HAIC, MoD, VidCapBench and GenS) are accepted to ACL 2025.

Selected Projects from My Team [Full List]


Training Paradigms

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua

We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation.
  Project Page  Paper (arXiv)

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities.
  Project Page  Paper (arXiv)  Model

Monet: Reasoning in Latent Visual Space Beyond Image and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

We introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts.
  Github Page  Paper (arXiv)  Model

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun

To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment.
  Github Page  Paper (arXiv)

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

We propose RICO, a novel framework that enhances captions through an iterative visual reconstruction-and-refinement pipeline. Our key idea is to: a) Reconstruct the caption into an image using a text-to-image model. b) Compare the original image with the reconstructed image using an MLLM. c) Refine the caption based on detected discrepancies.
  Github Page  Paper (arXiv)

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

We propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains.
  Paper (arXiv)

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

DDK dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective.
  Paper (arXiv)


Model Architecture

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun

TimeChat-Online is a novel online VideoLLM designed for efficient streaming video understanding. Its core innovation, the Differential Token Drop (DTD) module, tackles visual redundancy by selectively preserving only meaningful temporal changes while eliminating static content between frames.
  Project Page  Paper (arXiv)  Model

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang

Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings.
  Project Page  Paper (arXiv) 

MIO: A Foundation Model on Multimodal Tokens

Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

We present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks.
  Model  Paper (arXiv) 


Benchmarking

ViDiC: Video Difference Captioning

Jiangtao Wu, Shihao Li, Zhaozhou Bian, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Yuanxing Zhang, Jiaheng Liu

We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: Subject, Style, Background, Cinematography, Motion, Location (Position), and Playback Techniques.
  Project Page  Paper (arXiv)  Data

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu

OmniVideoBench is a large-scale, rigorously curated benchmark for assessing synergistic audio-visual intelligence, emphasizing modality complementarity, logical consistency, and long-term temporal reasoning.
  Project Page  Paper (arXiv)  Data

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, Houyi Li, Wei Ji, Pengfei Wan, Steven Huang, Zhaoxiang Zhang, Jiaheng Liu

We introduce MVU-Eval, the first comprehensive benchmark for evaluating multi-video understanding in MLLMs. It assesses eight core competencies: Object Recognition, Spatial Understanding, Counting, Comparison, Knowledge-Intensive Reasoning, In-Context Learning, Retrieval-Augmented Generation, and Temporal Reasoning.
  Project Page  Paper (arXiv)  Data

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yushuo Guan, Zhang Zhang, Liang Wang, Haoxuan Li, Zhouchen Lin, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Wenjing Yang

MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos.
  Project Page  Paper (arXiv)  Data

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian

We introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots.
  Project Page  Paper (arXiv)  Data

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan

This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws.
  Project Page  Paper (arXiv)  Data