32nd International Conference on Multimedia Modeling

Starts soon

January 29-31, 2026
Prague, Czech Republic

Program at glance

Note that the program is preliminary and some minor changes in the timing may occur.
Thursday Friday Saturday
January 29, 2026 January 30, 2026 January 31, 2026
Room 1 Room 2 Room 1 Room 2 Room 1 Room 2
8:00 8:30 Registration
8:30 9:00 Opening Keynote 2: Marcel Worring Keynote 3: Giuseppe Amato
9:00 9:30 Keynote 1: Jiri Matas
9:30 10:00 Session 4: Motion models
Chair: Werner Bailer
Session 8: Language & Text Generation
Chair: Duc Tien Dang Nguyen
10:00 10:10 Tea Break
10:10 10:30 Tea Break Tea Break
10:30 10:40 Session 1: Best paper candidates
Chair: Stevan Rudinac and Jan Zahalka
10:40 11:00 Session 5: Online Posters Session 9a: Demo Session Session 9b: MARS Spec. Session
Chair: Thu Nguyen
11:00 11:30
11:30 12:00
12:00 12:10 Lunch Break Lunch Break
Lunch Break
13:00 13:30 Session 2a: Recommendation & Graph learning
Chair: Kai Uwe Barthel
VBS Session Session 6: Vision-Language Models & Multimedia Applications
Chair: Giuseppe Amato
Session 10a: Datasets, Missing Data & Speech
Chair: Luca Rossetto
Session 10b: MOMST and HCMBA Spec. Session
Chair: Mario Döller
13:30 14:00
14:00 14:40
14:40 15:00 Tea Break Closing
15:00 15:10 Tea Break
15:10 15:30 Session 3a: Image enhancement, Object detection & Explanations
Chair: Max Fischer
15:30 16:00 Session 7: Video Retrieval & Datasets
Chair: Klaus Schoeffmann
Prague Tour and Boat Trip
16:00 16:10
16:10 16:30
16:30 17:00
17:00 17:10 Welcome Reception and VBS Session
17:10 17:30
17:30 18:00
18:00 18:30
18:30 19:00
19:00 19:30 Banquet
19:30 20:00
20:00 21:00
21:00 22:00

Detailed Program

SessionTimePaper IDAuthorsTitle
Session 1: Best paper candidates 29.1. 10:30 - 12:10 229 Tu, Teng; Liu, Xiaohao; Ma, Yunshan; Qi, Ji; Chua, Tat-Seng Integrating Symbolic and Waveform Music into Large Language Models
335 Chen, Lucy; Collins, KC Can AI Capture Emotion? A Study on Human Emotional Perception and Response to AI-Generated and Human-Composed Pop Music
340 ZHU, Bin; Yin, Hailong; Chen, Jingjing; Jiang, Yu-Gang Benchmarking Gaslighting Negation Attacks Against Reasoning Models
410 Deng, Zhixuan; Zhu, Yifan; Xiang, Lei; Jin, Shilong; Duan, Haoran; Long, Yang; Zhou, Yuan ZeroDINO: Entropy-Driven Granularity-Aware Semantic Fusion for Zero-Shot Learning
Session 2a: Recommendation & Graph learning 29.1. 13:00 - 14:40 106 Dose, Yuma; Hara, Takahiro Graph Contrastive Learning with Popularity and Neighborhood Awareness for Long-Tail Item Recommendation
149 Tsukuda, Kosetsu; Ishida, Keisuke; Takahashi, Takumi; Hamasaki, Masahiro; Goto, Masataka A Case Study of a Transparent and Controllable Music Recommender System with Multi-Relational Layers
182 Dang Hoang Minh, Triet; Tran Hoang, Anh; Nguyen Hoang, Hai; Tran Nguyen Minh, Quang; Tran Cong, Hieu; Nguyen, Thu; Nguyen Thanh, Binh PreBERT-Rec: Improving Topic Modeling in Recommendation Systems via Effective Data Preprocessing and BERT
207 Ruosch, Florian; Rossetto, Luca Applications of Multimodal Knowledge Graphs in Modeling Multimedia
348 Wang, Haoyang; Zhang, Shengbing; Fan, Xiaoya; Zhu, Junda; Zhang, Meng Enabling Efficient Distributed Graph Neural Network Acceleration with Near Memory Processing
Session 3a: Image enhancement, Object detection & Explanations 29.1. 15:10 - 16:10 109 Xu, Zhoutong; Wang, Zhangye MAGNet: Multi-Level Attention For Guided Thermal Infrared Image Super-Resolution
359 Hu, Weiyi; Cui, Hua; Hu, Haoran; Yang, Zhao TD-MBEV:Robust 3D Object Detection with Temporal Diffusion-Masked BEV
103 Bai, Yannan; Wang, Danding; Tang, Sheng; Cao, Juan; Li, Jintao Dissecting Deepfake Artifacts via Multimodal Explanations
Session 4: Motion models 30.1. 9:30 - 10:10 165 Li, Zhaoyang; Tian, Jinglan; Lyu, Na Conditional VQ-VAE for Action-Conditioned Motion Generation
283 Yu, Congrui; Fan, Bo; Lyu, Na MotionSlim: A Lightweight T2M Generation Framework Based on LLM
Session 6: Vision-Language Models & Multimedia Applications 30.1. 13:00 - 15:00 111 Zhu, Fengbin; Liu, Ziyang; NG, Xiang Yao; Wu, Haohui; Wang, Wenjie; Feng, Fuli; Wang, Chao; Luan, Huanbo; Chua, Tat-Seng MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding and Grounding
118 Martín-Fernández, Iván; Constantin, Mihai Gabriel; Ionescu, Bogdan; Esteban-Romero, Sergio; Fernández-Martínez, Fernando; Gil-Martín, Manuel A Case Study on Large Visual-Language Model Attention Explainability After Adaptation Using Persuasion Strategies in Advertisements
166 Chi, Jui-Feng; Chu, Wei-Ta; Lin, Sheng-Long Food Image Segmentation with LLM-Derived Ingredient Labels and Multimodal Fusion
272 Tran, Allie; Rossetto, Luca On the Brittleness of CLIP Text Encoders
337 Wu, Xinlan; Zhu, Bin; Han, Feng; Jiao, Pengkun; Chen, Jingjing Dual-LoRA and Quality-Enhanced Pseudo Replay for Multimodal Continual Food Learning
373 Gan, Kian-Yu; Nguyen, Phuong-Anh; Ngo, Chong-Wah Food Recognition with Visual Language Models: Search Re-ranking or Retrieval-Augmented Generation?
Session 7: Video Retrieval & Datasets 30.1. 15:30 - 17:10 120 LE, HOANG BAO; Tran, Allie; T. Nguyen, Binh; Zhou, Liting; Gurrin, Cathal FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text
128 Wattasseril, Jobin Idiculla; Scheibel, Willy; Döllner, Jürgen Benchmarking SmolVLM for Parking Occupancy Detection
185 Tarekegn, Adane Nega; Rabbi, Fazle; Opdahl, Andreas Lothe; Tessem, Bjørnar Multimodal Video Summarization with Mamba and Bayesian Approach
190 He, Chunjiang; Yang, Gang DiffSynth-LVOS: Enhancing Language-Guided Video Object Segmentation via Diffusion-Based Synthetic Data Generation
383 Kongmeesub, Onanong; Spiess, Florian; Gurrin, Cathal; Nie, Dongyun; Rattanatamrong, Prapaporn An Eye Tracking Dataset for Multimedia Retrieval
Session 8: Language & Text Generation 31.1. 9:30 - 10:10 216 Presacan, Oriana; Nik, Alireza; Thambawita, Vajira; Ionescu, Bogdan; Riegler, Michael A Comparative Study of Decoding Strategies in Medical Text Generation
281 Tran, Minh Huan; Tran Nguyen, Minh Quang; Pham, Phi Nhung; Huynh, Thanh Son; Nguyen, Thanh Binh HFS: Hierarchical Fine-Tuning for Span Detection and Aspect-Based Sentiment Analysis In Vietnamese Language
Session 9b: MARS special session 31.1. 11:00 - 12:20 260 Michael, Yonathan; Alansari, Mohamad; Assefa, Maregu; Werghi, Naoufel; Henschel, Andreas X-ThreatDet: Enhancing X-ray Threat Detection with Self-Supervised and Multi-modal learning
333 Neuschmied, Helmut; Winter, Martin; Bailer, Werner Improving Few-Shot Object Detection using Visual Explanations of DINOv2 Features
370 Schlegel, Udo; Weeber, Franziska; Lan, Jian; Seidl, Thomas PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases
Session 10a: Datasets, Missing Data & Speech 31.1. 13:20 - 14:40 269 Do, Thanh Tu; Hua, Van; Dang, Uyen; Nguyen, Thu; Hicks, Steven; Halvorsen, Pål; Riegler, Michael A.; Nguyen, Binh T. Low-dimension Representation Estimation in Principal Component Analysis under Missing Data
289 Sun, Zhicong; Lo, Jacqueline; Hu, Jinxing WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
311 Vo, Tuan L.; Dang, Uyen; Nguyen, Thu; Halvorsen, Pål; Riegler, Michael A.; Nguyen, Binh T. DPERC: Direct Parameter Estimation for Mixed Data with Random Missingness
407 Langø, Victoria; Hassan, Zohaib; Hicks, Steven Synthesizing Norwegian Dialects in Low-Resource TTS
Session 10b: MOMST and HCMBA special sessions 31.1. 13:20 - 14:40 115 Lu, Yi-Hsuan; Chu, Wei-Ta Vision-Based 3D Baseball Swing Trajectory Reconstruction and Swing Performance Analysis
245 Razyapov, Oskar; Vojtas, Peter; Balcar, Stepan A Video Benchmark Dataset for Indoor Object Positioning in Industrial Environments
119 Rajendran, Megani; Ng, Aik Beng; Tan, Chek Tien; Atmosukarto, Indri; Lim Jun Feng, Joey; Ping Shu Ho, Cliff; See, Simon AutoPose: Pose-Mixing for Rare Human Video Data Augmentation to Enhance Recognition
Virtual Posters (Session 5) 30.1. 10:40 - 12:00 101 Ding, Guohui; Fan, Tengyu; Wang, Chufei Multi-granular Feature Selection Fusion Method for Multimodal Named Entity Recognition
102 Fan, Ziyang; Tao, Li; Wang, Yi; Qu, Jingwei; Wang, Ying; Jiang, Fei DS-HGCN: A Dual-Stream Hypergraph Convolutional Network for Predicting Student Engagement via Social Contagion
113 Ye, Haiyang; Li, Dengshi; Wu, YuLin; LI, Wei; Fang, Yu; Li, Yuxin WavGateMamba: A Frequency-Enhanced and Gated Mamba Model for Multimodal Depression Detection
125 Lin, Hailan; Wei, Qijie; Tian, Kaibin; Zhao, Ruixiang; Li, Xirong Co-Teaching for Unsupervised Domain Expansion
127 Chen, Ziyu; Wang, Hanli Taming Image-based Vision-Language Pre-training Model with Bootstrapped Auxiliary Tasks for Video Captioning
130 Zou, Jiahao; Zhang, Congxuan; Ge, Liyue; He, Chao; Yang, Jiawen; Chen, Zhen; Lu, Ke SAP-DQR:Joining Spatial-Adaptive Pyramid and Adaptive Query Reorganization for Speed-Accuracy Instance Segmentation
132 Peng, Bo; Lyu, YuanJie; Qin, PengGang; Xu, Tong Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
134 Qiu, Xinkuan; Zhou, Yongbin Comparative Robustness of CNNs, ViTs, and MLLMs under Image Corruption
135 WANG, YONGXIANG; Zhou, Gang; Liu, Wei; Zhou, Yang STEREO3D-NERF: GENERATING 3D VISUALIZATIONS WITH PAIRED STEREOSCOPIC VIEWS
137 Li, Hao; Cui, zhenchao MS-MRFNet: A Multi-Scale and Multi-Receptive Field Network for UAV Aerial Object Detection
138 FENG, XIAOJING; TAN, ZHENHUA; CHENG, ZIWEI; LUO, JIAYUAN HDBC: A Heterogeneous Dual-Branch Convolutional Network for Audio Splicing Detection
146 Xie, Wen; Zhu, Yanjun; Overgoor, Gijs; Bart, Yakov; Lapedriza Garcia, Agata; Ostadabbas, Sarah AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
153 Wang, Xiaoyu; Liu, Jing RIDA: Detection of Adversarial Examples through Image Adaptive Local Reconstruction
154 Guo, YuNong; Liu, Jing AIT3D-DSR: An Adjustable Integration Targeted 3D Adversarial Attack based on Differentiable Structured Rendering
155 Li, Hongyang; Tao, Junyi; Wei, Qijie; Yang, Ningzhi; Wang, Meng; Yu, Weihong; Li, Xirong Cross-modal Fundus Image Registration under Large FoV Disparity
156 Liao, Yun; Chen, Nan; Liu, Junhui; Lyu, Jiayi; Hu, Zongxiao; Duan, Qing SeViMatch: A Detector-Based Image Matching Framework with Semantic-Visual Fusion
157 Gao, Ziyuan; Morel, Philippe Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models
158 Du, Haizhou; Li, Wenhao M3RAG: An Adaptive Multi-Agent Framework for Multi-Modal Multi-Hop Reasoning
162 Sun, Ao; Hao, Shijie; Guo, Yanrong Illumination-Prior Guided Hybrid Network for Low-Light Image Enhancement
163 Teng, shangzhi; Li, yekai; Gong, xi; Lv, xueqiang Small Object Detection via Frequency-Based Multi-modal Fusion
170 Lu, Nengbo; Pan, Minghua; Sun, Shaohua; Liang, Yizhou GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting
171 Liao, Yun; Lyu, Jiayi; Liu, Junhui; Chen, Nan; Hu, Zongxiao; Duan, Qing FFMatch: A FilterFormer-Based Network for Accurate Multimodal Image Matching
177 Fu, Cheng; Wu, Junlong; Chen, Xianhong; Peng, Hujin; Xu, Jing; Gong, Junyuan; Liu, Zeyun; Liu, Wenzheng; Deng, Tan; Yuan, Ming CCASNet: Criss-Cross Attention Enhanced Network with Dual-Channel Spatial Modeling for Medical Image Segmentation
178 Chen, Feiyu; Li, Zijian; Yu, Nanjun; Ruan, Tangjun; Ma, Teng; Zhang, Chao Diffusion-driven Deep Variational Image Clustering with Representation Decoupling
180 Shi, Tong; de Almeida, Melonie; Ivanova, Daniela; Pugeault, Nicolas; Henderson, Paul Splat-Portrait: Generalizing Talking Portraits with Gaussian Splatting
181 Lin, Yuzhen; Chen, Hongyi; Chen, Xuanjing; Wang, Shaowen; Xu, Ivonne; Jiang, Dongming CGMG: Collaborative-Guided Multimodal Generative Recommendation
183 Liu, Anqi; Cheng, Qimin; Du, Yingjie SHNet: Spectral Bias Guidance and Hierarchical Dependency Modeling Network for Camouflaged Object Detection
187 Singh, Mantek; Challagundla, Jeshwanth; Raina, Siddharth; Jarsania, Jasmin Efficient Reasoning Distillation: Small Video-Language Models via Synthetic CoT and Difficulty-Aware Fine-Tuning
189 Zhang, Lehan; Cheng, Yinlei; Hu, Shiqi; Zhou, Yiheng; Li, Shangxi; Zhao, Naidong MRAFnd: Multimodal Retrieval-Augmented Framework for Zero-Shot Fake News Detection
196 Li, Jiafeng; Cai, Xichang; Wu, Menglong Dual-Stream Attention Across Time-Frequency for Sound Event Detection
201 Yu, Qiqun; Chen, Yihua; Ma, Jiliang; Tang, Zhenjun No-Reference Image Quality Assessment via Attention-Based Feature Enhancement and Feature Interaction
214 Jia, Heng; Zhao, Na; Xu, Yunqiu; Zhu, Linchao; Yang, Yi GAS: Geometry-Appearance Synergy for Consistent Video Customization
215 Cai, Jiajun; Su, Jianmei RC-NeRF: Anti-Aliasing with Artifact Suppression via Adaptive Hybrid Sampling in Explicit Voxel Grids
218 Liang, Peirou; Yang, Meng; Wu, Zhiqian; Zhou, Peng Yuan; Liao, Yong DiSCo: Disrupting Semantic Consistency for Transferable Cross-modal Adversarial Attacks
219 Li, Junhao; Chen, Jiahao; Feng, Zhou; Zhou, Chunyi Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework
220 Ye, Qihao; Wang, Zhuowei FGR: Frequency Aware and Geometric Structure-guided Multi-modality Image Registration Framework
221 Wang, Zhangyi; Li, Zongze MedFuse-GRM: Multi-scale Feature Extraction and Medically-Guided Graph Relation Modeling for Multimodal Skin Lesion Classification
227 Lyu, Meiyi; Mo, Jiawei; Chen, Xuewen; Wang, Chaoqun AxialUNet: A Lightweight Network for Medical Image Segmentation with Axial Operators
230 Yang, Likai; Li, Nianqiao; Liang, Xiaoping; Chen, Lv; Tang, Zhenjun Video Hashing via a Mamba-Transformer Network for Retrieval
231 Wu, Feng; Li, Li; Wang, Zhaojing DFRF-MIAD: Multimodal Industrial Anomaly Detection via Feature Reconstruction and Fusion
233 Zhang, Guobin; Li, Li; Wang, Qihang; Wang, Zhaojing; Peng, Tao; Hu, Xinrong PSR-Diff: Polarization-Guided Diffusion Model for Single Image Specular Highlight Removal
235 Li, Shuai; Yuan*, Xin; Chen, Minshi; Yin, Yi; Xu, Xin NPFML: Non-isotropic Potential Fields with Hierarchical Decay for Deep Metric Learning
236 Ren, Ruichao; Wang, Yiqi; Zhang, Jiaxin; Yin, Wen; Guo, Yong; LI, Xiaoling Robust Ensemble of GNNs with Adaptive Graph Structure Learning
237 Li, Yuxuan; Ren, Yuning Enhancing Vision Transformer with Multiple Fractional-Order Differential Operators for Image Desnowing
241 Vo, Thanh-Nhan; Nguyen, Trong-Thuan; Nguyen, Tam V.; Tran, Minh-Triet VENUS: Visual Editing with Noise Inversion Using Scene Graphs
244 Deng, Yuchen; Chen, Hongyou; QU, Lingfeng; JIANG, Yong; FAN, Yong Noise Scale Controllable Anomaly Synthesis Strategy for Industrial Anomaly Detection and Localization
247 Wang, Wei; Hu, Jiayi Enhancing Image Generation of Diffusion Models with Structural Image Guidance
249 Wang, Lin; Li, Tiansong; Wang, Guofen; Cui, Shaoguo; Wang, Hongkui; Yu, Li HCFFPN: Hierarchical Cross-scale Feature Fusion Pyramid Network for Small Target Detection in Unmanned Aerial Vehicle Images
250 Zeng, Zhaofu; Xing, Jian MP-CLIP: Unlocking Long-Text Understanding in CLIP via Multi-Paragraph Encoding
251 Wu, Bo Token-Based Multi-Condition Autoregressive Diffusion for Lung CT Image Generation
252 Li, Qingguan; Cong, Jiawei; Zhao, Kai DAHM: A Dual-Stream Attention Fusion Model for Hate Content Detection
256 zhang, feng; tan, junliang; chen, zhenming; feng, hao; guo, biao; chen, junyan; lu, yao; Jiang, Ming TTEdit: Cross-Modal Fusion with Diffusion Models for Detail-Aware Fashion Editing
259 Guan, JingShuo; Qi, Na; Zhu, Qing; Chen, Liang UCAMNet: HVI Color Space based Unsupervised Low-Light Enhancement via Uncertainty Constraint and Attention Mechanism
263 Huo, Guang; Wang, Yue DPC-FCNet: A Dual-Channel Cross-Modality Person Re-Identification Network with Enhanced Multi-Level Feature Correlation
264 Wang, Xiaoqiang; Zhao, Liurui; Wang, Yanjie Surface defect detection of photovoltaic panels based on deep learning and electroluminescent images
274 Chen, Zhiting; Bai, Jieyun; Lu, Hua; Li, Suining; Zhang, Xiaoshen DPNet: A Dual-Perception Fusion Network for Automated Coronary Artery Segmentation
275 Yan, Hongzhi; Su, Jianmei SDB: Safety Constraint Mechanism for Dual-Branch End-to-End Autonomous Driving
276 Li, Yiqian; Ma, JInhua CSQDA: A Parameter-efficient and Memory-efficient Tuning Method for Medical Image Classification
278 Mai, Zhiyang; Qian, Yukun; Wang, Haitao; Wu, Hejun; Zhou, Liangliang LCKPose: Laplacian Candidate Keypoints Modeling for 6D Object Pose Estimation
279 Niu, Wenlong; Zhang, Zebao SCP: Sinkhorn-reconciled Collaborative Prompt Learning for Vision-Language Models
280 Zhou, Xinying; Li, Leixiao; Lin, Hao DAGMP: A Multimodal Learning Approach Jointly Driven by Feature Fusion and Gradient Modulation
282 Miao, Guohua; Xie, Zhihua; Chang, Haolin; Tu, Chengyu Spatial-Spectral Prior Guided Mamba Network for Hyperspectral Image Super-Resolution
284 Yang, Jiale; Zhao, Kai; Zhang, Linlin; Li, Qingguan Boosting the Transferability of Adversarial Examples via Frequency Domain Masking and Adaptive Step Size
288 Chen, Honghui; Zhou, Fan; Wang, Ruomei; Zhao, Baoquan V-HOI: Velocity-Aware Human-Object Interaction Generation
293 ao, yu; han, hongze; li, yuqin; miao, yu; shi, weili LGF-Net: Integrating Local and Global Features in a Dual-Branch Architecture for Tooth Segmentation in CBCT Images
294 Li, Feng; Wu, Ke; Li, Yongwei MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition
299 Wei, GuoDong; Yu, Jiayu; Ao, Yu; Li, YuQin; Guan, YuanYuan; Shi, WeiLi; Miao, Yu; Jiang, ZhenGang CTDiff : A Lightweight Hybrid Diffusion Network for Low-Light Endoscopic Image Enhancement
300 Xu, Tianshi; Sun, Zhengzheng; Hu, Yizheng; Shang, Junyuan; Wu, Si Hierarchical Cross-Modality Interaction for Unified Video-Text Retrieval Modeling
302 Ma, Penghao; Wei, Guangcun; Kong, Chuike; Li, Shuo; Fang, Jianfeng SE-EEND: A Structurally Enhanced End-to-End Neural Diarization System
304 Liu, Wenzheng; Yuan, Ming; Wang, Yizhou; Shen, Lianghao; Wang, Xiaofeng; Xing, Qianqian; Cao, Ronghui; Tang, Xiaoyong; Deng, Tan; Fu, Cheng SPADE: Attention-Guided Split Diffusion for Precise Spatial Control in Interior Layout Image Generation
309 Xu, Feifei; Zhu, Wenjing; Li, Dongyang; Li, Puzhe Question-Aware Spatial-Temporal Reasoning in Patch for Audio-Visual Question Answering
310 Wang, Haoyang; Liu, Liming; Zhang, Xinggong R^2-Mesh: Reinforcement Learning Powered Mesh Reconstruction via Geometry and Appearance Refinement
313 Zhang, Wenli; Zhu, Dali; Zeng, Hualin; Yang, Long TrackPhys: Learning Transferable Physiological Representations for Motion-Robust Heart Rate and 3D Mask Attack Detection
317 He, Zefeng; Hua, Yong; Yang, Xuan Epistemic Uncertainty Guided Bayesian Neural Network for Cardiac Image Registration
318 Liang, Yun; Luo, Tang; Chen, Zhichao; Zhong, Cankun TF-AttNet:An Efficient Time-Frequency Structure Modeling For Low-Complexity Acoustic Scene Classification
319 WANG, YONGXIANG; Zhou, Gang SPDGS: Spatial Pruning and Depth Priors for Sparse-View 3D Gaussian Splatting
321 Zhang, Xuan; Li, Wenjing; Liu, Zhiqiang; Hu, Zhipeng; Liu, Na SMLA-YOLO: Efficient Multiscale Small Defect Detection in Wind Turbine Blades via Dynamic Feature Calibration
324 Sun, Hao; Liu, Yue; Yu, Peiqi; Fan, Kexuan SFL-Net: Synergistic Spatial-Frequency Learning for Medical Image Segmentation
326 Zhao, Faqi; Cheng, Dingxin; Chen, Xuanda; Su, Kun Multi-view Interaction Network with Guided Contrastive Learning for Multimodal Summarization
328 Wang, Dongsheng; Zhu, Yuan; Wang, Yifei PGS-YOLO: A Lightweight and Accurate Framework for Aerial Small Object Detection in Urban Environments
330 Shi, Qifeng; Zhang, Yan DRA-YOLO: Dynamic Receptive-Field Attention and Dual-Gated Upsampling Module Model for Aerial Object Detection
332 Yang, Xing Yao; Jia, Meng kun Dynamic Spectral Fusion and Causal Graph Propagation for Multimodal Recommendation
334 Wang, Quan; Ibrayim, Mayire MAF3Net: Multiscale Attention and Frequency Domain Feature Fusion for Oracle Cross-Domain Recognition Networks
341 Gao, Han; Ning, Tao; Duan, Xiaodong; Wan, Xiaochun; Chen, Lvzuo; Chen, Jingsong; Wang, Yuangang Multimodal Personality Trait Recognition via Spatiotemporal Modeling and Dual-Stage Fusion
343 Ao, Yu; Liu, Chengji; Ye, Jida; Li, YuQin; Miao, Yu; SHi, Weili WES2P: Wavelet-Enhanced SAM2 for Automatic Polyp Segmentation
347 Zhou, Ziyu; Lei, Xia; Zhang, Linlin; Fan, Yongkai FACER: Evaluating and Enhancing Explainable Feature-Level Robustness via Causal Effects
349 Xu, Feifei; Li, Puzhe; Li, Dongyang; Huang, Luobin; Zhu, Wenjing Text-Driven Hybrid Curriculum Learning for Multimodal Sentiment Analysis
351 Liu, Guozhen; Yu, JiaMing; Tan, PanLong; Zhang, XiaoYu; Sun, Qinglin; Sun, Hao SK2-eNeRF: Event-Driven Neural Radiance Fields for 3D Reconstruction in Dynamic and Blurry Scenes
352 Li, Yuqin; Qi, Kailun; Dai, Bofan; Ao, Yu; Shi, Weili Study on Path Planning of Acupuncture Treatment for Allergic Rhinitis
358 Zhou, Aoxiang; Wu, Hao; Peng, Kaichen; Liu, Peng; Li, Xianxian Improving Few-shot Multi-modal Aspect-Level Sentiment Classification with Implicit In-context Learning
362 Zheng, Kaihong; Sun, Lingyun; Cui, Yanan; Guo, Lili; Zhang, Jian Graph Dynamic Fusion Network with Contrastive Learning for Multimodal Emotion Recognition in Conversation
366 Yang, Tianchi; Liu, Shenling; Wu, Shihong; Jiang, Wenhao; Luo, Yuchuan; Liu, Lin; Fu, Shaojing FAHALE: Federated Asynchronous Heterogeneity-Adaptive Learning with Output Estimation
367 Guo, Jiacai; Xu, Zili; Luo, Jianjie; Lee, Lap-Kei; Wang, Fu Lee; Yang, Zhenguo FANB-Net: Frequency-Awared Attention and Noise-injected Boosting for AI-generated Image Detection
369 hu, yunteng; zhao, hongyan; ke, zunwang HiCP-SS: Hierarchical Two-Level Prototype Copy-Paste for Semi-Supervised Medical Segmentation
372 Dong, Chunling; Xu, Hang A Hierarchical Multi-Scale Attention Enhancement Method for Micro-Expression Recognition: from Channel Modulation to Spatial-Scale Fusion
375 Liang, Fang; Guo, Jianshu; Wen, Xuexiang; Hu, Wenhao; Ye, Xiang; Wang, Gaoang GauScene: Physically Plausible Scene Generation via Language-Guided 3D Gaussian Interaction
380 Wei, Xinyue; Pang, Weiguang; Wang, Changwei; Fu, Kexue; Qu, Youyang; Gao, Longxiang SAMFF: A Semantic-Guided Zero-Shot Multi-Focus Image Fusion Framework
386 peng, yimin; yan, xu; cao, ziqiang M3ITR: Modeling Many-to-Many Relationships for Robust Image-Text Retrieval
387 Liu, Yanfei; Xu, Miaosen; Shi, Youchang; Zou, Zheng; Li, yuanqian; Wen, Hao MCPDS-CMNet: A Multi-Conditional Prior-guided Dual Spiral CNN-Mamba Network for Face Sketch-Photo Synthesis
389 Ren, Xuena SRSG-AReID: Self-Rewarded Shuffle Grouping for Robust Aerial Person Re-Identification
394 Qiu, Weibin; Wu, Suping; Xu, Hao; Yang, Jie; Zhang, Xiang ZC-MVSNet: Zero-Sum Convolution and Prior Fusion for Multi-View Stereo
398 Zhao, Yawei; Wumaier, Aishan; Guo, Xueliang; Lv, Yaxuan A Multi-aspect Multi-granularity Pronunciation Assessment Method Based on Multi-feature Fusion and Transformer Encoder
402 Zhu, Xinya; Li, Wenqiang; Qiao, Mengyu; Yang, Zhihui; Wang, Yang MOSS: Multi-modal Source Separation for Music Deepfake Detection
403 Zhang, Tianyun; Zou, Binfeng; Zhang, Xiaoshuai; Zhang, Guangyuan; Huang, Zhao; Liu, Jin; Zheng, Zhiwen; Huang, Xingru Graph-Dynamics Augmented Foundation Model for Surgical Instrument Segmentation
405 Ma, Qiang; Wu, Suping; Yang, Sheng; Qiu, Weibin; Liu, Feng; Jin, Zhaocheng; Xu, Hao HiAvatar: High-Fidelity Animatable Head Avatar
409 Yao, Yuheng; Zhuang, Liansheng; Long, Xiao; Wang, Shafei Exploring the Polysemy of Relations in Path Reasoning for Inductive Knowledge Graph Completion
150 Xue, Song lin; Zhang, Hong Towards Robust and Secure Cross-Domain Face Anti-Spoofing via Feature-Boundary Consistency
271 xue, song; du, jiayu; huang, wei; yang, qiulong A Backdoor Attack via Fixed Patch Triggers in Frequency Domain
296 Guo, Yusheng; Hao, Qiang; Liu, Zhao; Wu, Qingshuang; Lu, Yanliang; Gu, Ming; Hu, Su Backdoor-based Protection Architecture for Model Functional Services
400 Zhu, Jin; Du, Jiayu; Zhang, Fan; Chen, Xin; Zhou, Zhizhong Enhancing Ensemble Adversarial Defense via Dataset Orthogonal Decomposition
303 昊凯, 徐; 登实, 李 Emotion-aware Multi-modal Fusion for Human Behavior Analysis via Graph and State-space Modeling
Demo Session (Session 9a) 31.1. 11:00 - 12:20 206 Rossetto, Luca; Ruosch, Florian Beyond the Blob: Demonstrating MeGraS for Multimodal Knowledge Graph Interaction
416 Bui, Quoc-Anh; Lim, Serhane; Boudard, Tom; Rougeron, Gilles; Gasparini, Simone; Morin, Géraldine XROI-GS: Real-time XR Interactive Inspection of High-quality Objects of Interest in a 3D Gaussian Splats Scene
417 Hamanaka, Masatoshi AI-based Composition Tools for Composing A School Song With Student Participation
418 Baglanova, Aidana; Babakhojayeva, Zilola; Fakhrutdinov, Nail; Kalimzhanov, Ruslan Kalimzhanov; Azirakhmet, Umit; Yatbaz, Hakan Yekta; Yazici, Adnan ZhadigerAI: Software as a Service AI Platform for Kazakh and English Languages
422 Fernandez Roblero, Jaime Boanerjes; Syed, Ali Akbar Shah; Ali, Muhammad Intizar Ask VR: Vision Language Model Driven Scene Descriptor for Blind and Low Vision Users in VR Environment
424 Kostka de Sztemberg, Berenika Nawoja; Żywica, Patryk A Multimedia Pipeline for Interactive Game Archaeology on the Example of Wolfenstein 3D in Godot Engine
428 Pham, Minh-Anh; Luu, Duc-Tuan; Skivdal, Johannes; Dang-Nguyen, Duc-Tien Past Forward: Exif Revisited
430 Vats, Shivi; Timmerer, Christian; Hellwagner, Hermann STEP-MR: A Subjective Testing and Eye-Tracking Platform for Dynamic Point Clouds in Mixed Reality
VBS Sessions & Welcome reception 29.1. 13:00 - 16:10 (private) & 29.1. 17:00 - 19:30 (public) 411 Huynh, Viet-Tham; Le-Hinh, Nhut-Thanh; Nguyen-Ho, Thang-Long; Nguyen, Trong-Thuan; Gurrin, Cathal; Nguyen, Tam V.; Tran, Minh-Triet TapesVRy: Immersive Panoramic Exploration in Large-Scale Video Retrieval
412 Le, Huy M.; Nguyen, Tien Dat; Nguyen, Phuc Binh; Le Tran, Gia Bao; Truong Thien, Phu; Dinh, Cuong; Nguyen, Nga T.N.; Nguyen, Thuy T. N.; Ngo, Huy Gia; Nguyen, Tan Nhat; Nguyen, Binh T. Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets
413 Ueki, Kazuya; Muto, Ryo; Wada, Takuya; Akaba, Ryota; Zhang, Guannan U-Cker at VBS2026: A Web-Based Interactive Video Retrieval System with Multimodal Query Support
414 Cheng, Yu-Tong; Nguyen, Phuong-Anh; Kha, Kim-Thuy; Ngo, Chong-Wah VIREO @ Video Browser Showdown 2026
419 Ho-Le, Minh-Quan; Ho, Duy-Khang; Ninh, Tu V.; Gurrin, Cathal; Tran, Minh-Triet From Expert Practices to Intelligent Agents: Autonomy in Interactive Video Retrieval
420 Geller, Andrina; Arnold, Rahel; Waltenspül, Raphael; Schuldt, Heiko Extending vitrivr-engine with Emotion-Based Retrieval and a Modular User Interface
421 Arnold, Rahel; Pietzak, Anna; Schuldt, Heiko MediaMix: Multimedia Retrieval with Dual Backend Support and Result Exploration in MR
423 Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Tran, Allie; Tran, Minh-Triet; Gurrin, Cathal; Healy, Graham H-EAGLE: Hierarchical Extension of EAGLE for Multi-Level Semantic Video Retrieval
425 Pantelidis, Nick; Kosmidou, Eleni; Galanopoulos, Damianos; Georgalis, Dimitris; Pasios, Stefanos; Apostolidis, Konstantinos; Goulas, Andreas; Pegia, Maria; Tsionkis, Georgios; Gkountakos, Konstantinos; Kouvrakis, Grigorios; Moumtzidou, Anastasia; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis VERGE in VBS 2026
426 Jäckl, Bastian; Verner, Benjamin; Stroh, Michael; Kloda, Vojtech; Nagy, Ladislav; Deussen, Oliver; Keim, Daniel A.; Lokoc, Jakub PraK V4 at the Video Browser Showdown 2026
427 Tran, Bao; Do, Tien; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi NII-UIT at VBS2026: Towards Effective Visual Question Answering for Interactive and Multimodal Video Retrieval
429 Khan, Omar Shahbaz; Sharma, Ujjwal; Marcelino, Gonçalo; Rudinac, Stevan; Jónsson, Björn Þór Exquisitor at the Video Browser Showdown 2026: Temporal Queries Revisited

Keynotes

Jiri Matas

Jiri Matas is the head of the Visual Recognition Group at the Center for Machine Perception, Department of Cybernetics, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published more than 300 papers that have been cited about 74000 in Google Scholar (h-index = 99). He received the best paper prize at the British Machine Vision Conference in 2002, 2005 and 2022, at the Asian Conference on Computer Vision in 2007 and at Int. Conf. on Document analysis and Recognition in 2015. J. Matas served as a programme or general chair at ECCV 2004, 2016, 2022 and CVPR 2007 and 2022. He is an Editor-in-Chief of the International Journal of Computer Vision was an Associate Editor-in-Chief of IEEE T. Pattern Analysis and Machine Intelligence. He is on the computer science panel of the ERC.
His research interests include visual tracking, object recognition, image matching and retrieval, sequential pattern recognition, and RANSAC-type optimization metods. He has co-founded two companies, Eyedea Recognition (computer vision) and Locksley (combinatorial optimization).

Marcel Worring: Multimedia Analytics in the Foundation Model Era

Professor Marcel Worring

Abstract: Multimedia Analytics was introduced in 2010 by Chinchor et.al. as the combination of multimedia analysis and visual analytics. In 2014 Zahalka et.al. developed a model for Multimedia Analytics, while at the same time Sacha et.al. developed a visual analytics model connecting interactive processes with human cognition. Up to recently, these two fields have developed more or less is parallel. Foundation models have changed the playing field completely. They combine multimedia analysis with reasoning and generation capabilities. Still for a lot of tasks, humans outperform AI and Human-AI teaming is the best way forward. Yet, the main communication channel between humans and AI is via a textual prompt. Clearly Visual Analytics could yield significant improvements here. In this talk, we will consider the historical developments in multimedia analysis and visual analytics and present a new multimedia / visual analytics model suited for the foundation model era. We show the value of the model by taking a number of existing systems as case studies and use the model to describe them and suggest improvements based on the model characteristics.

Bio: Marcel Worring is a full professor in the Informatics Institute of the University of Amsterdam. He is leading the MultiX group which is doing research on multimedia analytics techniques for getting the richest information possible from the data through AI algorithms, interactions, and interfaces; surpassing human and machine intelligence for applications and social impact in public health, forensics and law enforcement, cultural heritage, and data-driven business. He has been associate editor of ACM TOMCCAP, IEEE Transactions on Multimedia, and IEEE Multimedia, and organized ACM Multimedia 2016, MMM2024, and will organize ICMR2026. He is a fellow of ELLIS, the European Laboratory for Learning and Intelligent Systems, and co-founder of the Innovation Center for Artificial Intelligence.

Giuseppe Amato: Synergy between Extended Reality and Artificial Intelligence

Dr. Giuseppe Amato

Abstract: Extended Reality (XR) builds on Augmented Reality (AR) and Mixed Reality (MR), which themselves extend the foundations of Virtual Reality (VR). VR immerses users in fully digital environments through headsets, smart devices, or computer screens, but it isolates them from the physical world. AR and MR, instead, overlay digital content onto the real environment. Virtual objects are aligned, fused, and synchronized with physical surroundings, and in MR users can interact with both physical and virtual elements in a unified space.

XR goes even further. It is not limited to visual augmentation; it enhances the realism of interactions across multiple sensory dimensions. With XR, users can feel virtual objects—their weight, their temperature, even their texture and consistency—bringing the physical and digital worlds into deeper, more natural continuity.

Artificial Intelligence plays a crucial role in enabling this integration. AI can enrich virtual scenes with semantic information, reconstruct and interpret 3D environments, correct and enhance data during 3D digitization, support the creation of virtual worlds from physical ones, and facilitate seamless interaction across both realms.

In this keynote, we will explore how these challenges have been addressed within the SUN project (Social and hUman ceNtered XR – https://www.sun-xr-project.eu/) and discuss the current limitations and emerging opportunities in this rapidly evolving field.

Bio: Dr. Giuseppe Amato is a research director at CNR-ISTI in Pisa, where he leads the "Artificial Intelligence for Multimedia and Humanities" laboratory (AIMH - http://aimh.isti.cnr.it/ ). He was awarded a PhD in Computer Science at the University of Dortmund, Germany, in 2002. His main research interests are artificial intelligence, extended reality, content-based retrieval of multimedia documents, access methods for similarity search, smart camera networks. He has published more than 200 papers in peer reviews international journals and conferences in the areas of artificial intelligence and multimedia information retrieval. He has participated in several EC and national funded research actions in the areas of Artificial Intelligence, Multimedia Information Retrieval, Computer Vision, Extended Reality, Robotics, and Cultural Heritage, and he currently coordinates the Social and hUman ceNtered XR (SUN) project (https://www.sun-xr-project.eu/) funded by the Horizon Europe Research & Innovation Programme. More at http://aimh.isti.cnr.it/giuseppeamato/