Program at glance
| Thursday | Friday | Saturday | |||||
| January 29, 2026 | January 30, 2026 | January 31, 2026 | |||||
| Room 1 | Room 2 | Room 1 | Room 2 | Room 1 | Room 2 | ||
| 8:00 | 8:30 | Registration | |||||
| 8:30 | 9:00 | Opening | Keynote 2: Marcel Worring | Keynote 3: Giuseppe Amato | |||
| 9:00 | 9:30 | Keynote 1: Jiri Matas | |||||
| 9:30 | 10:00 | Session 4: Motion models Chair: Werner Bailer |
Session 8: Language & Text Generation Chair: Duc Tien Dang Nguyen |
||||
| 10:00 | 10:10 | Tea Break | |||||
| 10:10 | 10:30 | Tea Break | Tea Break | ||||
| 10:30 | 10:40 | Session 1: Best paper candidates Chair: Stevan Rudinac and Jan Zahalka |
|||||
| 10:40 | 11:00 | Session 5: Online Posters | Session 9a: Demo Session | Session 9b: MARS Spec. Session Chair: Thu Nguyen |
|||
| 11:00 | 11:30 | ||||||
| 11:30 | 12:00 | ||||||
| 12:00 | 12:10 | Lunch Break | Lunch Break | ||||
| Lunch Break | |||||||
| 13:00 | 13:30 | Session 2a: Recommendation & Graph learning Chair: Kai Uwe Barthel |
VBS Session | Session 6: Vision-Language Models & Multimedia Applications Chair: Giuseppe Amato |
Session 10a: Datasets, Missing Data & Speech Chair: Luca Rossetto |
Session 10b: MOMST and HCMBA Spec. Session Chair: Mario Döller |
|
| 13:30 | 14:00 | ||||||
| 14:00 | 14:40 | ||||||
| 14:40 | 15:00 | Tea Break | Closing | ||||
| 15:00 | 15:10 | Tea Break | |||||
| 15:10 | 15:30 | Session 3a: Image enhancement, Object detection & Explanations Chair: Max Fischer |
|||||
| 15:30 | 16:00 | Session 7: Video Retrieval & Datasets Chair: Klaus Schoeffmann |
Prague Tour and Boat Trip | ||||
| 16:00 | 16:10 | ||||||
| 16:10 | 16:30 | ||||||
| 16:30 | 17:00 | ||||||
| 17:00 | 17:10 | Welcome Reception and VBS Session | |||||
| 17:10 | 17:30 | ||||||
| 17:30 | 18:00 | ||||||
| 18:00 | 18:30 | ||||||
| 18:30 | 19:00 | ||||||
| 19:00 | 19:30 | Banquet | |||||
| 19:30 | 20:00 | ||||||
| 20:00 | 21:00 | ||||||
| 21:00 | 22:00 | ||||||
Detailed Program
| Session | Time | Paper ID | Authors | Title |
|---|---|---|---|---|
| Session 1: Best paper candidates | 29.1. 10:30 - 12:10 | 229 | Tu, Teng; Liu, Xiaohao; Ma, Yunshan; Qi, Ji; Chua, Tat-Seng | Integrating Symbolic and Waveform Music into Large Language Models |
| 335 | Chen, Lucy; Collins, KC | Can AI Capture Emotion? A Study on Human Emotional Perception and Response to AI-Generated and Human-Composed Pop Music | ||
| 340 | ZHU, Bin; Yin, Hailong; Chen, Jingjing; Jiang, Yu-Gang | Benchmarking Gaslighting Negation Attacks Against Reasoning Models | ||
| 410 | Deng, Zhixuan; Zhu, Yifan; Xiang, Lei; Jin, Shilong; Duan, Haoran; Long, Yang; Zhou, Yuan | ZeroDINO: Entropy-Driven Granularity-Aware Semantic Fusion for Zero-Shot Learning | ||
| Session 2a: Recommendation & Graph learning | 29.1. 13:00 - 14:40 | 106 | Dose, Yuma; Hara, Takahiro | Graph Contrastive Learning with Popularity and Neighborhood Awareness for Long-Tail Item Recommendation |
| 149 | Tsukuda, Kosetsu; Ishida, Keisuke; Takahashi, Takumi; Hamasaki, Masahiro; Goto, Masataka | A Case Study of a Transparent and Controllable Music Recommender System with Multi-Relational Layers | ||
| 182 | Dang Hoang Minh, Triet; Tran Hoang, Anh; Nguyen Hoang, Hai; Tran Nguyen Minh, Quang; Tran Cong, Hieu; Nguyen, Thu; Nguyen Thanh, Binh | PreBERT-Rec: Improving Topic Modeling in Recommendation Systems via Effective Data Preprocessing and BERT | ||
| 207 | Ruosch, Florian; Rossetto, Luca | Applications of Multimodal Knowledge Graphs in Modeling Multimedia | ||
| 348 | Wang, Haoyang; Zhang, Shengbing; Fan, Xiaoya; Zhu, Junda; Zhang, Meng | Enabling Efficient Distributed Graph Neural Network Acceleration with Near Memory Processing | ||
| Session 3a: Image enhancement, Object detection & Explanations | 29.1. 15:10 - 16:10 | 109 | Xu, Zhoutong; Wang, Zhangye | MAGNet: Multi-Level Attention For Guided Thermal Infrared Image Super-Resolution |
| 359 | Hu, Weiyi; Cui, Hua; Hu, Haoran; Yang, Zhao | TD-MBEV:Robust 3D Object Detection with Temporal Diffusion-Masked BEV | ||
| 103 | Bai, Yannan; Wang, Danding; Tang, Sheng; Cao, Juan; Li, Jintao | Dissecting Deepfake Artifacts via Multimodal Explanations | ||
| Session 4: Motion models | 30.1. 9:30 - 10:10 | 165 | Li, Zhaoyang; Tian, Jinglan; Lyu, Na | Conditional VQ-VAE for Action-Conditioned Motion Generation |
| 283 | Yu, Congrui; Fan, Bo; Lyu, Na | MotionSlim: A Lightweight T2M Generation Framework Based on LLM | ||
| Session 6: Vision-Language Models & Multimedia Applications | 30.1. 13:00 - 15:00 | 111 | Zhu, Fengbin; Liu, Ziyang; NG, Xiang Yao; Wu, Haohui; Wang, Wenjie; Feng, Fuli; Wang, Chao; Luan, Huanbo; Chua, Tat-Seng | MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding and Grounding |
| 118 | Martín-Fernández, Iván; Constantin, Mihai Gabriel; Ionescu, Bogdan; Esteban-Romero, Sergio; Fernández-Martínez, Fernando; Gil-Martín, Manuel | A Case Study on Large Visual-Language Model Attention Explainability After Adaptation Using Persuasion Strategies in Advertisements | ||
| 166 | Chi, Jui-Feng; Chu, Wei-Ta; Lin, Sheng-Long | Food Image Segmentation with LLM-Derived Ingredient Labels and Multimodal Fusion | ||
| 272 | Tran, Allie; Rossetto, Luca | On the Brittleness of CLIP Text Encoders | ||
| 337 | Wu, Xinlan; Zhu, Bin; Han, Feng; Jiao, Pengkun; Chen, Jingjing | Dual-LoRA and Quality-Enhanced Pseudo Replay for Multimodal Continual Food Learning | ||
| 373 | Gan, Kian-Yu; Nguyen, Phuong-Anh; Ngo, Chong-Wah | Food Recognition with Visual Language Models: Search Re-ranking or Retrieval-Augmented Generation? | ||
| Session 7: Video Retrieval & Datasets | 30.1. 15:30 - 17:10 | 120 | LE, HOANG BAO; Tran, Allie; T. Nguyen, Binh; Zhou, Liting; Gurrin, Cathal | FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text |
| 128 | Wattasseril, Jobin Idiculla; Scheibel, Willy; Döllner, Jürgen | Benchmarking SmolVLM for Parking Occupancy Detection | ||
| 185 | Tarekegn, Adane Nega; Rabbi, Fazle; Opdahl, Andreas Lothe; Tessem, Bjørnar | Multimodal Video Summarization with Mamba and Bayesian Approach | ||
| 190 | He, Chunjiang; Yang, Gang | DiffSynth-LVOS: Enhancing Language-Guided Video Object Segmentation via Diffusion-Based Synthetic Data Generation | ||
| 383 | Kongmeesub, Onanong; Spiess, Florian; Gurrin, Cathal; Nie, Dongyun; Rattanatamrong, Prapaporn | An Eye Tracking Dataset for Multimedia Retrieval | ||
| Session 8: Language & Text Generation | 31.1. 9:30 - 10:10 | 216 | Presacan, Oriana; Nik, Alireza; Thambawita, Vajira; Ionescu, Bogdan; Riegler, Michael | A Comparative Study of Decoding Strategies in Medical Text Generation |
| 281 | Tran, Minh Huan; Tran Nguyen, Minh Quang; Pham, Phi Nhung; Huynh, Thanh Son; Nguyen, Thanh Binh | HFS: Hierarchical Fine-Tuning for Span Detection and Aspect-Based Sentiment Analysis In Vietnamese Language | ||
| Session 9b: MARS special session | 31.1. 11:00 - 12:20 | 260 | Michael, Yonathan; Alansari, Mohamad; Assefa, Maregu; Werghi, Naoufel; Henschel, Andreas | X-ThreatDet: Enhancing X-ray Threat Detection with Self-Supervised and Multi-modal learning |
| 333 | Neuschmied, Helmut; Winter, Martin; Bailer, Werner | Improving Few-Shot Object Detection using Visual Explanations of DINOv2 Features | ||
| 370 | Schlegel, Udo; Weeber, Franziska; Lan, Jian; Seidl, Thomas | PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases | ||
| Session 10a: Datasets, Missing Data & Speech | 31.1. 13:20 - 14:40 | 269 | Do, Thanh Tu; Hua, Van; Dang, Uyen; Nguyen, Thu; Hicks, Steven; Halvorsen, Pål; Riegler, Michael A.; Nguyen, Binh T. | Low-dimension Representation Estimation in Principal Component Analysis under Missing Data |
| 289 | Sun, Zhicong; Lo, Jacqueline; Hu, Jinxing | WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond | ||
| 311 | Vo, Tuan L.; Dang, Uyen; Nguyen, Thu; Halvorsen, Pål; Riegler, Michael A.; Nguyen, Binh T. | DPERC: Direct Parameter Estimation for Mixed Data with Random Missingness | ||
| 407 | Langø, Victoria; Hassan, Zohaib; Hicks, Steven | Synthesizing Norwegian Dialects in Low-Resource TTS | ||
| Session 10b: MOMST and HCMBA special sessions | 31.1. 13:20 - 14:40 | 115 | Lu, Yi-Hsuan; Chu, Wei-Ta | Vision-Based 3D Baseball Swing Trajectory Reconstruction and Swing Performance Analysis |
| 245 | Razyapov, Oskar; Vojtas, Peter; Balcar, Stepan | A Video Benchmark Dataset for Indoor Object Positioning in Industrial Environments | ||
| 119 | Rajendran, Megani; Ng, Aik Beng; Tan, Chek Tien; Atmosukarto, Indri; Lim Jun Feng, Joey; Ping Shu Ho, Cliff; See, Simon | AutoPose: Pose-Mixing for Rare Human Video Data Augmentation to Enhance Recognition | ||
| Virtual Posters (Session 5) | 30.1. 10:40 - 12:00 | 101 | Ding, Guohui; Fan, Tengyu; Wang, Chufei | Multi-granular Feature Selection Fusion Method for Multimodal Named Entity Recognition |
| 102 | Fan, Ziyang; Tao, Li; Wang, Yi; Qu, Jingwei; Wang, Ying; Jiang, Fei | DS-HGCN: A Dual-Stream Hypergraph Convolutional Network for Predicting Student Engagement via Social Contagion | ||
| 113 | Ye, Haiyang; Li, Dengshi; Wu, YuLin; LI, Wei; Fang, Yu; Li, Yuxin | WavGateMamba: A Frequency-Enhanced and Gated Mamba Model for Multimodal Depression Detection | ||
| 125 | Lin, Hailan; Wei, Qijie; Tian, Kaibin; Zhao, Ruixiang; Li, Xirong | Co-Teaching for Unsupervised Domain Expansion | ||
| 127 | Chen, Ziyu; Wang, Hanli | Taming Image-based Vision-Language Pre-training Model with Bootstrapped Auxiliary Tasks for Video Captioning | ||
| 130 | Zou, Jiahao; Zhang, Congxuan; Ge, Liyue; He, Chao; Yang, Jiawen; Chen, Zhen; Lu, Ke | SAP-DQR:Joining Spatial-Adaptive Pyramid and Adaptive Query Reorganization for Speed-Accuracy Instance Segmentation | ||
| 132 | Peng, Bo; Lyu, YuanJie; Qin, PengGang; Xu, Tong | Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining | ||
| 134 | Qiu, Xinkuan; Zhou, Yongbin | Comparative Robustness of CNNs, ViTs, and MLLMs under Image Corruption | ||
| 135 | WANG, YONGXIANG; Zhou, Gang; Liu, Wei; Zhou, Yang | STEREO3D-NERF: GENERATING 3D VISUALIZATIONS WITH PAIRED STEREOSCOPIC VIEWS | ||
| 137 | Li, Hao; Cui, zhenchao | MS-MRFNet: A Multi-Scale and Multi-Receptive Field Network for UAV Aerial Object Detection | ||
| 138 | FENG, XIAOJING; TAN, ZHENHUA; CHENG, ZIWEI; LUO, JIAYUAN | HDBC: A Heterogeneous Dual-Branch Convolutional Network for Audio Splicing Detection | ||
| 146 | Xie, Wen; Zhu, Yanjun; Overgoor, Gijs; Bart, Yakov; Lapedriza Garcia, Agata; Ostadabbas, Sarah | AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping | ||
| 153 | Wang, Xiaoyu; Liu, Jing | RIDA: Detection of Adversarial Examples through Image Adaptive Local Reconstruction | ||
| 154 | Guo, YuNong; Liu, Jing | AIT3D-DSR: An Adjustable Integration Targeted 3D Adversarial Attack based on Differentiable Structured Rendering | ||
| 155 | Li, Hongyang; Tao, Junyi; Wei, Qijie; Yang, Ningzhi; Wang, Meng; Yu, Weihong; Li, Xirong | Cross-modal Fundus Image Registration under Large FoV Disparity | ||
| 156 | Liao, Yun; Chen, Nan; Liu, Junhui; Lyu, Jiayi; Hu, Zongxiao; Duan, Qing | SeViMatch: A Detector-Based Image Matching Framework with Semantic-Visual Fusion | ||
| 157 | Gao, Ziyuan; Morel, Philippe | Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models | ||
| 158 | Du, Haizhou; Li, Wenhao | M3RAG: An Adaptive Multi-Agent Framework for Multi-Modal Multi-Hop Reasoning | ||
| 162 | Sun, Ao; Hao, Shijie; Guo, Yanrong | Illumination-Prior Guided Hybrid Network for Low-Light Image Enhancement | ||
| 163 | Teng, shangzhi; Li, yekai; Gong, xi; Lv, xueqiang | Small Object Detection via Frequency-Based Multi-modal Fusion | ||
| 170 | Lu, Nengbo; Pan, Minghua; Sun, Shaohua; Liang, Yizhou | GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting | ||
| 171 | Liao, Yun; Lyu, Jiayi; Liu, Junhui; Chen, Nan; Hu, Zongxiao; Duan, Qing | FFMatch: A FilterFormer-Based Network for Accurate Multimodal Image Matching | ||
| 177 | Fu, Cheng; Wu, Junlong; Chen, Xianhong; Peng, Hujin; Xu, Jing; Gong, Junyuan; Liu, Zeyun; Liu, Wenzheng; Deng, Tan; Yuan, Ming | CCASNet: Criss-Cross Attention Enhanced Network with Dual-Channel Spatial Modeling for Medical Image Segmentation | ||
| 178 | Chen, Feiyu; Li, Zijian; Yu, Nanjun; Ruan, Tangjun; Ma, Teng; Zhang, Chao | Diffusion-driven Deep Variational Image Clustering with Representation Decoupling | ||
| 180 | Shi, Tong; de Almeida, Melonie; Ivanova, Daniela; Pugeault, Nicolas; Henderson, Paul | Splat-Portrait: Generalizing Talking Portraits with Gaussian Splatting | ||
| 181 | Lin, Yuzhen; Chen, Hongyi; Chen, Xuanjing; Wang, Shaowen; Xu, Ivonne; Jiang, Dongming | CGMG: Collaborative-Guided Multimodal Generative Recommendation | ||
| 183 | Liu, Anqi; Cheng, Qimin; Du, Yingjie | SHNet: Spectral Bias Guidance and Hierarchical Dependency Modeling Network for Camouflaged Object Detection | ||
| 187 | Singh, Mantek; Challagundla, Jeshwanth; Raina, Siddharth; Jarsania, Jasmin | Efficient Reasoning Distillation: Small Video-Language Models via Synthetic CoT and Difficulty-Aware Fine-Tuning | ||
| 189 | Zhang, Lehan; Cheng, Yinlei; Hu, Shiqi; Zhou, Yiheng; Li, Shangxi; Zhao, Naidong | MRAFnd: Multimodal Retrieval-Augmented Framework for Zero-Shot Fake News Detection | ||
| 196 | Li, Jiafeng; Cai, Xichang; Wu, Menglong | Dual-Stream Attention Across Time-Frequency for Sound Event Detection | ||
| 201 | Yu, Qiqun; Chen, Yihua; Ma, Jiliang; Tang, Zhenjun | No-Reference Image Quality Assessment via Attention-Based Feature Enhancement and Feature Interaction | ||
| 214 | Jia, Heng; Zhao, Na; Xu, Yunqiu; Zhu, Linchao; Yang, Yi | GAS: Geometry-Appearance Synergy for Consistent Video Customization | ||
| 215 | Cai, Jiajun; Su, Jianmei | RC-NeRF: Anti-Aliasing with Artifact Suppression via Adaptive Hybrid Sampling in Explicit Voxel Grids | ||
| 218 | Liang, Peirou; Yang, Meng; Wu, Zhiqian; Zhou, Peng Yuan; Liao, Yong | DiSCo: Disrupting Semantic Consistency for Transferable Cross-modal Adversarial Attacks | ||
| 219 | Li, Junhao; Chen, Jiahao; Feng, Zhou; Zhou, Chunyi | Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework | ||
| 220 | Ye, Qihao; Wang, Zhuowei | FGR: Frequency Aware and Geometric Structure-guided Multi-modality Image Registration Framework | ||
| 221 | Wang, Zhangyi; Li, Zongze | MedFuse-GRM: Multi-scale Feature Extraction and Medically-Guided Graph Relation Modeling for Multimodal Skin Lesion Classification | ||
| 227 | Lyu, Meiyi; Mo, Jiawei; Chen, Xuewen; Wang, Chaoqun | AxialUNet: A Lightweight Network for Medical Image Segmentation with Axial Operators | ||
| 230 | Yang, Likai; Li, Nianqiao; Liang, Xiaoping; Chen, Lv; Tang, Zhenjun | Video Hashing via a Mamba-Transformer Network for Retrieval | ||
| 231 | Wu, Feng; Li, Li; Wang, Zhaojing | DFRF-MIAD: Multimodal Industrial Anomaly Detection via Feature Reconstruction and Fusion | ||
| 233 | Zhang, Guobin; Li, Li; Wang, Qihang; Wang, Zhaojing; Peng, Tao; Hu, Xinrong | PSR-Diff: Polarization-Guided Diffusion Model for Single Image Specular Highlight Removal | ||
| 235 | Li, Shuai; Yuan*, Xin; Chen, Minshi; Yin, Yi; Xu, Xin | NPFML: Non-isotropic Potential Fields with Hierarchical Decay for Deep Metric Learning | ||
| 236 | Ren, Ruichao; Wang, Yiqi; Zhang, Jiaxin; Yin, Wen; Guo, Yong; LI, Xiaoling | Robust Ensemble of GNNs with Adaptive Graph Structure Learning | ||
| 237 | Li, Yuxuan; Ren, Yuning | Enhancing Vision Transformer with Multiple Fractional-Order Differential Operators for Image Desnowing | ||
| 241 | Vo, Thanh-Nhan; Nguyen, Trong-Thuan; Nguyen, Tam V.; Tran, Minh-Triet | VENUS: Visual Editing with Noise Inversion Using Scene Graphs | ||
| 244 | Deng, Yuchen; Chen, Hongyou; QU, Lingfeng; JIANG, Yong; FAN, Yong | Noise Scale Controllable Anomaly Synthesis Strategy for Industrial Anomaly Detection and Localization | ||
| 247 | Wang, Wei; Hu, Jiayi | Enhancing Image Generation of Diffusion Models with Structural Image Guidance | ||
| 249 | Wang, Lin; Li, Tiansong; Wang, Guofen; Cui, Shaoguo; Wang, Hongkui; Yu, Li | HCFFPN: Hierarchical Cross-scale Feature Fusion Pyramid Network for Small Target Detection in Unmanned Aerial Vehicle Images | ||
| 250 | Zeng, Zhaofu; Xing, Jian | MP-CLIP: Unlocking Long-Text Understanding in CLIP via Multi-Paragraph Encoding | ||
| 251 | Wu, Bo | Token-Based Multi-Condition Autoregressive Diffusion for Lung CT Image Generation | ||
| 252 | Li, Qingguan; Cong, Jiawei; Zhao, Kai | DAHM: A Dual-Stream Attention Fusion Model for Hate Content Detection | ||
| 256 | zhang, feng; tan, junliang; chen, zhenming; feng, hao; guo, biao; chen, junyan; lu, yao; Jiang, Ming | TTEdit: Cross-Modal Fusion with Diffusion Models for Detail-Aware Fashion Editing | ||
| 259 | Guan, JingShuo; Qi, Na; Zhu, Qing; Chen, Liang | UCAMNet: HVI Color Space based Unsupervised Low-Light Enhancement via Uncertainty Constraint and Attention Mechanism | ||
| 263 | Huo, Guang; Wang, Yue | DPC-FCNet: A Dual-Channel Cross-Modality Person Re-Identification Network with Enhanced Multi-Level Feature Correlation | ||
| 264 | Wang, Xiaoqiang; Zhao, Liurui; Wang, Yanjie | Surface defect detection of photovoltaic panels based on deep learning and electroluminescent images | ||
| 274 | Chen, Zhiting; Bai, Jieyun; Lu, Hua; Li, Suining; Zhang, Xiaoshen | DPNet: A Dual-Perception Fusion Network for Automated Coronary Artery Segmentation | ||
| 275 | Yan, Hongzhi; Su, Jianmei | SDB: Safety Constraint Mechanism for Dual-Branch End-to-End Autonomous Driving | ||
| 276 | Li, Yiqian; Ma, JInhua | CSQDA: A Parameter-efficient and Memory-efficient Tuning Method for Medical Image Classification | ||
| 278 | Mai, Zhiyang; Qian, Yukun; Wang, Haitao; Wu, Hejun; Zhou, Liangliang | LCKPose: Laplacian Candidate Keypoints Modeling for 6D Object Pose Estimation | ||
| 279 | Niu, Wenlong; Zhang, Zebao | SCP: Sinkhorn-reconciled Collaborative Prompt Learning for Vision-Language Models | ||
| 280 | Zhou, Xinying; Li, Leixiao; Lin, Hao | DAGMP: A Multimodal Learning Approach Jointly Driven by Feature Fusion and Gradient Modulation | ||
| 282 | Miao, Guohua; Xie, Zhihua; Chang, Haolin; Tu, Chengyu | Spatial-Spectral Prior Guided Mamba Network for Hyperspectral Image Super-Resolution | ||
| 284 | Yang, Jiale; Zhao, Kai; Zhang, Linlin; Li, Qingguan | Boosting the Transferability of Adversarial Examples via Frequency Domain Masking and Adaptive Step Size | ||
| 288 | Chen, Honghui; Zhou, Fan; Wang, Ruomei; Zhao, Baoquan | V-HOI: Velocity-Aware Human-Object Interaction Generation | ||
| 293 | ao, yu; han, hongze; li, yuqin; miao, yu; shi, weili | LGF-Net: Integrating Local and Global Features in a Dual-Branch Architecture for Tooth Segmentation in CBCT Images | ||
| 294 | Li, Feng; Wu, Ke; Li, Yongwei | MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition | ||
| 299 | Wei, GuoDong; Yu, Jiayu; Ao, Yu; Li, YuQin; Guan, YuanYuan; Shi, WeiLi; Miao, Yu; Jiang, ZhenGang | CTDiff : A Lightweight Hybrid Diffusion Network for Low-Light Endoscopic Image Enhancement | ||
| 300 | Xu, Tianshi; Sun, Zhengzheng; Hu, Yizheng; Shang, Junyuan; Wu, Si | Hierarchical Cross-Modality Interaction for Unified Video-Text Retrieval Modeling | ||
| 302 | Ma, Penghao; Wei, Guangcun; Kong, Chuike; Li, Shuo; Fang, Jianfeng | SE-EEND: A Structurally Enhanced End-to-End Neural Diarization System | ||
| 304 | Liu, Wenzheng; Yuan, Ming; Wang, Yizhou; Shen, Lianghao; Wang, Xiaofeng; Xing, Qianqian; Cao, Ronghui; Tang, Xiaoyong; Deng, Tan; Fu, Cheng | SPADE: Attention-Guided Split Diffusion for Precise Spatial Control in Interior Layout Image Generation | ||
| 309 | Xu, Feifei; Zhu, Wenjing; Li, Dongyang; Li, Puzhe | Question-Aware Spatial-Temporal Reasoning in Patch for Audio-Visual Question Answering | ||
| 310 | Wang, Haoyang; Liu, Liming; Zhang, Xinggong | R^2-Mesh: Reinforcement Learning Powered Mesh Reconstruction via Geometry and Appearance Refinement | ||
| 313 | Zhang, Wenli; Zhu, Dali; Zeng, Hualin; Yang, Long | TrackPhys: Learning Transferable Physiological Representations for Motion-Robust Heart Rate and 3D Mask Attack Detection | ||
| 317 | He, Zefeng; Hua, Yong; Yang, Xuan | Epistemic Uncertainty Guided Bayesian Neural Network for Cardiac Image Registration | ||
| 318 | Liang, Yun; Luo, Tang; Chen, Zhichao; Zhong, Cankun | TF-AttNet:An Efficient Time-Frequency Structure Modeling For Low-Complexity Acoustic Scene Classification | ||
| 319 | WANG, YONGXIANG; Zhou, Gang | SPDGS: Spatial Pruning and Depth Priors for Sparse-View 3D Gaussian Splatting | ||
| 321 | Zhang, Xuan; Li, Wenjing; Liu, Zhiqiang; Hu, Zhipeng; Liu, Na | SMLA-YOLO: Efficient Multiscale Small Defect Detection in Wind Turbine Blades via Dynamic Feature Calibration | ||
| 324 | Sun, Hao; Liu, Yue; Yu, Peiqi; Fan, Kexuan | SFL-Net: Synergistic Spatial-Frequency Learning for Medical Image Segmentation | ||
| 326 | Zhao, Faqi; Cheng, Dingxin; Chen, Xuanda; Su, Kun | Multi-view Interaction Network with Guided Contrastive Learning for Multimodal Summarization | ||
| 328 | Wang, Dongsheng; Zhu, Yuan; Wang, Yifei | PGS-YOLO: A Lightweight and Accurate Framework for Aerial Small Object Detection in Urban Environments | ||
| 330 | Shi, Qifeng; Zhang, Yan | DRA-YOLO: Dynamic Receptive-Field Attention and Dual-Gated Upsampling Module Model for Aerial Object Detection | ||
| 332 | Yang, Xing Yao; Jia, Meng kun | Dynamic Spectral Fusion and Causal Graph Propagation for Multimodal Recommendation | ||
| 334 | Wang, Quan; Ibrayim, Mayire | MAF3Net: Multiscale Attention and Frequency Domain Feature Fusion for Oracle Cross-Domain Recognition Networks | ||
| 341 | Gao, Han; Ning, Tao; Duan, Xiaodong; Wan, Xiaochun; Chen, Lvzuo; Chen, Jingsong; Wang, Yuangang | Multimodal Personality Trait Recognition via Spatiotemporal Modeling and Dual-Stage Fusion | ||
| 343 | Ao, Yu; Liu, Chengji; Ye, Jida; Li, YuQin; Miao, Yu; SHi, Weili | WES2P: Wavelet-Enhanced SAM2 for Automatic Polyp Segmentation | ||
| 347 | Zhou, Ziyu; Lei, Xia; Zhang, Linlin; Fan, Yongkai | FACER: Evaluating and Enhancing Explainable Feature-Level Robustness via Causal Effects | ||
| 349 | Xu, Feifei; Li, Puzhe; Li, Dongyang; Huang, Luobin; Zhu, Wenjing | Text-Driven Hybrid Curriculum Learning for Multimodal Sentiment Analysis | ||
| 351 | Liu, Guozhen; Yu, JiaMing; Tan, PanLong; Zhang, XiaoYu; Sun, Qinglin; Sun, Hao | SK2-eNeRF: Event-Driven Neural Radiance Fields for 3D Reconstruction in Dynamic and Blurry Scenes | ||
| 352 | Li, Yuqin; Qi, Kailun; Dai, Bofan; Ao, Yu; Shi, Weili | Study on Path Planning of Acupuncture Treatment for Allergic Rhinitis | ||
| 358 | Zhou, Aoxiang; Wu, Hao; Peng, Kaichen; Liu, Peng; Li, Xianxian | Improving Few-shot Multi-modal Aspect-Level Sentiment Classification with Implicit In-context Learning | ||
| 362 | Zheng, Kaihong; Sun, Lingyun; Cui, Yanan; Guo, Lili; Zhang, Jian | Graph Dynamic Fusion Network with Contrastive Learning for Multimodal Emotion Recognition in Conversation | ||
| 366 | Yang, Tianchi; Liu, Shenling; Wu, Shihong; Jiang, Wenhao; Luo, Yuchuan; Liu, Lin; Fu, Shaojing | FAHALE: Federated Asynchronous Heterogeneity-Adaptive Learning with Output Estimation | ||
| 367 | Guo, Jiacai; Xu, Zili; Luo, Jianjie; Lee, Lap-Kei; Wang, Fu Lee; Yang, Zhenguo | FANB-Net: Frequency-Awared Attention and Noise-injected Boosting for AI-generated Image Detection | ||
| 369 | hu, yunteng; zhao, hongyan; ke, zunwang | HiCP-SS: Hierarchical Two-Level Prototype Copy-Paste for Semi-Supervised Medical Segmentation | ||
| 372 | Dong, Chunling; Xu, Hang | A Hierarchical Multi-Scale Attention Enhancement Method for Micro-Expression Recognition: from Channel Modulation to Spatial-Scale Fusion | ||
| 375 | Liang, Fang; Guo, Jianshu; Wen, Xuexiang; Hu, Wenhao; Ye, Xiang; Wang, Gaoang | GauScene: Physically Plausible Scene Generation via Language-Guided 3D Gaussian Interaction | ||
| 380 | Wei, Xinyue; Pang, Weiguang; Wang, Changwei; Fu, Kexue; Qu, Youyang; Gao, Longxiang | SAMFF: A Semantic-Guided Zero-Shot Multi-Focus Image Fusion Framework | ||
| 386 | peng, yimin; yan, xu; cao, ziqiang | M3ITR: Modeling Many-to-Many Relationships for Robust Image-Text Retrieval | ||
| 387 | Liu, Yanfei; Xu, Miaosen; Shi, Youchang; Zou, Zheng; Li, yuanqian; Wen, Hao | MCPDS-CMNet: A Multi-Conditional Prior-guided Dual Spiral CNN-Mamba Network for Face Sketch-Photo Synthesis | ||
| 389 | Ren, Xuena | SRSG-AReID: Self-Rewarded Shuffle Grouping for Robust Aerial Person Re-Identification | ||
| 394 | Qiu, Weibin; Wu, Suping; Xu, Hao; Yang, Jie; Zhang, Xiang | ZC-MVSNet: Zero-Sum Convolution and Prior Fusion for Multi-View Stereo | ||
| 398 | Zhao, Yawei; Wumaier, Aishan; Guo, Xueliang; Lv, Yaxuan | A Multi-aspect Multi-granularity Pronunciation Assessment Method Based on Multi-feature Fusion and Transformer Encoder | ||
| 402 | Zhu, Xinya; Li, Wenqiang; Qiao, Mengyu; Yang, Zhihui; Wang, Yang | MOSS: Multi-modal Source Separation for Music Deepfake Detection | ||
| 403 | Zhang, Tianyun; Zou, Binfeng; Zhang, Xiaoshuai; Zhang, Guangyuan; Huang, Zhao; Liu, Jin; Zheng, Zhiwen; Huang, Xingru | Graph-Dynamics Augmented Foundation Model for Surgical Instrument Segmentation | ||
| 405 | Ma, Qiang; Wu, Suping; Yang, Sheng; Qiu, Weibin; Liu, Feng; Jin, Zhaocheng; Xu, Hao | HiAvatar: High-Fidelity Animatable Head Avatar | ||
| 409 | Yao, Yuheng; Zhuang, Liansheng; Long, Xiao; Wang, Shafei | Exploring the Polysemy of Relations in Path Reasoning for Inductive Knowledge Graph Completion | ||
| 150 | Xue, Song lin; Zhang, Hong | Towards Robust and Secure Cross-Domain Face Anti-Spoofing via Feature-Boundary Consistency | ||
| 271 | xue, song; du, jiayu; huang, wei; yang, qiulong | A Backdoor Attack via Fixed Patch Triggers in Frequency Domain | ||
| 296 | Guo, Yusheng; Hao, Qiang; Liu, Zhao; Wu, Qingshuang; Lu, Yanliang; Gu, Ming; Hu, Su | Backdoor-based Protection Architecture for Model Functional Services | ||
| 400 | Zhu, Jin; Du, Jiayu; Zhang, Fan; Chen, Xin; Zhou, Zhizhong | Enhancing Ensemble Adversarial Defense via Dataset Orthogonal Decomposition | ||
| 303 | 昊凯, 徐; 登实, 李 | Emotion-aware Multi-modal Fusion for Human Behavior Analysis via Graph and State-space Modeling | ||
| Demo Session (Session 9a) | 31.1. 11:00 - 12:20 | 206 | Rossetto, Luca; Ruosch, Florian | Beyond the Blob: Demonstrating MeGraS for Multimodal Knowledge Graph Interaction |
| 416 | Bui, Quoc-Anh; Lim, Serhane; Boudard, Tom; Rougeron, Gilles; Gasparini, Simone; Morin, Géraldine | XROI-GS: Real-time XR Interactive Inspection of High-quality Objects of Interest in a 3D Gaussian Splats Scene | ||
| 417 | Hamanaka, Masatoshi | AI-based Composition Tools for Composing A School Song With Student Participation | ||
| 418 | Baglanova, Aidana; Babakhojayeva, Zilola; Fakhrutdinov, Nail; Kalimzhanov, Ruslan Kalimzhanov; Azirakhmet, Umit; Yatbaz, Hakan Yekta; Yazici, Adnan | ZhadigerAI: Software as a Service AI Platform for Kazakh and English Languages | ||
| 422 | Fernandez Roblero, Jaime Boanerjes; Syed, Ali Akbar Shah; Ali, Muhammad Intizar | Ask VR: Vision Language Model Driven Scene Descriptor for Blind and Low Vision Users in VR Environment | ||
| 424 | Kostka de Sztemberg, Berenika Nawoja; Żywica, Patryk | A Multimedia Pipeline for Interactive Game Archaeology on the Example of Wolfenstein 3D in Godot Engine | ||
| 428 | Pham, Minh-Anh; Luu, Duc-Tuan; Skivdal, Johannes; Dang-Nguyen, Duc-Tien | Past Forward: Exif Revisited | ||
| 430 | Vats, Shivi; Timmerer, Christian; Hellwagner, Hermann | STEP-MR: A Subjective Testing and Eye-Tracking Platform for Dynamic Point Clouds in Mixed Reality | ||
| VBS Sessions & Welcome reception | 29.1. 13:00 - 16:10 (private) & 29.1. 17:00 - 19:30 (public) | 411 | Huynh, Viet-Tham; Le-Hinh, Nhut-Thanh; Nguyen-Ho, Thang-Long; Nguyen, Trong-Thuan; Gurrin, Cathal; Nguyen, Tam V.; Tran, Minh-Triet | TapesVRy: Immersive Panoramic Exploration in Large-Scale Video Retrieval |
| 412 | Le, Huy M.; Nguyen, Tien Dat; Nguyen, Phuc Binh; Le Tran, Gia Bao; Truong Thien, Phu; Dinh, Cuong; Nguyen, Nga T.N.; Nguyen, Thuy T. N.; Ngo, Huy Gia; Nguyen, Tan Nhat; Nguyen, Binh T. | Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets | ||
| 413 | Ueki, Kazuya; Muto, Ryo; Wada, Takuya; Akaba, Ryota; Zhang, Guannan | U-Cker at VBS2026: A Web-Based Interactive Video Retrieval System with Multimodal Query Support | ||
| 414 | Cheng, Yu-Tong; Nguyen, Phuong-Anh; Kha, Kim-Thuy; Ngo, Chong-Wah | VIREO @ Video Browser Showdown 2026 | ||
| 419 | Ho-Le, Minh-Quan; Ho, Duy-Khang; Ninh, Tu V.; Gurrin, Cathal; Tran, Minh-Triet | From Expert Practices to Intelligent Agents: Autonomy in Interactive Video Retrieval | ||
| 420 | Geller, Andrina; Arnold, Rahel; Waltenspül, Raphael; Schuldt, Heiko | Extending vitrivr-engine with Emotion-Based Retrieval and a Modular User Interface | ||
| 421 | Arnold, Rahel; Pietzak, Anna; Schuldt, Heiko | MediaMix: Multimedia Retrieval with Dual Backend Support and Result Exploration in MR | ||
| 423 | Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Tran, Allie; Tran, Minh-Triet; Gurrin, Cathal; Healy, Graham | H-EAGLE: Hierarchical Extension of EAGLE for Multi-Level Semantic Video Retrieval | ||
| 425 | Pantelidis, Nick; Kosmidou, Eleni; Galanopoulos, Damianos; Georgalis, Dimitris; Pasios, Stefanos; Apostolidis, Konstantinos; Goulas, Andreas; Pegia, Maria; Tsionkis, Georgios; Gkountakos, Konstantinos; Kouvrakis, Grigorios; Moumtzidou, Anastasia; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis | VERGE in VBS 2026 | ||
| 426 | Jäckl, Bastian; Verner, Benjamin; Stroh, Michael; Kloda, Vojtech; Nagy, Ladislav; Deussen, Oliver; Keim, Daniel A.; Lokoc, Jakub | PraK V4 at the Video Browser Showdown 2026 | ||
| 427 | Tran, Bao; Do, Tien; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi | NII-UIT at VBS2026: Towards Effective Visual Question Answering for Interactive and Multimodal Video Retrieval | ||
| 429 | Khan, Omar Shahbaz; Sharma, Ujjwal; Marcelino, Gonçalo; Rudinac, Stevan; Jónsson, Björn Þór | Exquisitor at the Video Browser Showdown 2026: Temporal Queries Revisited |
Keynotes

Jiri Matas is the head of the Visual Recognition Group at the Center for Machine Perception, Department of Cybernetics, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published more than 300 papers that have been cited about 74000 in Google Scholar (h-index = 99). He received the best paper prize at the British Machine Vision Conference in 2002, 2005 and 2022, at the Asian Conference on Computer Vision in 2007 and at Int.
Conf. on Document analysis and Recognition in 2015. J. Matas served as a programme or general chair at ECCV 2004, 2016, 2022 and CVPR 2007 and 2022. He is an Editor-in-Chief of the International Journal of Computer Vision was an Associate Editor-in-Chief of IEEE T. Pattern Analysis and Machine Intelligence. He is on the computer science panel of the ERC.
His research interests include visual tracking, object recognition, image matching and retrieval, sequential pattern recognition, and RANSAC-type optimization metods.
He has co-founded two companies, Eyedea Recognition (computer vision) and Locksley (combinatorial optimization).
Marcel Worring: Multimedia Analytics in the Foundation Model Era

Abstract: Multimedia Analytics was introduced in 2010 by Chinchor et.al. as the combination of multimedia analysis and visual analytics. In 2014 Zahalka et.al. developed a model for Multimedia Analytics, while at the same time Sacha et.al. developed a visual analytics model connecting interactive processes with human cognition. Up to recently, these two fields have developed more or less is parallel. Foundation models have changed the playing field completely. They combine multimedia analysis with reasoning and generation capabilities. Still for a lot of tasks, humans outperform AI and Human-AI teaming is the best way forward. Yet, the main communication channel between humans and AI is via a textual prompt. Clearly Visual Analytics could yield significant improvements here. In this talk, we will consider the historical developments in multimedia analysis and visual analytics and present a new multimedia / visual analytics model suited for the foundation model era. We show the value of the model by taking a number of existing systems as case studies and use the model to describe them and suggest improvements based on the model characteristics.
Bio: Marcel Worring is a full professor in the Informatics Institute of the University of Amsterdam. He is leading the MultiX group which is doing research on multimedia analytics techniques for getting the richest information possible from the data through AI algorithms, interactions, and interfaces; surpassing human and machine intelligence for applications and social impact in public health, forensics and law enforcement, cultural heritage, and data-driven business. He has been associate editor of ACM TOMCCAP, IEEE Transactions on Multimedia, and IEEE Multimedia, and organized ACM Multimedia 2016, MMM2024, and will organize ICMR2026. He is a fellow of ELLIS, the European Laboratory for Learning and Intelligent Systems, and co-founder of the Innovation Center for Artificial Intelligence.
Giuseppe Amato: Synergy between Extended Reality and Artificial Intelligence

Abstract: Extended Reality (XR) builds on Augmented Reality (AR) and Mixed Reality (MR), which themselves extend the foundations of Virtual Reality (VR). VR immerses users in fully digital environments through headsets, smart devices, or computer screens, but it isolates them from the physical world. AR and MR, instead, overlay digital content onto the real environment. Virtual objects are aligned, fused, and synchronized with physical surroundings, and in MR users can interact with both physical and virtual elements in a unified space.
XR goes even further. It is not limited to visual augmentation; it enhances the realism of interactions across multiple sensory dimensions. With XR, users can feel virtual objects—their weight, their temperature, even their texture and consistency—bringing the physical and digital worlds into deeper, more natural continuity.
Artificial Intelligence plays a crucial role in enabling this integration. AI can enrich virtual scenes with semantic information, reconstruct and interpret 3D environments, correct and enhance data during 3D digitization, support the creation of virtual worlds from physical ones, and facilitate seamless interaction across both realms.
In this keynote, we will explore how these challenges have been addressed within the SUN project (Social and hUman ceNtered XR – https://www.sun-xr-project.eu/) and discuss the current limitations and emerging opportunities in this rapidly evolving field.
Bio: Dr. Giuseppe Amato is a research director at CNR-ISTI in Pisa, where he leads the "Artificial Intelligence for Multimedia and Humanities" laboratory (AIMH - http://aimh.isti.cnr.it/ ). He was awarded a PhD in Computer Science at the University of Dortmund, Germany, in 2002. His main research interests are artificial intelligence, extended reality, content-based retrieval of multimedia documents, access methods for similarity search, smart camera networks. He has published more than 200 papers in peer reviews international journals and conferences in the areas of artificial intelligence and multimedia information retrieval. He has participated in several EC and national funded research actions in the areas of Artificial Intelligence, Multimedia Information Retrieval, Computer Vision, Extended Reality, Robotics, and Cultural Heritage, and he currently coordinates the Social and hUman ceNtered XR (SUN) project (https://www.sun-xr-project.eu/) funded by the Horizon Europe Research & Innovation Programme. More at http://aimh.isti.cnr.it/giuseppeamato/