Paper Digest: ACM Multimedia 2024 Papers & Highlights
Interested users can choose to read all MM-2024 papers in our digest console, which supports more features.
To search for papers presented at MM-2024 on a specific topic, please make use of the search by venue (MM-2024) service. To summarize the latest research published at MM-2024 on a specific topic, you can utilize the review by venue (MM-2024) service. To synthesizes the findings from MM 2024 into comprehensive reports, give a try to MM-2024 Research. If you are interested in browsing papers by author, we have a comprehensive list of all MM-2024 authors & their papers.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive updates on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: ACM Multimedia 2024 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). |
Lianghui Zhu; Junwei Zhou; Yan Liu; Xin Hao; Wenyu Liu; Xinggang Wang; |
| 2 | Decoding Urban Industrial Complexity: Enhancing Knowledge-Driven Insights Via IndustryScopeGPT Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban data including street views, corporate, socio-economic, and geospatial information, capturing the complex relationships and semantics within industrial parks. Alongside this, we present the IndustryScopeGPT framework, which leverages Large Language Models (LLMs) with Monte Carlo Tree Search to enhance tool-augmented reasoning and decision-making in Industrial Park Planning and Operation (IPPO). |
Siqi Wang; Chao Liang; Yunfan Gao; Yang Liu; Jing Li; Haofen Wang; |
| 3 | Zero-Shot Controllable Image-to-Video Animation Via Motion Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a new challenging task called Zero-Shot Controllable Image-to-Video Animation, where the goal is to animate an image based on motion trajectories defined by the user, without fine-tuning the base model. |
Shoubin Yu; Jacob Zhiyuan Fang; Jian Zheng; Gunnar Sigurdsson; Vicente Ordonez; Robinson Piramuthu; Mohit Bansal; |
| 4 | GPT4Video: A Unified Multimodal Large Language Model for Lnstruction-Followed Understanding and Safety-Aware Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present GPT4Video, a unified framework that seamlessly and lightly integrates with LLMs, visual feature extractors, and stable diffusion generative models for cohesive video understanding and generation. |
Zhanyu Wang; Longyue Wang; Zhen Zhao; Minghao Wu; Chenyang Lyu; Huayang Li; Deng Cai; Luping Zhou; Shuming Shi; Zhaopeng Tu; |
| 5 | Tango 2: Aligning Diffusion-based Text-to-Audio Generations Through Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. |
Navonil Majumder; Chia-Yu Hung; Deepanway Ghosal; Wei-Ning Hsu; Rada Mihalcea; Soujanya Poria; |
| 6 | VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. |
Rongjie Huang; Yongqi Wang; Ruofan Hu; Xiaoshan Xu; Zhiqing Hong; Dongchao Yang; Xize Cheng; Zehan Wang; Ziyue Jiang; Zhenhui Ye; Luping Liu; Siqi Zheng; Zhou Zhao; |
| 7 | T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation Via Fine-Tuned Large Multi-Modal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In our study, we introduce the T2I-Scorer, a novel two-stage training methodology aimed at fine-tuning LMMs for T2I evaluation. |
Haoning Wu; Xiele Wu; Chunyi Li; Zicheng Zhang; Chaofeng Chen; Xiaohong Liu; Guangtao Zhai; Weisi Lin; |
| 8 | Parameter-Efficient Complementary Expert Learning for Long-Tailed Visual Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To unleash the intrinsic representation capability of pretrained foundation models, in this work, we propose a new Parameter-Efficient Complementary Expert Learning (PECEL) for LTR. |
Lixiang Ru; Xin Guo; Lei Yu; Yingying Zhang; Jiangwei Lao; Jian Wang; Jingdong Chen; Yansheng Li; Ming Yang; |
| 9 | HS-Surf: A Novel High-Frequency Surface Shell Radiance Field to Improve Large-Scale Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, most models do not consider the drastic changes in distances. To address these issues, we propose a novel high-frequency surface shell radiance field, which uses depth-guided information to create a shell enveloping the scene surface under the current view, and then samples conic frustums on this shell to render high-frequency textures. |
Jiongming Qin; Fei Luo; Tuo Cao; Wenju Xu; Chunxia Xiao; |
| 10 | Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. |
Luoyi Sun; Xuenan Xu; Mengyue Wu; Weidi Xie; |
| 11 | GraphLearner: Graph Node Clustering with Fully Learnable Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the augmentation samples in existing methods are always predefined by human experiences, and agnostic from the downstream task clustering, thus leading to high human resource costs and poor performance. To overcome these limitations, we propose a Graph Node Clustering with Fully Learnable Augmentation, termed GraphLearner. |
Xihong Yang; Erxue Min; Ke Liang; Yue Liu; Siwei Wang; Sihang Zhou; Huijun Wu; Xinwang Liu; En Zhu; |
| 12 | RefMask3D: Language-Guided Transformer for 3D Referring Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. |
Shuting He; Henghui Ding; |
| 13 | CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Others train an extra encoder to extract object visual information for customization efficiently but struggle to preserve the object’s identity. To address these limitations, we present CustomNet, a unified encoder-based object customization framework that explicitly incorporates 3D novel view synthesis capabilities into the customization process. |
Ziyang Yuan; Mingdeng Cao; Xintao Wang; Zhongang Qi; Chun Yuan; Ying Shan; |
| 14 | FiLo: Zero-Shot Anomaly Detection By Fine-Grained Description and High-Quality Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). |
Zhaopeng Gu; Bingke Zhu; Guibo Zhu; Yingying Chen; Hao Li; Ming Tang; Jinqiao Wang; |
| 15 | GIST: Improving Parameter Efficient Fine-Tuning Via Knowledge Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These oversights lead to insufficient utilization of knowledge and suboptimal performance. To address these issues, we propose a novel fine-tuning framework, named GIST, that can be seamlessly integrated into the current PEFT methods in a plug-and-play manner. |
Jiacheng Ruan; Jingsheng Gao; Mingye Xie; Suncheng Xiang; Zefang Yu; Ting Liu; Yuzhuo Fu; Xiaoye Qu; |
| 16 | Embracing Adaptation: An Effective Dynamic Defense Strategy Against Adversarial Examples Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that to counter more formidable attacks, models should continually adapt to various attack methods. |
Shenglin Yin; Kelu Yao; Zhen Xiao; Jieyi Long; |
| 17 | Inferring 3D Occupancy Fields Through Implicit Reasoning on Silhouette Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we propose to use implicit reasoning, that is, we reason directly on the implicit occupancy field without explicit rendering. |
Baorui Ma; Yu-Shen Liu; Matthias Zwicker; Zhizhong Han; |
| 18 | DRMF: Degradation-Robust Multi-Modal Image Fusion Via Composable Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Degradation-Robust Multi-modality image Fusion (DRMF), leveraging the powerful generative properties of diffusion models to counteract various degradations during image fusion. |
Linfeng Tang; Yuxin Deng; Xunpeng Yi; Qinglong Yan; Yixuan Yuan; Jiayi Ma; |
| 19 | PerFRDiff: Personalised Weight Editing for Multiple Appropriate Facial Reaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the first online personalised multiple appropriate facial reaction generation (MAFRG) approach which learns a unique personalised cognitive style from the target human listener’s previous facial behaviours and represents it as a set of network weight shifts. |
Hengde Zhu; Xiangyu Kong; Weicheng Xie; Xin Huang; Linlin Shen; Lu Liu; Hatice Gunes; Siyang Song; |
| 20 | Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, effectively exploiting the multi-modal structure is a remarkable challenge owing to the complex nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. |
Qianxin Huang; Siyao Peng; Xiaobo Shen; Yun-Hao Yuan; Shirui Pan; |
| 21 | Towards Open-vocabulary HOI Detection with Calibrated Vision-language Models and Locality-aware Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the absence of novel human-object position distributions often leads to overfitting on the base categories within their learned queries. To address these issues, we propose a two-step framework named, CaM-LQ, Calibrating visual-language Models, (e.g., CLIP) for open-vocabulary HOI detection with Locality-aware Queries. |
Zhenhao Yang; Xin Liu; Deqiang Ouyang; Guiduo Duan; Dongyang Zhang; Tao He; Yuan-Fang Li; |
| 22 | EMVCC: Enhanced Multi-View Contrastive Clustering for Hyperspectral Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: At the same time, the data representation via self-supervised contrastive loss is not specifically designed for clustering. Thus, to tackle this challenge, we propose a novel multi-view clustering method, i.e., Enhanced Multi-View Contrastive Clustering (EMVCC). |
Fulin Luo; Yi Liu; Xiuwen Gong; Zhixiong Nan; Tan Guo; |
| 23 | Frequency-Aware GAN for Imperceptible Transfer Attack on 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on designing a transfer-based black-box attack method, called Transferable Frequency-aware 3D GAN, to delve into achieving a high black-box ASR by improving the adversarial transferability while making the adversarial samples more imperceptible. |
Xiaowen Cai; Yunbo Tao; Daizong Liu; Pan Zhou; Xiaoye Qu; Jianfeng Dong; Keke Tang; Lichao Sun; |
| 24 | DGMamba: Domain Generalization Via Generalized State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel framework for DG, named DGMamba, that excels in strong generalizability toward unseen domains and meanwhile has the advantages of global receptive fields, and efficient linear complexity. |
Shaocong Long; Qianyu Zhou; Xiangtai Li; Xuequan Lu; Chenhao Ying; Yuan Luo; Lizhuang Ma; Shuicheng Yan; |
| 25 | Leveraging Weak Cross-Modal Guidance for Coherence Modelling Via Iterative Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. |
Yi Bin; Junrong Liao; Yujuan Ding; HaoXuan Li; Yang Yang; See-Kiong Ng; Heng Tao Shen; |
| 26 | GalleryGPT: Analyzing Paintings with Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. |
Yi Bin; Wenhao Shi; Yujuan Ding; Zhiqiang Hu; Zheng Wang; Yang Yang; See-Kiong Ng; Heng Tao Shen; |
| 27 | Q-Ground: Image Quality Grounding with Large Multi-modality Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. |
Chaofeng Chen; Sensen Yang; Haoning Wu; Liang Liao; Zicheng Zhang; Annan Wang; Wenxiu Sun; Qiong Yan; Weisi Lin; |
| 28 | Heterogeneous Graph Guided Contrastive Learning for Spatially Resolved Transcriptomics Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we construct a heterogeneous Graph guided Contrastive Learning (stGCL) for aggregating spatial transcriptomics data. |
Xiao He; Chang Tang; Xinwang Liu; Chuankun Li; Shan An; Zhenglai Li; |
| 29 | ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. |
Minghang Zheng; Jiahua Zhang; Qingchao Chen; Yuxin Peng; Yang Liu; |
| 30 | Unraveling Motion Uncertainty for Local Motion Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To that end, we propose a novel method named Motion-Uncertainty-Guided Network (MUGNet), which harnesses a probabilistic representational model to explicitly address the intricacies stemming from motion uncertainties. |
Zeyu Xiao; Zhihe Lu; Michael Bi Mi; Zhiwei Xiong; Xinchao Wang; |
| 31 | P-BiC: Ultra-High-Definition Image Moir\'{e} Patterns Removal Via Patch Bilateral Compensation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel patch bilateral compensation network (P-BiC) for the demoire pattern removal in UHD images, which is memory-efficient and prior-knowledge-based. |
Zeyu Xiao; Zhihe Lu; Xinchao Wang; |
| 32 | MPLUG-PaperOwl: Scientific Diagram Analysis with The Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. |
Anwen Hu; Yaya Shi; Haiyang Xu; Jiabo Ye; Qinghao Ye; Ming Yan; Chenliang Li; Qi Qian; Ji Zhang; Fei Huang; |
| 33 | CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Multimodal pre-train models could be the potential solutions given their promising performance on various multimodal downstream tasks. |
Yunshan Ma; Yingzhi He; Wenjun Zhong; Xiang Wang; Roger Zimmermann; Tat-Seng Chua; |
| 34 | GeoFormer: Learning Point Cloud Completion with Tri-Plane Integrated Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a GeoFormer that simultaneously enhances the global geometric structure of the points and improves the local details. |
Jinpeng Yu; Binbin Huang; Yuxuan Zhang; Huaxia Li; Xu Tang; Shenghua Gao; |
| 35 | JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition to regulate the generating capabilities of the diffusion model. |
Yuhui Wu; Guoqing Wang; Zhiwen Wang; Yang Yang; Tianyu Li; Malu Zhang; Chongyi Li; Heng Tao Shen; |
| 36 | Attentive Linguistic Tracking in Diffusion Models for Training-free Text-guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce VICTORIA, a novel approach that augments TIE by incorporating linguistic knowledge into the manipulation of attention maps during image generation. |
Bingyan Liu; Chengyu Wang; Jun Huang; Kui Jia; |
| 37 | Automatic and Aligned Anchor Learning Strategy for Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, it is not reasonable to assume an identical number of anchors across all views, as this assumption restricts the representational capacity of anchors in individual views. To address the above issues, we propose a view adaptive anchor multi-view clustering called Multi-view Clustering with Automatic and Aligned Anchor (3AMVC). |
Huimin Ma; Siwei Wang; Shengju Yu; Suyuan Liu; Jun-Jie Huang; Huijun Wu; Xinwang Liu; En Zhu; |
| 38 | A Lightweight Anchor-Based Incremental Framework for Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose A Lightweight Anchor-Based Incremental Framework for Multi-view Clustering. |
Qian Qu; Xinhang Wan; Weixuan Liang; Jiyuan Liu; Yu Feng; Huiying Xu; Xinwang Liu; En Zhu; |
| 39 | MAG-Edit: Localized Image Editing in Complex Scenarios Via Mask-Based Attention-Adjusted Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose MAG-Edit, a plug-and-play, inference-stage optimization method, that empowers attention-based editing approaches, such as P2P, to enhance localized image editing in intricate scenarios. |
Qi Mao; Lan Chen; Yuchao Gu; Zhen Fang; Mike Zheng Shou; |
| 40 | MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. |
Haoxuan Li; Zhengmao Yang; Yunshan Ma; Yi Bin; Yang Yang; Tat-Seng Chua; |
| 41 | Real-time Parameter Evaluation of High-speed Microfluidic Droplets Using Continuous Spike Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a real-time evaluation method for high-speed droplet parameters based on spike-based microfluidic flow-focusing, named RTDE, that integrates spike camera into the droplet collection system to efficiently capture information using spike stream. |
Bo Xiong; Changqing Su; Zihan Lin; Yanqin Chen; You Zhou; Zhen Cheng; Zhaofei Yu; Tiejun Huang; |
| 42 | SI-BiViT: Binarizing Vision Transformers with Spatial Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. |
Peng Yin; Xiaosu Zhu; Jingkuan Song; Lianli Gao; Heng Tao Shen; |
| 43 | TDSD: Text-Driven Scene-Decoupled Weakly Supervised Video Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to address the challenge of scene-dependent weakly supervised video anomaly detection by decoupling scenes. |
Shengyang Sun; Jiashen Hua; Junyi Feng; Dongxu Wei; Baisheng Lai; Xiaojin Gong; |
| 44 | Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. |
Chaoya Jiang; Hongrui Jia; Mengfan Dong; Wei Ye; Haiyang Xu; Ming Yan; Ji Zhang; Shikun Zhang; |
| 45 | HeroMaker: Human-centric Video Editing with Motion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Simultaneously, some patterns on the human body appear intermittently throughout the video, posing a knotty problem in identifying visual correspondence. To address the above problems, we present HeroMaker. |
Shiyu Liu; Zibo Zhao; Yihao Zhi; Yiqun Zhao; Binbin Huang; Shuo Wang; Ruoyu Wang; Michael Xuan; Zhengxin Li; Shenghua Gao; |
| 46 | AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. |
Yipo Huang; Xiangfei Sheng; Zhichao Yang; Quan Yuan; Zhichao Duan; Pengfei Chen; Leida Li; Weisi Lin; Guangming Shi; |
| 47 | FedSLS: Exploring Federated Aggregation in Saliency Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a saliency latent space feature aggregation method (FedSLS) across federated clients. |
Hengyi Wang; Weiying Xie; Jitao Ma; Daixun Li; Yunsong Li; |
| 48 | AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. |
Zhixi Cai; Shreya Ghosh; Aman Pankaj Adatia; Munawar Hayat; Abhinav Dhall; Tom Gedeon; Kalin Stefanov; |
| 49 | MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive Redesign Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, computational enzyme design is relatively overlooked within the broader protein domain and remains in its early stages. In this work, we address these challenges by introducing MetaEnzyme, a staged and unified enzyme design framework. |
Jiangbin Zheng; Han Zhang; Qianqing Xu; An-Ping Zeng; Stan Z. Li; |
| 50 | Causal-driven Large Language Models with Faithful Reasoning for Knowledge Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We instantiate this theory within the context of Knowledge Question Answering (KQA) by constructing a causal graph that delineates the pathways between the candidate knowledge and belief. Through the application of the do-calculus rules from structural causal models, we devise an unbiased estimation framework based on this causal graph, thereby establishing a methodology for knowledge modeling grounded in causal inference. |
Jiawei Wang; Da Cao; Shaofei Lu; Zhanchang Ma; Junbin Xiao; Tat-Seng Chua; |
| 51 | Stochastic Context Consistency Reasoning for Domain Adaptive Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the problem, we propose a stochastic context consistency reasoning network with the self-training framework. |
Yiming Cui; Liang Li; Jiehua Zhang; Chenggang Yan; Hongkui Wang; Shuai Wang; Heng Jin; Li Wu; |
| 52 | Decoupling General and Personalized Knowledge in Federated Learning Via Additive and Low-rank Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. |
Xinghao Wu; Xuefeng Liu; Jianwei Niu; Haolin Wang; Shaojie Tang; Guogang Zhu; Hao Su; |
| 53 | LoopGaussian: Creating 3D Cinemagraph with Multi-view Images Via Eulerian Motion Field Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We advance cinemagraph from 2D image space to 3-dimensional (3D) space with high quality by proposing LoopGaussian. |
Jiyang Li; Lechao Cheng; Zhangye Wang; Tingting Mu; Jingxuan He; |
| 54 | ReForm-Eval: Evaluating Large Vision Language Models Via Unified Re-Formulation of Task-Oriented Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To effectively leverage the annotations available and reduce the manual efforts required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. |
Zejun Li; Ye Wang; Mengfei Du; Qingwen Liu; Binhao Wu; Jiwen Zhang; Chengxing Zhou; Zhihao Fan; Jie Fu; Jingjing Chen; Zhongyu Wei; Xuanjing Huang; |
| 55 | Observe Before Generate: Emotion-Cause Aware Video Caption for Multimodal Emotion Cause Generation in Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing studies merely extract utterances from conversations as cause evidence, which is too coarse-grained to locate the exact causes from other modalities, especially those that may be reflected only in a specific video frame of an utterance. To address these limitations, we introduce a new task named Multimodal Emotion Cause Generation in Conversations (MECGC), which aims to generate an abstractive summary clearly and intuitively describing the causes that trigger the given emotion based on the multimodal context of conversations. |
Fanfan Wang; Heqing Ma; Xiangqing Shen; Jianfei Yu; Rui Xia; |
| 56 | MPT: Multi-grained Prompt Tuning for Text-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach may lead to sub-optimal performance due to the incorporation of irrelevant and indiscriminate knowledge. To address such an issue, we present a Multi-grained Prompt Tuning (MPT) for text-video retrieval, that designs a variety of specific prompts to effectively explore semantic interaction across different modalities with diverse granularity. |
Haonan Zhang; Pengpeng Zeng; Lianli Gao; Jingkuan Song; Heng Tao Shen; |
| 57 | Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. |
Minghe Gao; Shuang Chen; Liang Pang; Yuan Yao; Jisheng Dang; Wenqiao Zhang; Juncheng Li; Siliang Tang; Yueting Zhuang; Tat-Seng Chua; |
| 58 | Counterfactually Augmented Event Matching for De-biased Temporal Sentence Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we design a novel framework, termed Counterfactually-Augmented Event Matching (CAEM), which incorporates counterfactual data augmentation to learn event-query joint representations to resist the training bias. |
Xun Jiang; Zhuoyuan Wei; Shenshen Li; Xing Xu; Jingkuan Song; Heng Tao Shen; |
| 59 | LiDAR-NeRF: Novel LiDAR View Synthesis Via Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new task, novel view synthesis for LiDAR sensors. |
Tang Tao; Longfei Gao; Guangrun Wang; Yixing Lao; Peng Chen; Hengshuang Zhao; Dayang Hao; Xiaodan Liang; Mathieu Salzmann; Kaicheng Yu; |
| 60 | Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). |
Tengchuan Kou; Xiaohong Liu; Zicheng Zhang; Chunyi Li; Haoning Wu; Xiongkuo Min; Guangtao Zhai; Ning Liu; |
| 61 | MAGIC: Rethinking Dynamic Convolution Design for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: ii) The linear kernel aggregation is inefficient, restricting the model’s capacity to learn more intricate patterns. In this paper, we rethink the dynamic convolution design to address these limitations and propose multi-dimensional aggregation dynamic convolution (MAGIC). |
Shijie Li; Yunbin Tu; Qingyuan Xiang; Zheng Li; |
| 62 | Ego3DT: Tracking Every 3D Object in Ego-centric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. |
Shengyu Hao; Wenhao Chai; Zhonghan Zhao; Meiqi Sun; Wendi Hu; Jieyang Zhou; Yixian Zhao; Qi Li; Yizhou Wang; Xi Li; Gaoang Wang; |
| 63 | Realistic Full-Body Motion Generation from Sparse Tracking with State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the complex structure of the human body further complicates this task. To address these issues, we present Motion Mamba Diffusion (MMD), a novel conditional diffusion model, which effectively utilizes the sequence modeling capability of SSMs and the robust generation ability of diffusion models to track full-body poses accurately. |
Kun Dong; Jian Xue; Zehai Niu; Xing Lan; Ke Lu; Qingyuan Liu; Xiaoyu Qin; |
| 64 | Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named ”Tunnel Try-on.” |
Zhengze Xu; Mengting Chen; Zhao Wang; Linyu Xing; Zhonghua Zhai; Nong Sang; Jinsong Lan; Shuai Xiao; Changxin Gao; |
| 65 | Uni-DlLoRA: Style Fine-Tuning for Fashion Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods focus on enhancing the generative model with diversity while lacking ID-preserved domain translation. This paper introduces a novel model named Uni-DlLoRA to release this constraint. |
Fangjian Liao; Xingxing Zou; Waikeung Wong; |
| 66 | View Gap Matters: Cross-view Topology and Information Decoupling for Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Tree-Based View-Gap Maintaining Multi-View Clustering (TGM-MVC) method. |
Fangdi Wang; Jiaqi Jin; Zhibin Dong; Xihong Yang; Yu Feng; Xinwang Liu; Xinzhong Zhu; Siwei Wang; Tianrui Liu; En Zhu; |
| 67 | Making Large Language Models Perform Better in Knowledge Graph Completion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore methods to incorporate structural information into the LLMs, with the overarching goal of facilitating structure-aware reasoning. |
Yichi Zhang; Zhuo Chen; Lingbing Guo; Yajing Xu; Wen Zhang; Huajun Chen; |
| 68 | Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). |
Lv Tang; Peng-Tao Jiang; Zhi-Hao Shen; Hao Zhang; Jin-Wei Chen; Bo Li; |
| 69 | Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. |
Yukang Lin; Haonan Han; Chaoqun Gong; Zunnan Xu; Yachao Zhang; Xiu Li; |
| 70 | Bilateral Adaptive Cross-Modal Fusion Prompt Learning for CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose that the proper alignment for downstream tasks is determined by the flexibility of the interaction between cross-modal information, which compensates for the absence of contrastive loss during the adaptation process. |
Qiang Wang; Ke Yan; Shouhong Ding; |
| 71 | AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. |
Huadai Liu; Rongjie Huang; Yang Liu; Hengyuan Cao; Jialei Wang; Xize Cheng; Siqi Zheng; Zhou Zhao; |
| 72 | Deblurring Neural Radiance Fields with Event-driven Bundle Adjustment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Event-driven Bundle Adjustment for Deblurring Neural Radiance Fields (EBAD-NeRF) to jointly optimize the learnable poses and NeRF parameters by leveraging the hybrid event-RGB data. |
Yunshan Qi; Lin Zhu; Yifan Zhao; Nan Bao; Jia Li; |
| 73 | A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, existing works mainly learn the unimodal representation based on the multimodal supervision signal of a single sample, failing to explicitly capture the unique emotional state of each modality as well as its emotional correlation between samples. To overcome these issues, we propose a Unimodal Valence-Arousal driven contrastive learning framework (UniVA) for the MMER task. |
Wenjie Zheng; Jianfei Yu; Rui Xia; |
| 74 | Multi-Label Learning with Block Diagonal Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new setting, i.e. block diagonal labels, to reduce the workload on both sides. |
Leqi Shen; Sicheng Zhao; Yifeng Zhang; Hui Chen; Jundong Zhou; Pengzhang Liu; Yongjun Bao; Guiguang Ding; |
| 75 | Multimodal Emotion Recognition Calibration in Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This contradicts the foundational principle in informatics, namely, the elimination of uncertainty. Based on this, we propose a novel calibration framework CMERC to calibrate MERC models without altering the model structure. |
Geng Tu; Feng Xiong; Bin Liang; Hui Wang; Xi Zeng; Ruifeng Xu; |
| 76 | FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. |
Ruibin Li; Jingcai Guo; Qihua Zhou; Song Guo; |
| 77 | Learning Dual Enhanced Representation for Contrastive Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Secondly, cluster-level CL lacks the guidance of global information and is always restricted by the local diversity information. We in this paper Learn dUal enhanCed rEpresentation for Contrastive Multi-view Clustering (LUCE-CMC) to effectively addresses the above challenges, and it mainly contains two parts, i.e., enhanced feature-level CL (En-FeaCL) and enhanced cluster-level CL (En-CluCL). |
Guoliang Zou; Yangdong Ye; Tongji Chen; Shizhe Hu; |
| 78 | XMeCap: Meme Caption Generation with Sub-Image Adaptability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: After that, we introduce the XMeCap framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. |
Yuyan Chen; Songzhou Yan; Zhihong Zhu; Zhixu Li; Yanghua Xiao; |
| 79 | Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel Generative Multimodal Data Augmentation (GMDA) framework for MNER, which contains two stages: Multimodal Text Generation and Multimodal Image Generation. |
Ziyan Li; Jianfei Yu; Jia Yang; Wenya Wang; Li Yang; Rui Xia; |
| 80 | 3DPCP-Net: A Lightweight Progressive 3D Correspondence Pruning Network for Accurate and Efficient Point Cloud Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These approaches are either insufficiently accurate or inefficient, often requiring more network parameters. To address this issue, we propose a lightweight network, 3DPCP-Net, for fast and robust registration. |
Jingtao Wang; Zechao Li; |
| 81 | DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval Guidelines Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. |
Xin Jiang; Hao Tang; Rui Yan; Jinhui Tang; Zechao Li; |
| 82 | HyperTime: Hyperparameter Optimization for Combating Temporal Distribution Shifts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a hyperparameter optimization method named HyperTime to find hyperparameters robust to potential temporal distribution shifts in the unseen test data. |
Shaokun Zhang; Yiran Wu; Zhonghua Zheng; Qingyun Wu; Chi Wang; |
| 83 | Unsupervised Multi-view Pedestrian Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With the prosperity of the intelligent surveillance, multiple cameras have been applied to localize pedestrians more accurately. However, previous methods rely on laborious … |
Mengyin Liu; Chao Zhu; Shiqi Ren; Xu-Cheng Yin; |
| 84 | LanEvil: Benchmarking The Robustness of Lane Detection to Environmental Illusions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To defend against environmental illusions, we propose the Attention Area Mixing (AAM) approach using hard examples, which witness significant robustness improvement (+3.76\%) under illumination effects. |
Tianyuan Zhang; Lu Wang; Hainan Li; Yisong Xiao; Siyuan Liang; Aishan Liu; Xianglong Liu; Dacheng Tao; |
| 85 | LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. |
Xuechen Guo; Wenhao Chai; Shi-Yan Li; Gaoang Wang; |
| 86 | What’s The Real: A Novel Design Philosophy for Robust AI-Synthesized Voice Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we analyze the limitations of existing fake voice detectors and propose a new design philosophy, guiding the detection model to prioritize learning human voice features rather than the difference between the human voice and the synthetic voice. Based on this philosophy, we propose a novel AI-synthesized voice detection framework named SiFSafer, which uses pre-trained speech representation models to enhance the learning of feature distribution in human voices and the adapter fine-tuning to optimize the performance. |
Xuan Hai; Xin Liu; Yuan Tan; Gang Liu; Song Li; Weina Niu; Rui Zhou; Xiaokang Zhou; |
| 87 | Reliable Attribute-missing Multi-view Clustering with Instance-level and Feature-level Cooperative Imputation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, current methods uniformly treat all missing attributes as zero values, thus failing to differentiate between real and technical zeroes, potentially resulting in data over-imputation. To mitigate these challenges, we introduce a novel Reliable Attribute-Missing Multi-View Clustering method (RAM-MVC). |
Dayu Hu; Suyuan Liu; Jun Wang; Junpu Zhang; Siwei Wang; Xingchen Hu; Xinzhong Zhu; Chang Tang; Xinwang Liu; |
| 88 | HOGDA: Boosting Semi-supervised Graph Domain Adaptation Via High-Order Structure-Guided Adaptive Feature Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing studies often directly utilize graph convolutional networks (GCNs)-based feature extractors to capture domain-invariant node features, while neglecting the issue that GCNs are insufficient in collecting complex structure information in graph. Considering the importance of graph structure information in encoding the complex relationship among nodes and edges, this paper aims to utilize such powerful information to assist graph transfer learning. |
Jun Dan; Weiming Liu; Mushui Liu; Chunfeng Xie; Shunjie Dong; Guofang Ma; Yanchao Tan; Jiazheng Xing; |
| 89 | TAVGBench: Benchmarking Text to Audible-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. |
Yuxin Mao; Xuyang Shen; Jing Zhang; Zhen Qin; Jinxing Zhou; Mochu Xiang; Yiran Zhong; Yuchao Dai; |
| 90 | It Takes Two: Accurate Gait Recognition in The Wild Via Cross-granularity Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. |
Jinkai Zheng; Xinchen Liu; Boyue Zhang; Chenggang Yan; Jiyong Zhang; Wu Liu; Yongdong Zhang; |
| 91 | Prior Metadata-Driven RAW Reconstruction: Eliminating The Need for Per-Image Metadata Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Reconstruction methods of RAW images from sRGB data typically require additional metadata from the RAW image, which increases camera processing computations. To address this problem, we propose using Prior Meta as a reference to reconstruct the RAW data instead of relying on per-image metadata. |
Wencheng Han; Chen Zhang; Yang Zhou; Wentao Liu; Chen Qian; Cheng-zhong Xu; Jianbing Shen; |
| 92 | GOI: Find 3D Gaussians of Interest with An Optimizable Open-vocabulary Semantic-space Hyperplane Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. |
Yansong Qu; Shaohui Dai; Xinyang Li; Jianghang Lin; Liujuan Cao; Shengchuan Zhang; Rongrong Ji; |
| 93 | Virtual Visual-Guided Domain-Shadow Fusion Via Modal Exchanging for Domain-Specific Multi-Modal Neural Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This challenge can lead to a decrease in machine translation performance for domain-specific terms. To tackle this problem, this paper presents a virtual visual scene-guided domain-shadow multi-modal fusion mechanism to simultaneously integrate multi-grained domain visual details and text with the guidance of modality-agnostic virtual visual scene, thereby enhancing machine translation performance for DMNMT, especially for domain terms. |
Zhenyu Hou; Junjun Guo; |
| 94 | DreamLCM: Towards High Quality Text-to-3D Generation Via Latent Consistency Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). |
Yiming Zhong; Xiaolin Zhang; Yao Zhao; Yunchao Wei; |
| 95 | Exploring Data Efficiency in Image Restoration: A Gaussian Denoising Case Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This hypothesis is rigorously tested through experiments conducted on synthetically blurred datasets. Building on this premise, we delve into the data efficiency within training datasets and introduce an effective and stabilized method for quantifying content information, thereby enabling the ranking of training images based on their influence. |
Zhengwei Yin; Mingze Ma; Guixu Lin; Yinqiang Zheng; |
| 96 | FlexIR: Towards Flexible and Manipulable Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While some studies have explored dynamic restoration through the integration of an auxiliary network within a unified framework, these approaches often fall short in practical applications due to the complexities involved in training, retraining, and hyperparameter adjustment, as well as limitations as being totally controlled by auxiliary network and biased by training data. To address these challenges, we introduce FlexIR: a flexible and manipulable framework for image restoration. |
Zhengwei Yin; Guixu Lin; Mengshun Hu; Hao Zhang; Yinqiang Zheng; |
| 97 | WisdoM: Improving Multimodal Sentiment Analysis By Fusing Contextual World Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. |
Wenbin Wang; Liang Ding; Li Shen; Yong Luo; Han Hu; Dacheng Tao; |
| 98 | AlignCLIP: Align Multi Domains of Texts Input for CLIP Models with Object-IoU Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the use of caption object parsing to identify the objects set contained within captions. |
Lu Zhang; Ke Yan; Shouhong Ding; |
| 99 | MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. |
Zeyu Li; Ruitong Gan; Chuanchen Luo; Yuxi Wang; Jiaheng Liu; Ziwei Zhu; Qing Li; Xucheng Yin; Man Zhang; Zhaoxiang Zhang; Junran Peng; |
| 100 | Attribute-Driven Multimodal Hierarchical Prompts for Image Aesthetic Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To fully explore the attribute information perceived by users for evaluating image aesthetic quality, this paper proposes an image aesthetic quality assessment method based on attribute-driven multimodal hierarchical prompts. |
Hancheng Zhu; Ju Shi; Zhiwen Shao; Rui Yao; Yong Zhou; Jiaqi Zhao; Leida Li; |
| 101 | Gait Recognition in Large-scale Free Environment Via Single LiDAR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, based on a single LiDAR, we present the Hierarchical Multi-representation Feature Interaction Network (HMRNet) for robust gait recognition. |
Xiao Han; Yiming Ren; Peishan Cong; Yujing Sun; Jingya Wang; Lan Xu; Yuexin Ma; |
| 102 | Towards Practical Human Motion Prediction with LiDAR Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LiDAR-HMP, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. |
Xiao Han; Yiming Ren; Yichen Yao; Yujing Sun; Yuexin Ma; |
| 103 | Cluster-driven Personalized Federated Recommendation with Interest-aware Graph Convolution Network for Multimedia Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Graph Convolutional Networks (GCNs) offer a promising method by utilizing the information from high-order neighbors, but face challenges in federated settings due to problems such as over-smoothing, data heterogeneity, and elevated communication expenses. To resolve these problems, we propose a Cluster-driven Personalized Federated Recommender System with Interest-aware Graph Convolution Network (CPF-GCN) for multimedia recommendation. |
Xingyuan Mao; Yuwen Liu; Lianyong Qi; Li Duan; Xiaolong Xu; Xuyun Zhang; Wanchun Dou; Amin Beheshti; Xiaokang Zhou; |
| 104 | F-3DGS: Factorized Coordinates and Representations for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the storage overhead, we propose Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that drastically reduces storage requirements while preserving image quality. |
Xiangyu Sun; Joo Chan Lee; Daniel Rho; Jong Hwan Ko; Usman Ali; Eunbyung Park; |
| 105 | When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on Its Contour-following Ability Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: ControlNet excels at creating content that closely matches precise contours in user-provided masks. However, when these masks contain noise, as a frequent occurrence with … |
Wenjie Xuan; Yufei Xu; Shanshan Zhao; Chaoyue Wang; Juhua Liu; Bo Du; Dacheng Tao; |
| 106 | Traj2Former: A Local Context-aware Snapshot and Sequential Dual Fusion Transformer for Trajectory Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Second, though efforts have been made to incorporate a shape feature by rendering trajectories into images, they fail to model the local correspondence between GPS points and image pixels. To address these issues, we propose a novel model termed Traj2Former to spotlight the spatial distribution of the adjacent trajectory points (i.e., contextual snapshot) and enhance the snapshot fusion between the trajectory data and the corresponding spatial contexts. |
Yuan Xie; Yichen Zhang; Yifang Yin; Sheng Zhang; Ying Zhang; Rajiv Shah; Roger Zimmermann; Guoqing Xiao; |
| 107 | Enhancing Underwater Images Via Asymmetric Multi-Scale Invertible Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Underwater images, often plagued by complex degradation, pose significant challenges for image enhancement. To address these challenges, the paper redefines underwater image enhancement as an image decomposition problem and proposes a deep invertible neural network (INN) that accurately predicts both the latent image and the degradation effects. |
Yuhui Quan; Xiaoheng Tan; Yan Huang; Yong Xu; Hui Ji; |
| 108 | Prototypical Prompting for Text-to-image Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. |
Shuanglin Yan; Jun Liu; Neng Dong; Liyan Zhang; Jinhui Tang; |
| 109 | HandRefiner: Refining Malformed Hands in Generated Images By Diffusion-based Conditional Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For correct hand generation, our paper introduces a lightweight post-processing solution called HandRefiner. |
Wenquan Lu; Yufei Xu; Jing Zhang; Chaoyue Wang; Dacheng Tao; |
| 110 | Decoder Pre-Training with Only Text for Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). |
Shuai Zhao; Yongkun Du; Zhineng Chen; Yu-Gang Jiang; |
| 111 | StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. |
Yiheng Huang; Hui Yang; Chuanchen Luo; Yuxi Wang; Shibiao Xu; Zhaoxiang Zhang; Man Zhang; Junran Peng; |
| 112 | Resisting Over-Smoothing in Graph Neural Networks Via Dual-Dimensional Decoupling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we consider the oversmoothing issue from two aspects of the node embedding space: dimension and instance. |
Wei Shen; Mang Ye; Wenke Huang; |
| 113 | Segment Anything with Precise Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although the Segment Anything Model (SAM) has achieved impressive results in many segmentation tasks and benchmarks, its performance noticeably deteriorates when applied to … |
Mengzhen Liu; Mengyu Wang; Henghui Ding; Yilong Xu; Yao Zhao; Yunchao Wei; |
| 114 | Calibrating Prompt from History for Continual Vision-Language Retrieval and Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, to enable neural networks better understand diverse modalities in real-world scenario, we investigate continual learning for two typical vision-language applications, i.e. retrieval and grounding. |
Tao Jin; Weicai Yan; Ye Wang; Sihang Cai; Qifan Shuai; Zhou Zhao; |
| 115 | TreeReward: Improve Diffusion Model Via Tree-Structured Feedback Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, to address the limitation of the fine-grained feedback data, we first design a novel AI + Expert feedback data construction pipeline, yielding about 2.2M high-quality feedback dataset encompassing six fine-grained dimensions at a relatively low cost. Built upon this dataset, we introduce a tree-structure reward model to exploit the fine-grained feedback data efficiently and provide tailored optimization during feedback learning. |
Jiacheng Zhang; Jie Wu; Huafeng Kuang; Haiming Zhang; Yuxi Ren; Weifeng Chen; Manlin Zhang; Xuefeng Xiao; Guanbin Li; |
| 116 | Label Decoupling and Reconstruction: A Two-Stage Training Framework for Long-tailed Multi-label Medical Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the clinical reality is complex and multifaceted, with patients often suffering from multiple intertwined diseases, not all of which are equally common, leading to medical datasets that are frequently characterized by multi-labels and a long-tailed distribution. In this paper, we propose a method involving label decoupling and reconstruction (LDRNet) to address these two specific challenges. |
Jie Huang; Zhao-Min Chen; Zhao-Min Chen; Xiaoqin Zhang; Xiaoqin Zhang; Yisu Ge; Yisu Ge; Lusi Ye; Lusi Ye; Guodao Zhang; Guodao Zhang; Huiling Chen; Huiling Chen; |
| 117 | ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. |
Ziyi Gao; Kai Chen; Zhipeng Wei; Tingshu Mou; Jingjing Chen; Zhiyu Tan; Hao Li; Yu-Gang Jiang; |
| 118 | LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Given LMMs’ exceptional performance and robustness in low-level vision and quality assessment tasks, this study aims to investigate the feasibility of imparting PCQA knowledge to LMMs through text supervision. |
Zicheng Zhang; Haoning Wu; Yingjie Zhou; Chunyi Li; Wei Sun; Chaofeng Chen; Xiongkuo Min; Xiaohong Liu; Weisi Lin; Guangtao Zhai; |
| 119 | Boosting Semi-supervised Crowd Counting with Scale-based Active Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple yet effective active labeling strategy to explicitly select informative unlabeled images, guided by the intra-scale uncertainty and inter-scale inconsistency metrics. |
Shiwei Zhang; Wei Ke; Shuai Liu; Xiaopeng Hong; Tong Zhang; |
| 120 | 3D-GRES: Generalized 3D Referring Expression Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. |
Changli Wu; Yihang Liu; Jiayi Ji; Yiwei Ma; Haowei Wang; Gen Luo; Henghui Ding; Xiaoshuai Sun; Rongrong Ji; |
| 121 | IconDM: Text-Guided Icon Set Expansion Using Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike ordinary images, icons exhibit richer fine-grained stylistic elements, including tones, line widths, shapes, shadow effects, etc., which puts higher demands on capturing and preserving detailed styles during icon generation.To address the challenges, we propose IconDM, a method based on pre-trained text-to-image (T2I) diffusion models. |
Jiawei Lin; Zhaoyun Jiang; Jiaqi Guo; Shizhao Sun; Ting Liu; Zijiang Yang; Jian-Guang Lou; Dongmei Zhang; |
| 122 | HMR-Adapter: A Lightweight Adapter with Dual-Path Cross Augmentation for Expressive Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, whole-body estimation models often inaccurately estimate hand poses, while hand expert models struggle with severe occlusions. To overcome these limitations, we introduce a dual-path cross augmentation framework with a novel adaptation approach called HMR-Adapter that enhances existing large HMR models. |
Wenhao Shen; Wanqi Yin; Hao Wang; Chen Wei; Zhongang Cai; Lei Yang; Guosheng Lin; |
| 123 | CoIn: A Lightweight and Effective Framework for Story Visualization and Continuation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The autoregressive framework modifies the large pretrained text-to-image model in an auto-regressive manner with additional history modules, leading to large model size, resource-intensive requirements, and slow generation speed. To address these issues, we propose a lightweight and effective framework, namely CoIn. |
Ming Tao; Bing-Kun Bao; Hao Tang; Yaowei Wang; Changsheng Xu; |
| 124 | Dual-view Pyramid Network for Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We aim to unleash the multifaceted knowledge yielded by the hierarchical views at multiple scales in a pyramid network. |
Yao Luo; Ming Yang; Jinhui Tang; |
| 125 | GLoMo: Global-Local Modal Fusion for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, the integration of multiple local representations, and the fusion of local and global information present significant challenges. To address these limitations, we propose the Global-Local Modal (GLoMo) Fusion framework. |
Yan Zhuang; Yanru Zhang; Zheng Hu; Xiaoyue Zhang; Jiawen Deng; Fuji Ren; |
| 126 | SAT3D: Image-driven Semantic Attribute Transfer in 3D Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an image-driven Semantic Attribute Transfer method in 3D (SAT3D) by editing semantic attributes from a reference image. |
Zhijun Zhai; Zengmao Wang; Xiaoxiao Long; Kaixuan Zhou; Bo Du; |
| 127 | Parameter-efficient Is Not Sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter (E3VA) tuning to address this issue. |
Dongshuo Yin; Xueting Han; Bin Li; Hao Feng; Jing Bai; |
| 128 | Cascaded Adversarial Attack: Simultaneously Fooling Rain Removal and Semantic Segmentation Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: When applying high-level visual algorithms to rainy scenes, it is customary to preprocess the rainy images using low-level rain removal networks, followed by visual networks to achieve the desired objectives. |
Zhiwen Wang; Yuhui Wu; Zheng Wang; Jiwei Wei; Tianyu Li; Guoqing Wang; Yang Yang; Hengtao Shen; |
| 129 | Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for Long-Tailed Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. |
Lei Liu; Li Liu; Yawen Cui; |
| 130 | InstantAS: Minimum Coverage Sampling for Arbitrary-Size Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the InstantAS method for arbitrary-size image generation. |
Changshuo Wang; Mingzhe Yu; Lei Wu; Lei Meng; Xiang Li; Xiangxu Meng; |
| 131 | TALE: Training-free Cross-domain Image Composition Via Adaptive Latent Manipulation and Energy-guided Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TALE, a novel training-free framework harnessing the generative capabilities of text-to-image diffusion models to address the cross-domain image composition task that focuses on flawlessly incorporating user-specified objects into a designated visual contexts regardless of domain disparity. |
Kien T. Pham; Jingye Chen; Qifeng Chen; |
| 132 | Adaptive Selection Based Referring Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the direct fusion of word-level features into coarse aligned features disrupts the established vision-language alignment. In this paper, we introduce an innovative framework for RIS that seeks to overcome these challenges with adaptive alignment of vision and language features, termed the Adaptive Selection with Dual Alignment (ASDA). |
Pengfei Yue; Jianghang Lin; Shengchuan Zhang; Jie Hu; Yilin Lu; Hongwei Niu; Haixin Ding; Yan Zhang; Guannan Jiang; Liujuan Cao; Rongrong Ji; |
| 133 | NovaChart: A Large-scale Dataset Towards Chart Understanding and Generation of Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To build NovaChart, we propose a data generation engine for metadata curation, chart visualization and instruction formulation. |
Linmei Hu; Duokang Wang; Yiming Pan; Jifan Yu; Yingxia Shao; Chong Feng; Liqiang Nie; |
| 134 | Dual-Resolution Fusion Modeling for Unsupervised Cross-Resolution Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, acquiring manual labels requires considerable human effort, greatly limiting the flexibility of existing CR-ReID methods. To address this issue, we propose a dual-resolution fusion modeling (DRFM) framework to tackle the CR-ReID problem in an unsupervised manner. |
Zhiqi Pang; Lingling Zhao; Chunyu Wang; |
| 135 | Serial Section Microscopy Image Inpainting Guided By Axial Optical Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an optical flow-based serial section inpainting architecture to effectively combine the 3D structure information from neighboring sections and 2D image features from surrounding regions. |
Yiran Cheng; Bintao He; Fa Zhang; Renmin Han; |
| 136 | AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. |
Xin Wang; Kai Chen; Xingjun Ma; Zhineng Chen; Jingjing Chen; Yu-Gang Jiang; |
| 137 | Prompting to Adapt Foundational Segmentation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This significantly limits the models’ generalization capabilities and efficiency in deployment. In this study, we propose a novel adaptation paradigm, termed prompting-to-adapt”, to tackle the above issue by introducing an innovative image prompter. |
Jie Hu; Jie Li; Yue Ma; Liujuan Cao; Songan Zhang; Wei Zhang; Guannan Jiang; Rongrong Ji; |
| 138 | WorldGPT: Empowering LLM As Multimodal World Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). |
Zhiqi Ge; Hongzhe Huang; Mingze Zhou; Juncheng Li; Guoming Wang; Siliang Tang; Yueting Zhuang; |
| 139 | HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. |
Linhui Xiao; Xiaoshan Yang; Fang Peng; Yaowei Wang; Changsheng Xu; |
| 140 | Multimodal Low-light Image Enhancement with Depth Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The integration of depth information into image restoration is a research question worthy of exploration. Therefore, in this paper, we propose a multimodal low-light image enhancement task based on depth information and establish a dataset named LED (Low-light Image Enhanced with Depth Map), consisting of 1,365 samples. |
Zhen Wang; Dongyuan Li; Guang Li; Ziqing Zhang; Renhe Jiang; |
| 141 | PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods often suffer from two challenges: 1) Insufficient excavation of domain-invariant features leads to low-quality cross-domain correspondences, and 2) Inaccurate correspondences result in errors propagated during the translation process due to a lack of reliable prior guidance. To tackle these issues, we propose a novel prior-guided diffusion model with global-local contrastive learning (PROMOTE), which is trained in a self-supervised manner. |
Guojin Zhong; Yihu Guo; Jin Yuan; Qianjun Zhang; Weili Guan; Long Chen; |
| 142 | Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. |
Gangyan Zeng; Yuan Zhang; Jin Wei; Dongbao Yang; Peng Zhang; Yiwen Gao; Xugong Qin; Yu Zhou; |
| 143 | Autogenic Language Embedding for Coherent Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. |
Zikai Song; Ying Tang; Run Luo; Lintao Ma; Junqing Yu; Yi-Ping Phoebe Chen; Wei Yang; |
| 144 | Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these methods neglect the fact that ZS-CIR entails considering not only the final similarity between the composed text and retrieved images but also the semantic increment during the compositional editing process. To address this limitation, this paper proposes a training-free method called Semantic Editing Increment for ZS-CIR (SEIZE) to retrieve the target image based on the query image and text without training. |
Zhenyu Yang; Shengsheng Qian; Dizhan Xue; Jiahong Wu; Fan Yang; Weiming Dong; Changsheng Xu; |
| 145 | Towards Multimodal-augmented Pre-trained Language Models Via Self-balanced Expectation-Maximization Iteration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods: (1) inevitably encounter modality gaps and noise; (2) treat all modalities indiscriminately; and (3) ignore visual or acoustic semantics of key entities. To tackle these challenges, we propose a novel principled iterative framework for multimodal-augmented PLMs termed MASE, which achieves efficient and balanced injection of multimodal semantics under the proposed Expectation Maximization (EM) based iterative algorithm. |
Xianwei Zhuang; Xuxin Cheng; Zhihong Zhu; Zhanpeng Chen; Hongxiang Li; Yuexian Zou; |
| 146 | Poisoning for Debiasing: Fair Recognition Via Eliminating Bias Uncovered in Data Poisoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we first reveal that previous biased models fit target labels, which resulted in failing to expose data bias. To tackle this issue, we propose poisoner, which utilizes data poisoning to embed the biases learned by biased models into the poisoned training data, thereby encouraging the models to learn more biases. |
Yi Zhang; Zhefeng Wang; Rui Hu; Xinyu Duan; Yi Zheng; Baoxing Huai; Jiarun Han; Jitao Sang; |
| 147 | IControl3D: An Interactive System for Controllable 3D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. |
Xingyi Li; Yizheng Wu; Jun Cen; Juewen Peng; Kewei Wang; Ke Xian; Zhe Wang; Zhiguo Cao; Guosheng Lin; |
| 148 | Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel knowledge-aware artifact image synthesis approach that brings lost historical objects accurately into their visual forms. |
Shengguang Wu; Zhenglun Chen; Qi Su; |
| 149 | Semantic Codebook Learning for Dynamic Recommendation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it faces the challenges of large parameter search space and sparse and noisy user-item interactions, which reduces the applicability of the generated model parameters. The Semantic Codebook Learning for Dynamic Recommendation Models (SOLID) framework presents a significant advancement in DSR by effectively tackling these challenges. |
Zheqi Lv; Shaoxuan He; Tianyu Zhan; Shengyu Zhang; Wenqiao Zhang; Jingyuan Chen; Zhou Zhao; Fei Wu; |
| 150 | DERO: Diffusion-Model-Erasure Robust Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, most existing robust watermarking methods fail to tackle such an erasure attack since they are primarily designed for traditional channel distortions. To address such issue, this paper proposed DERO, a diffusion-model-erasure robust watermarking framework. |
Han Fang; Kejiang Chen; Yupeng Qiu; Zehua Ma; Weiming Zhang; Ee-Chien Chang; |
| 151 | TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the one hand, we propose a flexible image fusion framework with multiple large language and vision models, which facilitates the visual-text interaction. |
Hebaixu Wang; Hao Zhang; Xunpeng Yi; Xinyu Xiang; Leyuan Fang; Jiayi Ma; |
| 152 | Domain-Conditioned Transformer for Fully Test-time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. |
Yushun Tang; Shuoshuo Chen; Jiyuan Jia; Yi Zhang; Zhihai He; |
| 153 | Bridging Gaps in Content and Knowledge for Multimodal Entity Linking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The other is the knowledge gap, indicating insufficient knowledge extraction and reasoning during the linking process. To bridge these gaps, we propose a novel framework FissFuse, as well as a plug-and-play knowledge-aware re-ranking method KAR. |
Pengfei Luo; Tong Xu; Che Liu; Suojuan Zhang; Linli Xu; Minglei Li; Enhong Chen; |
| 154 | Multi-Granularity Hand Action Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. |
Ting Zhe; Jing Zhang; Yongqian Li; Yong Luo; Han Hu; Dacheng Tao; |
| 155 | Towards Low-latency Event-based Visual Recognition with Hybrid Step-wise Distillation Spiking Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. |
Xian Zhong; Shengwang Hu; Wenxuan Liu; Wenxin Huang; Jianhao Ding; Zhaofei Yu; Tiejun Huang; |
| 156 | Towards High-performance Spiking Transformers from ANN to SNN Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. |
Zihan Huang; Xinyu Shi; Zecheng Hao; Tong Bu; Jianhao Ding; Zhaofei Yu; Tiejun Huang; |
| 157 | Open-Vocabulary Video Scene Graph Generation Via Union-aware Semantic Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the popular VLMs facilitating preliminary exploration of open-vocabulary VidSGG tasks, the correspondence between visual union regions and relation predicates is usually ignored. Therefore, we propose an Open-vocabulary VidSGG framework named Union-Aware Semantic Alignment Network (UASAN) to explore the alignment between visual union regions and relation predicate concepts in the same semantic space. |
Ziyue Wu; Junyu Gao; Changsheng Xu; |
| 158 | Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. |
JingJing Xie; Yuxin Zhang; Mingbao Lin; Liujuan Cao; Rongrong Ji; |
| 159 | FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. |
Fan Nie; Jiangqun Ni; Jian Zhang; Bin Zhang; Weizhe Zhang; |
| 160 | PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis Via Forward Dynamics Guided 4D Imitation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. |
Yunze Liu; Changxi Chen; Chenjing Ding; Li Yi; |
| 161 | QE-BEV: Query Evolution for Bird’s Eye View Object Detection in Varied Contexts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce a framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. |
Jiawei Yao; Yingxin Lai; Hongrui Kou; Tong Wu; Ruixi Liu; |
| 162 | White-box Multimodal Jailbreaks Against Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we propose a dual optimization objective aimed at guiding the model to generate highly toxic affirmative responses. |
Ruofan Wang; Xingjun Ma; Hanxu Zhou; Chuanjun Ji; Guangnan Ye; Yu-Gang Jiang; |
| 163 | G-Refine: A General Quality Refiner for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. |
Chunyi Li; Haoning Wu; Hongkun Hao; Zicheng Zhang; Tengchuan Kou; Chaofeng Chen; Lei Bai; Xiaohong Liu; Weisi Lin; Guangtao Zhai; |
| 164 | Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Stable Diffusion, which contains multiple cross-attention modules, possesses great potential in facilitating adversarial transferability by leveraging abundant cross-modal interactions. Therefore, We propose a Multimodal Diffusion-based Attack (MDA), which conducts adversarial attacks against VLMs using Stable Diffusion. |
Wenzhuo Xu; Kai Chen; Ziyi Gao; Zhipeng Wei; Jingjing Chen; Yu-Gang Jiang; |
| 165 | SleepMG: Multimodal Generalizable Sleep Staging with Inter-modal Balance of Classification and Domain Discrimination Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To balance inter-modal differences and achieve highly accurate cross-domain sleep staging, we propose SleepMG, a Multimodal Generalizable Sleep staging method. |
Shuo Ma; Yingwei Zhang; Qiqi Zhang; Yiqiang Chen; Haoran Wang; Ziyu Jia; |
| 166 | Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models Within Perturbed Inputs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Prior works primarily center on evaluating hallucination using standard, unperturbed benchmarks, which overlook the prevalent occurrence of perturbed inputs in real-world scenarios-such as image cropping or blurring-that are critical for a comprehensive assessment of MLLMs’ hallucination. In this paper, to bridge this gap, we propose Hallu-PI, the first benchmark designed to evaluate Hallucination in MLLMs within Perturbed Inputs. |
Peng Ding; Jingyu Wu; Jun Kuang; Dan Ma; Xuezhi Cao; Xunliang Cai; Shi Chen; Jiajun Chen; Shujian Huang; |
| 167 | Future Motion Dynamic Modeling Via Hybrid Supervision for Multi-Person Motion Prediction Uncertainty Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches always overlook the motion dynamic modeling among the prediction frames to reduce the uncertainty, but leave it entirely up to the deep neural networks, which lacks a dynamic inductive bias, leading to suboptimal performance. This paper addresses this limitation by proposing an effective multi-person motion prediction method named Hybrid Supervision Transformer (HSFormer), which formulates the dynamic modeling within the prediction horizon as a novel hybrid supervision task. |
Yan Zhuang; Yanlu Cai; Weizhong Zhang; Cheng Jin; |
| 168 | Visual-Language Collaborative Representation Network for Broad-Domain Few-Shot Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods excessively depend on the multi-modal information representation and alignment capabilities acquired from CLIP pre-training, which hinders accurate generalization to unfamiliar domains. To address this issue, this paper introduces a novel visual-language collaborative representation network (MCRNet), aiming at acquiring a generalized capability for collaborative fusion and representation of multi-modal information. |
Qianyu Guo; Jieji Ren; Haofen Wang; Tianxing Wu; Weifeng Ge; Wenqiang Zhang; |
| 169 | Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. |
Bolei Chen; Jiaxu Kang; Ping Zhong; Yixiong Liang; Yu Sheng; Jianxin Wang; |
| 170 | DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To model emotional saliency variations in long-term audio contexts, we design a dynamic emotional intensity (DEI) modeling module and a dynamic positional encoding (DPE) strategy. |
Kang Shen; Haifeng Xia; Guangxing Geng; Guangyue Geng; Siyu Xia; Zhengming Ding; |
| 171 | Safe-SD: Safe and Traceable Stable Diffusion with Text Prompt Trigger for Invisible Generative Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Safe and high-traceable Stable Diffusion framework (namely Safe-SD) to adaptively implant the graphical watermarks (e.g., QR code) into the imperceptible structure-related pixels during the generative diffusion process for supporting text-driven invisible watermarking and detection. |
Zhiyuan Ma; Guoli Jia; Biqing Qi; Bowen Zhou; |
| 172 | HICEScore: A Hierarchical Metric for Image Captioning Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). |
Zequn Zeng; Jianqiao Sun; Hao Zhang; Tiansheng Wen; Yudi Su; Yan Xie; Zhengjue Wang; Bo Chen; |
| 173 | One-bit Deep Hashing: Towards Resource-Efficient Hashing Model with Binary Neural Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Second, the evolution of hash code aggregation undergoes two stages in BNN-DH, which is different from CNN-based DH. Based on these findings, we designed a strong and general method called One-bit Deep Hashing (ODH). |
Liyang He; Zhenya Huang; Chenglong Liu; Rui Li; Runze Wu; Qi Liu; Enhong Chen; |
| 174 | Spatio-temporal Heterogeneous Federated Learning for Time Series Classification with Multi-view Orthogonal Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on sensitive time series data collected by distributed sensors in real-world applications. |
Chenrui Wu; Haishuai Wang; Xiang Zhang; Zhen Fang; Jiajun Bu; |
| 175 | DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel customizing 3D human try-on model, named DreamVTON, to separately optimize the geometry and texture of the 3D human. |
Zhenyu Xie; Haoye Dong; Yufei Gao; Zehua Ma; Xiaodan Liang; |
| 176 | Seeing Text in The Dark: Algorithm and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an efficient and effective single-stage approach for localizing text in the dark that circumvents the need for LLE. |
Chengpei Xu; Hao Fu; Long Ma; Wenjing Jia; Chengqi Zhang; Feng Xia; Xiaoyu Ai; Binghao Li; Wenjie Zhang; |
| 177 | Cover-separable Fixed Neural Network Steganography Via Deep Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the stego-images generated by the existing FNNS methods exhibit high distortion, which is prone to be detected by steganalysis tools. To deal with this issue, we propose a Cover-separable Fixed Neural Network Steganography, namely Cs-FNNS. |
Guobiao Li; Sheng Li; Zhenxing Qian; Xinpeng Zhang; |
| 178 | Rethinking The Implicit Optimization Paradigm with Dual Alignments for Referring Remote Sensing Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective. |
Yuwen Pan; Rui Sun; Yuan Wang; Tianzhu Zhang; Yongdong Zhang; |
| 179 | Test-Time Training on Graphs with Large Language Models (LLMs) Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators. |
Jiaxin Zhang; Yiqi Wang; Xihong Yang; Siwei Wang; Yu Feng; Yu Shi; Ruichao Ren; En Zhu; Xinwang Liu; |
| 180 | DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. |
Haozhe Jia; Yan Li; Hengfei Cui; Di Xu; Yuwang Wang; Tao Yu; |
| 181 | VmambaSCI: Dynamic Deep Unfolding Network with Mamba for Compressive Spectral Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a dynamic deep unfolding network with mamba for compressive spectral imaging, called VmambaSCI. |
Mingjin Zhang; Longyi Li; Wenxuan Shi; Jie Guo; Yunsong Li; Xinbo Gao; |
| 182 | Object-Level Pseudo-3D Lifting for Distance-Aware Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observe the natural advantage of objects being well-separated in high-dimensional space and propose a novel 2D MOT framework, Detecting-Lifting-Tracking” (DLT). |
Haoyuan Jin; Xuesong Nie; Yunfeng Yan; Xi Chen; Zhihang Zhu; Donglian Qi; |
| 183 | Few-Shot Multimodal Explanation for Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper aims to promote explainable VQA from both data and method perspectives. |
Dizhan Xue; Shengsheng Qian; Changsheng Xu; |
| 184 | Identity-Driven Multimedia Forgery Detection Via Reference Assistance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, official media concerning relevant identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots sourced from 324 wild videos of 54 celebrities collected from the Internet. |
Junhao Xu; Jingjing Chen; Xue Song; Feng Han; Haijun Shan; Yu-Gang Jiang; |
| 185 | Explore Hybrid Modeling for Moving Infrared Small Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a hybrid modeling method for moving infrared small target detection via smoothed-particle hydrodynamics (SPH) and Markov decision processes (MDP). |
Mingjin Zhang; Shilong Liu; Yuanjun Ouyang; Jie Guo; Zhihong Tang; Yunsong Li; |
| 186 | HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Using the dataset, we introduce a technique of haze type classification followed by specialized dehazers to clear hazy images. |
Md Tanvir Islam; Nasir Rahim; Saeed Anwar; Muhammad Saqib; Sambit Bakshi; Khan Muhammad; |
| 187 | Streamable Portrait Video Editing with Probabilistic Pixel Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a brand new system, StreamEdit, which is primarily designed to edit streaming videos. |
Xiaodi Li; |
| 188 | Shapley Value-based Contrastive Alignment for Multimodal Information Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often faces significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. |
Wen Luo; Yu Xia; Shen Tianshu; Sujian Li; |
| 189 | SkipVSR: Adaptive Patch Routing for Video Super-Resolution with Inter-Frame Mask Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To accelerate the inference of VSR models, we propose a scalable method based on adaptive patch routing to achieve practical speedup. |
Zekun Ai; Xiaotong Luo; Yanyun Qu; Yuan Xie; |
| 190 | Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. |
Xianbing Zhao; Lizhen Qu; Tao Feng; Jianfei Cai; Buzhou Tang; |
| 191 | Navigating Weight Prediction with Diet Diary Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, we propose a novel task of weight prediction with a dietary diary that aims to leverage historical food intake and weight to predict future weights. To tackle this task, we propose a model-agnostic time series forecasting framework. |
Yinxuan Gui; Bin Zhu; Jingjing Chen; Chong Wah Ngo; Yu-Gang Jiang; |
| 192 | Multimodal Inplace Prompt Tuning for Open-set Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to fully leverage the capabilities of large language models and augment prompt encoding for detection, this study introduces a redundancy assessment metric to identify uniform attention patterns. |
Guilin Li; Mengdan Zhang; Xiawu Zheng; Peixian Chen; Zihan Wang; Yunhang Shen; Mingchen Zhuge; Chenglin Wu; Fei Chao; Ke Li; Xing Sun; Rongrong Ji; |
| 193 | Cross-Modal Meta Consensus for Heterogeneous Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a meta learning strategy tailored for Multimodal Federated Learning in a multitask setting, which harmonizes intra-modal and inter-modal feature spaces through the Cross-Modal Meta Consensus. |
Shuai Li; Fan Qi; Zixin Zhang; Changsheng Xu; |
| 194 | SFP: Spurious Feature-Targeted Pruning for Out-of-Distribution Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The precise adaptation of model sparsity, specifically tailored for spurious features, remains a significant challenge. Motivated by the insight that in-distribution (ID) data containing spurious features may exhibit lower experiential risk, we propose a novel Spurious Feature-targeted Pruning framework, dubbed SFP, to induce the authentic invariant substructures without referring to the above concerns. |
Yingchun Wang; Jingcai Guo; Song Guo; Yi Liu; Jie Zhang; Weizhan Zhang; |
| 195 | StarStream: Live Video Analytics Over Space Networking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We accordingly develop StarStream, a novel LSN-adaptive streaming framework for LVA. |
Miao Zhang; Jiaxing Li; Haoyuan Zhao; Linfeng Shen; Jiangchuan Liu; |
| 196 | Towards Stricter Black-box Integrity Verification of Deep Neural Network Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our analysis reveals that existing fingerprinting methods, which are typically focused on significant tampering, lack the sensitivity needed to effectively detect subtle yet common and potentially severe modifications. To address this limitation, we propose MiSentry (Model Integrity Sentry), a novel fingerprinting method that leverages meta-learning. |
Chaoxiang He; Xiaofan Bai; Xiaojing Ma; Bin B. Zhu; Pingyi Hu; Jiayun Fu; Hai Jin; Dongmei Zhang; |
| 197 | Cantor: Inspiring Multimodal Chain-of-Thought of MLLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. |
Timin Gao; Peixian Chen; Mengdan Zhang; Chaoyou Fu; Yunhang Shen; Yan Zhang; Shengchuan Zhang; Xiawu Zheng; Xing Sun; Liujuan Cao; Rongrong Ji; |
| 198 | Sampling to Distill: Knowledge Transfer from Open-World Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. |
Yuzheng Wang; Zhaoyu Chen; Jie Zhang; Dingkang Yang; Zuhao Ge; Yang Liu; Siao Liu; Yunquan Sun; Wenqiang Zhang; Lizhe Qi; |
| 199 | PrimKD: Primary Modality Guided Multimodal Fusion for RGB-D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. |
Zhiwei Hao; Zhongyu Xiao; Yong Luo; Jianyuan Guo; Jing Wang; Li Shen; Han Hu; |
| 200 | Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the retrieved images may not always cover all possible relations, and the informative features across the bag of images are often overlooked. To address these challenges, a Multi-modal Cross-domain Feature Learning framework is proposed to incorporate the general domain knowledge from a large vision-text foundation model, ViT-GPT2, to handle unseen relations and exploit complementary information from multiple sources. |
Jialu Zhang; Xinyi Wang; Chenglin Yao; Jianfeng Ren; Xudong Jiang; |
| 201 | Blind Face Video Restoration with Temporal Consistent Generative Prior and Degradation-Aware Prompt Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods, primarily designed for images, encounter challenges in maintaining temporal consistency when applied to face video restoration. To tackle this issue, we introduce StableBFVR, an innovative Blind Face Video Restoration method based on Stable Diffusion that incorporates temporal information into the generative prior. |
Jingfan Tan; Hyunhee Park; Ying Zhang; Tao Wang; Kaihao Zhang; Xiangyu Kong; Pengwen Dai; Zikun Liu; Wenhan Luo; |
| 202 | PRISM: PRogressive Dependency MaxImization for Scale-invariant Image Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, the scale discrepancy issue still troubles existing methods. To address above issues, we propose PRogressive dependency maxImization for Scale-invariant image Matching (PRISM), which jointly prunes irrelevant patch features and tackles the scale discrepancy. |
Xudong Cai; Yongcai Wang; Lun Luo; Minhang Wang; Deying Li; Jintao Xu; Weihao Gu; Rui Ai; |
| 203 | Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Various information in different modalities in an intuitive way in multi-modal knowledge graphs (MKGs), which are utilized in different downstream tasks, like recommendation. … |
Ke Liang; Lingyuan Meng; Yue Liu; Meng Liu; Wei Wei; Suyuan Liu; Wenxuan Tu; Siwei Wang; Sihang Zhou; Xinwang Liu; |
| 204 | Selective Vision-Language Subspace Projection for Few-shot CLIP Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, most existing methods overlook modality gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. |
Xingyu Zhu; Beier Zhu; Yi Tan; Shuo Wang; Yanbin Hao; Hanwang Zhang; |
| 205 | MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. |
Xiaomin Li; Xu Jia; Qinghe Wang; Haiwen Diao; Mengmeng Ge; Pengxiang Li; You He; Huchuan Lu; |
| 206 | Event-Guided Rolling Shutter Correction with Time-Aware Cross-Attentions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore the characteristics of RS images and event data for the design of the rolling shutter correction (RSC) model. |
Hefei Huang; Xu Jia; Xinyu Zhang; Shengming Li; Huchuan Lu; |
| 207 | MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. |
Cunhang Fan; Jingjing Zhang; Hongyu Zhang; Wang Xiang; Jianhua Tao; Xinhui Li; Jiangyan Yi; Dianbo Sui; Zhao Lv; |
| 208 | Data Generation Scheme for Thermal Modality with Edge-Guided Adversarial Conditional Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, this paper introduces a novel approach termed the edge-guided conditional diffusion model (ECDM). |
Guoqing Zhu; Honghu Pan; Qiang Wang; Chao Tian; Chao Yang; Zhenyu He; |
| 209 | DisenStudio: Customized Multi-Subject Text-to-Video Generation with Disentangled Spatial Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. |
Hong Chen; Xin Wang; Yipeng Zhang; Yuwei Zhou; Zeyang Zhang; Siao Tang; Wenwu Zhu; |
| 210 | VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The conventional finetuning process with the randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision- languag e C ollaborative A ctive F inetuning (VeCAF). |
Rongyu Zhang; Zefan Cai; Huanrui Yang; Zidong Liu; Denis Gudovskiy; Tomoyuki Okuno; Yohei Nakata; Kurt Keutzer; Baobao Chang; Yuan Du; Li Du; Shanghang Zhang; |
| 211 | Asymmetric Event-Guided Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This assumption proves limiting in emerging high-resolution devices, such as dual-lens smartphones and unmanned aerial vehicles, where such precise calibration is typically unavailable. To unlock more event-guided application scenarios, we perform the task of asymmetric event-guided VSR for the first time, and we propose an Asymmetric Event-guided VSR Network (AsEVSRN) for this new task. |
Zeyu Xiao; Dachun Kai; Yueyi Zhang; Xiaoyan Sun; Zhiwei Xiong; |
| 212 | Neighbor Does Matter: Curriculum Global Positive-Negative Sampling for Vision-Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the sampling strategies adopted by current VLP works are limited in two ways: i) they only focus on negative sampling, ignoring the importance of more informative positive samples; ii) their sampling strategies are conducted in the local in-batch level, which may lead to sub-optimal results. To tackle these problems, in this paper, we propose a curriculum-based Global Positive-Negative Sampling (GPN-S) framework for vision-language pre-training, which conducts both positive and negative sampling in the global level, grounded on the notion of neighborhood relationships. |
Bin Huang; Feng He; Qi Wang; Hong Chen; Guohao Li; Zhifan Feng; Xin Wang; Wenwu Zhu; |
| 213 | DiffGlue: Diffusion-Aided Image Feature Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel method called DiffGlue that introduces the Diffusion Model into the sparse image feature matching framework. |
Shihua Zhang; Jiayi Ma; |
| 214 | RDLNet: A Novel and Accurate Real-world Document Localization Method Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. |
Yaqiang Wu; Zhen Xu; Yong Duan; Yanlai Wu; Qinghua Zheng; Hui Li; Xiaochen Hu; Lianwen Jin; |
| 215 | AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. |
Yifei Gao; Jiaqi Wang; Zhiyu Lin; Jitao Sang; |
| 216 | TrGa: Reconsidering The Application of Graph Neural Networks in Two-View Correspondence Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, previous works directly utilize the outputs of off-the-shelf GNNs, thus leading to confusion between sparse correspondence attribute features and their global structural information. To alleviate these issues, we propose a two-view correspondence pruning network TrGa. |
Luanyuan Dai; Xiaoyu Du; Jinhui Tang; |
| 217 | Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to empower MLLMs with the capability to perceive subtle differences between paired images and enhance their performance in generating change captions. |
Xian Zhang; Haokun Wen; Jianlong Wu; Pengda Qin; Hui Xue’; Liqiang Nie; |
| 218 | Robust Variational Contrastive Learning for Partially View-unaligned Clustering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While some methods are proposed to align the unaligned views by learning view-invariant representations, almost all of them overlook specific information across different views for complementarity, limiting performance improvement. To address these problems, we propose a robust framework, dubbed VariatIonal ConTrAstive Learning (VITAL), designed to learn both common and specific information simultaneously. |
Changhao He; Hongyuan Zhu; Peng Hu; Xi Peng; |
| 219 | Understanding The Impact of AI-Generated Content on Social Media: The Pixiv Case Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet, to date, we know little about how the arrival of AIGC has impacted the social media ecosystem. To fill this gap, we present a comprehensive study of Pixiv, an online community for artists who wish to share and receive feedback on their illustrations. |
Yiluo Wei; Gareth Tyson; |
| 220 | Exploring The Use of Abusive Generative AI Models on Civitai Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. |
Yiluo Wei; Yiming Zhu; Pan Hui; Gareth Tyson; |
| 221 | REmoNet: Reducing Emotional Label Noise Via Multi-regularized Self-supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, there’s a need for comprehensive capture of temporal-spatial-spectral characteristics of EEG signals and cope with low signal-to-noise ratio (SNR) issues. To tackle these challenges, we propose a comprehensive pipeline named REmoNet, which leverages novel self-supervised techniques and multi-regularized co-learning. |
Wei-Bang Jiang; Yu-Ting Lan; Bao-Liang Lu; |
| 222 | Relational Diffusion Distillation for Efficient Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: How to transfer better diffusion knowledge from teacher models is a more valuable problem but rarely studied. Therefore, we propose Relational Diffusion Distillation (RDD), a novel distillation method tailored specifically for distilling diffusion models. |
Weilun Feng; Chuanguang Yang; Zhulin An; Libo Huang; Boyu Diao; Fei Wang; Yongjun Xu; |
| 223 | SelM: Selective Mechanism Based Audio-Visual Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. |
Jiaxu Li; Songsong Yu; Yifan Wang; Lijun Wang; Huchuan Lu; |
| 224 | Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with the shape aligned with the input sketch and color matching the textual description. |
Wangguandong Zheng; Haifeng Xia; Rui Chen; Libo Sun; Ming Shao; Siyu Xia; Zhengming Ding; |
| 225 | Robust Live Streaming Over LEO Satellite Constellations: Measurement, Analysis, and Handover-Aware Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet this rise in popularity contrasts with the reality that a substantial segment of the global population still lacks Internet access. The emergence of Low Earth orbit Satellite Networks (LSNs), such as SpaceX’s Starlink and Amazon’s Project Kuiper, presents a promising solution to fill this gap. |
Hao Fang; Haoyuan Zhao; Jianxin Shi; Miao Zhang; Guanzhen Wu; Yi Ching Chou; Feng Wang; Jiangchuan Liu; |
| 226 | Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. |
Chengyou Jia; Minnan Luo; Xiaojun Chang; Zhuohang Dang; Mingfei Han; Mengmeng Wang; Guang Dai; Sizhe Dang; Jingdong Wang; |
| 227 | ListenFormer: Responsive Listening Head Generation with Non-autoregressive Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods exhibit two limitations: 1) the generation capability of their models is limited, resulting in generated videos that are far from real ones, and 2) they mostly employ autoregressive generative models, unable to mitigate the risk of error accumulation. To tackle these issues, we propose Listenformer that leverages the powerful temporal modeling capability of transformers for generation. |
Miao Liu; Jing Wang; Xinyuan Qian; Haizhou Li; |
| 228 | De-fine: Decomposing and Refining Visual Programs with Auto-Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. |
Minghe Gao; Juncheng Li; Hao Fei; Liang Pang; Wei Ji; Guoming Wang; Zheqi Lv; Wenqiao Zhang; Siliang Tang; Yueting Zhuang; |
| 229 | Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at leveraging limited single-frame annotation for correspondence learning. |
Kun Wang; Hao Liu; Lirong Jie; Zixu Li; Yupeng Hu; Liqiang Nie; |
| 230 | Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. |
Congqi Cao; Yueran Zhang; Yating Yu; Qinyi Lv; Lingtong Min; Yanning Zhang; |
| 231 | MagicFight: Personalized Martial Arts Combat Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. |
Jiancheng Huang; Mingfu Yan; Songyan Chen; Yi Huang; Shifeng Chen; |
| 232 | CREST: Cross-modal Resonance Through Evidential Deep Learning for Enhanced Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response, we propose a bidirectional cross-modal ZSL approach CREST. |
Haojian Huang; Xiaozhennn Qiao; Zhuo Chen; Haodong Chen; Bingyu Li; Zhe Sun; Mulin Chen; Xuelong Li; |
| 233 | Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis Via State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. |
Xu Han; Yuan Tang; Zhaoxuan Wang; Xianzhi Li; |
| 234 | Improving Interaction Comfort in Authoring Task in AR-HRI Through Dynamic Dual-Layer Interaction Adjustment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study purposes a dynamic dual-layer interaction adjustment mechanism to improve user comfort and interaction efficiency. |
Yunqiang Pei; Keiyue Zhang; Hongrong Yang; Yong Tao; Qihang Tang; Jialei Tang; Guoqing Wang; Zhitao Liu; Ning Xie; Peng Wang; Yang Yang; Hengtao Shen; |
| 235 | Emotion Recognition in HMDs: A Multi-task Approach Using Physiological Signals and Occluded Faces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The study presents a novel approach to emotion recognition in XR, addressing the limitations of facial occlusion by HMDs. |
Yunqiang Pei; Jialei Tang; Qihang Tang; Mingfeng Zha; Dongyu Xie; Guoqing Wang; Zhitao Liu; Ning Xie; Peng Wang; Yang Yang; Hengtao Shen; |
| 236 | RSC-SNN: Exploring The Trade-off Between Adversarial Robustness and Accuracy in Spiking Neural Networks Via Randomized Smoothing Coding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, it is still unclear in theory how the adversarial robustness of SNNs is derived, and whether SNNs can still maintain its adversarial robustness advantage on large-scale dataset tasks. This work theoretically demonstrates that SNN’s inherent adversarial robustness stems from its Poisson coding. |
Keming Wu; Man Yao; Yuhong Chou; Xuerui Qiu; Rui Yang; Bo Xu; Guoqi Li; |
| 237 | Diverse Consensuses Paired with Motion Estimation-Based Multi-Model Fitting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel method of diverse Consensuses paired with Motion estimation-based multi-Model Fitting (CMMF), which leverages three types of diverse consensuses along with inter-model collaboration to enhance the effectiveness of multi-model fusion. |
Wenyu Yin; Shuyuan Lin; Yang Lu; Hanzi Wang; |
| 238 | Simplifying Cross-modal Interaction Via Modality-Shared Features for RGBT Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing RGBT tracking methods face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction. To address these issues, we propose a novel Integrating Interaction into Modality-shared Features with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. |
Liqiu Chen; Yuqing Huang; Hengyu Li; Zikun Zhou; Zhenyu He; |
| 239 | SparseFormer: Detecting Objects in HRW Shots Via Sparse Vision Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. |
Wenxi Li; Yuchen Guo; Jilai Zheng; Haozhe Lin; Chao Ma; Lu Fang; Xiaokang Yang; |
| 240 | PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC2) framework to address this challenge. |
Yue Duan; Zhangxuan Gu; Zhenzhe Ying; Lei Qi; Changhua Meng; Yinghuan Shi; |
| 241 | P2SAM: Probabilistically Prompted SAMs Are Efficient Segmentator for Ambiguous Medical Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Generating diverse plausible outputs from a single input is crucial for addressing visual ambiguities, exemplified in medical imaging where experts may provide varying semantic segmentation annotations for the same image.Existing methods handles ambiguous segmentation relying on probabilistic modeling and extensive multi-output annotated data while often struggles with limited ambiguously labeled datasets common in real-world applications.To surmount the challenge, we propose P²SAM, a novel framework that leverages the Segment Anything Model (SAM)’s prior knowledge for ambiguous object segmentation. |
Yuzhi Huang; Chenxin Li; Zixu Lin; Hengyu Liu; Haote Xu; Yifan Liu; Yue Huang; Xinghao Ding; Xiaotong Tu; Yixuan Yuan; |
| 242 | SpikeGS: 3D Gaussian Splatting from Spike Streams with High-Speed Camera Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, to train SpikeGS, we establish computational equations between the rendering process of 3DGS and the processes of instantaneous imaging and exposing-like imaging of the continuous spike stream. |
Jiyuan Zhang; Kang Chen; Shiyan Chen; Yajing Zheng; Tiejun Huang; Zhaofei Yu; |
| 243 | Prompt2Poster: Automatically Artistic Chinese Poster Creation from Prompt Only Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To create desired artistic Chinese posters including an aligned background, reasonable layouts, and stylized graphical texts from given prompts only, we propose an automatic poster creation framework, named Prompt2Poster. |
Shaodong Wang; Yunyang Ge; Liuhan Chen; Haiyang Zhou; Qian Wang; Xinhua Cheng; Li Yuan; |
| 244 | FM-CLIP: Flexible Modal CLIP for Face Anti-Spoofing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, borrowing a solution from the large-scale vision-language models (VLMs) instead of directly removing modality-specific signals from visual features, we propose a novel Flexible Modal CLIP (FM-CLIP) for flexible modal FAS, that can utilize text features to dynamically adjust visual features to be modality independent. |
Ajian Liu; Hui Ma; Junze Zheng; Haocheng Yuan; Xiaoyuan Yu; Yanyan Liang; Sergio Escalera; Jun Wan; Zhen Lei; |
| 245 | Group-aware Parameter-efficient Updating for Content-Adaptive Neural Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each encoding component by using both serial and parallel configuration. |
Zhenghao Chen; Luping Zhou; Zhihao Hu; Dong Xu; |
| 246 | Sparse Query Dense: Enhancing 3D Object Detection with Pseudo Points Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, due to the imprecision of depth completion, the pseudo points suffer from noise and local structural ambiguity, which limit the further improvement of detection accuracy. This paper presents SQDNet, a novel framework designed to address these challenges. |
Yujian Mo; Yan Wu; Junqiao Zhao; Zhenjie Hou; Weiquan Huang; Yinghao Hu; Jijun Wang; Jun Yan; |
| 247 | Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and Pseudolabels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. |
Han Jiang; Haoyu Tang; Ming Yan; Ji Zhang; Mingzhu Xu; Yupeng Hu; Jihua Zhu; Liqiang Nie; |
| 248 | MMHead: Towards Fine-grained Multi-modal 3D Facial Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Concretely, we integrate five public 2D portrait video datasets, and propose an automatic pipeline to 1) reconstruct 3D facial motion sequences from monocular videos; and 2) obtain hierarchical text annotations with the help of AU detection and ChatGPT. |
Sijing Wu; Yunhao Li; Yichao Yan; Huiyu Duan; Ziwei Liu; Guangtao Zhai; |
| 249 | SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. |
Longtao Jiang; Min Wang; Zecheng Li; Yao Fang; Wengang Zhou; Houqiang Li; |
| 250 | Overcoming The Pitfalls of Vision-Language Model for Image-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite significant advancements facilitated by large-scale Contrastive Language-Image Pretraining (CLIP) models, we found that existing methods fall short in bridging the fine-grained semantic gap between visual and textual representations. To address the above pitfalls, we propose a model called Local and Generative-driven Modality Gap Correction (LG-MGC), which devotes to simultaneously enhancing representation learning and alleviating the modality gap in cross-modal retrieval. |
Feifei Zhang; Sijia Qu; Fan Shi; Changsheng Xu; |
| 251 | Efficient Single Image Super-Resolution with Entropy Attention and Receptive Field Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present an efficient SR model to mitigate the dilemma between model efficiency and SR performance, which is dubbed Entropy Attention and Receptive Field Augmentation network (EARFA), and composed of a novel entropy attention (EA) and a shifting large kernel attention (SLKA). |
Xiaole Zhao; Linze Li; Chengxing Xie; Xiaoming Zhang; Ting Jiang; Wenjie Lin; Shuaicheng Liu; Tianrui Li; |
| 252 | Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. |
Zhong Ji; Changxu Meng; Yan Zhang; Haoran Wang; Yanwei Pang; Jungong Han; |
| 253 | Learning A Low-Level Vision Generalist Via Visual Task Prompt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. |
Xiangyu Chen; Yihao Liu; Yuandong Pu; Wenlong Zhang; Jiantao Zhou; Yu Qiao; Chao Dong; |
| 254 | ColVO: Colonoscopic Visual Odometry Considering Geometric and Photometric Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Secondly, the light source in the colon environment moves with the colonoscope, leading to brightness fluctuations among continuous frame images. To address these two issues, we propose ColVO, a novel deep learning-based Visual Odometry framework, which can continuously estimate colon depth and colonoscopic pose using two key components: a deep couple strategy for depth and pose estimation (DCDP) and a light consistent calibration mechanism (LCC). |
Ruyu Liu; Zhengzhe Liu; Haoyu Zhang; Guodao Zhang; Jianhua Zhang; Weiguo Sheng; Xiufeng Liu; Yaochu Jin; |
| 255 | V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs … |
Xuanyu Zhang; Youmin Xu; Runyi Li; Jiwen Yu; Weiqi Li; Zhipei Xu; Jian Zhang; |
| 256 | SymAttack: Symmetry-aware Imperceptible Adversarial Attacks on 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given that adversarial perturbations tend to disrupt the inherent symmetry in objects, we recognize this disruption as the primary cause of the lack of imperceptibility in these attacks. In this paper, we introduce a novel framework, symmetry-aware imperceptible adversarial attacks on 3D point clouds (SymAttack), to address this issue. |
Keke Tang; Zhensu Wang; Weilong Peng; Lujie Huang; Le Wang; Peican Zhu; Wenping Wang; Zhihong Tian; |
| 257 | Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. |
Shixuan Gao; Pingping Zhang; Tianyu Yan; Huchuan Lu; |
| 258 | Virtual Agent Positioning Driven By Personal Characteristics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel pipeline for relocating virtual agents in new scenarios based on their personal characteristics. |
Jingjing Liu; Youyi Zheng; Kun Zhou; |
| 259 | OneChart: Purify The Chart Structural Extraction Via One Auxiliary Token Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. |
Jinyue Chen; Lingyu Kong; Haoran Wei; Chenglong Liu; Zheng Ge; Liang Zhao; Jianjian Sun; Chunrui Han; Xiangyu Zhang; |
| 260 | Medical Report Generation Via Multimodal Spatio-Temporal Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing works normally generate reports from single chest radiographs, although historical examination data also serve as crucial references for radiologists in real-world clinical settings. To address this constraint, we introduce a novel framework that mimics the workflow of radiologists. |
Xin Mei; Rui Mao; Xiaoyan Cai; Libin Yang; Erik Cambria; |
| 261 | Advancing 3D Object Grounding Beyond A Single 3D Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. |
Wencan Huang; Daizong Liu; Wei Hu; |
| 262 | Mesh-Centric Gaussian Splatting for Human Avatar Modelling with Real-time Dynamic Mesh Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel approach, Mesh-Centric Gaussian Splatting (MCGS), which introduces a unique representation Mesh-Centric SDF and optimizes it using high-efficiency Gaussian Splatting. |
Ruiqi Zhang; Jie Chen; |
| 263 | Towards Artist-Like Painting Agents with Multi-Granularity Semantic Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Lacking a direct mapping (and consequently the differentiable ability) between pixel domain and stroke parameter searching space, these methods often yield non-realistic/artist-incompatible stroke decompositions, hindering its further application in high quality art generation. To explicitly address this issue, we propose a novel SBR based image-to-painting framework which aligns with artistic oil painting behaviors/techniques. |
Zhangli Hu; Ye Chen; Zhongyin Zhao; Jinfan Liu; Bilian Ke; Bingbing Ni; |
| 264 | When, Where, and What? A Benchmark for Accident Anticipation and Localization with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, this study introduces a novel framework that integrates Large Language Models (LLMs) to enhance predictive capabilities across multiple dimensions-what, when, and where accidents might occur. |
Haicheng Liao; Yongkang Li; Chengyue Wang; Yanchen Guan; Kahou Tam; Chunlin Tian; Li Li; Chengzhong Xu; Zhenning Li; |
| 265 | CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. |
Haicheng Liao; Haoyu Sun; Huanming Shen; Chengyue Wang; Chunlin Tian; KaHou Tam; Li Li; Chengzhong Xu; Zhenning Li; |
| 266 | COCO-LC: Colorfulness Controllable Language-based Colorization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel coarse-to-fine framework, COlorfulness COntrollable Language-based Colorization (COCO-LC), that effectively reinforces the image-text correspondence with a coarsely colorized results. |
Yifan Li; Yuhang Bai; Shuai Yang; Jiaying Liu; |
| 267 | X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. |
Pinxue Guo; Wanyun Li; Hao Huang; Lingyi Hong; Xinyu Zhou; Zhaoyu Chen; Jinglun Li; Kaixun Jiang; Wei Zhang; Wenqiang Zhang; |
| 268 | Auto DragGAN: Editing The Generative Image Manifold in An Autoregressive Manner Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through experimental analysis, we discover that a short movement distance from handle points to target points yields a high-fidelity edited image, as the model only needs to predict the movement of a small portion of pixels. |
Pengxiang Cai; Zhiwei Liu; Guibo Zhu; Yunfang Niu; Jinqiao Wang; |
| 269 | Freehand Sketch Generation from Mechanical Components Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. |
Zhichao Liao; Fengyuan Piao; Di Huang; Xinghui Li; Yue Ma; Pingfa Feng; Heming Fang; Long Zeng; |
| 270 | A Picture Is Worth A Graph: A Blueprint Debate Paradigm for Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). |
Changmeng Zheng; Dayong Liang; Wengyu Zhang; Xiao-Yong Wei; Tat-Seng Chua; Qing Li; |
| 271 | MagicVFX: Visual Effects Synthesis in Just Minutes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The experimental results demonstrate that the pipeline we established can effectively produce impressive visual effects synthesis outcomes, thereby evidencing the significant potential of existing AIGC technology for application in visual effects synthesis tasks. |
Jiaqi Guo; Lianli Gao; Junchen Zhu; Jiaxin Zhang; Siyang Li; Jingkuan Song; |
| 272 | Color4E: Event Demosaicing for Full-color Event Guided Image Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The challenges associated with this approach include demosaicing color events for reconstructing full-resolution sampled signals and fusing bimodal signals to achieve image deblurring. To meet these challenges, we propose a novel network called Color4E to enhance the color restoration quality for the image deblurring task. |
Yi Ma; Peiqi Duan; Yuchen Hong; Chu Zhou; Yu Zhang; Jimmy Ren; Boxin Shi; |
| 273 | From Covert Hiding To Visual Editing: Robust Generative Video Steganography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. |
Xueying Mao; Xiaoxiao Hu; Wanli Peng; Zhenliang Gan; Zhenxing Qian; Xinpeng Zhang; Sheng Li; |
| 274 | Aspects Are Anchors: Towards Multimodal Aspect-based Sentiment Analysis Via Aspect-driven Alignment and Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To alleviate NCP, in this paper, we introduce Aspect-driven Alignment and Refinement (ADAR), which is a two-stage coarse-to-fine alignment framework. |
Zhanpeng Chen; Zhihong Zhu; Wanshi Xu; Yunyan Zhang; Xian Wu; Yefeng Zheng; |
| 275 | FakingRecipe: Detecting Fake News on Short Video Platforms from The Perspective of Creative Process Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unlike existing works mostly focusing on analyzing what is presented, we introduce a novel perspective that considers how it might be created. |
Yuyan Bu; Qiang Sheng; Juan Cao; Peng Qi; Danding Wang; Jintao Li; |
| 276 | Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). |
Haibo Yang; Yang Chen; Yingwei Pan; Ting Yao; Zhineng Chen; Chong-Wah Ngo; Tao Mei; |
| 277 | Improving Out-of-Distribution Detection with Disentangled Foreground and Background Features Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper considers the importance of feature disentanglement in out-of-distribution detection and proposes the simultaneous exploitation of both foreground and background features to support the detection of OOD inputs in in out-of-distribution detection. |
Choubo Ding; Guansong Pang; |
| 278 | Deep Instruction Tuning for Segment Anything Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. |
Xiaorui Huang; Gen Luo; Chaoyang Zhu; Bo Tong; Yiyi Zhou; Xiaoshuai Sun; Rongrong Ji; |
| 279 | StealthDiffusion: Towards Evading Diffusion Forensic Detection Through Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, current adversarial attacks often introduce visible noise, have poor transferability, and fail to address spectral differences between AI-generated and genuine images. To address this, we propose StealthDiffusion, a framework based on stable diffusion that modifies AI-generated images into high-quality, imperceptible adversarial examples capable of evading state-of-the-art forensic detectors. |
Ziyin Zhou; Ke Sun; Zhongxi Chen; Huafeng Kuang; Xiaoshuai Sun; Rongrong Ji; |
| 280 | QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. |
Shengxin Chen; Gen Luo; Yiyi Zhou; Xiaoshuai Sun; Guannan Jiang; Rongrong Ji; |
| 281 | Training-Free Feature Reconstruction with Sparse Optimization for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the challenge of adapting vision-language models (VLMs) to few-shot image recognition in a training-free manner. |
Yi Zhang; Ke Yu; Angelica I. Aviles-Rivero; Jiyuan Jia; Yushun Tang; Zhihai He; |
| 282 | GaussianTalker: Speaker-specific Talking Head Synthesis Via 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. |
Hongyun Yu; Zhan Qu; Qihang Yu; Jianchuan Chen; Zhonghua Jiang; Zhiwen Chen; Shengyu Zhang; Jimin Xu; Fei Wu; Chengfei Lv; Gang Yu; |
| 283 | Large Point-to-Gaussian Model for Image-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. |
Longfei Lu; Huachen Gao; Tao Dai; Yaohua Zha; Zhi Hou; Junta Wu; Shu-Tao Xia; |
| 284 | Audio-Driven Identity Manipulation for Face Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our main insight is that a person’s voice carries distinct identity markers, such as age and gender, which provide an essential supplement for identity-aware face inpainting. |
Yuqi Sun; Qing Lin; Weimin Tan; Bo Yan; |
| 285 | RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. |
Tianrui Pan; Jie Liu; Bohan Wang; Jie Tang; Gangshan Wu; |
| 286 | Expanded Convolutional Neural Network Based Look-Up Tables for High Efficient Single-Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the performance of EC-LUT regarding SR quality and LUT volume is unsatisfactory. To address these limitations, this paper proposes a novel expanded convolutional neural network (ECNN). |
Kai Yin; Jie Shen; |
| 287 | RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. |
Hongtao Wu; Yijun Yang; Huihui Xu; Weiming Wang; Jinni Zhou; Lei Zhu; |
| 288 | Mixed Prototype Correction for Causal Inference in Medical Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a mixed prototype correction for causal inference (MPCCI) method, aimed at mitigating the impact of unseen confounding factors on the causal relationships between medical images and disease labels, so as to enhance the diagnostic accuracy of deep learning models. |
Yajie Zhang; Zhi-An Huang; Zhiliang Hong; Songsong Wu; Jibin Wu; Kay Chen Tan; |
| 289 | Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. |
Fengqi Liu; Hexiang Wang; Jingyu Gong; Ran Yi; Qianyu Zhou; Xuequan Lu; Jiangbo Lu; Lizhuang Ma; |
| 290 | Enhanced Tensorial Self-representation Subspace Learning for Incomplete Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, designing an effective strategy to retain salient features while eliminating noise is rarely considered in IMVC. To tackle these issues, we propose a novel self-representation learning method with missing sample recovery and enhanced low-rank tensor regularization. |
Hangjun Che; Xinyu Pu; Deqiang Ouyang; Beibei Li; |
| 291 | ViewPCGC: View-Guided Learned Point Cloud Geometry Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the limitation, we innovatively propose a view-guided learned point cloud geometry compression scheme, namely ViewPCGC. |
Huiming Zheng; Wei Gao; Zhuozhen Yu; Tiesong Zhao; Ge Li; |
| 292 | PS-TTL: Prototype-based Soft-labels and Test-Time Learning for Few-shot Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The fine-tuning based paradigm is currently dominating this field, where detectors are initially pre-trained on base classes with sufficient samples and then fine-tuned on novel ones with few samples, but the scarcity of labeled samples of novel classes greatly interferes precisely fitting their data distribution, thus hampering the performance. To address this issue, we propose a new framework for FSOD, namely Prototype-based Soft-labels and Test-Time Learning (PS-TTL). |
Yingjie Gao; Yanan Zhang; Ziyue Huang; Nanqing Liu; Di Huang; |
| 293 | Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. |
Yuanyuan Liu; Yuxuan Huang; Shuyang Liu; Yibing Zhan; Zijing Chen; Zhe Chen; |
| 294 | TGCA-PVT: Topic-Guided Context-Aware Pyramid Vision Transformer for Sticker Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While conventional image emotion recognition focuses on global features, sticker emotion recognition necessitates incorporating both global and local features, along with additional modalities like text. To address this, we introduce a topic ID-guided transformer method to facilitate a more nuanced analysis of the stickers. |
Jian Chen; Wei Wang; Yuzhu Hu; Junxin Chen; Han Liu; Xiping Hu; |
| 295 | UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. |
Siru Zhong; Xixuan Hao; Yibo Yan; Ying Zhang; Yangqiu Song; Yuxuan Liang; |
| 296 | FTF-ER: Feature-Topology Fusion-Based Experience Replay Method for Continual Graph Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In addition, the topology-based ER methods only consider local topological information and add neighboring nodes to the buffer, which ignores the global topological information and increases memory overhead. To bridge these gaps, we propose a novel method called Feature-Topology Fusion-based Experience Replay (FTF-ER) to effectively mitigate the catastrophic forgetting issue with enhanced efficiency. |
Jinhui Pang; Changqing Lin; Xiaoshuai Hao; Rong Yin; Zixuan Wang; Zhihui Zhang; Jinglin He; Huang Tai Sheng; |
| 297 | Beyond The Known: Ambiguity-Aware Multi-view Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This demands that predictors not only recognize familiar patterns but also adaptively interpret unknown ones out of training scope. To address this challenge, we propose an Ambiguity-Aware Multi-view Learning Framework, which integrates four synergistic modules into an end-to-end framework to achieve generalizability and reliability beyond the known. |
Zihan Fang; Shide Du; Yuhong Chen; Shiping Wang; |
| 298 | Language-Driven Interactive Shadow Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. |
Hongqiu Wang; Wei Wang; Haipeng Zhou; Huihui Xu; Shaozhi Wu; Lei Zhu; |
| 299 | DySarl: Dynamic Structure-Aware Representation Learning for Multimodal Knowledge Graph Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, existing studies have largely ignored the dynamic impact of different multimodal features on different decision facts for reasoning, which utilize asymmetric coattention to independently learn the static interplay between different modalities without dynamically joining the reasoning process. We propose a novel Dynamic Structure-aware representation learning method, namely DySarl, to overcome this problem and significantly improve the MKG reasoning performance. |
Kangzheng Liu; Feng Zhao; Yu Yang; Guandong Xu; |
| 300 | Few-Shot Joint Multimodal Entity-Relation Extraction Via Knowledge-Enhanced Cross-modal Prompt Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the insufficient information in the few-shot setting, we introduce the Knowledge-Enhanced Cross-modal Prompt Model (KECPM) for JMERE. |
li yuan; Yi Cai; Junsheng Huang; |
| 301 | Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. |
Bohong Chen; Yumeng Li; Yao-Xiang Ding; Tianjia Shao; Kun Zhou; |
| 302 | Accurate and Lightweight Learning for Specific Domain Image-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods for SDITR often neglect two critical aspects: the enhancement of modal-level distribution consistency within the retrieval space and the reduction of CLIP’s computational cost during inference. To address these issues, this paper presents a novel framework, Accurate and lightweight learning for specific domain Image-text Retrieval (AIR), based on the CLIP. |
Rui Yang; Shuang Wang; Jianwei Tao; Yingping Han; Qiaoling Lin; YanHe Guo; Biao Hou; Licheng Jiao; |
| 303 | Multimodal Unlearnable Examples: Protecting Data Against Multimodal Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples. |
Xinwei Liu; Xiaojun Jia; Yuan Xun; Siyuan Liang; Xiaochun Cao; |
| 304 | AL-GTD: Deep Active Learning for Gaze Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. |
Francesco Tonini; Nicola Dall’Asen; Lorenzo Vaquero; Cigdem Beyan; Elisa Ricci; |
| 305 | SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. |
Wenbo Huang; Jinghui Zhang; Xuwei Qian; Zhen Wu; Meng Wang; Lei Zhang; |
| 306 | Combating Visual Question Answering Hallucinations Via Robust Multi-Space Co-Debias Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. |
Jiawei Zhu; Yishu Liu; Huanjia Zhu; Hui Lin; Yuncheng Jiang; Zheng Zhang; Bingzhi Chen; |
| 307 | Subjective and Objective Quality-of-Experience Assessment for 3D Talking Heads Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, transmitting 3D THs poses significant challenges due to its complex and voluminous nature, often leading to pronounced distortion and a compromised user experience. Addressing this challenge, we introduce the 3D Talking Heads Quality Assessment (THQA-3D) dataset, comprising 1,000 sets of distorted and 50 original TH mesh sequences (MSs), to facilitate quality assessment in 3D TH transmission. |
Yingjie Zhou; Zicheng Zhang; Wei Sun; Xiaohong Liu; Xiongkuo Min; Guangtao Zhai; |
| 308 | Harmony in Diversity: Improving All-in-One Image Restoration Via Multi-Task Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we extend and redefine the conventional all-in-one image restoration task as a multi-task learning problem and propose a straightforward yet effective active-reweighting strategy, dubbed Art , to harmonize the optimization of multiple degradation tasks. |
Gang Wu; Junjun Jiang; Kui Jiang; Xianming Liu; |
| 309 | Disentangled-Multimodal Privileged Knowledge Distillation for Depression Recognition with Incomplete Multimodal Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In particular, these methods often encounter performance degradation when certain modalities are either missing or degraded. To tackle this issue, we present a generalizable multimodal framework for DR by aggregating feature disentanglement and privileged knowledge distillation. |
Yuchen Pan; Junjun Jiang; Kui Jiang; Xianming Liu; |
| 310 | Multimodal Fusion Via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. |
Zijian Yi; Ziming Zhao; Zhishu Shen; Tiehua Zhang; |
| 311 | Unleashing The Power of Generic Segmentation Model: A Simple Baseline for Infrared Small Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. |
Mingjin Zhang; Chi Zhang; Qiming Zhang; Yunsong Li; Xinbo Gao; Jing Zhang; |
| 312 | 3D Gaussian Editing with A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous works on editing have achieved interesting results through manipulating 3D meshes, they often require accurately reconstructed meshes to perform editing, which limits their application in 3D content generation. To address this gap, we introduce a novel single-image-driven 3D scene editing approach based on 3D Gaussian Splatting, enabling intuitive manipulation via directly editing the content on a 2D image plane. |
Guan Luo; Tian-Xing Xu; Ying-Tian Liu; Xiao-Xiong Fan; Fang-Lue Zhang; Song-Hai Zhang; |
| 313 | Generative Expressive Conversational Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker. |
Rui Liu; Yifan Hu; Yi Ren; Xiang Yin; Haizhou Li; |
| 314 | Point-GCC: Universal Self-supervised 3D Scene Pre-training Via Geometry-Color Contrast Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we propose a universal 3D scene pre-training framework via Geometry-Color Contrast (Point-GCC), which aligns geometry and color information using a Siamese network. |
Guofan Fan; Zekun Qi; Wenkai Shi; Kaisheng Ma; |
| 315 | Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP^2AI), consisting of three major components that tackle key challenges of RPM problems. |
Wentao He; Jianfeng Ren; Ruibin Bai; Xudong Jiang; |
| 316 | Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. |
Junzhang Liu; Zhecan Wang; Hammad Ayyubi; Haoxuan You; Chris Thomas; Rui Sun; Shih-Fu Chang; Kai-Wei Chang; |
| 317 | ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle this problem, we are the first to introduce a novel Region of Interest (ROI)-guided Point Cloud Geometry Compression (RPCGC) method for human and machine vision. Our framework employs a dual-branch parallel structure, where the base layer encodes and decodes a simplified version of the point cloud, and the enhancement layer refines this by focusing on geometry details. |
Liang Xie; Wei Gao; Huiming Zheng; Ge Li; |
| 318 | MaskMentor: Unlocking The Potential of Masked Self-Teaching for Missing Modality RGB-D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing RGB-D semantic segmentation methods struggle to handle modality missing input, where only RGB images or depth maps are available, leading to degenerated segmentation performance. We tackle this issue using MaskMentor, a new pre-training framework for modality missing segmentation, which advances its counterparts via two novel designs: Masked Modality and Image Modeling (M2IM), and Self-Teaching via Token-Pixel Joint reconstruction (STTP). |
Zhida Zhao; Jia Li; Lijun Wang; Yifan Wang; Huchuan Lu; |
| 319 | FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. |
Ziyu Yao; Xuxin Cheng; Zhiqi Huang; |
| 320 | GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audiowatermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. |
Weizhi Liu; Yue Li; Dongdong Lin; Hui Tian; Haizhou Li; |
| 321 | Enhancing Pre-trained ViTs for Downstream Task Adaptation: A Locality-Aware Prompt Learning Method Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Particularly, fully supervised pre-trained ViTs, such as Vanilla ViT and CLIP, face the challenge of locality vanishing when adapting to downstream tasks. To address this, we introduce a novel LOcality-aware pRompt lEarning (LORE) method, aiming to improve the adaptation of pre-trained ViTs to downstream tasks. |
Shaokun Wang; Yifan Yu; Yuhang He; Yihong Gong; |
| 322 | Perceive Before Respond: Improving Sticker Response Selection By Emotion Distillation and Hard Mining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a ‘Perceive before Respond’ (PBR) training paradigm. |
Wuyou Xia; Shengzhe Liu; Qin Rong; Guoli Jia; Eunil Park; Jufeng Yang; |
| 323 | Diffusion Networks with Task-Specific Noise Control for Radiology Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to conduct RRG with diffusion networks by controlling the noise with task-specific features, which leverages irrelevant visual and textual information as noise rather than the stochastic Gaussian noise, and allows the diffusion networks to filter particular information through iterative denoising, thus performing a precise and controlled report generation process. |
Yuanhe Tian; Fei Xia; Yan Song; |
| 324 | Adaptive Hierarchical Aggregation for Federated Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nonetheless, the issue of data heterogeneity introduces distinct challenges to federated object detection, evident in diminished object perception, classification and localization abilities. In response, we introduce a task-driven federated learning methodology, dubbed Adaptive Hierarchical Aggregation (FedAHA), tailored to overcome these obstacles. |
Ruofan Jia; Weiying Xie; Jie Lei; Yunsong Li; |
| 325 | HPC: Hierarchical Progressive Coding Framework for Volumetric Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hierarchical progressive volumetric video coding framework achieving variable bitrate using a single model. |
Zihan Zheng; Houqiang Zhong; Qiang Hu; Xiaoyun Zhang; Li Song; Ya Zhang; Yanfeng Wang; |
| 326 | WaveDN: A Wavelet-based Training-free Zero-shot Enhancement for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods often fall short in scenarios where labeled data for downstream tasks is either unavailable or insufficient for fine-tuning, and the training of additional parameter modules may considerably impair the existing transferability of VLMs on open-set tasks. To alleviate this issue, we introduce WaveDN, a wavelet-based distribution normalization method that can boost the VLMs’ performance on downstream tasks without parametric modules or labeled data. |
Jiulin Li; Mengyu Yang; Ye Tian; Lanshan Zhang; Yongchun Lu; Jice Liu; Wendong Wang; |
| 327 | A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. |
Zihan Cao; Xiao Wu; Liang-Jian Deng; Yu Zhong; |
| 328 | Adaptive Instance-wise Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We design a collaborative framework with the augmented Lagrangian method to refine all subtasks towards optimal solutions iteratively. |
Shudong Huang; Hecheng Cai; Hao Dai; Wentao Feng; Jiancheng Lv; |
| 329 | RFFNet: Towards Robust and Flexible Fusion for Low-Light Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a robust and flexible fusion network (RFFNet) for low-light image denoising. |
Qiang Wang; Yuning Cui; Yawen Li; Yaping Ruan; Ben Zhu; Wenqi Ren; |
| 330 | Leveraging RGB-Pressure for Whole-body Human-to-Humanoid Motion Imitation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we establish a RGB-Pressure (RGB-P) based humanoid imitation system, achieving accurate and stable end-to-end mapping from human body models to robot control parameters. |
Yi Lu; Shenghao Ren; Qiu Shen; Xun Cao; |
| 331 | Do LLMs Understand Visual Anomalies? Uncovering LLM’s Capabilities in Zero-shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. |
Jiaqi Zhu; Shaofeng Cai; Fang Deng; Beng Chin Ooi; Junran Wu; |
| 332 | A Sample-driven Selection Framework: Towards Graph Contrastive Networks with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, recent GCL methods often rely on uniform negative sample selection schemes, such as random sampling, which results in suboptimal performance. To tackle this challenge, we present GraphSaSe, a tailored approach specifically designed for graph contrastive learning. |
Xiangping Zheng; Xiuxin Hao; Bo Wu; Xigang Bao; Xuan Zhang; Wei Li; Xun Liang; |
| 333 | Joint-Motion Mutual Learning for Pose Estimation in Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics.To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. |
Sifan Wu; Haipeng Chen; Yifang Yin; Sihao Hu; Runyang Feng; Yingying Jiao; Ziqi Yang; Zhenguang Liu; |
| 334 | Self-Supervised Visual Preference Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). |
Ke Zhu; Liang Zhao; Zheng Ge; Xiangyu Zhang; |
| 335 | Sustainable Self-evolution Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. |
Wenxuan Wang; Chenglei Wang; Huihui Qi; Menghao Ye; Xuelin Qian; Peng Wang; Yanning Zhang; |
| 336 | R4D-planes: Remapping Planes For Novel View Synthesis and Self-Supervised Decoupling of Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For the latter decoupling problem, previous neural radiation field methods require frequent tuning of the relevant parameters for different scenes, which is very inconvenient for practical use. We consider above problems and propose a new representation of dynamic scenes based on tensor decomposition, which we call R4D-planes. |
Junyuan Guo; Hao Tang; Teng Wang; Chao Wang; |
| 337 | Semantic-aware Next-Best-View for Multi-DoFs Mobile System in Search-and-Acquisition Based Visual Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we formulate a novel information gain that integrates both visibility and semantic gain in a unified form to select the semantic-aware Next-Best-View. |
Xiaotong Yu; Chang Wen Chen; |
| 338 | A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network (MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. |
Guoqing Yang; Zhiming Luo; Jianzhe Gao; Yingxin Lai; Kun Yang; Yifan He; Shaozi Li; |
| 339 | Modality-Balanced Learning for Multimedia Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. |
Jinghao Zhang; Guofan Liu; Qiang Liu; Shu Wu; Liang Wang; |
| 340 | NFT1000: A Cross-Modal Dataset For Non-Fungible Token Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we will introduce a benchmark dataset named NFT Top1000 Visual-Text Dataset(NFT1000), containing 7.56 million image-text pairs, and being collected from 1000 most famous PFP NFT collections by sales volume on the Ethereum blockchain. |
Shuxun Wang; Yunfei Lei; Ziqi Zhang; Wei Liu; Haowei Liu; Li Yang; Bing Li; Wenjuan Li; Jin Gao; Weiming Hu; |
| 341 | LDStega: Practical and Robust Generative Image Steganography Based on Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a practical and robust generative image steganography based on Latent Diffusion Models, called LDStega. |
Yinyin Peng; Yaofei Wang; Donghui Hu; Kejiang Chen; Xianjin Rong; Weiming Zhang; |
| 342 | Model-Based Non-Independent Distortion Cost Design for Effective JPEG Steganography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it remains a challenge to exploit the correlations between DCT coefficients for secure steganography in practical scenarios where only a single compressed JPEG image is available. To cope with this, we propose a novel model-based steganographic scheme using the Conditional Random Field (CRF) model with four-element cross-neighborhood to capture the dependencies among DCT coefficients for JPEG steganography with symmetric embedding. |
Yuanfeng Pan; Wenkang Su; Jiangqun Ni; Qingliang Liu; Yulin Zhang; Donghua Jiang; |
| 343 | ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, due to missing part appearance information caused by occlusion and noisy spatial information from external model, these purely vision-based approaches fail to correctly learn the features of human body parts from limited training data and struggle in accurately locating body parts, ultimately leading to misaligned part features. To tackle these challenges, we propose a Prompt-guided Feature Disentangling method (ProFD), which leverages the rich pre-trained knowledge in the textual modality facilitate model to generate well-aligned part features. |
Can Cui; Siteng Huang; Wenxuan Song; Pengxiang Ding; Min Zhang; Donglin Wang; |
| 344 | DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the aforementioned challenges, we propose dual-perturbation optimization (DPO) for Test-time Adaptation in 3D Object Detection (TTA-3OD). |
Zhuoxiao Chen; Zixin Wang; Yadan Luo; Sen Wang; Zi Huang; |
| 345 | Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing countermeasures still serve a classification purpose and fail to perform meaningful analysis of the start and end timestamps of partial forgery segments. To address this challenge, we introduce a novel coarse-to-fine proposal refinement framework (CFPRF) that incorporates a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. |
Junyan Wu; Wei Lu; Xiangyang Luo; Rui Yang; Qian Wang; Xiaochun Cao; |
| 346 | Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the issues, we propose an efficient method called MOOT for unconditional human motion synthesis. |
Hua Yu; Weiming Liu; Jiapeng Bai; Xu Gui; Yaqing Hou; YewSoon Ong; Qiang Zhang; |
| 347 | Cross-Modal Coherence-Enhanced Feedback Prompting for News Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given that these articles often contain extensive information not directly related to the image, captions may end up misaligned with the visual content. To mitigate this issue, we propose the novel cross-modal coherence-enhanced feedback prompting method to clarify the crucial elements that align closely with the visual content for news captioning. |
Ning Xu; Yifei Gao; Ting-Ting Zhang; Hongshuo Tian; An-An Liu; |
| 348 | Cognition-Supervised Saliency Detection: Contrasting EEG Signals and Visual Stimuli Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we explore a novel method that utilizes signals measured directly from human cognition via electroencephalogram (EEG) in response to natural visual perception. |
Jun Ma; Tuukka Ruotsalo; |
| 349 | Rethinking Image Editing Detection in The Era of Generative AI Revolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering that image editing and manipulation technologies pose significant threats to the authenticity and security of image content, research on image regional manipulation detection has always been a critical issue. |
Zhihao Sun; Haipeng Fang; Juan Cao; Xinying Zhao; Danding Wang; |
| 350 | MDR: Multi-stage Decoupled Relational Knowledge Distillation with Adaptive Stage Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We also find that adding distance-wise relational information to contrastive-learning-based methods negatively impacts distillation quality, revealing an implicit contention between angle-wise and distance-wise attributes. Therefore, we propose a Multi-stage Decoupled Relational (MDR) KD framework equipped with an adaptive stage selection to identify the stages that maximize the efficacy of transferring the relational knowledge. |
Jiaqi Wang; Lu Lu; Mingmin Chi; Jian Chen; |
| 351 | Rethinking The Architecture Design for Efficient Generic Event Boundary Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper demonstrates that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed. We contribute to addressing this challenge by reexamining the architecture of GEBD models and uncovering several surprising findings. |
Ziwei Zheng; Zechuan Zhang; Yulin Wang; Shiji Song; Gao Huang; Le Yang; |
| 352 | Learning Cross-Spectral Prior for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the rising interest in multi-camera cross-spectral systems, cross-spectral images have been widely used in computer vision and image processing. Therefore, an effective super-resolution (SR) method provides high-resolution (HR) cross-spectral images for different research and applications. |
Chenxi Ma; Weimin Tan; Shili Zhou; Bo Yan; |
| 353 | Integrating Content-Semantics-World Knowledge to Detect Stress from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a three-level content-semantic-world knowledge framework, addressing three particular issues for video-based stress detection. |
Yang Ding; Yi Dai; Xin Wang; Ling Feng; Lei Cao; Huijun Zhang; |
| 354 | Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore CNN pruning strategies from the perspective of model interpretability. |
Weiying Xie; Mei Yuan; Jitao Ma; Yunsong Li; |
| 355 | Understanding and Tackling Scattering and Reflective Flare for Mobile Camera Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, current solutions only partially address ISP-related deterioration due to a lack of comprehensive raw image datasets for flare study. To bridge these research gaps, we introduce a new raw image dataset tailored for mobile camera systems, focusing on eliminating flare. |
Fengbo Lan; Chang Wen Chen; |
| 356 | Dual-path Collaborative Generation Network for Emotional Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. |
Cheng Ye; Weidong Chen; Jingyu Li; Lei Zhang; Zhendong Mao; |
| 357 | Enhancing Robustness in Learning with Noisy Labels: An Asymmetric Co-Training Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an Asymmetric Co-Training (ACT) method to mitigate the detrimental effects of label noise. |
Mengmeng Sheng; Zeren Sun; Gensheng Pei; Tao Chen; Haonan Luo; Yazhou Yao; |
| 358 | RHKH: Relational Hypergraph Neural Network for Link Prediction on N-ary Knowledge Hypergraph Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by these, avoiding breaking knowledge structure in KHs like previous studies do, we propose the first KH reasoning model based on original knowledge formats, RHKH. |
Yuzhuo Wang; Junwei He; Hongzhi Wang; |
| 359 | DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. |
Zixuan Wang; Jiayi Li; Xiaoyu Qin; Shikun Sun; Songtao Zhou; Jia Jia; Jiebo Luo; |
| 360 | AbsGS: Recovering Fine Details in 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a comprehensive analysis of the cause of aforementioned artifacts, namely gradient collision, which prevents large Gaussians in over-reconstructed regions from splitting. |
Zongxin Ye; Wenyu Li; Sidun Liu; Peng Qiao; Yong Dou; |
| 361 | Event-ID: Intrinsic Decomposition Using An Event Camera Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Event-ID, an event-based intrinsic decomposition framework that leverages events and images for stable decomposition under extreme scenarios. |
Zehao Chen; Zhan Lu; De Ma; Huajin Tang; Xudong Jiang; Qian Zheng; Gang Pan; |
| 362 | PTSBench: A Comprehensive Post-Training Sparsity Benchmark Towards Algorithms and Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose the first comprehensive post-training sparsity benchmark called PTSBench towards algorithms and models. |
Zining Wang; Jinyang Guo; Ruihao Gong; Yang Yong; Aishan Liu; Yushi Huang; Jiaheng Liu; Xianglong Liu; |
| 363 | Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). |
Peng Wu; Xuerong Zhou; Guansong Pang; Zhiwei Yang; Qingsen Yan; Peng Wang; Yanning Zhang; |
| 364 | Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. |
Yabing Wang; Le Wang; Qiang Zhou; Zhibin Wang; Hao Li; Gang Hua; Wei Tang; |
| 365 | A Coarse to Fine Detection Method for Prohibited Object in X-ray Images Based on Progressive Transformer Decoder Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Currently, Transformer-based prohibited object detection methods in X-ray images appear constantly, but there are still some shortcomings such as poor performance and high computational complexity for prohibited object detection with heavily occlusion. Therefore, a coarse to fine detection method for prohibited object in X-ray images based on progressive Transformer decoder is proposed in this paper. |
Chunjie Ma; Lina Du; Zan Gao; Li Zhuo; Meng Wang; |
| 366 | COMD: Training-free Video Motion Transfer With Camera-Object Motion Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control, preventing the realization of some specific camera controls, such as various camera movements in films. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. |
Teng Hu; Jiangning Zhang; Ran Yi; Yating Wang; Jieyu Weng; Hongrui Huang; Yabiao Wang; Lizhuang Ma; |
| 367 | RainyScape: Unsupervised Rainy Scene Reconstruction Using Decoupled Neural Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose RainyScape, an unsupervised framework to reconstruct pristine scenes from a collection of multi-view rainy images. |
Xianqiang Lyu; Hui Liu; Junhui Hou; |
| 368 | FreeEnhance: Tuning-Free Image Enhancement Via Content-Consistent Noising-and-Denoising Process Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. |
Yang Luo; Yiheng Zhang; Zhaofan Qiu; Ting Yao; Zhineng Chen; Yu-Gang Jiang; Tao Mei; |
| 369 | Timeline and Boundary Guided Diffusion Network for Video Shadow Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. |
Haipeng Zhou; Hongqiu Wang; Tian Ye; Zhaohu Xing; Jun Ma; Ping Li; Qiong Wang; Lei Zhu; |
| 370 | Towards Multi-view Consistent Graph Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, there is a notable absence of theoretical guidance for constructing multi-view data topologies, leading to uncertainty regarding the progression of graph embeddings toward a consistent state. To tackle these challenges, we introduce a framework named energy-constrained multi-view graph diffusion. |
Jielong Lu; Zhihao Wu; Zhaoliang Chen; Zhiling Cai; Shiping Wang; |
| 371 | Correlation-Driven Multi-Modality Graph Decomposition for Cross-Subject Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel framework, termed Correlation-Driven Multi-Modality Graph Decomposition (CMMGD). |
Wuliang Huang; Yiqiang Chen; Xinlong Jiang; Chenlong Gao; Qian Chen; Teng Zhang; Bingjie Yan; Yifan Wang; Jianrong Yang; |
| 372 | IF-Garments: Reconstructing Your Intersection-Free Multi-Layered Garments from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, there are inevitable and undetectable overlaps for a single video that hinder researchers from modeling complete and intersection-free multi-layered clothing. To address the above limitations, in this paper, we propose a novel method to reconstruct multi-layered clothing from multiple monocular videos sequentially, which surpasses existing work in generalization and robustness against penetration. |
Mingyang Sun; Qipeng Yan; Zhuoer Liang; Dongliang Kou; Dingkang Yang; Ruisheng Yuan; Xiao Zhao; Mingcheng Li; Lihua Zhang; |
| 373 | Cons2Plan: Vector Floorplan Generation from Various Conditions Via A Learning Framework Based on Conditional Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a learning framework, named Cons2Plan, for automatically and high-quality generating vector floorplans from various conditions. |
Shibo Hong; Xuhong Zhang; Tianyu Du; Sheng Cheng; Xun Wang; Jianwei Yin; |
| 374 | Narrowing The Gap Between Vision and Action in Navigation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. |
Yue Zhang; Parisa Kordjamshidi; |
| 375 | UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. |
Yi Tu; Chong Zhang; Ya Guo; Huan Chen; Jinyang Tang; Huijia Zhu; Qi Zhang; |
| 376 | Reverse2Complete: Unpaired Multimodal Point Cloud Completion Via Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel unpaired multimodal shape completion approach that directly operates on point coordinate space. |
Wenxiao Zhang; Hossein Rahmani; Xun Yang; Jun Liu; |
| 377 | Informative Point Cloud Dataset Extraction for Classification Via Gradient-based Points Moving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a new open problem in the point cloud field, named point cloud condensation : Can we condense a large point cloud dataset into a much smaller synthetic dataset while preserving the important information of the original large dataset? |
Wenxiao Zhang; Ziqi Wang; Li Xu; Xun Yang; Jun Liu; |
| 378 | Towards Robust Physical-world Backdoor Attacks on Lane Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. |
Xinwei Zhang; Aishan Liu; Tianyuan Zhang; Siyuan Liang; Xianglong Liu; |
| 379 | Illumination Distribution Prior for Low-light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple but effective illumination distribution prior (IDP) for images to illuminate the darkness. |
Chao Wang; Yang Zhou; Liangtian He; Fenglai Lin; Hongming Chen; Liang-Jian Deng; |
| 380 | Edit As You Wish: Video Caption Editing with Multi-grained User Control Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel Video Caption Editing (VCE) task to automatically revise an existing video description guided by multi-grained user requests. |
Linli Yao; Yuanmeng Zhang; Ziheng Wang; Xinglin Hou; Tiezheng Ge; Yuning Jiang; Xu Sun; Qin Jin; |
| 381 | Towards Labeling-free Fine-grained Animal Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we are interested in identifying denser and finer animals joints. |
Dan Zeng; Yu Zhu; Shuiwang Li; Qijun Zhao; Qiaomu Shen; Bo Tang; |
| 382 | SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. |
Zeyu Jin; Jia Jia; Qixin Wang; Kehan Li; Shuoyi Zhou; Songtao Zhou; Xiaoyu Qin; Zhiyong Wu; |
| 383 | Linearly-evolved Transformer for Pan-sharpening Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite the remarkable advancement, their success may be at the huge cost of model parameters and FLOPs, thus preventing its application over low-resource satellites. To address this challenge between favorable performance and expensive computation, we tailor an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework. |
Junming Hou; Zihan Cao; Naishan Zheng; Xuan Li; Xiaoyu Chen; Xinyang Liu; Xiaofeng Cong; Danfeng Hong; Man Zhou; |
| 384 | MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the great expectation of state space models (SSMs), such as Mamba, in long-term sequence modeling with near-linear complexity, we introduce a Mamba-based motion model named Mamba moTion Predictor (MTP). |
Changcheng Xiao; Qiong Cao; Zhigang Luo; Long Lan; |
| 385 | InMu-Net: Advancing Multi-modal Intent Detection Via Information Bottleneck and Multi-sensory Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the promising advancements in complex fusion mechanisms or architecture designs, challenges remain due to: (1) various noise and redundancy in both visual and audio modalities and (2) long-tailed distributions of intent categories. In this paper, to tackle the above two issues, we propose InMu-Net, a simple yet effective framework for MID from the Information bottleneck and Multi-sensory processing perspective. |
Zhihong Zhu; Xuxin Cheng; Zhaorun Chen; Yuyan Chen; Yunyan Zhang; Xian Wu; Yefeng Zheng; Bowen Xing; |
| 386 | Distribution Consistency Guided Hashing for Cross-Modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 2) They frequently utilize pairwise similarities to guide hashing learning and neglect class distribution correlations. To overcome these two issues, we propose a novel Distribution Consistency Guided Hashing (DCGH) framework. |
Yuan Sun; Kaiming Liu; Yongxiang Li; Zhenwen Ren; Jian Dai; Dezhong Peng; |
| 387 | UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. |
Xinfa Zhu; Wenjie Tian; Xinsheng Wang; Lei He; Yujia Xiao; Xi Wang; Xu Tan; Sheng Zhao; Lei Xie; |
| 388 | Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. |
Jinfu Liu; Chen Chen; Mengyuan Liu; |
| 389 | Bridging The Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. |
Xiang Ma; Xuemei Li; Lexin Fang; Caiming Zhang; |
| 390 | Heterogeneity-Aware Federated Deep Multi-View Clustering Towards Diverse Feature Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing federated multi-view clustering algorithms often result in misalignment in feature representations among clients, difficulty in integrating information across multiple views, and poor performance in heterogeneous scenarios. To address these challenges, we propose HFMVC, a heterogeneity-aware federated deep multi-view clustering method. |
Xiaorui Jiang; Zhongyi Ma; Yulin Fu; Yong Liao; Pengyuan Zhou; |
| 391 | Advancing Generalized Deepfake Detector with Forgery Perception Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we shift the focus to real-time perception analysis in the training process and generalize deepfake detectors through an efficient method dubbed Forgery Perception Guidance (FPG). |
Ruiyang Xia; Dawei Zhou; Decheng Liu; Lin Yuan; Shuodi Wang; Jie Li; Nannan Wang; Xinbo Gao; |
| 392 | Generative Motion Stylization of Cross-structure Characters Within Canonical Motion Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. |
Jiaxu Zhang; Xin Chen; Gang Yu; Zhigang Tu; |
| 393 | Saliency-Guided Fine-Grained Temporal Mask Learning for Few-Shot Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, coarse-level temporal relation modeling can make the few-shot models overfit in high-discrepancy temporal context, and ignore the low-discrepancy but high-semantic relevance action details in the video. To address these issues, we propose a saliency-guided fine-grained temporal mask learning method that models the temporal atomic action relation for few-shot action recognition in a finer manner. |
Shuo Zheng; Yuanjie Dang; Peng Chen; Ruohong Huan; Dongdong Zhao; Ronghua Liang; |
| 394 | Consistencies Are All You Need for Semi-supervised Vision-Language Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, we insist that an efficient tracker should excel in tracking the target, regardless of the temporal direction. Building upon these insights, we propose the pioneering semi-supervised learning scheme for VLT task, representing a crucial step towards reducing the dependency on high-quality yet costly labeled data. |
Jiawei Ge; Jiuxin Cao; Xuelin Zhu; Xinyu Zhang; Chang Liu; Kun Wang; Bo Liu; |
| 395 | Boosting Non-causal Semantic Elimination: An Unconventional Harnessing of LVM for Open-World Deepfake Interpretation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To align research with the evolving technologies of forgery, we propose a new task named Open-World Deepfake Interpretation (OW-DFI). |
Zhaoyang Li; Zhu Teng; Baopeng Zhang; Jianping Fan; |
| 396 | CompGS: Efficient 3D Scene Representation Via Compressed Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. |
Xiangrui Liu; Xinju Wu; Pingping Zhang; Shiqi Wang; Zhu Li; Sam Kwong; |
| 397 | Zero-Shot Character Identification and Speaker Prediction in Comics Via Iterative Multimodal Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. |
Yingxuan Li; Ryota Hinami; Kiyoharu Aizawa; Yusuke Matsui; |
| 398 | FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. |
Changgu Chen; Libing Yang; Xiaoyan Yang; Lianggangxu Chen; Gaoqi He; Changbo Wang; Yang Li; |
| 399 | Tangram-Splatting: Optimizing 3D Gaussian Splatting Through Tangram-inspired Shape Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the tangram, a Chinese ancient puzzle, we introduce a novel methodology (Tangram-Splatting) that leverages shape priors to optimize 3D scene fitting. |
Yi Wang; Ningze Zhong; Minglin Chen; Longguang Wang; Yulan Guo; |
| 400 | SOIL: Contrastive Second-Order Interest Learning for Multimodal Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To fully exploit user interest, we propose a Second-Order Interest Learning (SOIL) framework to retrieve second-order interest from unrecorded suboptimal items. |
Hongzu Su; Jingjing Li; Fengling Li; Ke Lu; Lei Zhu; |
| 401 | FedDEO: Description-Enhanced One-Shot Federated Learning with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose FedDEO, a Description-Enhanced One-Shot Federated Learning Method with DMs, offering a novel exploration of utilizing the DM in OSFL. |
Mingzhao Yang; Shangchao Su; Bin Li; Xiangyang Xue; |
| 402 | PixelFade: Privacy-preserving Person Re-identification with Noise-guided Progressive Replacement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. |
Delong Zhang; Yi-Xing Peng; Xiao-Ming Wu; Ancong Wu; Wei-Shi Zheng; |
| 403 | Language-Guided Visual Prompt Compensation for Multi-Modal Remote Sensing Image Classification with Modality Absence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity. |
Ling Huang; Wenqian Dong; Song Xiao; Jiahui Qu; Yuanbo Yang; Yunsong Li; |
| 404 | LoMOE: Localized Multi-Object Editing Via Multi-Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. |
Goirik Chakrabarty; Aditya Chandrasekar; Ramya Hebbalaguppe; Prathosh AP; |
| 405 | Thinking Temporal Automatic White Balance: Datasets, Models and Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we propose CTANet, which integrates cross-frame attention and RepViT for self-adjustment to content and illumination variations. |
Chunxiao Li; Shuyang Wang; Xuejing Kang; Anlong Ming; |
| 406 | Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: (2) The pseudo labels tend to identify video tubes that are widely present in the source domain, rather than accurately localizing the correct video tubes specific to the target domain samples. To address the above issues, we propose the unsupervised domain adaptation model via Hierarchical dEbiAsing and noisy corRecTion (HEART) for cross-domain video tube retrieval, which contains two characteristic modules: Layered Feature Debiasing (including the adversarial feature alignment and the graph based alignment) and Pseudo Label Refinement. |
Jingqiao Xiu; Mengze Li; Wei Ji; Jingyuan Chen; Hanbin Zhao; Shin’ichi Satoh; Roger Zimmermann; |
| 407 | FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. |
Xiang Gao; Jiaying Liu; |
| 408 | Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. |
Yang Du; Yuqi Liu; Qin Jin; |
| 409 | Q-MoE: Connector for MLLMs with Text-Driven Routing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. |
Hanzi Wang; Jiamin Ren; Yifeng Ding; Lei Ren; Huixing Jiang; Wei Chen; Fangxiang Feng; Xiaojie Wang; |
| 410 | Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training diversities, while preserving non-customized knowledge. |
Weili Zeng; Yichao Yan; Qi Zhu; Zhuo Chen; Pengzhi Chu; Weiming Zhao; Xiaokang Yang; |
| 411 | KNN Transformer with Pyramid Prompts for Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to … |
Wenhao Li; Qiangchang Wang; Peng Zhao; Yilong Yin; |
| 412 | Multi-modal Auto-regressive Modeling Via Visual Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. |
Tianshuo Peng; Zuchao Li; Lefei Zhang; Hai Zhao; Ping Wang; Bo Du; |
| 413 | Disentangling Identity Features from Interference Factors for Cloth-Changing Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel algorithm for thoroughly disentangling identity features from interference factors brought by clothes and camera view changes while ensuring the robustness and discriminability. |
Yubo Li; De Cheng; Chaowei Fang; Changzhe Jiao; Nannan Wang; Xinbo Gao; |
| 414 | Adversarial Experts Model for Black-box Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional approaches typically tackle the black-box noisy label problem from two aspects: self-knowledge distillation and pseudo-label denoising, both achieving limited performance due to limited knowledge information. To mitigate this issue, we explore the potential of off-the-shelf vision-language (ViL) multimodal models with rich semantic information for black-box domain adaptation by introducing an Adversarial Experts Model (AEM). |
Siying Xiao; Mao Ye; Qichen He; Shuaifeng Li; Song Tang; Xiatian Zhu; |
| 415 | PAIR: Pre-denosing Augmented Image Retrieval Model for Defending Adversarial Patches Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Pre-denosing Augmented Image Retrieval (PAIR) model, a new approach designed to protect image retrieval systems against adversarial patch attacks. |
Ziyang Zhou; Pinghui Wang; Zi Liang; Ruofei Zhang; Haitao Bai; |
| 416 | PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, SNNs also face several challenges: i) Existing SNNs are not purely additive and involve a substantial amount of floating-point computations, which contradicts the original design intention of adapting to neuromorphic chips; ii) The incorrect positioning of convolutional and pooling layers relative to spiking layers leads to reduced accuracy; iii) Leaky Integrate-and-Fire (LIF) neurons have limited capability in representing local information, which is disadvantageous for downstream visual tasks like semantic segmentation.To address the challenges in SNNs, i) we introduce Pure Sparse Self Attention (PSSA) and Dynamic Spiking Membrane Shortcut (DSMS), combining them to tackle the issue of floating-point computations; ii) the Spiking Precise Gradient downsampling (SPG-down) method is proposed for accurate gradient transmission; iii) the Group-LIF neuron concept is introduced to ensure LIF neurons’ capability in representing local information both horizontally and vertically, enhancing their applicability in semantic segmentation tasks. Ultimately, these three solutions are integrated into the Powerful Sparse-Spike-Driven Transformer (PSSD-Transformer), effectively handling semantic segmentation tasks and addressing the challenges inherent in SNNs. |
Hongzhi Wang; Xiubo Liang; Tao Zhang; Yue Gu; Weidong Geng; |
| 417 | MFRGN: Multi-scale Feature Representation Generalization Network for Ground-to-Aerial Geo-localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a pure end-to-end solution, free from task-specific techniques, termed the Multi-scale Feature Representation Generalization Network (MFRGN) to improve generalization. |
Yuntao Wang; Jinpu Zhang; Ruonan Wei; Wenbo Gao; Yuehuan Wang; |
| 418 | ZePo: Zero-Shot Portrait Stylization with Faster Sampling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. |
Jin Liu; Huaibo Huang; Jie Cao; Ran He; |
| 419 | WSEL: EEG Feature Selection with Weighted Self-expression Learning for Incomplete Multi-dimensional Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To wrestle with the aforementioned problem, we propose a novel EEG feature selection model with weighted self-expression learning (WSEL). |
Xueyuan Xu; Li Zhuo; Jinxin Lu; Xia Wu; |
| 420 | Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel efficient training strategy, processing with visual speech units. |
Minsu Kim; Jeonghun Yeo; Se Jin Park; Hyeongseop Rha; Yong Man Ro; |
| 421 | Motion-aware Latent Diffusion Models for Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel diffusion framework, Motion-Aware latent Diffusion models (MADiff), which is specifically designed for the VFI task. |
Zhilin Huang; Yijie Yu; Ling Yang; Chujun Qin; Bing Zheng; Xiawu Zheng; Zikun Zhou; Yaowei Wang; Wenming Yang; |
| 422 | Generalized Source-Free Domain-adaptive Segmentation Via Reliable Knowledge Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on a more challenging paradigm in semantic segmentation, Generalized SFDA (G-SFDA), aiming to achieve robust performance on both source and target domains. |
Qi Zang; Shuang Wang; Dong Zhao; Yang Hu; Dou Quan; Jinlong Li; Nicu Sebe; Zhun Zhong; |
| 423 | U2UData: A Large-scale Cooperative Perception Dataset for Swarm UAVs Autonomous Flight Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents U2UData, the first large-scale cooperative perception dataset for swarm UAVs autonomous flight. |
Tongtong Feng; Xin Wang; Feilin Han; Leping Zhang; Wenwu Zhu; |
| 424 | Exposure Completing for Temporally Consistent Neural High Dynamic Range Video Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel paradigm to render HDR frames via completing the absent exposure information, hence the exposure information is complete and consistent. |
Jiahao Cui; Wei Jiang; Zhan Peng; Zhiyu Pan; Zhiguo Cao; |
| 425 | Cefdet: Cognitive Effectiveness Network Based on Fuzzy Inference for Action Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, these methods frequently generate detection results with cognitive abnormalities. To solve the above problems, this study proposes a cognitive effectiveness network based on fuzzy inference (Cefdet), which introduces the concept of ‘cognition–based detection’ to simulate human cognition. |
Zhe Luo; Weina Fu; Shuai Liu; Saeed Anwar; Muhammad Saqib; Sambit Bakshi; Khan Muhammad; |
| 426 | SMART: Self-Weighted Multimodal Fusion for Diagnostics of Neurodegenerative Disorders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent multimodal learning methods adopt deep encoders to extract features and simple concatenation or alignment techniques for feature fusion, which suffer the representation degeneration issue due to the vast irrelevant information. To address this challenge, we propose a deep self-weighted multimodal relevance weighting approach, which leverages clustering-based constrastive learning and eliminates the intra- and inter-modal irrelevancy. |
Qiuhui Chen; Yi Hong; |
| 427 | PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. |
Yibin Wang; Weizhong Zhang; Jianwei Zheng; Cheng Jin; |
| 428 | GPD-VVTO: Preserving Garment Details in Video Virtual Try-On Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In light of previous research, we posit that the task of video virtual try-on can be decomposed into two key aspects: (1) single-frame results are realistic and natural, while retaining consistency with the garment; (2) the person’s actions and the garment are coherent throughout the entire video. To address these two aspects, we propose a novel two-stage framework based on Latent Diffusion Model, namely Garment-Preserving Diffusion for Video Virtual Try-On (GPD-VVTO). |
Yuanbin Wang; Weilun Dai; Long Chan; Huanyu Zhou; Aixi Zhang; Si Liu; |
| 429 | EvilEdit: Backdooring Text-to-Image Diffusion Models in One Second Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing backdoor attacks typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance of T2I diffusion models. To address these issues, we propose EvilEdit, a training-free and data-free backdoor attack against T2I diffusion models. |
Hao Wang; Shangwei Guo; Jialing He; Kangjie Chen; Shudong Zhang; Tianwei Zhang; Tao Xiang; |
| 430 | SemGIR: Semantic-Guided Image Regeneration Based Method for AI-generated Image Detection and Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the variety of semantic text prompts yields diverse generated images, posing significant challenges to existing detection methodologies that rely solely on learning from image features, particularly in scenarios with limited samples. To tackle these challenges, this paper presents a novel perspective on the AI-generated image detection task, advocating for detection under semantic-decoupling conditions. |
Xiao Yu; Kejiang Chen; Kai Zeng; Han Fang; Zijin Yang; Xiuwei Shang; Yuang Qi; Weiming Zhang; Nenghai Yu; |
| 431 | HcaNet: Haze-concentration-aware Network for Real-scene Dehazing with Codebook Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a haze concentration aware network (HcaNet), its haze-concentration-aware module (HcaM) can reduce the information loss in the vector quantization stage and achieve an adaptive domain transfer for regions with different degrees of degradation. |
Yi Liu; Jiachen Li; Yanchun Ma; Qing Xie; Yongjian Liu; |
| 432 | Collaborative Training of Tiny-Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. |
Shichen Lu; Longteng Guo; Wenxuan Wang; Zijia Zhao; Tongtian Yue; Jing Liu; Si Liu; |
| 433 | Align2Concept: Language Guided Interpretable Image Recognition By Visual Prototype and Textual Concept Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by cognitive science and vision-language learning, we propose a Prototype-Concept Alignment Network (ProCoNet) for learning visual prototypes under the guidance of textual concepts. |
Jiaqi Wang; Pichao Wang; Yi Feng; Huafeng Liu; Chang Gao; Liping Jing; |
| 434 | Learning Geometry Consistent Neural Radiance Fields from Sparse and Unposed Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, collecting dense input images for a NeRF with accurate camera poses is highly expensive in many real-world scenarios. In this paper, we propose to learn Geometry Consistent Neural Radiance Field (GC-NeRF), to tackle this challenge by jointly optimizing a NeRF and its corresponding camera poses with sparse (as low as 2) and unposed views. |
Qi Zhang; Chi Huang; Qian Zhang; Nan Li; Wei Feng; |
| 435 | Scalable Super-Resolution Neural Operator Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an inference-time adaptive network width optimization method for arbitrary scale SR modules, dubbed as Scalable Super-Resolution Neural Operator (SSRNO), which is capable of efficient performance-preserving deployment on various mobile or edge devices with only a user input parameter indicating the desired compression rate. |
Lei Han; Xuesong Zhang; |
| 436 | Detached and Interactive Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain modalities, limiting the full potential of others. In response to this challenge, this paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. |
Yunfeng Fan; Wenchao Xu; Haozhao Wang; Junhong Liu; Song Guo; |
| 437 | Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. |
Yang Liu; Daizong Liu; Zongming Guo; Wei Hu; |
| 438 | Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. |
Hongyu Li; Tianrui Hui; Zihan Ding; Jing Zhang; Bin Ma; Xiaoming Wei; Jizhong Han; Si Liu; |
| 439 | ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody. |
Zhongxu Wang; Yujia Wang; Mingzhu Li; Hua Huang; |
| 440 | Hybrid Cost Volume for Memory-Efficient Optical Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. |
Yang Zhao; Gangwei Xu; Gang Wu; |
| 441 | Multi-view Self-Supervised Contrastive Learning for Multivariate Time Series Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel self-supervised general-purpose framework called Temporal-Frequency and Contextual Consistency (TFCC). |
Yuhan Wu; Xiyu Meng; Yang He; Junru Zhang; Haowen Zhang; Yabo Dong; Dongming Lu; |
| 442 | Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. |
Xiang Fang; Wanlong Fang; Daizong Liu; Xiaoye Qu; Jianfeng Dong; Pan Zhou; Renfu Li; Zichuan Xu; Lixing Chen; Panpan Zheng; Yu Cheng; |
| 443 | AesMamba: Universal Image Aesthetic Assessment with State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Is it possible to design a universal IAA framework applicable for the whole IAA task taxonomy? In this paper, we explore this issue, and propose a modular IAA framework, dubbed AesMamba. |
Fei Gao; Yuhao Lin; Jiaqi Shi; Maoying Qiao; Nannan Wang; |
| 444 | A General Framework to Boost 3D GS Initialization for Text-to-3D Generation By Lexical Richness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., a dog, not for lexically richer (or harder) texts, e.g., a dog is sitting on the top of the airplane. To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. |
Lutao Jiang; Hangyu Li; Lin Wang; |
| 445 | Exploring The Robustness of Decision-Level Through Adversarial Attacks on LLM-Based Embodied Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: At the same time, during the attack process, to more accurately ascertain whether our method is successful in attacking the LLM-based embodied model, we devise a new attack success evaluation method utilizing the BLIP2 model. |
Shuyuan Liu; Jiawei Chen; Shouwei Ruan; Hang Su; Zhaoxia Yin; |
| 446 | Large Multi-modality Model Assisted AI-Generated Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. |
Puyi Wang; Wei Sun; Zicheng Zhang; Jun Jia; Yanwei Jiang; Zhichao Zhang; Xiongkuo Min; Guangtao Zhai; |
| 447 | MoS2: Mixture of Scale and Shift Experts for Text-Only Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, manually annotating coherent textual descriptions for videos is laborious and time-consuming. To address this challenge, we propose a novel approach that enhances video captioning using only synthetic text data. |
Heng Jia; Yunqiu Xu; Linchao Zhu; Guang Chen; Yufei Wang; Yi Yang; |
| 448 | PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. |
Junjie Shi; Caozhi Shang; Zhaobin Sun; Li Yu; Xin Yang; Zengqiang Yan; |
| 449 | Not All Pairs Are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). |
Yang Liu; Qianqian Xu; Peisong Wen; Siran Dai; Qingming Huang; |
| 450 | CMT: Co-training Mean-Teacher for Unsupervised Domain Adaptation on 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel Co-training Mean-Teacher (CMT) framework for unsupervised domain adaptation in 3D object detection. |
Shijie Chen; Junbao Zhuo; Xin Li; Haizhuang Liu; Rongquan Wang; Jiansheng Chen; Huimin Ma; |
| 451 | Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. |
Yiming Li; Zhifang Guo; Xiangdong Wang; Hong Liu; |
| 452 | Reversing Structural Pattern Learning with Biologically Inspired Knowledge Distillation for Spiking Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although Some pruning methods were proposed to tackle this problem, they normally ignored the fact the neural topology in the human brain could be adjusted dynamically. Inspired by this, this paper proposed an evolutionary-based structure construction method for constructing more reasonable SNNs. |
Qi Xu; Yaxin Li; Xuanye Fang; Jiangrong Shen; Qiang Zhang; Gang Pan; |
| 453 | RSNN: Recurrent Spiking Neural Networks for Dynamic Spatial-Temporal Information Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed an efficient Recurrent Spiking Neural Network (RSNN) to reduce the time domain information loss of original slice samples with the spiking based neural dynamics for processing the dynamic spatial-temporal information. |
Qi Xu; Xuanye Fang; Yaxin Li; Jiangrong Shen; De Ma; Yi Xu; Gang Pan; |
| 454 | Multi-Modal Inductive Framework for Text-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. |
Qian Li; Yucheng Zhou; Cheng Ji; Feihong Lu; Jianian Gong; Shangguang Wang; Jianxin Li; |
| 455 | DERD: Data-free Adversarial Robustness Distillation Through Self-adversarial Teacher Group Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Data-free Experts-guided Robustness Distillation (DERD) to extend robustness distillation to the data-free paradigm, which offers three advantages: (1) Dual-level adversarial learning strategy achieves robustness distillation without real data. |
Yuhang Zhou; Yushu Zhang; Leo Yu Zhang; Zhongyun Hua; |
| 456 | Minerva: Enhancing Quantum Network Performance for High-Fidelity Multimedia Transmission Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, when selecting a single link, existing methods can easily fall into the exploration and exploitation dilemma, given various fidelity distributions. To address this issue, this paper proposes a new framework that selects high-fidelity link transmission for multiple tasks through median elimination to estimate fidelity and transmission strategies, thereby improving the application scalability of quantum networks. |
Tingting Li; Ziming Zhao; Jianwei Yin; |
| 457 | In Situ 3D Scene Synthesis for Ubiquitous Embodied Interfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a scene agent that synthesizes situated 3D virtual scenes as a kind of ubiquitous embodied interface in VR for users. |
Haiyan Jiang; Leiyu Song; Dongdong Weng; Zhe Sun; Huiying Li; Xiaonuo Dongye; Zhenliang Zhang; |
| 458 | Partial Multi-label Learning Based On Near-Far Neighborhood Label Enhancement And Nonlinear Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The existing PML methods have the following problems: (1) the correlation between samples and labels is not fully utilized; (2) the nonlinear nature of the model is not taken into account. To solve these problems, we propose a new method of PML based on label enhancement of near and far neighbor information and nonlinear guidance(PML-LENFN). |
Yu Chen; Yanan Wu; Na Han; Xiaozhao Fang; Bingzhi Chen; Jie Wen; |
| 459 | Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. |
Andong Lu; Jiacong Zhao; Chenglong Li; Yun Xiao; Bin Luo; |
| 460 | CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. |
Xiang He; Xiangxi Liu; Yang Li; Dongcheng Zhao; Guobin Shen; Qingqun Kong; Xin Yang; Yi Zeng; |
| 461 | Dual-Modeling Decouple Distillation for Unsupervised Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Dual-Modeling Decouple Distillation (DMDD) for the unsupervised Anomaly Detection. |
Xinyue Liu; Jianyuan Wang; Biao Leng; Shuo Zhang; |
| 462 | APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this approach is inherently influenced by the errors introduced by 2D pose detectors and overlooks the intrinsic spatial information embedded within RGB images. To address these challenges, we introduce a versatile module called Adaptive Pose Pooling (APP), which is compatible with many existing 2D-to-3D lifting models. |
Jinyan Zhang; Mengyuan Liu; Hong Liu; Guoquan Wang; Wenhao Li; |
| 463 | MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. |
Chencan Fu; Yabiao Wang; Jiangning Zhang; Zhengkai Jiang; Xiaofeng Mao; Jiafu Wu; Weijian Cao; Chengjie Wang; Yanhao Ge; Yong Liu; |
| 464 | Robust Pseudo-label Learning with Neighbor Relation for Unsupervised Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the high intra-class variations, noisy pseudo-labels are difficult to calibrate completely. Therefore, we introduce a Neighbor Relation Learning module to reduce high intra-class variations by modeling potential interactions between all samples. |
Xiangbo Yin; Jiangming Shi; Yachao Zhang; Yang Lu; Zhizhong Zhang; Yuan Xie; Yanyun Qu; |
| 465 | Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. |
Zhihua Xu; Tianshui Chen; Zhijing Yang; Chunmei Qing; Yukai Shi; Liang Lin; |
| 466 | Towards Photorealistic Video Colorization Via Gated Color-Guided Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, based on a pretrained text-to-image model, we introduce the Gated Color Guidance module (GCG ), enabling the model to adaptively perform color propagation or generation according to the structural differences between reference and grayscale frames. |
Jiaxing Li; Hongbo Zhao; Yijun Wang; Jianxin Lin; |
| 467 | Efficiency in Focus: LayerNorm As A Catalyst for Fine-tuning Medical Visual Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning Layer Normalization (LayerNorm) layers, Feedforward Networks and Attention layers on the Med-VLMs. |
Jiawei Chen; Dingkang Yang; Yue Jiang; Mingcheng Li; Jinjie Wei; Xiaolu Hou; Lihua Zhang; |
| 468 | FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As a result, such vision-semantics alignment is inherently biased, leading to suboptimal integration outcomes. In this paper, we avoid such biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. |
Zhuoling Li; Yong Wang; Kaitong Li; |
| 469 | Semantic Alignment for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). |
Tao Wu; Mengze Li; Jingyuan Chen; Wei Ji; Wang Lin; Jinyang Gao; Kun Kuang; Zhou Zhao; Fei Wu; |
| 470 | Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an energy-informed diffusion model for unpaired photo-realistic image deraining (UPID-EDM). |
Yuanbo Wen; Tao Gao; Ting Chen; |
| 471 | Hypergraph-guided Intra- and Inter-category Relation Modeling for Fine-grained Visual Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous methods fail to combine these two complementary dimensions and mine the intrinsic relations among various semantic features. To address these limitations, we propose HI2R, a Hypergraph-guided Intra- and Inter-category Relation Modeling approach, which simultaneously extracts the intra-category structural information and inter-category relation for more precise reasoning. |
Lu Chen; Qiangchang Wang; Zhaohui Li; Yilong Yin; |
| 472 | Prototype-Guided Dual-Transformer Reasoning for Video Individual Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. |
Rui Li; Yishu Liu; Huafeng Li; Jinxing Li; Guangming Lu; |
| 473 | CT2C-QA: Multimodal Question Answering Over Chinese Text, Table and Chart Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CT2C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. |
Bowen Zhao; Tianhao Cheng; Yuejie Zhang; Ying Cheng; Rui Feng; Xiaobo Zhang; |
| 474 | FARFusion V2: A Geometry-based Radar-Camera Fusion Method on The Ground for Roadside Far-Range 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, real-world issues like uneven terrain and sensor movement degrade these transformations’ precision, impacting fusion effectiveness. To alleviate these issues, we propose a geometry-based Radar-camera fusion method on the ground, namely FARFusion V2. |
Yao Li; Jiajun Deng; Yuxuan Xiao; Yingjie Wang; Xiaomeng Chu; Jianmin Ji; Yanyong Zhang; |
| 475 | View-consistent Object Removal in Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel RF editing pipeline that significantly enhances consistency by requiring the inpainting of only a single reference image. |
Yiren Lu; Jing Ma; Yu Yin; |
| 476 | High Fidelity Aggregated Planar Prior Assisted PatchMatch Multi-View Stereo Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, due to the unreliable planar models in large-scale low-textured objects, the reconstruction results are incomplete. To address the above issues, we introduce the segmentation generated from Segment Anything Model into PatchMatch. |
Jie Liang; Rongjie Wang; Rui Peng; Zhe Zhang; Kaiqiang Xiong; Ronggang Wang; |
| 477 | Event Traffic Forecasting with Sparse Multimodal Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, to broaden the applicable scenarios of traffic forecasting, we focus on modeling the impact of events on traffic patterns and propose an event traffic forecasting problem with multimodal inputs. |
Xiao Han; Zhenduo Zhang; Yiling Wu; Xinfeng Zhang; Zhe Wu; |
| 478 | GS3LAM: Gaussian Semantic Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework, which takes multimodal data as input and can render consistent, continuous dense semantic maps in real-time. |
Linfei Li; Lin Zhang; Zhong Wang; Ying Shen; |
| 479 | Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. |
Yang Liu; Xiang Huang; Minghan Qin; Qinwei Lin; Haoqian Wang; |
| 480 | Cross-modal Observation Hypothesis Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose the Cross-modal Observation hypothesIs iNference task (COIN). |
Mengze Li; Kairong Han; Jiahe Xu; Yueying Li; Tao Wu; Zhou Zhao; Jiaxu Miao; Shengyu Zhang; Jingyuan Chen; |
| 481 | VoCAPTER: Voting-based Pose Tracking for Category-level Articulated Object Via Inter-frame Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we deal with the problem of category-level online robust 9D pose tracking of articulated objects, where we propose VoCAPTER, a novel 3D Voting-based Category-level Articulated object Pose TrackER. |
Li Zhang; Zean Han; Yan Zhong; Qiaojun Yu; Xingyu Wu; Xue Wang; Rujing Wang; |
| 482 | MultiColor: Image Colorization By Learning from Multiple Color Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we employ a set of dedicated colorization modules for individual color space. |
Xiangcheng Du; Zhao Zhou; Xingjiao Wu; Yanlong Wang; Zhuoyao Wang; Yingbin Zheng; Cheng Jin; |
| 483 | Digging Into Contrastive Learning for Robust Depth Estimation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. |
Jiyuan Wang; Chunyu Lin; Lang Nie; Kang Liao; Shuwei Shao; Yao Zhao; |
| 484 | ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, integrating data and supporting diverse languages and styles in SVS remain challenging. To tackle these issues, this paper presents ExpressiveSinger, an SVS framework that leverages a cascade of diffusion models to generate realistic singing across multiple languages, styles, and techniques from scores and lyrics. |
Shuqi Dai; Ming-Yu Liu; Rafael Valle; Siddharth Gururani; |
| 485 | AutoM3L: An Automated Multimodal Machine Learning Framework with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce AutoM3L, an innovative Automated Multimodal Machine Learning framework that leverages LLMs as controllers to automatically construct multimodal training pipelines. |
Daqin Luo; Chengjian Feng; Yuxuan Nong; Yiqing Shen; |
| 486 | PriFU: Capturing Task-Relevant Information Without Adversarial Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although effective, these adversarial learning-based approaches suffer not only from convergence difficulties, but also from limited generalization beyond the specific privacy for which they are trained. To address these issues, we propose a method for privacy preservation in the inference phase by removing task-irrelevant information, which requires no knowledge of the privacy attacks nor introduction of adversarial learning. |
Xiuli Bi; Yang Hu; Bo Liu; Weisheng Li; Pamela Cosman; Bin Xiao; |
| 487 | SDePR: Fine-Grained Leaf Image Retrieval with Structural Deep Patch Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we for the first time investigate the possible way to mine the spatial structure and contextual information from the activation of the convolutional layers of CNN networks for FGLIR. |
Xin Chen; Bin Wang; Jinzheng Jiang; Kunkun Zhang; Yongsheng Gao; |
| 488 | ReCorD: Reasoning and Correcting Diffusion for HOI Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. |
Jian-Yu Jiang-Lin; Kang-Yang Huang; Ling Lo; Yi-Ning Huang; Terence Lin; Jhih-Ciang Wu; Hong-Han Shuai; Wen-Huang Cheng; |
| 489 | Rainmer: Learning Multi-view Representations for Comprehensive Image Deraining and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address image deraining under complex backgrounds, diverse rain scenarios, and varying illumination conditions, representing a highly practical and challenging problem. |
Wu Ran; Peirong Ma; Zhiquan He; Hong Lu; |
| 490 | SceneExpander: Real-Time Scene Synthesis for Interactive Floor Plan Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The system proposed in this paper generates scenes over floor plans in real-time. |
Shao-Kui Zhang; Junkai Huang; Liang Yue; Jia-Tong Zhang; Jia-Hong Liu; Yu-Kun Lai; Song-Hai Zhang; |
| 491 | Controllable Procedural Generation of Landscapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a controllable framework for procedurally generating landscapes. |
Jia-Hong Liu; Shao-Kui Zhang; Chuyue Zhang; Song-Hai Zhang; |
| 492 | ScenePhotographer: Object-Oriented Photography for Residential Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces ScenePhotographer, an object-oriented framework for automatic view selection in residential scenes. |
Shao-Kui Zhang; Hanxi Zhu; Xuebin Chen; Jinghuan Chen; Zhike Peng; Ziyang Chen; Yong-Liang Yang; Song-Hai Zhang; |
| 493 | A Plug-and-Play Method for Rare Human-Object Interactions Detection By Bridging Domain Gap Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called Context-Enhanced Feature Alignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. |
Lijun Zhang; Wei Suo; Peng Wang; Yanning Zhang; |
| 494 | MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in the context of Audio-Visual Class-Incremental Learning (AVCIL), the effective integration and utilization of heterogeneous modalities, with their complementary and enhancing characteristics, remains largely unexplored. To bridge this gap, we propose the Multi-Modal Analytic Learning (MMAL) framework, an exemplar-free solution for AVCIL that employs a closed-form, linear approach. |
Xianghu Yue; Xueyi Zhang; Yiming Chen; Chengwei Zhang; Mingrui Lao; Huiping Zhuang; Xinyuan Qian; Haizhou Li; |
| 495 | Fast and Scalable Incomplete Multi-View Clustering with Duality Optimal Graph Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an alternative optimization algorithm with linear complexity. |
Liang Du; Yukai Shi; Yan Chen; Peng Zhou; Yuhua Qian; |
| 496 | Speech Reconstruction from Silent Lip and Tongue Articulation By Diffusion Models and Text-Guided Pseudo Target Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome the domain discrepancy between silent and standard vocalized articulation, we introduce a novel pseudo target generation strategy. |
Rui-Chen Zheng; Yang Ai; Zhen-Hua Ling; |
| 497 | PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). |
Zuoyan Zhao; Hui Xue; Pengfei Fang; Shipeng Zhu; |
| 498 | QPT-V2: Masked Image Modeling Advances Visual Scoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. |
Qizhi Xie; Kun Yuan; Yunpeng Qu; Mingda Wu; Ming Sun; Chao Zhou; Jihong Zhu; |
| 499 | Enhanced Experts with Uncertainty-Aware Routing for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Enhanced experts with Uncertainty-Aware Routing (EUAR) method to address the influence of noisy data on multimodal sentiment analysis by capturing uncertainty and dynamically altering the network. |
Zixian Gao; Disen Hu; Xun Jiang; Huimin Lu; Heng Tao Shen; Xing Xu; |
| 500 | Prior Knowledge Integration Via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore the use of large language models (LLMs) to enhance video moment retrieval (VMR) by integrating general knowledge and pseudo-events as priors. |
Yiyang Jiang; Wengyu Zhang; Xulu Zhang; Xiao-Yong Wei; Chang Wen Chen; Qing Li; |
This table only includes 500 papers selected based on our selection algorithm. To continue with the full list, please visit Paper Digest: MM-2024 (Full List).