Paper Digest: ACM Multimedia 2023 Papers & Highlights
Interested users can choose to read all MM-2023 papers in our digest console, which supports more features.
To search for papers presented at MM-2023 on a specific topic, please make use of the search by venue (MM-2023) service. To summarize the latest research published at MM-2023 on a specific topic, you can utilize the review by venue (MM-2023) service. To synthesizes the findings from MM 2023 into comprehensive reports, give a try to MM-2023 Research. If you are interested in browsing papers by author, we have a comprehensive list of all MM-2023 authors & their papers.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive updates on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: ACM Multimedia 2023 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In addition, it is a challenging task due to the weak correlation between speech and gestures. To address these problems, we present UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. |
Sicheng Yang; Zilin Wang; Zhiyong Wu; Minglei Li; Zhensong Zhang; Qiaochu Huang; Lei Hao; Songcen Xu; Xiaofei Wu; Changpeng Yang; Zonghong Dai; |
| 2 | SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The framework constructs a network system consisting of three modules: facial animator, speech recognizer, and lip-reading interpreter. The core of SelfTalk is a commutative training diagram that facilitates compatible features exchange among audio, text, and lip shape, enabling our models to learn the intricate connection between these factors. |
Ziqiao Peng; Yihao Luo; Yue Shi; Hao Xu; Xiangyu Zhu; Hongyan Liu; Jun He; Zhaoxin Fan; |
| 3 | Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. |
Zezhong Lv; Bing Su; Ji-Rong Wen; |
| 4 | Synthesizing Long-Term Human Motions with Diffusion Models Via Coherent Sampling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. |
Zhao Yang; Bing Su; Ji-Rong Wen; |
| 5 | Text-to-Audio Generation Using Instruction Guided Latent Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation-a task where the goal is to generate an audio from its textual description. |
Deepanway Ghosal; Navonil Majumder; Ambuj Mehrish; Soujanya Poria; |
| 6 | Spatio-Temporal Branching for Motion Prediction Using Motion Increments Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. |
Jiexin Wang; Yujie Zhou; Wenwen Qiang; Ying Ba; Bing Su; Ji-Rong Wen; |
| 7 | Simple Techniques Are Sufficient for Boosting Adversarial Transferability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we revisit the method of output space attack and improve it from two perspectives. |
Chaoning Zhang; Philipp Benz; Adil Karjauv; In So Kweon; Choong Seon Hong; |
| 8 | Towards Explainable In-the-Wild Video Quality Assessment: A Database and A Language-Prompted Approach Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). |
Haoning Wu; Erli Zhang; Liang Liao; Chaofeng Chen; Jingwen Hou; Annan Wang; Wenxiu Sun; Qiong Yan; Weisi Lin; |
| 9 | Taming The Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model’s generation effectively. |
Junhong Gou; Siyu Sun; Jianfu Zhang; Jianlou Si; Chen Qian; Liqing Zhang; |
| 10 | Beyond Generic: Enhancing Image Captioning with Real-World Knowledge Using Vision-Language Pre-Training Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. |
Kanzhi Cheng; Wenpo Song; Zheng Ma; Wenhao Zhu; Zixuan Zhu; Jianbing Zhang; |
| 11 | CLE Diffusion: Controllable Light Enhancement Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability.Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. |
Yuyang Yin; Dejia Xu; Chuangchuang Tan; Ping Liu; Yao Zhao; Yunchao Wei; |
| 12 | CONVERT: Contrastive Graph Clustering with Reliable Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The reliability of the augmented view semantics for contrastive learning can not be guaranteed, thus limiting the model performance. To address these problems, we propose a novel CONtrastiVe Graph ClustEring network with Reliable AugmenTation (COVERT). |
Xihong Yang; Cheng Tan; Yue Liu; Ke Liang; Siwei Wang; Sihang Zhou; Jun Xia; Stan Z. Li; Xinwang Liu; En Zhu; |
| 13 | DealMVC: Dual Contrastive Calibration for Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). |
Xihong Yang; Jin Jiaqi; Siwei Wang; Ke Liang; Yue Liu; Yi Wen; Suyuan Liu; Sihang Zhou; Xinwang Liu; En Zhu; |
| 14 | Hashing One With All Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Determining and maintaining the relations between each datum and all the others maximally utilize the semantic diversity of the training set, but can we explore this on the holistic dataset using a single network end-to-end? In this paper, we take a step towards this vision by proposing Overview Hashing (OH). |
Jiaguo Yu; Yuming Shen; Haofeng Zhang; |
| 15 | Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel framework to learn the geometric primitives shared in seen and unseen categories’ objects and employ a fine-grained alignment between language and the learned geometric primitives. |
Runnan Chen; Xinge Zhu; Nenglun Chen; Wei Li; Yuexin Ma; Ruigang Yang; Wenping Wang; |
| 16 | A Lightweight Collective-attention Network for Change Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, instead of devoting full attention to changes, most existing solutions often expend unnecessary resources yet derive task-irrelevant features. To relieve this issue, we propose a collective-attention network, which enjoys lightweight model architecture yet guarantees high performance. |
Yuchao Feng; Yanyan Shao; Honghui Xu; Jinshan Xu; Jianwei Zheng; |
| 17 | Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, some tasks introduce weakly supervised learning with random masks, while the single masking forces the model to focus on masked words and ignore multi-modal contextual information. In view of this, we attempt weakly supervised joint tasks (MR+HD) and propose Dynamic Contrastive Learning with Pseudo-Sample Intervention (CPI) for better multi-modal video comprehension. |
Shuhan Kong; Liang Li; Beichen Zhang; Wenyu Wang; Bin Jiang; Chenggang Yan; Changhao Xu; |
| 18 | VTLayout: A Multi-Modal Approach for Video Text Layout Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To bridge the gap between video OCR and understanding, we explore the study of video text layout in this work. |
Yuxuan Zhao; Jin Ma; Zhongang Qi; Zehua Xie; Yu Luo; Qiusheng Kang; Ying Shan; |
| 19 | LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. |
Leigang Qu; Shengqiong Wu; Hao Fei; Liqiang Nie; Tat-Seng Chua; |
| 20 | RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. |
Zheng Yuan; Qiao Jin; Chuanqi Tan; Zhengyun Zhao; Hongyi Yuan; Fei Huang; Songfang Huang; |
| 21 | DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). |
Chenpeng Du; Qi Chen; Tianyu He; Xu Tan; Xie Chen; Kai Yu; Sheng Zhao; Jiang Bian; |
| 22 | HSIC-based Moving Weight Averaging for Few-Shot Open-Set Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we solve the challenging few-shot open-set object detection problems from three aspects. |
Binyi Su; Hua Zhang; Zhong Zhou; |
| 23 | Two-stage Content-Aware Layout Generation for Poster Designs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, existing layout generation models often fail to incorporate explicit aesthetic principles such as alignment and non-overlap, and neglect implicit aesthetic principles which are hard to model. To address these issues, this paper proposes a two-stage content-aware layout generation framework for poster layout generation. |
Shang Chai; Liansheng Zhuang; Fengying Yan; Zihan Zhou; |
| 24 | Improving The Transferability of Adversarial Examples with Arbitrary Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. |
Zhijin Ge; Fanhua Shang; Hongying Liu; Yuanyuan Liu; Liang Wan; Wei Feng; Xiaosen Wang; |
| 25 | UniSA: Unified Generative Framework for Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, unifying all subtasks in sentiment analysis presents numerous challenges, including modality alignment, unified input/output forms, and dataset bias. To address these challenges, we propose a Task-Specific Prompt method to jointly model subtasks and introduce a multimodal generative framework called UniSA. |
Zaijing Li; Ting-En Lin; Yuchuan Wu; Meng Liu; Fengxiao Tang; Ming Zhao; Yongbin Li; |
| 26 | Few-shot Multimodal Sentiment Analysis Based on Multimodal Probabilistic Fusion Prompts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To enhance the model’s robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. |
Xiaocui Yang; Shi Feng; Daling Wang; Yifei Zhang; Soujanya Poria; |
| 27 | Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. |
Wentian Xin; Qiguang Miao; Yi Liu; Ruyi Liu; Chi-Man Pun; Cheng Shi; |
| 28 | Text-to-Image Diffusion Models Can Be Easily Backdoored Through Multimodal Data Poisoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. |
Shengfang Zhai; Yinpeng Dong; Qingni Shen; Shi Pu; Yuejian Fang; Hang Su; |
| 29 | Degeneration-Tuning: Using Scrambled Grid Shield Unwanted Concepts from Stable Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel strategy named Degeneration-Tuning (DT) to shield contents of unwanted concepts from SD weights. |
Zixuan Ni; Longhui Wei; Jiacheng Li; Siliang Tang; Yueting Zhuang; Qi Tian; |
| 30 | RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose RoomDreamer, which leverages powerful natural language to synthesize a new room with a different style. |
Liangchen Song; Liangliang Cao; Hongyu Xu; Kai Kang; Feng Tang; Junsong Yuan; Zhao Yang; |
| 31 | Gradient-Free Textual Inversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a gradient-free framework to optimize the continuous textual inversion in an iterative evolutionary strategy. |
Zhengcong Fei; Mingyuan Fan; Junshi Huang; |
| 32 | UER: A Heuristic Bias Addressing Approach for Online Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we try to address the bias issue by a more straightforward and more efficient method. |
Huiwei Lin; Shanshan Feng; Baoquan Zhang; Hongliang Qiao; Xutao Li; Yunming Ye; |
| 33 | Digital Twins Fuzzy System Based on Time Series Forecasting Model LFTformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In wind power forecasting, this paper proposes a new algorithm LFTformer that combines the Transformer model with linear fuzzy information granulation (LFIG). |
Jinkang Guo; Zhibo Wan; Zhihan Lv; |
| 34 | Learning Discriminative Feature Representation for Open Set Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework for OSAR that enriches the discriminative representation from a backbone with a reconstructive one to further improve performance. |
Hongjie Zhang; Yi Liu; Yali Wang; Limin Wang; Yu Qiao; |
| 35 | Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. |
Yi Bin; Haoxuan Li; Yahui Xu; Xing Xu; Yang Yang; Heng Tao Shen; |
| 36 | Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, since no physical reflection is considered in its rendering pipeline, NeRF mistakes the reflection in the mirror as a separate virtual scene, leading to the inaccurate reconstruction of the mirror and multi-view inconsistent reflections in the mirror. In this paper, we present a novel neural rendering framework, named Mirror-NeRF, which is able to learn accurate geometry and reflection of the mirror and support various scene manipulation applications with mirrors, such as adding new objects or mirrors into the scene and synthesizing the reflections of these new objects in mirrors, controlling mirror roughness, etc. |
Junyi Zeng; Chong Bao; Rui Chen; Zilong Dong; Guofeng Zhang; Hujun Bao; Zhaopeng Cui; |
| 37 | Visual Causal Scene Refinement for Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). |
Yushen Wei; Yang Liu; Hong Yan; Guanbin Li; Liang Lin; |
| 38 | Exploring The Knowledge Transferred By Response-Based Teacher-Student Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The method is motivated by observing that the predicted probabilities reflect the relation among labels, which is the knowledge to be transferred. |
Liangchen Song; Xuan Gong; Helong Zhou; Jiajie Chen; Qian Zhang; David Doermann; Junsong Yuan; |
| 39 | DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. |
Qiaosong Qi; Le Zhuo; Aixi Zhang; Yue Liao; Fei Fang; Si Liu; Shuicheng Yan; |
| 40 | ALEX: Towards Effective Graph Transfer Learning with Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We introduce a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge. |
Jingyang Yuan; Xiao Luo; Yifang Qin; Zhengyang Mao; Wei Ju; Ming Zhang; |
| 41 | Decoupled Cross-Scale Cross-View Interaction for Stereo Image Enhancement in The Dark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This can be attributed to two main factors: 1) insufficient single-scale inter-view interaction hinders the exploitation of valuable cross-view cues; 2) lacking long-range dependency leads to the inability to deal with the spatial long-range effects caused by illumination degradation. To address these limitations, we propose a novel LLSIE model named Decoupled Cross-Scale Cross-View Interaction Network (DCI-Net). |
Huan Zheng; Zhao Zhang; Jicong Fan; Richang Hong; Yi Yang; Shuicheng Yan; |
| 42 | Hierarchical Prompt Learning Using CLIP for Multi-label Classification with Single Positive Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Though achieving promising performance, they generally consider labels independently, leaving out the inherent hierarchical semantic relationship among labels which reveals that labels can be clustered into groups. In this paper, we propose a hierarchical prompt learning method with a novel Hierarchical Semantic Prompt Network (HSPNet) to harness such hierarchical semantic relationships using a large-scale pretrained vision and language model, i.e., CLIP, for SPML. |
Ao Wang; Hui Chen; Zijia Lin; Zixuan Ding; Pengzhang Liu; Yongjun Bao; Weipeng Yan; Guiguang Ding; |
| 43 | LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. |
Davide Morelli; Alberto Baldrati; Giuseppe Cartella; Marcella Cornia; Marco Bertini; Rita Cucchiara; |
| 44 | Object Segmentation By Mining Cross-Modal Semantics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features, with the aim of controlling the modal contribution based on relative entropy. |
Zongwei Wu; Jingjing Wang; Zhuyun Zhou; Zhaochong An; Qiuping Jiang; C\'{e}dric Demonceaux; Guolei Sun; Radu Timofte; |
| 45 | MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method which leverages large-scale self-supervised pre-training on abundant unlabeled data to largely advance the development of DFER. |
Licai Sun; Zheng Lian; Bin Liu; Jianhua Tao; |
| 46 | Multimodal Color Recommendation in Vector Graphic Documents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. |
Qianru Qiu; Xueting Wang; Mayu Otani; |
| 47 | Dance with You: The Diversity Controllable Dancer Generation Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel multi-dancer synthesis task called partner dancer generation, which involves synthesizing virtual human dancers capable of performing dance with users. |
Siyue Yao; Mingjie Sun; Bingliang Li; Fengyu Yang; Junle Wang; Ruimao Zhang; |
| 48 | CgT-GAN: CLIP-guided Text GAN for Image Captioning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to see real visual modality. |
Jiarui Yu; Haoran Li; Yanbin Hao; Bin Zhu; Tong Xu; Xiangnan He; |
| 49 | A Symbolic Characters Aware Model for Solving Geometry Problems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, by simply tokenizing symbolic characters into individual letters (e.g., ‘A’, ‘B’ and ‘C’), existing works fail to study them explicitly and thus lose the semantic relationship with the diagram. In this paper, we develop a symbolic character-aware model to fully explore the role of these characters in both text and diagram understanding and optimize the model under a multi-modal reasoning framework. |
Maizhen Ning; Qiu-Feng Wang; Kaizhu Huang; Xiaowei Huang; |
| 50 | Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. |
Yeying Jin; Beibei Lin; Wending Yan; Yuan Yuan; Wei Ye; Robby T. Tan; |
| 51 | DANet: Multi-scale UAV Target Detection with Dynamic Feature Perception and Scale-aware Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we design a dynamic attentive network (DANet) incorporating a scale-adaptive feature enhancement mechanism (SaFEM) and an attention-guided cross-weighting feature aggregator (ACFA). |
Houzhang Fang; Zikai Liao; Lu Wang; Qingshan Li; Yi Chang; Luxin Yan; Xuhua Wang; |
| 52 | Your Negative May Not Be True Negative: Boosting Image-Text Matching with False Negative Elimination Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. |
HaoXuan Li; Yi Bin; Junrong Liao; Yang Yang; Heng Tao Shen; |
| 53 | Neural Image Popularity Assessment with Retrieval-augmented Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore social preference model should be able to account for this user-specific image aspect of the task. To address this issue, we present a retrieval-augmented approach that leverages both image features and user-specific statistics for neural image popularity assessment. |
Liya Ji; Chan Ho Park; Zhefan Rao; Qifeng Chen; |
| 54 | Joint Searching and Grounding: Multi-Granularity Video Content Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a challenging but more realistic task called Multi-Granularity Video Content Retrieval (MGVCR), which involves retrieving both video files and specific video content with their temporal locations. |
Zhiguo Chen; Xun Jiang; Xing Xu; Zuo Cao; Yijun Mo; Heng Tao Shen; |
| 55 | Event-guided Frame Interpolation and Dynamic Range Expansion of Single Rolling Shutter Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Another limitation of RS cameras in complex dynamic scenarios lies in the dynamic range, since traditional ways of multiple exposure for high dynamic range (HDR) imaging will fail due to alignment issues. To deal with these two challenges simultaneously, we propose to use an event camera for assistance, which has much faster temporal response and wider dynamic range. |
Guixu Lin; Jin Han; Mingdeng Cao; Zhihang Zhong; Yinqiang Zheng; |
| 56 | Redundancy-aware Transformer for Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. |
Yicong Li; Xun Yang; An Zhang; Chun Feng; Xiang Wang; Tat-Seng Chua; |
| 57 | DAWN: Direction-aware Attention Wavelet Network for Image Deraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the existing wavelet-based methods ignore the heterogeneous degradation for different coefficients due to the inherent directional characteristics of rain streaks, leading to inter-frequency conflicts and compromised deraining results. To address this issue, we propose a novel Direction-aware Attention Wavelet Network (DAWN) for rain streaks removal. |
Kui Jiang; Wenxuan Liu; Zheng Wang; Xian Zhong; Junjun Jiang; Chia-Wen Lin; |
| 58 | Scalable Incomplete Multi-View Clustering with Structure Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Such the AUP-ID would cause inaccurate graph fusion and degrades clustering performance. To tackle these issues, we propose a novel incomplete anchor graph learning framework termed Scalable Incomplete Multi-View Clustering with Structure Alignment (SIMVC-SA). |
Yi Wen; Siwei Wang; Ke Liang; Weixuan Liang; Xinhang Wan; Xinwang Liu; Suyuan Liu; Jiyuan Liu; En Zhu; |
| 59 | DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel framework termed Deep Cross-modal Evidential Learning (DCEL), which deploys evidential deep learning to consider the cross-modal alignment uncertainty. |
Shenshen Li; Xing Xu; Yang Yang; Fumin Shen; Yijun Mo; Yujie Li; Heng Tao Shen; |
| 60 | ProtoHPE: Prototype-guided High-frequency Patch Enhancement for Visible-Infrared Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we find that some cross-modal correlated high-frequency components contain discriminative visual patterns and are less affected by variations such as wavelength, pose, and background clutter than holistic images. Therefore, we are motivated to bridge the modality gap based on such high-frequency components, and propose Prototype-guided High-frequency Patch Enhancement (ProtoHPE) with two core designs. |
Guiwei Zhang; Yongfei Zhang; Zichang Tan; |
| 61 | Latent-space Unfolding for MRI Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To achieve a performance promotion yet with the guarantee of running efficiency, in this work, we propose a latent-space unfolding network (LsUNet). |
Jiawei Jiang; Yuchao Feng; Jiacheng Chen; Dongyan Guo; Jianwei Zheng; |
| 62 | Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. |
Shiyuan Yang; Xiaodong Chen; Jing Liao; |
| 63 | Fine-Grained Multimodal Named Entity Recognition and Grounding with A Generative Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, existing MNER studies primarily classify entities into four coarse-grained entity types, which are often insufficient to map them to their real-world referents. To solve these limitations, we introduce a task named Fine-grained Multimodal Named Entity Recognition and Grounding (FMNERG) in this paper, which aims to simultaneously extract named entities in text, their fine-grained entity types, and their grounded visual objects in image. |
Jieming Wang; Ziyan Li; Jianfei Yu; Li Yang; Rui Xia; |
| 64 | Open-Scenario Domain Adaptive Object Detection in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these algorithms typically assume a single source and target domain for adaptation, which is not representative of the more complex data distributions in practice. To address this issue, we propose a novel Open-Scenario Domain Adaptive Object Detection (OSDA), which leverages multiple source and target domains for more practical and effective domain adaptation. |
Zeyu Ma; Ziqiang Zheng; Jiwei Wei; Xiaoyong Wei; Yang Yang; Heng Tao Shen; |
| 65 | Semantic-based Selection, Synthesis, and Supervision for Few-shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It is because a feature extractor trained on base samples (known knowledge) tends to focus on the textures and structures of the objects it learns, which is inadequate for describing novel samples. To solve these issues, we introduce semantics and propose a Semantic-based Selection, Synthesis, and S upervision (4S) method, where semantics provide more diverse and informative supervision for recognizing novel objects. |
Jinda Lu; Shuo Wang; Xinyu Zhang; Yanbin Hao; Xiangnan He; |
| 66 | Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement (PICR-Net). |
Runmin Cong; Hongyu Liu; Chen Zhang; Wei Zhang; Feng Zheng; Ran Song; Sam Kwong; |
| 67 | SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite significant progress in shadow detection, current methods still struggle with the adverse impact of background color, which may lead to errors when shadows are present on complex backgrounds. Drawing inspiration from the human visual system, we treat the input shadow image as a composition of a background layer and a shadow layer, and design a Style-guided Dual-layer Disentanglement Network (SDDNet) to model these layers independently. |
Runmin Cong; Yuchen Guan; Jinpeng Chen; Wei Zhang; Yao Zhao; Sam Kwong; |
| 68 | Equivariant Learning for Out-of-Distribution Cold-start Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an equivariant learning framework, which aims to achieve equivariant alignment between item features, feature representations, and CF representations in the underrepresented feature space. |
Wenjie Wang; Xinyu Lin; Liuhui Wang; Fuli Feng; Yinwei Wei; Tat-Seng Chua; |
| 69 | Personalized Behavior-Aware Transformer for Multi-Behavior Sequential Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: On the other hand, there exists comprehensive co-influence between behavior correlations and item collaborations, the intensity of which is deeply affected by temporal factors. To tackle these challenges, we propose a Personalized Behavior-Aware Transformer framework (PBAT) for MBSR problem, which models personalized patterns and multifaceted sequential collaborations in a novel way to boost recommendation performance. |
Jiajie Su; Chaochao Chen; Zibin Lin; Xi Li; Weiming Liu; Xiaolin Zheng; |
| 70 | Deconfounded Multimodal Learning for Spatio-temporal Video Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. |
Jiawei Wang; Zhanchang Ma; Da Cao; Yuquan Le; Junbin Xiao; Tat-Seng Chua; |
| 71 | Frequency Perception Network for Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Considering that the features of the camouflaged object and the background are more discriminative in the frequency domain, we propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain. |
Runmin Cong; Mengyao Sun; Sanyi Zhang; Xiaofei Zhou; Wei Zhang; Yao Zhao; |
| 72 | Modeling Multi-Relational Connectivity for Personalized Fashion Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With a new perspective, this paper proposes to formulate the personalized item matching as the multi-relational connectivity and apply a single-component translation operation to model the targeted third-order interactions. |
Yujuan Ding; P.Y. Mok; Yi Bin; Xun Yang; Zhiyong Cheng; |
| 73 | Noise-Robust Continual Test-Time Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the noise-robustness problem in continual TTA by offering three effective recipes to mitigate it. |
Zhiqi Yu; Jingjing Li; Zhekai Du; Fengling Li; Lei Zhu; Yang Yang; |
| 74 | Sparse Sharing Relation Network for Panoptic Driving Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, to model the co-occurrence and spatial relationships of traffic objects, we propose to use a Graph Convolutional Network (GCN) block operating on the patches of feature maps. |
Fan Jiang; Zilei Wang; |
| 75 | Target-Guided Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). |
Haokun Wen; Xian Zhang; Xuemeng Song; Yinwei Wei; Liqiang Nie; |
| 76 | VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate this limitation and demonstrate that simultaneous tuning of the two modalities in such models leads to multi-modal forgetting and catastrophic performance degradation, particularly when generalizing to new classes. To address this issue, we propose a novel PET approach called VioLET (Vision Language Efficient Tuning) that utilizes collaborative multi-modal gradients to unlock the full potential of both modalities. |
Yaoming Wang; Yuchen Liu; Xiaopeng Zhang; Jin Li; Bowen Shi; Chenglin Li; Wenrui Dai; Hongkai Xiong; Qi Tian; |
| 77 | Depth-Aware Sparse Transformer for Video-Language Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it is cumbersome to compute the self-attention on a long-range sequence and heterogeneous video-level representations with regard to computation cost and flexibility on various frame scales. To tackle this, we propose a hierarchical transformer, termed Depth-Aware Sparse Transformer (DAST). |
Haonan Zhang; Lianli Gao; Pengpeng Zeng; Alan Hanjalic; Heng Tao Shen; |
| 78 | SepMark: Deep Separable Watermarking for Unified Source Tracing and Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although many countermeasures have been developed to detect Deepfakes ex-post, undoubtedly, passive forensics has not considered any preventive measures for the pristine face before foreseeable manipulations. To complete this forensics ecosystem, we thus put forward the proactive solution dubbed SepMark, which provides a unified framework for source tracing and Deepfake detection. |
Xiaoshuai Wu; Xin Liao; Bo Ou; |
| 79 | Faster Video Moment Retrieval with Point-Level Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing VMR methods suffer from two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules are deployed, which lead to high computational cost and low efficiency for the retrieval process. To address these issues, we propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR), which balances the retrieval accuracy, efficiency, and annotation cost for VMR. |
Xun Jiang; Zailei Zhou; Xing Xu; Yang Yang; Guoqing Wang; Heng Tao Shen; |
| 80 | Scene-aware Human Pose Generation Using Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. |
Jieteng Yao; Junjie Chen; Li Niu; Bin Sheng; |
| 81 | Guided Image Synthesis Via Initial Image Editing in Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. |
Jiafeng Mao; Xueting Wang; Kiyoharu Aizawa; |
| 82 | StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In recent years, many video stabilization algorithms have been proposed, yet no specific and accurate metric enables comprehensively evaluating the stability of videos. |
Tengchuan Kou; Xiaohong Liu; Wei Sun; Jun Jia; Xiongkuo Min; Guangtao Zhai; Ning Liu; |
| 83 | Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. |
Jian Huang; Yanli Ji; Yang Yang; Heng Tao Shen; |
| 84 | Graph-Based Video-Language Learning with Multi-Grained Audio-Visual Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: One of the key challenges in this area is how to effectively integrate visual and linguistic information to enable machines to understand video content and query information. In this work, we leverage graph-based representations and multi-grained audio-visual alignment to address this challenge. |
Chenyang Lyu; Wenxi Li; Tianbo Ji; Longyue Wang; Liting Zhou; Cathal Gurrin; Linyi Yang; Yi Yu; Yvette Graham; Jennifer Foster; |
| 85 | CARIS: Context-Aware Referring Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The stand-alone linguistic features are therefore unable to align with all visual concepts, resulting in inaccurate segmentation. In this paper, we propose to address this issue by incorporating rich visual context into linguistic features for sufficient vision-language alignment. |
Sun-Ao Liu; Yiheng Zhang; Zhaofan Qiu; Hongtao Xie; Yongdong Zhang; Ting Yao; |
| 86 | Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Built upon the HostSG, we present a nichetargeting VidSRL framework. |
Yu Zhao; Hao Fei; Yixin Cao; Bobo Li; Meishan Zhang; Jianguo Wei; Min Zhang; Tat-Seng Chua; |
| 87 | KeyPosS: Plug-and-Play Facial Landmark Detection Through GPS-Inspired True-Range Multilateration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS) – a groundbreaking facial landmark detection framework that stands out from existing methods. |
Xu Bao; Zhi-Qi Cheng; Jun-Yan He; Wangmeng Xiang; Chenyang Li; Jingdong Sun; Hanbing Liu; Wei Liu; Bin Luo; Yifeng Geng; Xuansong Xie; |
| 88 | Style-Controllable Generalized Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve the difficulty of metric learning under multi-source training, we design a Style-aware Hard-negative Sampling (SHS) strategy. |
Yuke Li; Jingkuan Song; Hao Ni; Heng Tao Shen; |
| 89 | PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the primary intra-domain knowledge, such as context correlation inside an image, remains under-explored. In an attempt to fill this gap, we revisit the current pixel contrast in semantic segmentation and propose a unified pixel- and patch-wise self-supervised learning framework, called PiPa, for domain adaptive semantic segmentation that facilitates intra-image pixel-wise correlations and patch-wise semantic consistency against different contexts. |
Mu Chen; Zhedong Zheng; Yi Yang; Tat-Seng Chua; |
| 90 | UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. |
Zhiqing Hong; Chenye Cui; Rongjie Huang; Lichao Zhang; Jinglin Liu; Jinzheng He; Zhou Zhao; |
| 91 | HELIOS: Hyper-Relational Schema Modeling from Knowledge Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Against this background, we study the problem of modeling hyper-relational schema, which is formulated as mixed hyper-relational tuples ({Th}, r, {Tt}, k, {Tv1},…) with two-fold hyper-relations: each type set T may contain multiple types and each schema tuple may contain multiple key-type set pairs (k, Tv). To address this problem, we propose HELIOS, a hyper-relational schema model designed to subtly learn from such hyper-relational schema tuples by capturing not only the correlation between multiple types of a single entity, but also the correlation between types of different entities and relations in a schema tuple. |
Yuhuan Lu; Bangchao Deng; Weijian Yu; Dingqi Yang; |
| 92 | PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the Multi-Hypothesis Pose Synthesis Domain Adaptation (PoSynDA) framework to overcome this issue without extensive target domain annotation. |
Hanbing Liu; Jun-Yan He; Zhi-Qi Cheng; Wangmeng Xiang; Qize Yang; Wenhao Chai; Gaoang Wang; Xu Bao; Bin Luo; Yifeng Geng; Xuansong Xie; |
| 93 | PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: More importantly, we hope that the generation can be easily controlled via interactively editing both types of information, making face generation more applicable to real-world applications. To accomplish this target, we propose a novel face generation model termed PixelFace+. |
Xiaoxiong Du; Jun Peng; Yiyi Zhou; Jinlu Zhang; Siting Chen; Guannan Jiang; Xiaoshuai Sun; Rongrong Ji; |
| 94 | Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we instead promote a single-stage transformer-based framework for OWIS. |
Xizhe Xue; Dongdong Yu; Lingqiao Liu; Yu Liu; Satoshi Tsutsui; Ying Li; Zehuan Yuan; Ping Song; Mike Zheng Shou; |
| 95 | Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose that the key lies in explicitly modeling the motion cues flowing in video frames. |
Qiang Wang; Junlong Du; Ke Yan; Shouhong Ding; |
| 96 | CPLFormer: Cross-scale Prototype Learning Transformer for Image Snow Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel approach, CPLFormer, which uses snow prototypes to own comprehensive clean scene understanding through learning from cross-scale features, outperforming convolutional network and vanilla transformer-based solutions. |
Sixiang Chen; Tian Ye; Yun Liu; Jinbin Bai; Haoyu Chen; Yunlong Lin; Jun Shi; Erkang Chen; |
| 97 | SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, different from previous 2D DG works, we focus on the 3D DG problem and propose a Single-dataset Unified Generalization (SUG) framework that only leverages a single source dataset to alleviate the unforeseen domain differences faced by a well-trained source model. |
Siyuan Huang; Bo Zhang; Botian Shi; Hongsheng Li; Yikang Li; Peng Gao; |
| 98 | ASTDF-Net: Attention-Based Spatial-Temporal Dual-Stream Fusion Network for EEG-Based Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel attention-based spatial-temporal dual-stream fusion network, named ASTDF-Net, for EEG-based emotion recognition. |
Peiliang Gong; Ziyu Jia; Pengpai Wang; Yueying Zhou; Daoqiang Zhang; |
| 99 | Physics-Based Adversarial Attack on Near-Infrared Human Detector for Nighttime Surveillance Camera Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we identify fundamental vulnerabilities in NIR-based image understanding caused by color and texture loss due to the intrinsic characteristics of clothes’ reflectance and cameras’ spectral sensitivity in the NIR range. |
Muyao Niu; Zhuoxiao Li; Yifan Zhan; Huy H. Nguyen; Isao Echizen; Yinqiang Zheng; |
| 100 | GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an effective Generative Cross-Modal Attacks (GCMA) framework to enhance adversarial transferability from image domains to video domains. |
Kai Chen; Zhipeng Wei; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang; |
| 101 | Giving Text More Imagination Space for Image-text Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. |
Xinfeng Dong; Longfei Han; Dingwen Zhang; Li Liu; Junwei Han; Huaxiang Zhang; |
| 102 | Whether You Can Locate or Not? Interactive Referring Expression Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. |
Fulong Ye; Yuxing Long; Fangxiang Feng; Xiaojie Wang; |
| 103 | Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As far as we know, it is the largest dataset for the HRSOD task, which will significantly help future works in training and evaluating models. |
Xinhao Deng; Pingping Zhang; Wei Liu; Huchuan Lu; |
| 104 | Cross-modal \& Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To answer it, this paper studies a new 3DLSS setting where a 2D dataset (source) with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available1. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL). |
Yiyang Chen; Shanshan Zhao; Changxing Ding; Liyao Tang; Chaoyue Wang; Dacheng Tao; |
| 105 | Fine-Grained Music Plagiarism Detection: Revealing Plagiarists Through Bipartite Graph Matching and A Comprehensive Large-Scale Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To detect the fine-grained plagiarism pairs effectively, we propose a graph-based method called Bipatite Melody Matching Detector (BMM-Det), which formulates the problem as a max matching problem in the bipartite graph. |
Wenxuan Liu; Tianyao He; Chen Gong; Ning Zhang; Hua Yang; Junchi Yan; |
| 106 | MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces MEAformer, a mlti-modal entity alignment transformer approach for meta modality hybrid, which dynamically predicts the mutual correlation coefficients among modalities for more fine-grained entity-level modality fusion and alignment. |
Zhuo Chen; Jiaoyan Chen; Wen Zhang; Lingbing Guo; Yin Fang; Yufeng Huang; Yichi Zhang; Yuxia Geng; Jeff Z. Pan; Wenting Song; Huajun Chen; |
| 107 | Self-Relational Graph Convolution Network for Skeleton-Based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes a novel spatial module called Multi-scale self-relational graph convolution (MS-SRGC) for dynamically modeling joint relations of action instances. |
Sophyani Banaamwini Yussif; Ning Xie; Yang Yang; Heng Tao Shen; |
| 108 | Modal-aware Visual Prompting for Incomplete Multi-modal Brain Tumor Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast to previous prompts that typically use textual network embeddings, we utilize embeddings as the prompts generated by a modality state classifier that focuses on the missing modality states. |
Yansheng Qiu; Ziyuan Zhao; Hongdou Yao; Delin Chen; Zheng Wang; |
| 109 | Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous methods have attempted to align text and image samples in a modal-shared space, but they face uncertainties in optimization directions due to the movable features of both modalities and the failure to account for one-to-many relationships of image-text pairs in TPR datasets. To address this issue, we propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample, thus mitigating the optimization problem. |
Yiwei Ma; Xiaoshuai Sun; Jiayi Ji; Guannan Jiang; Weilin Zhuang; Rongrong Ji; |
| 110 | Active CT Reconstruction with A Learned Sampling Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods are still stuck with a fixed uniform SV (USV) sampling strategy, which inhibits the possibility of acquiring a better image with an even reduced dose. In this paper, we explore this possibility via learning an active SV (ASV) sampling policy that optimizes the sampling positions for regions of interest (RoI)-specific, high-quality reconstruction. |
Ce Wang; Kun Shang; Haimiao Zhang; Shang Zhao; Dong Liang; S Kevin Zhou; |
| 111 | Hierarchical Category-Enhanced Prototype Learning for Imbalanced Temporal Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the significant imbalance of items in the training data poses a major challenge to predictive accuracy. Existing approaches attempt to alleviate this issue by modifying the loss function or utilizing resampling techniques, but such approaches may inadvertently amplify the specificity of certain behaviors.To address this problem, we propose a novel temporal recommendation algorithm called HCRec. |
Xiyue Gao; Zhuoqi Ma; Jiangtao Cui; Xiaofang Xia; Cai Xu; |
| 112 | PSNEA: Pseudo-Siamese Network for Entity Alignment Between Multi-modal Knowledge Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods focus on reducing the embedding differences between multiple modalities while neglecting the following challenges: 1) cannot handle the heterogeneity across graphs, 2) suffer from the scarcity of pre-aligned data (a.k.a. initial seeds). To tackle these issues, we propose a Pseudo-Siamese Network for multi-modal Entity Alignment (PSNEA). |
Wenxin Ni; Qianqian Xu; Yangbangyan Jiang; Zongsheng Cao; Xiaochun Cao; Qingming Huang; |
| 113 | Federated Learning with Label-Masking Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. |
Jianghu Lu; Shikun Li; Kexin Bao; Pengju Wang; Zhenxing Qian; Shiming Ge; |
| 114 | GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. |
Sihan Ma; Qiong Cao; Hongwei Yi; Jing Zhang; Dacheng Tao; |
| 115 | Open-Vocabulary Object Detection Via Scene Graph Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. |
Hengcan Shi; Munawar Hayat; Jianfei Cai; |
| 116 | NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an end-to-end transformer-based framework for nighttime haze removal, called NightHazeFormer. |
Yun Liu; Zhongsheng Yan; Sixiang Chen; Tian Ye; Wenqi Ren; Erkang Chen; |
| 117 | Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods underestimate the key cue role of mismatched region-word pairs and ignore the problem of low similarity between matched region-word pairs. To alleviate these issues, we propose a novel Pedestrian-specific Bipartite-aware Similarity Learning (PBSL) framework that efficiently reveals the plausible and credible levels of contribution of pedestrian-specific mismatched and matched region-word pairs towards overall similarity. |
Fei Shen; Xiangbo Shu; Xiaoyu Du; Jinhui Tang; |
| 118 | Generalizing Face Forgery Detection Via Uncertainty Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, deterministic networks cannot effectively capture noise and distribution shifts in the input, which makes them less robust and prone to poor generalization in real-world scenarios. To address this problem, in this paper, we propose an Uncertainty-Aware Learning (UAL) method for face forgery detection. |
Yanqi Wu; Xue Song; Jingjing Chen; Yu-Gang Jiang; |
| 119 | Personalized Image Aesthetics Assessment with Attribute-guided Fine-grained Feature Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, we propose an attribute-guided fine-grained feature-aware personalized image aesthetics assessment method, which can fully capture fine-grained features from multiple attributes to represent users’ aesthetic preferences for images. |
Hancheng Zhu; Zhiwen Shao; Yong Zhou; Guangcheng Wang; Pengfei Chen; Leida Li; |
| 120 | Normality Learning-based Graph Anomaly Detection Via Multi-Scale Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose a normality learning-based GAD framework via multi-scale contrastive learning networks (NLGAD for abbreviation). |
Jingcan Duan; Pei Zhang; Siwei Wang; Jingtao Hu; Hu Jin; Jiaxin Zhang; Haifang Zhou; Xinwang Liu; |
| 121 | Locate and Verify: A Two-Stream Network for Improved Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: (3) Confronted with the challenge of obtaining forgery annotations, we propose a Semi-supervised Patch Similarity Learning strategy to estimate patch-level forged location annotations. |
Chao Shuai; Jieming Zhong; Shuang Wu; Feng Lin; Zhibo Wang; Zhongjie Ba; Zhenguang Liu; Lorenzo Cavallaro; Kui Ren; |
| 122 | Recognizing High-Speed Moving Objects with Spike Camera Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This framework improves the recognition accuracy, meanwhile substantially decreasing the recognition latency, making our method can accurately recognize moving objects at an equivalent speed of 514 km/h, using only 1 ms of spike stream. |
Junwei Zhao; Jianming Ye; Shiliang Shiliang; Zhaofei Yu; Tiejun Huang; |
| 123 | PNT-Edge: Towards Robust Edge Detection with Noisy Labels By Learning Pixel-level Noise Transitions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level Noise Transitions to model the label-corruption process. |
Wenjie Xuan; Shanshan Zhao; Yu Yao; Juhua Liu; Tongliang Liu; Yixin Chen; Bo Du; Dacheng Tao; |
| 124 | Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. |
Xu Huang; Jin Liu; Zhizhong Zhang; Yuan Xie; |
| 125 | Underwater Image Enhancement By Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. |
Yi Tang; Hiroshi Kawasaki; Takafumi Iwaguchi; |
| 126 | A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. |
Jiancheng Pan; Qing Ma; Cong Bai; |
| 127 | Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The few works considering audio simply regard it as an additional modality, overlooking that: i) it’s non-trivial to explore consistency and complementarity between audio and visual; ii) such exploration requires handling different levels of information densities and noises in the two modalities. To tackle these challenges, we propose Adaptive Dual-branch Promoted Network (ADPN) to exploit such consistency and complementarity: i) we introduce a dual-branch pipeline capable of jointly training visual-only and audio-visual branches to simultaneously eliminate inter-modal interference; ii) we design Text-Guided Clues Miner (TGCM) to discover crucial locating clues via considering both consistency and complementarity during audio-visual interaction guided by text semantics; iii) we propose a novel curriculum-based denoising optimization strategy, where we adaptively evaluate sample difficulty as a measure of noise intensity in a self-aware fashion. |
Houlun Chen; Xin Wang; Xiaohan Lan; Hong Chen; Xuguang Duan; Jia Jia; Wenwu Zhu; |
| 128 | Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a simple yet effective framework, called LCR2S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. |
Shuanglin Yan; Neng Dong; Jun Liu; Liyan Zhang; Jinhui Tang; |
| 129 | ParliRobo: Participant Lightweight AI Robots for Massively Multiplayer Online Games (MMOGs) Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we attempt to develop practical PARs (dubbed ParliRobo) showing acceptably humanoid behaviors with well affordable infrastructures under a challenging scenario-a 3D-FPS (first-person shooter) mobile MMOG with real-time interaction requirements. |
Jianwei Zheng; Changnan Xiao; Mingliang Li; Zhenhua Li; Feng Qian; Wei Liu; Xudong Wu; |
| 130 | Generating Explanations for Embodied Action Decision from Visual Observation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study generating action decisions and explanations based on visual observation. |
Xiaohan Wang; Yuehu Liu; Xinhang Song; Beibei Wang; Shuqiang Jiang; |
| 131 | Automatic Human Scene Interaction Through Contact Estimation and Motion Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel approach to automatic human scene interaction that effectively recovers the human body mesh and high-precision contact information, subsequently enabling adaptation to different environments. |
Mingrui Zhang; Ming Chen; Yan Zhou; Li Chen; Weihua Jian; Pengfei Wan; |
| 132 | Improving Human-Object Interaction Detection Via Virtual Image Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). |
Shuman Fang; Shuai Liu; Jie Li; Guannan Jiang; Xianming Lin; Rongrong Ji; |
| 133 | Hierarchical Masked 3D Diffusion Model for Video Outpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a masked 3D diffusion model for video outpainting. |
Fanda Fan; Chaoxu Guo; Litong Gong; Biao Wang; Tiezheng Ge; Yuning Jiang; Chunjie Luo; Jianfeng Zhan; |
| 134 | AniPixel: Towards Animatable Pixel-Aligned Human Avatar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose AniPixel, a novel animatable and generalizable human avatar reconstruction method that leverages pixel-aligned features for body geometry prediction and RGB color blending. |
Jinlong Fan; Jing Zhang; Zhi Hou; Dacheng Tao; |
| 135 | SA-GDA: Spectral Augmentation for Graph Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present the Spectral Augmentation for Graph Domain Adaptation (SA-GDA) for graph node classification. |
Jinhui Pang; Zixuan Wang; Jiliang Tang; Mingyan Xiao; Nan Yin; |
| 136 | Prompt Me Up: Unleashing The Power of Alignments for Multimodal Entity and Relation Extraction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pre-training objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. |
Xuming Hu; Junzhe Chen; Aiwei Liu; Shiao Meng; Lijie Wen; Philip S. Yu; |
| 137 | Rethinking Missing Modality Learning from A Decoding Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. |
Tao Jin; Xize Cheng; Linjun Li; Wang Lin; Ye Wang; Zhou Zhao; |
| 138 | Unveiling The Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. |
Zhong Chen; Zhizhong Zhang; Xin Tan; Yanyun Qu; Yuan Xie; |
| 139 | Make-It-4D: Synthesizing A Consistent Long-Term Dynamic Scene Video from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. |
Liao Shen; Xingyi Li; Huiqiang Sun; Juewen Peng; Ke Xian; Zhiguo Cao; Guosheng Lin; |
| 140 | Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. |
Jiaxin Ye; Yujie Wei; Xin-Cheng Wen; Chenglong Ma; Zhizhong Huang; Kunhong Liu; Hongming Shan; |
| 141 | Generalized Universal Domain Adaptation with Generative Flow Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The key challenge of GUDA is developing and identifying novel target categories while estimating the target label distribution. To address this problem, we take advantage of the powerful exploration capability of generative flow networks and propose an active domain adaptation algorithm named GFlowDA, which selects diverse samples with probabilities proportional to a reward function. |
Didi Zhu; Yinchuan Li; Yunfeng Shao; Jianye Hao; Fei Wu; Kun Kuang; Jun Xiao; Chao Wu; |
| 142 | PDE-based Progressive Prediction Framework for Attribute Compression of 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, we propose a low-complexity method for calculating partial derivative operations on point clouds to address the uncertainty of neighbor occupancy in three-dimensional space. |
Xiaodong Yang; Yiting Shao; Shan Liu; Thomas H. Li; Ge Li; |
| 143 | Resolve Domain Conflicts for Generalizable Remote Physiological Measurement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, they often overlook the underlying conflict issues in the rPPG field, such as (1) label conflict resulting from different phase delays between physiological signal labels and face videos at the instance level, and (2) attribute conflict stemming from distribution shifts caused by head movements, illumination changes, skin types, etc. To address this, we introduce the DOmain-HArmonious framework (DOHA). |
Weiyu Sun; Xinyu Zhang; Hao Lu; Ying Chen; Yun Ge; Xiaolin Huang; Jie Yuan; Yingcong Chen; |
| 144 | V2Depth: Monocular Depth Estimation Via Feature-Level Virtual-View Simulation and Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present V2Depth, a novel coarse-to-fine framework with Virtual View feature simulation for supervised monocular Depth estimation. |
Zizhang Wu; Zhuozheng Li; Zhi-Gang Fan; Yunzhe Wu; Jian Pu; Xianzhi Li; |
| 145 | Towards Decision-based Sparse Attacks on Video Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing research has primarily focused on sparse attacks against image models, there is a notable gap in evaluating the robustness of video recognition models. To bridge this gap, we are the first to study sparse video attacks and propose an attack framework named V-DSA in the most challenging decision-based setting, in which threat models only return the predicted hard label. |
Kaixun Jiang; Zhaoyu Chen; Xinyu Zhou; Jingyu Zhang; Lingyi Hong; JiaFeng Wang; Bo Li; Yan Wang; Wenqiang Zhang; |
| 146 | Exploring The Adversarial Robustness of Video Object Segmentation Via One-shot Adversarial Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Adversarial robustness refers to the ability of the model to resist malicious attacks on adversarial examples. To address this gap, we propose a one-shot adversarial robustness evaluation framework (i.e., the adversary only perturbs the first frame) for VOS models, including white-box and black-box attacks. |
Kaixun Jiang; Lingyi Hong; Zhaoyu Chen; Pinxue Guo; Zeng Tao; Yan Wang; Wenqiang Zhang; |
| 147 | AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, how to learn aesthetics-aware attributes from CLIP-based semantic space has not been addressed before. With this motivation, this paper presents a CLIP-based multi-attribute contrastive learning framework for IAA, dubbed AesCLIP. |
Xiangfei Sheng; Leida Li; Pengfei Chen; Jinjian Wu; Weisheng Dong; Yuzhe Yang; Liwu Xu; Yaqian Li; Guangming Shi; |
| 148 | Suspected Objects Matter: Rethinking Model’s Prediction for One-stage Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, exploring their relationships in the one-stage paradigm is non-trivial because: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model’s prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. |
Yang Jiao; Zequn Jie; Jingjing Chen; Lin Ma; Yu-Gang Jiang; |
| 149 | Topological Structure Learning for Weakly-Supervised Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve the new problem, we propose an effective method called Topological Structure Learning (TSL). |
Rundong He; Rongxue Li; Zhongyi Han; Xihong Yang; Yilong Yin; |
| 150 | COPA : Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. |
Chaoya Jiang; Haiyang Xu; Wei Ye; Qinghao Ye; Chenliang Li; Ming Yan; Bin Bi; Shikun Zhang; Fei Huang; Ji Zhang; |
| 151 | Reinforcement Graph Clustering with Unknown Cluster Number Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). |
Yue Liu; Ke Liang; Jun Xia; Xihong Yang; Sihang Zhou; Meng Liu; Xinwang Liu; Stan Z. Li; |
| 152 | Language-guided Human Motion Synthesis with Atomic Actions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. |
Yuanhao Zhai; Mingzhen Huang; Tianyu Luan; Lu Dong; Ifeoma Nwogu; Siwei Lyu; David Doermann; Junsong Yuan; |
| 153 | That’s What I Said: Fully-Controllable Talking Face Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The goal of this paper is to synthesise talking faces with controllable facial motions. |
Youngjoon Jang; Kyeongha Rho; Jongbin Woo; Hyeongkeun Lee; Jihwan Park; Youshin Lim; Byeong-Yeol Kim; Joon Son Chung; |
| 154 | Semantic-aware Consistency Network for Cloth-changing Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite recent progress in CC-ReID, existing approaches are still hindered by the interference of clothing variations since they lack effective constraints to keep the model consistently focused on clothing-irrelevant regions. To address this issue, we present a Semantic-aware Consistency Network (SCNet) to learn identity-related semantic features by proposing effective consistency constraints. |
Peini Guo; Hong Liu; Jianbing Wu; Guoquan Wang; Tao Wang; |
| 155 | Relational Contrastive Learning for Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. |
Jinglei Zhang; Tiancheng Lin; Yi Xu; Kai Chen; Rui Zhang; |
| 156 | Focusing on Flexible Masks: A Novel Framework for Panoptic Scene Graph Generation with Relation Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the inference phase, we present an innovative concept of employing relation predictions to constrain segmentation and design a relation-constrained segmentation algorithm. |
Jiarui Yang; Chuan Wang; Zeming Liu; Jiahong Wu; Dongsheng Wang; Liang Yang; Xiaochun Cao; |
| 157 | FedGH: Heterogeneous Federated Learning with Generalized Global Header Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing model-heterogeneous FL approaches often require publicly available datasets and incur high communication and/or computational costs, which limit their performances. To address these limitations, we propose a simple but effective Federated Global prediction Header (FedGH) approach. |
Liping Yi; Gang Wang; Xiaoguang Liu; Zhuan Shi; Han Yu; |
| 158 | SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. |
Hui Lu; Xixin Wu; Zhiyong Wu; Helen Meng; |
| 159 | Cross-modal Contrastive Learning for Multimodal Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address that, we propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. |
Longzheng Wang; Chuang Zhang; Hongbo Xu; Yongxiu Xu; Xiaohan Xu; Siqi Wang; |
| 160 | Filling in The Blank: Rationale-Augmented Prompt Tuning for TextVQA Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we rethink the characteristics of the TextVQA task and find that scene text is indeed a special kind of language embedded in images. |
Gangyan Zeng; Yuan Zhang; Yu Zhou; Bo Fang; Guoqing Zhao; Xin Wei; Weiping Wang; |
| 161 | Relation Triplet Construction for Cross-modal Text-to-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To model fine-grained visual relations, this paper proposes a Multi-Granularity Matching (MGM) framework that considers both fine-grained relation triplet matching and coarse-grained global semantic matching for text-to-video retrieval. |
Xue Song; Jingjing Chen; Yu-Gang Jiang; |
| 162 | Multi-Spectral Image Stitching Via Spatial Graph Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Capitalizing on the strengths of Graph Convolutional Networks (GCNs) in modeling feature relationships, we propose a spatial graph reasoning based multi-spectral image stitching method that effectively distills the deformation and integration of multi-spectral images across different viewpoints. |
Zhiying Jiang; Zengxi Zhang; Jinyuan Liu; Xin Fan; Risheng Liu; |
| 163 | Incremental Few Shot Semantic Segmentation Via Class-agnostic Mask Proposal and Language-driven Classifier Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new IFSS framework called CaLNet, i.e., Class-agnostic mask proposal and Language-driven classifier incremental few-shot semantic segmentation network. |
Leo Shan; Wenzhang Zhou; Grace Zhao; |
| 164 | Deep Image Harmonization in Dual Color Spaces Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore image harmonization in dual color spaces, which supplements entangled RGB features with disentangled L, a, b features to alleviate the workload in harmonization process. |
Linfeng Tan; Jiangtong Li; Li Niu; Liqing Zhang; |
| 165 | Elucidate Gender Fairness in Singing Voice Transcription Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We find that different pitch distributions, rather than gender data imbalance, contribute to this disparity. To address this issue, we propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce the gender-invariance of acoustic representations. |
Xiangming Gu; Wei Zeng; Ye Wang; |
| 166 | WormTrack: Dataset and Benchmark for Multi-Object Tracking in Worm Crowds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observed that the state-of-the-art MOT methods suffers from considerable performance drop on the new dataset. Therefore, we propose a customized MOT method for worm crowds by deeply understanding the physical characteristics of worms and scenes. |
Zhiyu Jin; Hanyang Yu; Chen Haul; Linxiang Wang; Zuobin Zhu; Qiu Shen; Xun Cao; |
| 167 | Precise Target-Oriented Attack Against Deep Hashing-based Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they exhabit poor performance when facing a preciser single target label selection. In this work, we propose a novel Precise Target-Oriented Attack dubbed PTA, to enhance the precision of such targeted attacks. |
Wenshuo Zhao; Jingkuan Song; Shengming Yuan; Lianli Gao; Yang Yang; Hengtao Shen; |
| 168 | FedCE: Personalized Federated Learning Method Based on Clustering Ensembles Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to make the cluster more suitable for the distribution features of user data, we propose a clustering-ensemble based federated learning method (FedCE) that sets each client associated with multiple clusters. |
Luxin Cai; Naiyue Chen; Yuanzhouhan Cao; Jiahuan He; Yidong Li; |
| 169 | A Simple Baseline for Open-World Tracking Via Self-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the problem, we propose a simple baseline, SimOWT. |
Bingyang Wang; Tanlin Li; Jiannan Wu; Yi Jiang; Huchuan Lu; You He; |
| 170 | Semi-Supervised Panoptic Narrative Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. |
Danni Yang; Jiayi Ji; Xiaoshuai Sun; Haowei Wang; Yinan Li; Yiwei Ma; Rongrong Ji; |
| 171 | HARP: Let Object Detector Undergo Hyperplasia to Counter Adversarial Patches Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the phenomenon of bone hyperplasia in human bodies, we present a novel model-side adversarial patch defense, called HARP (Hyperplasia based Adversarial Patch defense). |
Junzhe Cai; Shuiyan Chen; Heng Li; Beihao Xia; Zimin Mao; Wei Yuan; |
| 172 | Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. |
Haowei Wang; Jiji Tang; Jiayi Ji; Xiaoshuai Sun; Rongsheng Zhang; Yiwei Ma; Minda Zhao; Lincheng Li; Zeng Zhao; Tangjie Lv; Rongrong Ji; |
| 173 | DeNoL: A Few-Shot-Sample-Based Decoupling Noise Layer for Cross-channel Watermarking Robustness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, directly using limited data to train may lead to an over-fitting issue. To address such limitation, we proposed DeNoL, a decoupling noise layer for cross-channel simulation which only needs few-shot samples. |
Han Fang; Kejiang Chen; Yupeng Qiu; Jiayang Liu; Ke Xu; Chengfang Fang; Weiming Zhang; Ee-Chien Chang; |
| 174 | High Fidelity Face Swapping Via Semantics Disentanglement and Structure Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Semantics and Structure-aware face Swapping framework (S2Swap) that exploits semantics disentanglement and structure enhancement for high fidelity face generation. |
Fengyuan Liu; Lingyun Yu; Hongtao Xie; Chuanbin Liu; Zhiguo Ding; Quanwei Yang; Yongdong Zhang; |
| 175 | All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. |
Chunhui Zhang; Xin Sun; Yiqian Yang; Li Liu; Qiong Liu; Xi Zhou; Yanfeng Wang; |
| 176 | M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. |
Hao Tang; Jun Liu; Shuanglin Yan; Rui Yan; Zechao Li; Jinhui Tang; |
| 177 | Zero-Shot Object Detection By Semantics-Aware DETR with Adaptive Contrastive Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing ZSD methods typically suffer from two drawbacks: 1) Due to the lack of data on unseen categories during the training phase, the model inevitably has a bias towards the seen categories, i.e., it prefers to subsume objects of unseen categories to seen categories; 2) It is usually very tricky for the feature extractor trained on data of seen categories to learn discriminative features that are good enough to help the model transfer the knowledge learned from data of seen categories to unseen categories. To tackle these problems, this paper proposes a novel zero-shot detection method based on a semantics-aware DETR and a class-wise adaptive contrastive loss. |
Huan Liu; Lu Zhang; Jihong Guan; Shuigeng Zhou; |
| 178 | Entropy-based Optimization on Individual and Global Predictions for Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we emphasize the cruciality of global prediction constraints and propose a new SSL method that employs Entropy-based optimization on both Individual and Global predictions of unlabeled instances, dubbed EntInG. |
Zhen Zhao; Meng Zhao; Ye Liu; Di Yin; Luping Zhou; |
| 179 | Unlocking The Power of Multimodal Learning for Emotion Recognition in Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We demonstrate through experiments that the main reason for this plateau is an imbalanced assignment of gradients across modalities. To address this issue, we propose fine-grained adaptive gradient modulation, a plug-in approach to rebalance the gradients of modalities. |
Yunxiao Wang; Meng Liu; Zhe Li; Yupeng Hu; Xin Luo; Liqiang Nie; |
| 180 | PointCRT: Detecting Backdoor in 3D Point Cloud Via Corruption Robustness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, there is a pressing need to bridge the gap for the development of a universal approach that is specifically designed for 3D point clouds.In this paper, we propose the first test-time backdoor sample detection method in 3D point cloud without assumption to the backdoor triggers, called Point Clouds Corruption Robustness Test (PointCRT). |
Shengshan Hu; Wei Liu; Minghui Li; Yechao Zhang; Xiaogeng Liu; Xianlong Wang; Leo Yu Zhang; Junhui Hou; |
| 181 | Weakly-supervised Video Scene Graph Generation Via Unbiased Cross-modal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the imbalanced data distribution and the lack of fine-grained annotations, models learned in this setting is prone to be biased. Therefore, we propose an Unbiased Cross-Modal Learning (UCML) framework to address the WS-VidSGG task. |
Ziyue Wu; Junyu Gao; Changsheng Xu; |
| 182 | CUCL: Codebook for Unsupervised Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This phenomenon can render the model inappropriate for practical applications. To address this issue, after analyzing the phenomenon and identifying the lack of diversity as a vital factor, we propose a method named Codebook for Unsupervised Continual Learning (CUCL) which promotes the model to learn discriminative features to complete the class boundary. |
Chen Cheng; Jingkuan Song; Xiaosu Zhu; Junchen Zhu; Lianli Gao; Hengtao Shen; |
| 183 | GridFormer: Towards Accurate Table Structure Recognition Via Grid Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. |
Pengyuan Lyu; Weihong Ma; Hongyi Wang; Yuechen Yu; Chengquan Zhang; Kun Yao; Yang Xue; Jingdong Wang; |
| 184 | Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. |
Xiaoxuan He; Siming Fu; Xinpeng Ding; Yuchen Cao; Hualiang Wang; |
| 185 | SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. |
Lingyi Hong; Wei Zhang; Shuyong Gao; Hong Lu; WenQiang Zhang; |
| 186 | TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. |
Meng Liu; Ke Liang; Dayu Hu; Hao Yu; Yue Liu; Lingyuan Meng; Wenxuan Tu; Sihang Zhou; Xinwang Liu; |
| 187 | EmotionKD: A Cross-Modal Knowledge Distillation Framework for Emotion Recognition Based on Physiological Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the high cost and inconvenience of EEG signal acquisition severely hinder the popularity of multi-modal emotion recognition in real-world scenarios, while GSR signals are easier to obtain. To address this challenge, we propose EmotionKD, a framework for cross-modal knowledge distillation that simultaneously models the heterogeneity and interactivity of GSR and EEG signals under a unified framework. |
Yucheng Liu; Ziyu Jia; Haichao Wang; |
| 188 | RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce RetouchingFFHQ, a large-scale and fine-grained face retouching dataset that contains over half a million conditionally-retouched images. |
Qichao Ying; Jiaxin Liu; Sheng Li; Haisheng Xu; Zhenxing Qian; Xinpeng Zhang; |
| 189 | Painterly Image Harmonization Using Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Previous methods for this task mainly rely on inference optimization or generative adversarial network, but they are either very time-consuming or struggling at fine control of the foreground objects (e.g., texture and content details). To address these issues, we propose a novel Painterly Harmonization stable Diffusion model (PHDiffusion), which includes a lightweight adaptive encoder and a Dual Encoder Fusion (DEF) module. |
Lingxiao Lu; Jiangtong Li; Junyan Cao; Li Niu; Liqing Zhang; |
| 190 | Lightweight Super-Resolution Head for Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Besides, we propose SRPose to gradually recover the HR heatmaps from LR heatmaps and degraded features in a coarse-to-fine manner. |
Haonan Wang; Jie Liu; Jie Tang; Gangshan Wu; |
| 191 | Efficiency-optimized Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: After examining the source of the computation cost, we confirm that the main calculation comes from the redundancy of the convolution. To address this issue, we propose Efficiency-optimized Video Diffusion Models to reduce the network’s computation cost by minimizing the input and output channels of the convolution. |
Zijun Deng; Xiangteng He; Yuxin Peng; |
| 192 | PEARL: Preprocessing Enhanced Adversarial Robust Learning of Image Deraining for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the first attempt to improve the robustness of semantic segmentation tasks by simultaneously handling different types of degradation factors. |
Xianghao Jiao; Yaohua Liu; Jiaxin Gao; Xinyuan Chu; Xin Fan; Risheng Liu; |
| 193 | Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current approaches address the STVG task with end-to-end frameworks while suffering from heavy computational complexity and insufficient spatio-temporal interactions. To overcome these limitations, we propose a novel Semantic-Guided Feature Decomposition based Network (SGFDN). |
Weikang Wang; Jing Liu; Yuting Su; Weizhi Nie; |
| 194 | Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. |
De Cheng; Lingfeng He; Nannan Wang; Shizhou Zhang; Zhen Wang; Xinbo Gao; |
| 195 | Unsupervised Visible-Infrared Person ReID By Collaborative Learning with Neighbor-Guided Label Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The key to essentially address the USL-VI-ReID task is to solve the cross-modality data association problem for further heterogeneous joint learning. To address this issue, we propose a Dual Optimal Transport Label Assignment (DOTLA) framework to simultaneously assign the generated labels from one modality to its counterpart modality. |
De Cheng; Xiaojian Huang; Nannan Wang; Lingfeng He; Zhihui Li; Xinbo Gao; |
| 196 | Universal Domain Adaptive Network Embedding for Node Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nonetheless, the complex network relationships between nodes increase the difficulty of this universal domain adaptive node classification task. In this work, we propose a novel Universal Domain Adaptive Network Embedding (UDANE) framework, which learns transferable node representations across networks to succeed in such a task. |
Jushuo Chen; Feifei Dai; Xiaoyan Gu; Jiang Zhou; Bo Li; Weipinng Wang; |
| 197 | MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. |
Jinpeng Wang; Ziyun Zeng; Yunxiao Wang; Yuting Wang; Xingyu Lu; Tianxiang Li; Jun Yuan; Rui Zhang; Hai-Tao Zheng; Shu-Tao Xia; |
| 198 | Points-to-3D: Bridging The Gap Between Sparse Points and Shape-Controllable Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. |
Chaohui Yu; Qiang Zhou; Jingliang Li; Zhe Zhang; Zhibin Wang; Fan Wang; |
| 199 | PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. |
Zhu Liu; Jinyuan Liu; Benzhuang Zhang; Long Ma; Xin Fan; Risheng Liu; |
| 200 | Hierarchical Dynamic Image Harmonization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a hierarchical dynamic network (HDNet) to adapt features from local to global view for better feature transformation in efficient image harmonization. |
Haoxing Chen; Zhangxuan Gu; Yaohui Li; Jun Lan; Changhua Meng; Weiqiang Wang; Huaxiong Li; |
| 201 | IN/ACTive: A Distance-Technology-Mediated Stage for Performer-Audience Telepresence and Environmental Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The increasing virtualization of the performance process has resulted in passive interactions between performer and audience based on the observing paradigm, but without the presence of in-person staging. We designed a more interactive paradigm of remote performance using a multimedia exhibition strategy where visitors can alter the environment of the performer’s location by changing its music and lighting, whereas the performer can create engagement in the exhibition space by affecting a remote robotic arm. |
RAY LC; Sijia Liu; Qiaosheng Lyu; |
| 202 | Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel method named Iterative Learning with Extra and Inner Knowledge (I2LEK) to address the long-tail problem in dynamic SGG. |
Yiming Li; Xiaoshan Yang; Changsheng Xu; |
| 203 | IDDR-NGP:Incorporating Detectors for Distractors Removal with Instant Neural Radiance Field Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. |
Xianliang Huang; Jiajie Gou; Shuhang Chen; Zhizhou Zhong; Jihong Guan; Shuigeng Zhou; |
| 204 | Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. |
Zixiao Wang; Hongtao Xie; Yuxin Wang; Jianjun Xu; Boqiang Zhang; Yongdong Zhang; |
| 205 | One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness.In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer.Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label.Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness.Code is available at https://github.com/csguoh/KD-LTR. |
Hang Guo; Tao Dai; Mingyan Zhu; Guanghao Meng; Bin Chen; Zhi Wang; Shu-Tao Xia; |
| 206 | Automatic Asymmetric Embedding Cost Learning Via Generative Adversarial Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel Generative Adversarial Network (GAN)-based steganography approach that independently learns asymmetric embedding costs from scratch. |
Dongxia Huang; Weiqi Luo; Peijia Zheng; Jiwu Huang; |
| 207 | Occluded Skeleton-Based Human Action Recognition with Dual Inhibition Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose Part-aware and Dual-inhibition Graph Convolutional Network (PDGCN), which comprises of three parts: Input Skeleton Inhibition (ISI), Part-Aware Representation Learning (PARL) and Predicted Score Inhibition (PSI). |
Zhenjie Chen; Hongsong Wang; Jie Gui; |
| 208 | Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of the existing methods are restricted to background scene bias and fewer motion details by employing a single-stream network to process scenes and motion information as a unified entity. In this paper, we address this challenge by proposing a novel dual-stream architecture Motion-Decoupled Spiking Transformer (MDFT) to explicitly decouple the contextual semantic information and highly sparsity dynamic motion information. |
Wenrui Li; Xi-Le Zhao; Zhengyu Ma; Xingtao Wang; Xiaopeng Fan; Yonghong Tian; |
| 209 | BLAT: Bootstrapping Language-Audio Pre-training Based on AudioSet Tag-guided Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. |
Xuenan Xu; Zhiling Zhang; Zelin Zhou; Pingyue Zhang; Zeyu Xie; Mengyue Wu; Kenny Q. Zhu; |
| 210 | DPNET: Dynamic Poly-attention Network for Trustworthy Multi-modal Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Dynamic Poly-attention Network (DPNET) for trustworthy multi-modal classification. |
Xin Zou; Chang Tang; Xiao Zheng; Zhenglai Li; Xiao He; Shan An; Xinwang Liu; |
| 211 | Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. |
Keke Chen; Xiangbo Shu; Guo-Sen Xie; Rui Yan; Jinhui Tang; |
| 212 | Handling Label Uncertainty for Camera Incremental Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate person ReID in an unexplored scenario named Camera Incremental Person ReID (CIPR), which advances existing lifelong person ReID by taking into account the class overlap issue. |
Zexian Yang; Dayan Wu; Wanqian Zhang; Bo Li; Weipinng Wang; |
| 213 | Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. |
Shuyu Yang; Yinan Zhou; Zhedong Zheng; Yaxiong Wang; Li Zhu; Yujiao Wu; |
| 214 | Learning Occlusion Disentanglement with Fine-grained Localization for Occluded Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Fine-grained Occlusion Disentanglement Network (FODN) that can extract more information from limited person regions. |
Wenfeng Liu; Xudong Wang; Lei Tan; Yan Zhang; Pingyang Dai; Yongjian We; Rongrong Ji; |
| 215 | Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. |
Keran Wang; Hongtao Xie; Yuxin Wang; Dongming Zhang; Yadong Qu; Zuan Gao; Yongdong Zhang; |
| 216 | RecolorNeRF: Layer Decomposed Radiance Fields for Efficient Color Editing of 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present RecolorNeRF, a novel user-friendly color editing approach for the neural radiance fields. |
Bingchen Gong; Yuehao Wang; Xiaoguang Han; Qi Dou; |
| 217 | Modality Profile – A New Critical Aspect to Be Considered When Generating RGB-D Salient Object Detection Training Set Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A training dataset with a modality profile similar to the test dataset can significantly improve performance. To address this, we present a viable solution for automatically generating a training dataset with any desired modality profile in a weakly supervised manner. |
Xuehao Wang; Shuai Li; Chenglizhao Chen; Aimin Hao; Hong Qin; |
| 218 | Semi-Supervised Convolutional Vision Transformer with Bi-Level Uncertainty Estimation for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we revisit the model of semi-supervised learning and develop a novel CNN-Transformer learning framework that allows for effective segmentation of medical images by producing complementary and reliable features and pseudo-label with bi-level uncertainty. |
Huimin Huang; Yawen Huang; Shiao Xie; Lanfen Lin; Tong Ruofeng; Yen-wei Chen; Yuexiang Li; Yefeng Zheng; |
| 219 | Explicifying Neural Implicit Fields for Efficient Dynamic Human Avatar Modeling Via A Neural Explicit Surface Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a technique for efficiently modeling dynamic humans by explicifying the implicit neural fields via a Neural Explicit Surface (NES). |
Ruiqi Zhang; Jie Chen; Qiang Wang; |
| 220 | HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. |
Shuyi Ouyang; Hongyi Wang; Ziwei Niu; Zhenjia Bai; Shiao Xie; Yingying Xu; Ruofeng Tong; Yen-Wei Chen; Lanfen Lin; |
| 221 | Advancing Video Question Answering with A Multi-modal and Multi-layer Question Enhancement Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. |
Meng Liu; Fenglei Zhang; Xin Luo; Fan Liu; Yinwei Wei; Liqiang Nie; |
| 222 | FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos inspired by the standard video codecs. |
Joo Chan Lee; Daniel Rho; Jong Hwan Ko; Eunbyung Park; |
| 223 | DTF-Net: Category-Level Pose Estimation and Shape Reconstruction Via Deformable Template Field Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. |
Haowen Wang; Zhipeng Fan; Zhen Zhao; Zhengping Che; Zhiyuan Xu; Dong Liu; Feifei Feng; Yakun Huang; Xiuquan Qiao; Jian Tang; |
| 224 | Hardware-friendly Scalable Image Super Resolution with Progressive Structured Sparsity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to solve the mentioned problems, we propose Hardware-friendly Scalable SR (HSSR) with progressively structured sparsity. |
Fangchen Ye; Jin Lin; Hongzhan Huang; Jianping Fan; Zhongchao Shi; Yuan Xie; Yanyun Qu; |
| 225 | RTQ: Rethinking Video-language Understanding Based on Image-text Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. |
Xiao Wang; Yaoyu Li; Tian Gan; Zheng Zhang; Jingjing Lv; Liqiang Nie; |
| 226 | General Debiasing for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose a general debiasing framework based on Inverse Probability Weighting (IPW), which adaptively assigns small weights to the samples with larger bias (i.e., the severer spurious correlations). |
Teng Sun; Juntong Ni; Wenjie Wang; Liqiang Jing; Yinwei Wei; Liqiang Nie; |
| 227 | Event-Diffusion: Event-Based Image Reconstruction and Restoration with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that the key to event-based image reconstruction is to enhance the edge information of objects and restore the artifacts in the reconstructed images. |
Quanmin Liang; Xiawu Zheng; Kai Huang; Yan Zhang; Jie Chen; Yonghong Tian; |
| 228 | ProTegO: Protect Text Content Against OCR Extraction Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ProTegO”, a novel text content protection method against the OCR extraction attack, which generates adversarial underpaintings that do not affect human reading but can interfere with OCR after taking screenshots. |
Yanru He; Kejiang Chen; Guoqiang Chen; Zehua Ma; Kui Zhang; Jie Zhang; Huanyu Bian; Han Fang; Weiming Zhang; Nenghai Yu; |
| 229 | Adaptive Feature Swapping for Unsupervised Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a simple but effective technique named Adaptive Feature Swapping for learning domain invariant features in Unsupervised Domain Adaptation (UDA). |
Junbao Zhuo; Xingyu Zhao; Shuhao Cui; Qingming Huang; Shuhui Wang; |
| 230 | Client-Adaptive Cross-Model Reconstruction Network for Modality-Incomplete Multimodal Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Client-Adaptive Cross-Modal Reconstruction Network (CACMRN) to solve the modality-incomplete multimodal federated learning (MI-MFL). |
Baochen Xiong; Xiaoshan Yang; Yaguang Song; Yaowei Wang; Changsheng Xu; |
| 231 | Deep Neural Network Watermarking Against Model Extraction Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most methods are vulnerable to model extraction attacks, where attackers collect output labels from the model to train a surrogate or a replica. To address this issue, we present a novel DNN watermarking approach, named SSW, which constructs an adaptive trigger set progressively by optimizing over a pair of symmetric shadow models to enhance the robustness to model extraction. |
Jingxuan Tan; Nan Zhong; Zhenxing Qian; Xinpeng Zhang; Sheng Li; |
| 232 | Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a novel model named Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment” (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. |
Guojin Zhong; Jin Yuan; Pan Wang; Kailun Yang; Weili Guan; Zhiyong Li; |
| 233 | MV-Diffusion: Motion-aware Video Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a Motion-aware Video Diffusion Model (MV-Diffusion) for enhancing the temporal consistency of generated videos using autoregressive diffusion models. |
Zijun Deng; Xiangteng He; Yuxin Peng; Xiongwei Zhu; Lele Cheng; |
| 234 | Graph to Grid: Learning Deep Representations for Multimodal Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these simple models have difficulty decoupling complex emotion patterns due to their limited representation capacity. To address this problem, we propose the graph-to-grid (G2G), a concise and plug-and-play module that transforms the 1-D graph-like data into the two-dimensional (2-D) grid-like data via the numerical relation coding. |
Ming Jin; Jinpeng Li; |
| 235 | Synthesizing Videos from Images for Image-to-Video Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we provide a new perspective and propose a single-stage method that synthesizes video from the source static image and converts the image-to-video adaptation problem into a video-to-video adaptation problem. |
Junbao Zhuo; Xingyu Zhao; Shuhui Wang; Huimin Ma; Qingming Huang; |
| 236 | Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra-and inter-client inconsistency simultaneously. |
Xinting Liao; Chaochao Chen; Weiming Liu; Pengyang Zhou; Huabin Zhu; Shuheng Shen; Weiqiang Wang; Mengling Hu; Yanchao Tan; Xiaolin Zheng; |
| 237 | Multimodal Adaptive Emotion Transformer with Flexible Modality Inputs on A Novel Dataset with Continuous Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we present a novel multimodal emotion dataset that incorporates electroencephalography (EEG) and eye movement signals to systematically explore human emotions. |
Wei-Bang Jiang; Xuan-Hao Liu; Wei-Long Zheng; Bao-Liang Lu; |
| 238 | FourLLIE: Boosting Low-Light Image Enhancement By Fourier Frequency Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Some researchers noticed that, in the Fourier space, the lightness degradation mainly exists in the amplitude component and the rest exists in the phase component. By incorporating both the Fourier frequency and the spatial information, these researchers proposed remarkable solutions for LLIE. |
Chenxi Wang; Hongjun Wu; Zhi Jin; |
| 239 | Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It easily leads to chromatic aberration and, to some extent, limits the diverse applications of chrominance in customized LLIE. In this work, a brighten-and-colorize” network (called BCNet), which introduces image colorization to LLIE, is proposed to address the above issues. |
Chenxi Wang; Zhi Jin; |
| 240 | Digging Into Depth Priors for Outdoor Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. |
Chen Wang; Jiadai Sun; Lina Liu; Chenming Wu; Zhelun Shen; Dayan Wu; Yuchao Dai; Liangjun Zhang; |
| 241 | FeaCo: Reaching Robust Feature-Level Consensus in Noisy Pose Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, limited sensor accuracy leads to noisy poses that misalign observations among vehicles. To address this problem, we propose the FeaCo, which achieves robust Feature-level Consensus among collaborating agents in noisy pose conditions without additional training. |
Jiaming Gu; Jingyu Zhang; Muyang Zhang; Weiliang Meng; Shibiao Xu; Jiguang Zhang; Xiaopeng Zhang; |
| 242 | FedAA: Using Non-sensitive Modalities to Improve Federated Learning While Preserving Image Privacy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we demonstrate that the problem can be solved by sharing information from the non-sensitive modality (e.g., metadata, non-sensitive descriptions, etc.) while keeping the sensitive information of images protected. |
Dong Chen; Siliang Tang; Zijin Shen; Guoming Wang; Jun Xiao; Yueting Zhuang; Carl Yang; |
| 243 | Graph Based Spatial-temporal Fusion for Multi-modal Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Graph based Spatio-Temporal Fusion model for high-performance multi-modal person Re-ID, namely G-Fusion, to mitigate the impact of noise. |
Yaobin Zhang; Jianming Lv; Chen Liu; Hongmin Cai; |
| 244 | M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, almost all ATC-related datasets are only unimodal for certain tasks, which fails to comprehensively illustrate the traffic situation to further support real-world studies. To address this gap, a multimodal air traffic situation (M2ATS) dataset is constructed to advance AI-related research in the ATC domain, including airspace information, flight plan, trajectory, and speech. |
Dongyue Guo; Yi Lin; Xuehang You; Zhongping Yang; Jizhe Zhou; Bo Yang; Jianwei Zhang; Han Shi; Shasha Hu; Zheng Zhang; |
| 245 | Optimizing Adaptive Video Streaming with Human Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Jade, which leverages reinforcement learning with human feedback(RLHF) technologies to better align the users’ opinion scores. |
Tianchi Huang; Rui-Xiao Zhang; Chenglei Wu; Lifeng Sun; |
| 246 | Knowledge Prompt-tuning for Sequential Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To summarize, we believe that a good recommendation system should utilize both general and domain knowledge simultaneously. Therefore, we introduce an external knowledge base and propose Knowledge Prompt-tuning for Sequential Recommendation (KP4SR). |
Jianyang Zhai; Xiawu Zheng; Chang-Dong Wang; Hui Li; Yonghong Tian; |
| 247 | Improving Few-shot Image Generation By Structural Discrimination and Textural Modulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, such an intuitive combination of images/features only exploits the most relevant information for generation, leading to poor diversity and coarse-grained semantic fusion. To remedy this, this paper proposes a novel textural modulation (TexMod) mechanism to inject external semantic signals into internal local representations. |
Mengping Yang; Zhe Wang; Wenyi Feng; Qian Zhang; Ting Xiao; |
| 248 | Differentially Private Sparse Mapping for Privacy-Preserving Cross Domain Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the Privacy-Preserving Cross-Domain Recommendation problem (PPCDR). |
Weiming Liu; Xiaolin Zheng; Chaochao Chen; Mengling Hu; Xinting Liao; Fan Wang; Yanchao Tan; Dan Meng; Jun Wang; |
| 249 | Better Integrating Vision and Semantics for Improving Few-shot Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel method, called bimodal integrator (BMI), to better integrate visual and semantic prototypes. |
Zhuoling Li; Yong Wang; |
| 250 | Learning Non-Uniform-Sampling for Ultra-High-Definition Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous studies typically rely on the uniform and content-agnostic downsampling method that equally treats various regions regardless of their complexities, thus limiting the detail reconstruction in UHD image enhancement. To alleviate this issue, we propose a novel spatial-variant and invertible non-uniform downsampler that adaptively adjusts the sampling rate according to the richness of details. |
Wei Yu; Qi Zhu; Naishan Zheng; Jie Huang; Man Zhou; Feng Zhao; |
| 251 | Unbalanced Multi-view Deep Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous methods for this problem have at least one of the following drawbacks: (1) despising the information of low dimensional views; (2) constructing balanced view-specific inter-instance similarity graphs or employing decision-level fusion, which cannot well learn multi-level inter-view correlations and is limited to category-related tasks such as clustering. To eliminate all these drawbacks, we present an Unbalanced Multi-view Deep Learning (UMDL) method. |
Cai Xu; Zehui Li; Ziyu Guan; Wei Zhao; Xiangyu Song; Yue Wu; Jianxin Li; |
| 252 | Dual Dynamic Proxy Hashing Network for Long-tailed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel Dual Dynamic Proxy Hashing Network (DDPHN) with two sets of learnable dynamic proxies, i.e. hash proxies and feature proxies, to improve the discrimination of hash codes for tail-class samples. |
Yan Jiang; Hongtao Xie; Lei Zhang; Pandeng Li; Dongming Zhang; Yongdong Zhang; |
| 253 | LDRM: Degradation Rectify Model for Low-light Imaging Via Color-Monochrome Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods usually pursue higher brightness, resulting in unrealistic exposure. Inspired by Human Vision System (HVS), where rods perceive more lights while cones perceive more colors, we propose a Low-light Degradation Rectify Model (LDRM) with color-monochrome cameras to solve this problem. |
Junhong Lin; Shufan Pei; Bing Chen; Nanfeng Jiang; Wei Gao; Tiesong Zhao; |
| 254 | LHNet: A Low-cost Hybrid Network for Single Image Dehazing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The shortcomings of the above three types of methods lead to issues such as imbalanced colors and incoherent details in the predicted haze-free image. To address these challenges, we propose a new Low-cost Hybrid Network called LHNet. |
Shenghai Yuan; Jijia Chen; Jiaqi Li; Wenchao Jiang; Song Guo; |
| 255 | TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose TIVA-KG, a multimodal Knowledge Graph covering Text, Image, Video and Audio, which can benefit various downstream tasks. |
Xin Wang; Benyuan Meng; Hong Chen; Yuan Meng; Ke Lv; Wenwu Zhu; |
| 256 | Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Disentangled Feature Mixup (DFM) framework for debiased VG, which is capable of performing unbiased grounding to tackle the temporal bias issue. |
Xin Wang; Zihao Wu; Hong Chen; Xiaohan Lan; Wenwu Zhu; |
| 257 | Echoes: Unsupervised Debiasing Via Pseudo-bias Labeling in An Echo Chamber Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper first presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data, which negatively impacts the debiasing performance of the target models. To address this issue, we propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy. |
Rui Hu; Yahan Tu; Jitao Sang; |
| 258 | Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. |
Yichen Zhang; Yifang Yin; Ying Zhang; Zhenguang Liu; Zheng Wang; Roger Zimmermann; |
| 259 | Separate and Locate: Rethink The Text in Text-based Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. |
Chengyang Fang; Jiangnan Li; Liang Li; Can Ma; Dayong Hu; |
| 260 | Filling The Information Gap Between Video and Query for Language-Driven Moment Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics. |
Daizong Liu; Xiaoye Qu; Jianfeng Dong; Guoshun Nan; Pan Zhou; Zichuan Xu; Lixing Chen; He Yan; Yu Cheng; |
| 261 | Unlearnable Examples Give A False Sense of Security: Piercing Through Unexploitable Data with Learnable Examples Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unfortunately, we find UEs provide a false sense of security, because they cannot stop unauthorized users from utilizing other unprotected data to remove the protection, by turning unlearnable data into learnable again. Motivated by this observation, we formally define a new threat by introducinglearnable unauthorized examples (LEs) which are UEs with their protection removed. |
Wan Jiang; Yunfeng Diao; He Wang; Jianxin Sun; Meng Wang; Richang Hong; |
| 262 | Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. |
Zheng-Yan Sheng; Yang Ai; Yan-Nian Chen; Zhen-Hua Ling; |
| 263 | Distilling Vision-Language Foundation Models: A Data-Free Approach Via Prompt Diversification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. |
Yunyi Xuan; Weijie Chen; Shicai Yang; Di Xie; Luojun Lin; Yueting Zhuang; |
| 264 | AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). |
Lianyu Hu; Liqing Gao; Zekang Liu; Chi-Man Pun; Wei Feng; |
| 265 | UniNeXt: Exploring A Unified Architecture for Vision Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. |
Fangjian Lin; Jianlong Yuan; Sitong Wu; Fan Wang; Zhibin Wang; |
| 266 | On The Importance of Spatial Relations for Few-shot Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. |
Yilun Zhang; Yuqian Fu; Xingjun Ma; Lizhe Qi; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang; |
| 267 | Scene-Generalizable Interactive Segmentation of Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unseen) scenes represented by radiance fields, guided by only a few interactive user clicks in a given set of multi-view 2D images. |
Songlin Tang; Wenjie Pei; Xin Tao; Tanghui Jia; Guangming Lu; Yu-Wing Tai; |
| 268 | Toward High Quality Facial Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. |
Yue Wang; Jinlong Peng; Jiangning Zhang; Ran Yi; Liang Liu; Yabiao Wang; Chengjie Wang; |
| 269 | PatchBackdoor: Backdoor Attack Against Deep Neural Networks Without Model Modification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we show that backdoor attacks can be achieved without any model modification. |
Yizhen Yuan; Rui Kong; Shenghao Xie; Yuanchun Li; Yunxin Liu; |
| 270 | High-Order Tensor Recovery Coupling Multilayer Subspace Priori with Application in Video Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, to better address the high-order tensor recovery issue, we propose a novel method that couples multilayer subspace priors with high-order tensor recovery techniques for tensor completion and robust tensor principal component analysis. |
Hao Tan; Weichao Kong; Feng Zhang; Wenjin Qin; Jianjun Wang; |
| 271 | Slowfast Diversity-aware Prototype Learning for Egocentric Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel Slowfast Diversity-aware Prototype learning (SDP) to effectively capture interacting objects by learning compact yet diverse prototypes, and adaptively capture motion in either long-time video or short-time video. |
Guangzhao Dai; Xiangbo Shu; Rui Yan; Peng Huang; Jinhui Tang; |
| 272 | Unlocking The Power of Cross-Dimensional Semantic Dependency for Image-Text Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. |
Kun Zhang; Lei Zhang; Bo Hu; Mengxiao Zhu; Zhendong Mao; |
| 273 | Improving Zero-shot Visual Question Answering Via Large Language Models with Reasoning Question Prompts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. |
Yunshi Lan; Xiang Li; Xin Liu; Yang Li; Wei Qin; Weining Qian; |
| 274 | SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the complexity of semantic attributes and intra-class feature diversity poses challenges for ZSD methods in distinguishing fine-grained food classes. To tackle this, we propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD). |
Pengfei Zhou; Weiqing Min; Yang Zhang; Jiajun Song; Ying Jin; Shuqiang Jiang; |
| 275 | SEAM: Searching Transferable Mixed-Precision Quantization Policy Through Large Margin Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limits the practicality of MPQ in real-world deployment scenarios. To address this issue, this paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset instead of the large-scale dataset used for training the model. |
Chen Tang; Kai Ouyang; Zenghao Chai; Yunpeng Bai; Yuan Meng; Zhi Wang; Wenwu Zhu; |
| 276 | DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. |
Shangyu Xing; Fei Zhao; Zhen Wu; Chunhui Li; Jianbing Zhang; Xinyu Dai; |
| 277 | Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. |
Wenyu Zhang; Xin Deng; Baojun Jia; Xingtong Yu; Yifan Chen; Jin Ma; Qing Ding; Xinming Zhang; |
| 278 | Uncertainty-Driven Dynamic Degradation Perceiving and Background Modeling for Efficient Single Image Desnowing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by them, we propose Dynamic Perceiving for Degraded Regions and Axial-Pooling Attention for Background Structure Modeling, which together couple a new network architecture, dubbed as D2P-BMNet. |
Sixiang Chen; Tian Ye; Chenghao Xue; Haoyu Chen; Yun Liu; Erkang Chen; Lei Zhu; |
| 279 | Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose inconsistent knowledge distillation (IKD), which aims to distill knowledge inherent in the teacher model’s counter-intuitive perceptions. |
Jiawei Liang; Siyuan Liang; Aishan Liu; Ke Ma; Jingzhi Li; Xiaochun Cao; |
| 280 | Improving Federated Person Re-Identification Through Feature-Aware Proximity and Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel feature-aware local proximity and global aggregation method for federated ReID to extract robust feature representations. |
Pengling Zhang; Huibin Yan; Wenhui Wu; Shuoyao Wang; |
| 281 | Propagation Is All You Need: A New Framework for Representation Learning and Classifier Training on Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly, a novel GNN-specific training framework is proposed by simultaneously updating node representations and classifier parameters via a unified feature propagation scheme. |
Jiaming Zhuo; Can Cui; Kun Fu; Bingxin Niu; Dongxiao He; Yuanfang Guo; Zhen Wang; Chuan Wang; Xiaochun Cao; Liang Yang; |
| 282 | Parsing Is All You Need for Accurate Gait Recognition in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). |
Jinkai Zheng; Xinchen Liu; Shuai Wang; Lihao Wang; Chenggang Yan; Wu Liu; |
| 283 | YOGA: Yet Another Geometry-based Point Cloud Compressor Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It is flexible, allowing for the separable lossy compression of geometry and color attributes, and variable-rate coding using a single neural model; it is high-efficiency, significantly outperforming the latest G-PCC standard quantitatively and qualitatively, e.g., 25\% BD-BR gains using PCQM (Point Cloud Quality Metric) as the distortion assessment, and it is lightweight, e.g., similar runtime as the G-PCC codec, owing to the use of sparse convolution and parallel entropy coding. |
Junteng Zhang; Tong Chen; Dandan Ding; Zhan Ma; |
| 284 | Learning Intra and Inter-Camera Invariance for Isolated Camera Supervised Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To eliminate the confounding effect of camera bias, we propose to learn both intra- and inter-camera invariance under a unified framework. |
Menglin Wang; Xiaojin Gong; |
| 285 | AffectFAL: Federated Active Affective Computing with Non-IID Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A major challenge in federated active learning is the inconsistency between the active sampling goals of global and local models, particularly in scenarios with Non-IID data across clients, which exacerbates the problem. To address the above challenge, we propose AffectFAL, a federated active affective computing framework. |
Zixin Zhang; Fan Qi; Shuai Li; Changsheng Xu; |
| 286 | Dense Object Grounding in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. |
Wencan Huang; Daizong Liu; Wei Hu; |
| 287 | AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. |
Jinpeng Lin; Min Zhou; Ye Ma; Yifan Gao; Chenxi Fei; Yangjian Chen; Zhang Yu; Tiezheng Ge; |
| 288 | Learning Causality-inspired Representation Consistency for Video Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by causal representation learning, we think that there exists a causal variable capable of adequately representing the general patterns of regular events in which anomalies will present significant variations. Therefore, we design a causality-inspired representation consistency (CRC) framework to implicitly learn the unobservable causal variables of normality directly from available normal videos and detect abnormal events with the learned representation consistency. |
Yang Liu; Zhaoyang Xia; Mengyang Zhao; Donglai Wei; Yuzheng Wang; Siao Liu; Bobo Ju; Gaoyun Fang; Jing Liu; Liang Song; |
| 289 | PetalView: Fine-grained Location and Orientation Extraction of Street-view Images Via Cross-view Local Search Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test~1 set with the recall within 1 meter (r@1m) for location estimation to 68.88\% and recall within 1 degree (r@1d) 21.10\% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70\% and 21\%, up to a 40-degree noise level. |
Wenmiao Hu; Yichen Zhang; Yuxuan Liang; Xianjing Han; Yifang Yin; Hannes Kruppa; See-Kiong Ng; Roger Zimmermann; |
| 290 | C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. |
Huaiwen Zhang; Yang Yang; Fan Qi; Shengsheng Qian; Changsheng Xu; |
| 291 | EasyNet: An Easy Network for 3D Industrial Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6\% without using pretrained models and memory banks. |
Ruitao Chen; Guoyang Xie; Jiaqi Liu; Jinbao Wang; Ziqi Luo; Jinfan Wang; Feng Zheng; |
| 292 | Probability Distribution Based Frame-supervised Language-driven Action Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. |
Shuo Yang; Zirui Shang; Xinxiao Wu; |
| 293 | SemanticRT: A Large-Scale Dataset and Method for Robust Semantic Segmentation in Multispectral Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, unlike traditional RGB-only semantic segmentation, the lack of a large-scale MSS dataset has become a hindrance to the progress of this field. To address this issue, we introduce a SemanticRT dataset – the largest MSS dataset to date, comprising 11,371 high-quality, pixel-level annotated RGB-thermal image pairs. |
Wei Ji; Jingjing Li; Cheng Bian; Zhicheng Zhang; Li Cheng; |
| 294 | Finding Efficient Pruned Network Via Refined Gradients for Pruned Weights Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, using these coarse gradients causes training instability and performance degradation owing to the unreliable gradient signal of the STE approximation. In this work, to tackle this issue, we introduce refined gradients to update the pruned weights by forming dual forwarding paths from two sets (pruned and unpruned) of weights. |
Jangho Kim; Jayeon Yoo; Yeji Song; KiYoon Yoo; Nojun Kwak; |
| 295 | DocDiff: Document Enhancement Via Residual Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing regression-based methods optimized for pixel-level distortion reduction tend to suffer from significant loss of high-frequency information, leading to distorted and blurred text edges. To compensate for this major deficiency, we propose DocDiff, the first diffusion-based framework specifically designed for diverse challenging document enhancement problems, including document deblurring, denoising, and removal of watermarks and seals. |
Zongyuan Yang; Baolin Liu; Yongping Xxiong; Lan Yi; Guibin Wu; Xiaojun Tang; Ziqi Liu; Junjie Zhou; Xing Zhang; |
| 296 | DUSA: Decoupled Unsupervised Sim2Real Adaptation for Vehicle-to-Everything Collaborative Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To take full advantage of simulated data, we present a new unsupervised sim2real domain adaptation method for V2X collaborative detection named Decoupled Unsupervised Sim2Real Adaptation (DUSA). |
Xianghao Kong; Wentao Jiang; Jinrang Jia; Yifeng Shi; Runsheng Xu; Si Liu; |
| 297 | CenterLPS: Segment Instances By Centers for LiDAR Panoptic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. |
Jianbiao Mei; Yu Yang; Mengmeng Wang; Zizhang Li; Xiaojun Hou; Jongwon Ra; Laijian Li; Yong Liu; |
| 298 | Intra- and Inter-Modal Curriculum for Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing solutions to this problem suffer from two limitations: i) they merely focus on inter-modal balance, failing to consider the influence of intra-modal data on each modality; ii) their implementations heavily rely on unimodal performances or losses, thus being suboptimal for the tasks requiring modal interactions (e.g., visual question answering). To tackle these limitations, we propose I2MCL, a generic Intra- and Inter-Modal Curriculum Learning framework which simultaneously considers both data difficulty and modality balance for multimodal learning. |
Yuwei Zhou; Xin Wang; Hong Chen; Xuguang Duan; Wenwu Zhu; |
| 299 | Diffused Fourier Network for Video Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes a novel model, dubbed as diffused Fourier network (DFN) for video action segmentation. |
Borui Jiang; Yadong MU; |
| 300 | GoRec: A Generative Cold-start Recommendation Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These hybrid preference representations contained auxiliary collaborative signals, so current solutions designed alignment functions to transfer learned hybrid preference representations to cold items. Despite the effectiveness, we argue that they are still limited as these models relied heavily on the manually carefully designed alignment functions, which are easily influenced by the limited item records and noises in the training data.To tackle the above limitations, we propose a Generative cold-start Recommendation (GoRec) framework for multimedia-based new item recommendation. |
Haoyue Bai; Min Hou; Le Wu; Yonghui Yang; Kun Zhang; Richang Hong; Meng Wang; |
| 301 | Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. |
Shihao Zou; Xianying Huang; Xudong Shen; |
| 302 | WaterFlow: Heuristic Normalizing Flow for Underwater Image Enhancement and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To balance the visual quality and application, we propose a heuristic normalizing flow for detection-driven underwater image enhancement, dubbed WaterFlow. |
ZengXi Zhang; Zhiying Jiang; Jinyuan Liu; Xin Fan; Risheng Liu; |
| 303 | Distribution Consistency Based Fast Anchor Imputation for Incomplete Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To break the existing limitations, we propose a novel Distribution Consistency based Fast Anchor Imputation for Incomplete Multi-view Clustering (DCFAI-IMVC) method. |
Xingfeng Li; Yinghui Sun; Quansen Sun; Jia Dai; Zhenwen Ren; |
| 304 | Video Infringement Detection Via Feature Disentanglement and Mutual Information Maximization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features.In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. |
Zhenguang Liu; Xinyang Yu; Ruili Wang; Shuai Ye; Zhe Ma; Jianfeng Dong; Sifeng He; Feng Qian; Xiaobo Zhang; Roger Zimmermann; Lei Yang; |
| 305 | Combating Online Misinformation Videos: Characterization, Detection, and Future Directions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Besides summarizing existing studies, we discuss related areas and outline open issues and future directions to encourage and guide more research on misinformation video detection. |
Yuyan Bu; Qiang Sheng; Juan Cao; Peng Qi; Danding Wang; Jintao Li; |
| 306 | Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame prediction task – Video Object Grounding (VOG). |
Mengze Li; Haoyu Zhang; Juncheng Li; Zhou Zhao; Wenqiao Zhang; Shengyu Zhang; Shiliang Pu; Yueting Zhuang; Fei Wu; |
| 307 | LUNA: Language As Continuing Anchors for Referring Expression Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and thus show that language-guided location priors can be effectively exploited in a Transformer-based architecture. |
Yaoyuan Liang; Zhao Yang; Yansong Tang; Jiashuo Fan; Ziran Li; Jingang Wang; Philip H.S. Torr; Shao-Lun Huang; |
| 308 | 3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. |
Haibo Yang; Yang Chen; Yingwei Pan; Ting Yao; Zhineng Chen; Tao Mei; |
| 309 | Isolation and Induction: Training Robust Deep Neural Networks Against Model Stealing Attacks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the problems, this paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. |
Jun Guo; Xingyu Zheng; Aishan Liu; Siyuan Liang; Yisong Xiao; Yichao Wu; Xianglong Liu; |
| 310 | MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser1. |
Yizhuo Lu; Changde Du; Qiongyi Zhou; Dianpeng Wang; Huiguang He; |
| 311 | Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. |
Yujun Ma; Benjia Zhou; Ruili Wang; Pichao Wang; |
| 312 | VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel framework called VCMaster for multimodal live video comments generation, which balances the diversity and quality of generated comments to create human-like sentences. |
Manman Zhang; Ge Luo; Yuchen Ma; Sheng Li; Zhenxing Qian; Xinpeng Zhang; |
| 313 | ChinaOpen: A Dataset for Open-world Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multimodal learning. |
Aozhu Chen; Ziyuan Wang; Chengbo Dong; Kaibin Tian; Ruixiang Zhao; Xun Liang; Zhanhui Kang; Xirong Li; |
| 314 | AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use.In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. |
Ziqi Zhou; Shengshan Hu; Minghui Li; Hangtao Zhang; Yechao Zhang; Hai Jin; |
| 315 | Multi-scale Target-Aware Framework for Constrained Splicing Detection and Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline. |
Yuxuan Tan; Yuanman Li; Limin Zeng; Jiaxiong Ye; Wei Wang; Xia Li; |
| 316 | Mask-Guided Progressive Network for Joint Raindrop and Rain Streak Removal in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The bottleneck is a lack of a video dataset, where each video frame contains both rain streaks and raindrops. To address this issue, we in this work generate a synthesized dataset, namely VRDS, with 102 rainy videos from diverse scenarios, and each video frame has the corresponding rain streak map, raindrop mask, and the underlying rain-free clean image (ground truth). |
Hongtao Wu; Yijun Yang; Haoyu Chen; Jingjing Ren; Lei Zhu; |
| 317 | Exploiting Low-confidence Pseudo-labels for Source-free Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current SFOD methods utilize a threshold-based pseudo-label approach in the adaptation phase, which is typically limited to high-confidence pseudo-labels and results in a loss of information. To address this issue, we propose a new approach to take full advantage of pseudo-labels by introducing high and low confidence thresholds. |
Zhihong Chen; Zilei Wang; Yixin Zhang; |
| 318 | Where to Find Fascinating Inter-Graph Supervision: Imbalanced Graph Classification with Kernel Information Bottleneck Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it is disadvantageous to accurately derive reliable inter-graph supervision because the redundancy information from majority graphs is introduced to obscure the representations of minority graphs during the propagation process. To tackle this issue, we propose a novel method that integrates the restricted random walk kernel with the global graph information bottleneck (GIB) to improve imbalanced graph classification. |
Hui Tang; Xun Liang; |
| 319 | Zero-shot Skeleton-based Action Recognition Via Mutual Information Estimation and Maximization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. |
Yujie Zhou; Wenwen Qiang; Anyi Rao; Ning Lin; Bing Su; Jiaqi Wang; |
| 320 | Enhancing Real-Time Super Resolution with Partial Convolution and Efficient Variance Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple network named PCEVAnet by constructing the PCEVA block, which leverages Partial Convolution and Efficient Variance Attention. |
Zhou Zhou; Jiahao Chao; Jiali Gong; Hongfan Gao; Zhenbing Zeng; Zhengfeng Yang; |
| 321 | Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. |
Zhenghao Chen; Lucas Relic; Roberto Azevedo; Yang Zhang; Markus Gross; Dong Xu; Luping Zhou; Christopher Schroers; |
| 322 | Benign Shortcut for Debiasing: Fair Visual Recognition Via Intervention with Shortcut Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Shortcut Debiasing, to first transfer the target task’s learning of bias attributes from bias features to shortcut features, and then employ causal intervention to eliminate shortcut features during inference. |
Yi Zhang; Jitao Sang; Junyang Wang; Dongmei Jiang; Yaowei Wang; |
| 323 | Cross-Architecture Distillation for Face Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Subsequently, 2) we develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge while preserving the model’s discriminative capacity. |
Weisong Zhao; Xiangyu Zhu; Zhixiang He; Xiao-Yu Zhang; Zhen Lei; |
| 324 | Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods have problems of over-reliance on camera parameters or insufficient semantic feature extraction. To address these issues, this paper proposes a hierar chical multi-view fusion transformer (HMVformer) framework for 3D HPE, incorporating cross-view feature fusion methods into the spatial and temporal feature extraction process in a coarse-to-fine manner. |
Kangkang Zhou; Lijun Zhang; Feng Lu; Xiang-Dong Zhou; Yu Shi; |
| 325 | Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We broadly evaluate our method across four large-scale benchmark datasets. |
Jingyu Wu; Shi Chen; Shuyu Gan; Weijun Li; Changyuan Yang; Lingyun Sun; |
| 326 | Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable machines to mimic human cognitive abilities and alleviate the catastrophic forgetting problem in cross-modal image-text retrieval (CMITR), this paper proposes a novel continual learning method, Knowledge Decomposition and Replay (KDR), which emulates the process of knowledge decomposition and replay exhibited by humans in complex and changing environments. |
Rui Yang; Shuang Wang; Huan Zhang; Siyuan Xu; YanHe Guo; Xiutiao Ye; Biao Hou; Licheng Jiao; |
| 327 | MEDIC: A Multimodal Empathy Dataset in Counseling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we construct a multimodal empathy dataset collected from face-to-face psychological counseling sessions. |
Zhouan Zhu; Chenguang Li; Jicai Pan; Xin Li; Yufei Xiao; Yanan Chang; Feiyi Zheng; Shangfei Wang; |
| 328 | Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. |
Yang Liu; Chen Chen; Can Wang; Xulin King; Mengyuan Liu; |
| 329 | DuDoINet: Dual-Domain Implicit Network for Multi-Modality MR Image Arbitrary-scale Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a dual-domain learning scheme for multi-modality MR image SR, which allows the network to sufficiently exploit the frequency and image domain information in MR images. |
Guangyuan Li; Wei Xing; Lei Zhao; Zehua Lan; Zhanjie Zhang; Jiakai Sun; Haolin Yin; Huaizhong Lin; Zhijie Lin; |
| 330 | Self-Reference Image Super-Resolution Via Pre-trained Diffusion Large Model and Window Adjustable Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, it is time-consuming, laborious, and even impossible in some cases to find high-quality reference images. To tackle this problem, we propose a brand-new self-reference image super-resolution approach using a pre-trained diffusion large model and a window adjustable transformer, termed DWTrans. |
Guangyuan Li; Wei Xing; Lei Zhao; Zehua Lan; Jiakai Sun; Zhanjie Zhang; Quanwei Zhang; Huaizhong Lin; Zhijie Lin; |
| 331 | STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a novel model called STIRER (the abbreviation of Scene Text Image REcovery and Recognition) is proposed to effectively and simultaneously recover and recognize LR scene text images under a unified framework. |
Minyi Zhao; Shijie Xuyang; Jihong Guan; Shuigeng Zhou; |
| 332 | FedCD: A Classifier Debiased Federated Learning Framework for Non-IID Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing federated learning approaches tend to bias towards classes containing a larger number of samples during local updates, which causes unwanted drift in the local classifiers. To address this issue, we propose a classifier debiased federated learning framework named FedCD for non-IID data. |
Yunfei Long; Zhe Xue; Lingyang Chu; Tianlong Zhang; Junjiang Wu; Yu Zang; Junping Du; |
| 333 | Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. |
Muxin Liao; Shishun Tian; Yuhang Zhang; Guoguang Hua; Wenbin Zou; Xia Li; |
| 334 | Multi-teacher Self-training for Semi-supervised Node Classification with Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing methods assume that the training data are with correct labels, but in the real world, the graph-structured data often carry noisy labels to reduce the effectiveness of GNNs. To address this issue, this paper proposes a new label correction method, called multi-teacher self-training (MTS-GNN for short), to conduct semi-supervised node classification with noisy labels. |
Yujing Liu; Zongqian Wu; Zhengyu Lu; Guoqiu Wen; Junbo Ma; Guangquan Lu; Xiaofeng Zhu; |
| 335 | Retrieval-based Knowledge Augmented Vision Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, not all knowledge present in images/texts is useful, therefore prior approaches often struggle to effectively integrate knowledge, visual, and textual information. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework to address the above issues. |
Jiahua Rao; Zifei Shan; Longpo Liu; Yao Zhou; Yuedong Yang; |
| 336 | Interpolation Normalization for Contrast Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in this paper, we find that directly applying contrastive-based methods is not effective in domain generalization. To overcome this limitation, we propose to leverage a novel contrastive learning approach that promotes class-discriminative and class-balanced features from source domains. |
Mengzhu Wang; Junyang Chen; Huan Wang; Huisi Wu; Zhidan Liu; Qin Zhang; |
| 337 | Exploring Dual Representations in Large-Scale Point Clouds: A Simple Weakly Supervised Semantic Segmentation Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To optimize point cloud representations, we propose a novel framework for the dual representation query network (DRQNet). |
Jiaming Liu; Yue Wu; Maoguo Gong; Qiguang Miao; Wenping Ma; Cai Xu; |
| 338 | Adaptive Contrastive Learning for Learning Robust Representations Under Label Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods either conduct sample-level processes and then use the resultant subset to construct pairs or directly perform pair-level selecting using a fixed threshold, both leading to sub-optimal pairing and subsequent representation learning. To address this issue, we propose a novel adaptive contrastive learning method (ACL) working at the pair level to select contrastive pairs adaptively. |
Zihao Wang; Weichen Zhang; Weihong Bao; Fei Long; Chun Yuan; |
| 339 | Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, due to the limitations of photographic equipments and techniques, UGC videos often contain various degradations, in which one of the most visually unfavorable effects is the underexposure. Therefore, corresponding video enhancement algorithms such as Low-Light Video Enhancement (LLVE) have been proposed to deal with the specific degradation. |
Yunlong Dong; Xiaohong Liu; Yixuan Gao; Xunchu Zhou; Tao Tan; Guangtao Zhai; |
| 340 | PVG: Progressive Vision Graph for Vision Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. |
JiaFu Wu; Jian Li; Jiangning Zhang; Boshen Zhang; Mingmin Chi; Yabiao Wang; Chengjie Wang; |
| 341 | View While Moving: Efficient Video Recognition in Long-untrimmed Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of preview-then-recognition and have achieved great success on multiple video … |
Ye Tian; Mengyu Yang; Lanshan Zhang; Zhizhen Zhang; Yang Liu; Xiaohui Xie; Xirong Que; Wendong Wang; |
| 342 | S-OmniMVS: Incorporating Sphere Geometry Into Omnidirectional Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we revisit omnidirectional MVS by incorporating three sphere geometry priors: spherical projection, spherical continuity, and spherical position. |
Zisong Chen; Chunyu Lin; Lang Nie; Zhijie Shen; Kang Liao; Yuanzhouhan Cao; Yao Zhao; |
| 343 | Event-based Motion Deblurring with Modality-Aware Decomposition and Recomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, inspired by the two-pathway visual system, a novel dual-stream based framework is proposed for motion deblurring (DS-Deblur), which flexibly utilizes the respective advantages from frame and event. |
Wen Yang; Jinjian Wu; Leida Li; Weisheng Dong; Guangming Shi; |
| 344 | Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. |
Wei Ji; Xiangyan Liu; An Zhang; Yinwei Wei; Yongxin Ni; Xiang Wang; |
| 345 | Non-Exemplar Class-Incremental Learning Via Adaptive Old Class Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel NECIL method named POLO with an adaPtive Old cLass recOnstruction mechanism, in which a density-based prototype reinforcement method (DBR), a topology-correction prototype adaptation method (TPA), and an adaptive prototype augmentation method (APA) are designed to reconstruct pseudo features of old classes in new incremental sessions. |
Shaokun Wang; Weiwei Shi; Yuhang He; Yifan Yu; Yihong Gong; |
| 346 | Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To deal with the challenge of missing modalities, in this paper, we propose a novel framework of Contrastive Intra- and Inter-Modality Generation (CI2MG) for enhancing incomplete multimedia recommendation. |
Zhenghong Lin; Yanchao Tan; Yunfei Zhan; Weiming Liu; Fan Wang; Chaochao Chen; Shiping Wang; Carl Yang; |
| 347 | Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (i.e. content, distortion, motion) and employ diverse pretrained models (e.g. architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. |
Hongbo Liu; Mingda Wu; Kun Yuan; Ming Sun; Yansong Tang; Chuanchuan Zheng; Xing Wen; Xiu Li; |
| 348 | Semantic-Aware Generator and Low-level Feature Augmentation for Few-shot Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To ameliorate the generation quality, we in this paper propose a Semantic-Aware Generator (SAG) to provide explicit semantic guidance to the discriminator, and a Low-level Feature Augmentation (LFA) technique to provide fine-grained information, facilitating the diversity. |
Zhe Wang; Jiaoyan Guan; Mengping Yang; Ting Xiao; Ziqiu Chi; |
| 349 | Rethinking The Localization in Weakly Supervised Object Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the … |
Rui Xu; Yong Luo; Han Hu; Bo Du; Jialie Shen; Yonggang Wen; |
| 350 | Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Spatial Transformer \& Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). |
Chenwei Zhang; Yuxuan Hu; Min Yang; Chengming Li; Xiping Hu; |
| 351 | Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. |
Xugong Qin; Pengyuan Lyu; Chengquan Zhang; Yu Zhou; Kun Yao; Peng Zhang; Hailun Lin; Weiping Wang; |
| 352 | Expand BERT Representation with Visual Information Via Grounded Language Learning with Multimodal Partial Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT – a grounded language learning method that enhances the BERT representation with visually grounded information. |
Cong-Duy Nguyen; The-Anh Vu-Le; Thong Nguyen; Tho Quan; Anh-Tuan Luu; |
| 353 | Where and How: Mitigating Confusion in Neural Radiance Fields from Sparse Inputs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as CONFUSION during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: WHERE and HOW. |
Yanqi Bao; Yuxin Li; Jing Huo; Tianyu Ding; Xinyue Liang; Wenbin Li; Yang Gao; |
| 354 | Context-Aware Talking-Head Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we fully utilize the video context to design a novel framework for talking-head video editing, which achieves efficiency, disentangled motion control, and sequential smoothness. |
Songlin Yang; Wei Wang; Jun Ling; Bo Peng; Xu Tan; Jing Dong; |
| 355 | QA-CLIMS: Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model to maximize the text-based understanding of images and guide the generation of activation maps. |
Songhe Deng; Wei Zhuo; Jinheng Xie; Linlin Shen; |
| 356 | AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, existing methods lack consideration for the efficiency of modal fusion. To tackle these issues, we propose AcFormer, which contains two core ingredients: i) contrastive learning within and across modalities to explicitly align different modality streams before fusion; and ii) pivot attention for multimodal interaction/fusion. |
Daoming Zong; Chaoyue Ding; Baoxiang Li; Jiakui Li; Ken Zheng; Qunyan Zhou; |
| 357 | Depth-aided Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by research on biology and evolution, we introduce depth information as an additional cue to help break camouflage, which can provide spatial information and texture-free separation for foreground and background. |
Qingwei Wang; Jinyu Yang; Xiaosheng Yu; Fangyi Wang; Peng Chen; Feng Zheng; |
| 358 | Control3D: Towards Controllable Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users’ desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. |
Yang Chen; Yingwei Pan; Yehao Li; Ting Yao; Tao Mei; |
| 359 | Layout Sequence Prediction From Noisy Mobile Modality Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. |
Haichao Zhang; Yi Xu; Hongsheng Lu; Takayuki Shimizu; Yun Fu; |
| 360 | Zero-Shot Learning By Harnessing Adversarial Samples Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). |
Zhi Chen; Pengfei Zhang; Jingjing Li; Sen Wang; Zi Huang; |
| 361 | Freq-HD: An Interpretable Frequency-based High-Dynamics Affective Clip Selection Method for In-the-Wild Facial Expression Recognition in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To provide more expression-related clips for DFER models, we propose a novel and interpretable frequency-based method (Freq-HD) for high-dynamics affective clip selection. |
Zeng Tao; Yan Wang; Zhaoyu Chen; Boyang Wang; Shaoqi Yan; Kaixun Jiang; Shuyong Gao; Wenqiang Zhang; |
| 362 | StegaDDPM: Generative Image Steganography Based on Denoising Diffusion Probabilistic Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To ensure secure and dependable communication, we propose a novel generative image steganography based on the denoising diffusion probabilistic model, called StegaDDPM. |
Yinyin Peng; Donghui Hu; Yaofei Wang; Kejiang Chen; Gang Pei; Weiming Zhang; |
| 363 | Flexible and Secure Watermarking for Latent Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: But, the post-hoc watermarking methods can be easily escaped to obtain the non-watermarked images, and the existing watermarking methods designed for LDMs can only embed a fixed message, i.e., the to-be-embedded message cannot be changed unless retraining the model. Therefore, in this work, we propose an end-to-end watermarking method based on the encoder-decoder (ENDE) and message-matrix. |
Cheng Xiong; Chuan Qin; Guorui Feng; Xinpeng Zhang; |
| 364 | All-in-one Multi-degradation Image Restoration Network Via Hierarchical Degradation Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous works have employed contrastive learning to learn the degradation representation from observed images, but this often leads to representation drift caused by deficient positive and negative pairs. To address this issue, we propose a novel All-in-one Multi-degradation Image Restoration Network (AMIRNet) that can effectively capture and utilize accurate degradation representation for image restoration. |
Cheng Zhang; Yu Zhu; Qingsen Yan; Jinqiu Sun; Yanning Zhang; |
| 365 | What2comm: Towards Communication-efficient Collaborative Perception Via Feature Decoupling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite advancements in previous approaches, challenges remain due to redundant communication patterns and vulnerable collaboration processes. To address these issues, we propose What2comm, an end-to-end collaborative perception framework to achieve a trade-off between perception performance and communication bandwidth. |
Kun Yang; Dingkang Yang; Jingyu Zhang; Hanqi Wang; Peng Sun; Liang Song; |
| 366 | Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on a strict but practical setting of AU, namely Post-Training Attribute Unlearning (PoT-AU), where unlearning can only be performed after the training of the recommendation model is completed. |
Yuyuan Li; Chaochao Chen; Xiaolin Zheng; Yizhao Zhang; Zhongxuan Han; Dan Meng; Jun Wang; |
| 367 | Cerebrovascular Segmentation in TOF-MRA with Topology Regularization Adversarial Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed a topology regularization adversarial model for cerebrovascular segmentation in TOF-MRA images. |
Cheng Chen; Yunqing Chen; Shuang Song; Jianan Wang; Huansheng Ning; Ruoxiu Xiao; |
| 368 | RAIRNet: Region-Aware Identity Rectification for Face Forgery Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, there are still several nonnegligible problems: (1) generic identity extractor is totally trained on real images, leading to enormous identity representation bias during processing forged content; (2) the identity information of forged image is hybrid and presents regional distribution, while the single global identity feature is hard to reflect this local identity inconsistency. To solve the above problems, in this paper a novel Region-Aware Identity Rectification Network (RAIRNet) is proposed to effectively rectify the identity bias and adaptively exploit the inconsistency local region. |
Mingqi Fang; Lingyun Yu; Hongtao Xie; Junqiang Wu; Zezheng Wang; Jiahong Li; Yongdong Zhang; |
| 369 | Frequency-based Zero-Shot Learning with Phase Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, existing ZSL methods typically learn visual features directly from the RGB domain, which can impede the recognition of certain attributes. To overcome this limitation, we propose a novel ZSL method named Frequency-based Phase Augmentation (FPA) network, which learns an effective representation of the attributes in the frequency domain. |
Wanting Yin; Hongtao Xie; Lei Zhang; Jiannan Ge; Pandeng Li; Chuanbin Liu; Yongdong Zhang; |
| 370 | Edge-Assisted On-Device Model Update for Video Analytics in Adverse Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an edge-assisted framework that continuously updates the lightweight model deployed on the end cameras to achieve accurate predictions in adverse environments. |
Yuxin Kong; Peng Yang; Yan Cheng; |
| 371 | Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that the model should prioritize hard samples after inferring easy ones, and hard samples can benefit from easy ones. |
Yuchen Zhou; Guang Tan; Mengtang Li; Chao Gou; |
| 372 | Slow-Fast Time Parameter Aggregation Network for Class-Incremental Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a benchmark for Class-Incremental Lip-Reading (CILR). |
Xueyi Zhang; Chengwei Zhang; Tao Wang; Jun Tang; Songyang Lao; Haizhou Li; |
| 373 | Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Domain generalization (DG) refers to the task of training a model on multiple source domains and test it on a different target domain with different distribution. In this paper, we address a more challenging and realistic scenario known as Single Long-Tailed Domain Generalization, where only one source domain is available and the minority class in this domain has an abundance of instances in other domains. |
Mengzhu Wang; Jianlong Yuan; Zhibin Wang; |
| 374 | Enhancing Domain-Invariant Parts for Generalized Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we will review the characteristics of attributes. |
Yang Zhang; Songhe Feng; |
| 375 | Blind Image Super-resolution with Rich Texture-Aware Codebook Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). |
Rui Qin; Ming Sun; Fangyuan Zhang; Xing Wen; Bin Wang; |
| 376 | PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose PromptMTopic, a novel multimodal prompt-based model designed to learn topics from both text and visual modalities by leveraging the language modeling capabilities of large language models. |
Nirmalendu Prakash; Han Wang; Nguyen Khoi Hoang; Ming Shan Hee; Roy Ka-Wei Lee; |
| 377 | In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The existing research on UOF is limited and fails to deal with the root cause of the UOF issue: the learning process between advantaged and disadvantaged users is unfair. To tackle this issue, we propose an In-processing User Constrained Dominant Sets (In-UCDS) framework, which is a general framework that can be applied to any backbone recommendation model to achieve user-oriented fairness. |
Zhongxuan Han; Chaochao Chen; Xiaolin Zheng; Weiming Liu; Jun Wang; Wenjie Cheng; Yuyuan Li; |
| 378 | Patchmatch Stereo++: Patchmatch Binocular Stereo with Continuous Disparity Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Patchmatch Stereo++, inspired by the traditional Patchmatch Stereo to achieve better continuous disparity optimization in deep-learning-based methods. |
Wenjia Ren; Qingmin Liao; Zhijing Shao; Xiangru Lin; Xin Yue; Yu Zhang; Zongqing Lu; |
| 379 | ICMH-Net: Neural Image Compression Towards Both Machine Vision and Human Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, our objective is to enhance image compression methods for both human vision quality and machine vision tasks simultaneously. |
Lei Liu; Zhihao Hu; Zhenghao Chen; Dong Xu; |
| 380 | Practical Deep Dispersed Watermarking with Synchronization and Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, most works usually demonstrate robustness against typical non-geometric attacks (e.g., JPEG compression) but ignore common geometric attacks (e.g., Rotate) and more challenging combined attacks. To overcome the above limitations, we propose a practical deep Dispersed Watermarking with Synchronization and Fusion, called DWSF. |
Hengchang Guo; Qilong Zhang; Junwei Luo; Feng Guo; Wenbin Zhang; Xiaodong Su; Minglei Li; |
| 381 | Real-time Facial Animation for 3D Stylized Character with Emotion Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present two real-time solutions which drive character expressions in a geometrically consistent and perceptually valid way. |
Ye Pan; Ruisi Zhang; Jingying Wang; Yu Ding; Kenny Mitchell; |
| 382 | Moby: Empowering 2D Models for Efficient Point Cloud Analytics on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we present Moby, a novel system that demonstrates the feasibility and potential of our approach. |
Jingzong LI; Yik Hong Cai; Libin Liu; Yu Mao; Chun Jason Xue; Hong Xu; |
| 383 | TeViS: Translating Text Synopses to Video Storyboards Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. |
Xu Gu; Yuchong Sun; Feiyue Ni; Shizhe Chen; Xihua Wang; Ruihua Song; Boyuan Li; Xiang Cao; |
| 384 | Adversarial Training of Deep Neural Networks Guided By Texture and Structural Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although recent studies have shown that data augmentation can effectively reduce this gap, most methods heavily rely on generating large amounts of training data without considering which features are beneficial for model robustness, making them inefficient. To address the above issue, we propose a two-stage AT algorithm for image data that adopts different data augmentation strategies during the training process to improve model robustness. |
Zhaoxin Wang; Handing Wang; Cong Tian; Yaochu Jin; |
| 385 | Single Domain Generalization Via Unsupervised Diversity Probe Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel adversarial method, termed Unsupervised Diversity Probe (UDP), to synthesize novel and diverse samples in fully unsupervised settings. |
Kehua Guo; Rui Ding; Tian Qiu; Xiangyuan Zhu; Zheng Wu; Liwei Wang; Hui Fang; |
| 386 | Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. |
Teng Hu; Ran Yi; Haokun Zhu; Liang Liu; Jinlong Peng; Yabiao Wang; Chengjie Wang; Lizhuang Ma; |
| 387 | Collaborative Learning of Diverse Experts for Source-free Universal Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods are developed based on a single-expert target model for both known- and unknown-class data training, such that the known- and unknown-class data in the target domain may not be separated well from each other. To address this issue, we propose a novel Cobllaborative Learning of Diverse Experts (CoDE) method for SFUniDA. |
Meng Shen; Yanzuo Lu; Yanxu Hu; Andy J. Ma; |
| 388 | Localization-assisted Uncertainty Score Disentanglement Network for Action Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Localization-assisted Uncertainty Score Disentanglement Network (LUSD-Net) to deal with PCS and TES two predictions. |
Yanli Ji; Lingfeng Ye; Huili Huang; Lijing Mao; Yang Zhou; Lingling Gao; |
| 389 | SimHMR: A Simple Query-based Framework for Parameterized Human Mesh Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a simple query-based framework, dubbed SimHMR, for parameterized human mesh reconstruction. |
Zihao Huang; Min Shi; Chengxin Liu; Ke Xian; Zhiguo Cao; |
| 390 | LandmarkGait: Intrinsic Human Parsing for Gait Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, its practical application in gait recognition is often hindered by missing RGB modality, lack of annotated body parts, and difficulty in balancing parsing quantity and quality. To address this issue, we propose LandmarkGait, an accessible and alternative parsing-based solution for gait recognition. |
Zengbin Wang; Saihui Hou; Man Zhang; Xu Liu; Chunshui Cao; Yongzhen Huang; Shibiao Xu; |
| 391 | FOLT: Fast Multiple Object Tracking from UAV-captured Videos Based on Optical Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, MOT for the videos captured by unmanned aerial vehicles (UAV) is still challenging due to small object size, blurred object appearance, and very large and/or irregular motion in both ground objects and UAV platforms. In this paper, we propose FOLT to mitigate these problems and reach fast and accurate MOT in UAV view. |
Mufeng Yao; Jiaqi Wang; Jinlong Peng; Mingmin Chi; Chao Liu; |
| 392 | Localized and Balanced Efficient Incomplete Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of the existing works are inapplicable to large-scale clustering task and their clustering results are unstable since these methods have high computational complexities and their results are produced by kmeans rather than their designed learning models. In this paper, we propose a new one-step incomplete multi-view clustering model, called Localized and Balanced Incomplete Multi-view Clustering (LBIMVC), to address these issues. |
Jie Wen; Gehui Xu; Chengliang Liu; Lunke Fei; Chao Huang; Wei Wang; Yong Xu; |
| 393 | When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, due to the domain bias, these methods tend to suffering from the confusion of classes with a similar visual appearance in different domains. To address the above issue, we propose to enhance discriminability towards target samples with masked image modeling to model spatial context relations as additional recognition clues. |
Gang Li; Xianzheng Ma; Zhao Wang; Hao Li; Qifei Zhang; Chao Wu; |
| 394 | Feeling Present! From Physical to Virtual Cinematography Lighting Education with Metashadow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional teaching methods, combining basic lighting equipment operation with slide lectures, often yield unsatisfactory results, hindering students’ mastery of cinematography lighting techniques. Therefore, we propose Metashadow, a virtual reality (VR) cinematography lighting education system demonstrating the feasibility of learning in a virtual soundstage. |
Zheng Wei; Xian Xu; Lik-Hang Lee; Wai Tong; Huamin Qu; Pan Hui; |
| 395 | Physical Invisible Backdoor Based on Camera Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel physical invisible backdoor based on camera imaging without changing nature image pixels. |
Yusheng Guo; Nan Zhong; Zhenxing Qian; Xinpeng Zhang; |
| 396 | Little Strokes Fell Great Oaks: Boosting The Hierarchical Features for Multi-exposure Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally, unsupervised techniques predominantly employ rudimentary weighted summation for color channel processing, culminating in an overall desaturated final image tone. To partially mitigate these issues, this study proposes a gamma correction module specifically designed to fully leverage latent information embedded within source images. |
Pan Mu; Zhiying Du; Jinyuan Liu; Cong Bai; |
| 397 | A Generalized Physical-knowledge-guided Dynamic Model for Underwater Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While it is difficult to obtain high-quality paired training samples with a generalized model. To tackle these challenges, we design a Generalized Underwater image enhancement method via a Physical-knowledge-guided Dynamic Model (short for GUPDM). |
Pan Mu; Hanning Xu; Zheyuan Liu; Zheng Wang; Sixian Chan; Cong Bai; |
| 398 | Prior-Guided Accuracy-Bias Tradeoff Learning for CTR Prediction in Multimedia Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in many business scenarios, it is usually possible to extract a subset of features associated with the biases by means of expert knowledge, i.e., the confounding proxy features. Therefore, in this paper, we propose a novel debiasing framework with confounding proxy priors for the accuracy-bias tradeoff learning in the multimedia recommendation, or CP2Rec for short, in which these confounding proxy features driven by the expert experience are integrated into the model as prior knowledge corresponding to the biases. |
Dugang Liu; Yang Qiao; Xing Tang; Liang Chen; Xiuqiang He; Zhong Ming; |
| 399 | Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. |
Bingqing Zhang; Sen Wang; Yifan Liu; Brano Kusy; Xue Li; Jiajun Liu; |
| 400 | DeepSVC: Deep Scalable Video Coding for Both Machine and Human Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a deep scalable video codec (DeepSVC) to support three-layer scalability from machine to human vision. |
Hongbin Lin; Bolin Chen; Zhichen Zhang; Jielian Lin; Xu Wang; Tiesong Zhao; |
| 401 | LocLoc: Low-level Cues and Local-area Guides for Weakly Supervised Object Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a unified framework that simultaneously improves localization and classification accuracy, termed as LocLoc (Low-level Cues and Local-area Guides). |
Xinzi Cao; Xiawu Zheng; Yunhang Shen; Ke Li; Jie Chen; Yutong Lu; Yonghong Tian; |
| 402 | Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. |
Shengkai Sun; Daizong Liu; Jianfeng Dong; Xiaoye Qu; Junyu Gao; Xun Yang; Xun Wang; Meng Wang; |
| 403 | Self-Supervised Cross-Language Scene Text Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. |
Fuxiang Yang; Tonghua Su; Xiang Zhou; Donglin Di; Zhongjie Wang; Songze Li; |
| 404 | Data-Scarce Animal Face Alignment Via Bi-Directional Cross-Species Knowledge Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Cross-Species Knowledge Transfer, Meta-CSKT, for animal face alignment, which consists of a base network and an adaptation network. |
Dan Zeng; Shanchuan Hong; Shuiwang Li; Qiaomu Shen; Bo Tang; |
| 405 | ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new task for stylizing” text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. |
Jingwen Chen; Yingwei Pan; Ting Yao; Tao Mei; |
| 406 | Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mine more temporal information from the video, we present a bi-level clip inference network for HMR, which leverages both local motion and global context explicitly for dense 3D reconstruction. |
Peng Wu; Xiankai Lu; Jianbing Shen; Yilong Yin; |
| 407 | Interactive Interior Design Recommendation Via Coarse-to-fine Multimodal Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent efforts in developing intelligent interior design systems have focused on generating textual requirement-based decoration designs while neglecting the problem of how to mine homeowner’s hidden preferences and choose the proper initial design. To fill this gap, we propose an Interactive Interior Design Recommendation System (IIDRS) based on reinforcement learning (RL). |
He Zhang; Ying Sun; Weiyu Guo; Yafei Liu; Haonan Lu; Xiaodong Lin; Hui Xiong; |
| 408 | High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 3) Tensor structure is rarely utilized for high-order complementarity investigation. In light of this, we propose High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization (CFMVC-ETR). |
Jintian Ji; Songhe Feng; |
| 409 | FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits Via Fashion Compatibility Boosting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this research, we present FCBoost-Net, a new framework for outfit generation that leverages the power of pre-trained generative models to produce multiple collocated and diversified outfits. |
Dongliang Zhou; Haijun Zhang; Jianghong Ma; Jicong Fan; Zhao Zhang; |
| 410 | Text-Only Training for Visual Storytelling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. |
Yuechen Wang; Wengang Zhou; Zhenbo Lu; Houqiang Li; |
| 411 | Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. |
Yue Lv; Jinxi Xiang; Jun Zhang; Wenming Yang; Xiao Han; Wei Yang; |
| 412 | A Capture to Registration Framework for Realistic Image Super-Resolution in The Industry Environment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the image distortion in building realistic LR-HR image pairs in the industry environment, we design a capture to registration framework. |
Boyang Wang; Yan Wang; Qing Zhao; Junxiong Lin; Zeng Tao; Pinxue Guo; Zhaoyu Chen; Kaixun Jiang; Shaoqi Yan; Shuyong Gao; Wenqiang Zhang; |
| 413 | SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. |
Shanshan Zhong; Zhongzhan Huang; Weushao Wen; Jinghui Qin; Liang Lin; |
| 414 | JAVP: Joint-Aware Video Processing with Edge-Cloud Collaboration for DNN Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a joint-aware video processing (JAVP) architecture for edge-cloud collaboration. |
Zheming Yang; Wen Ji; Qi Guo; Zhi Wang; |
| 415 | CoCa: A Connectivity-Aware Cascade Framework for Histology Gland Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, we provide a novel perspective for gland segmentation by incorporating gland connectivity information to locate critical errors within TCAs. We propose a Connectivity-Aware Cascade framework (CoCa) that explicitly encodes gland connectivity information into the network to locate all connectivity errors during training and then leverage attention operations to focus on these errors. |
Yu Bai; Bo Zhang; Zheng Zhang; Wu Liu; Jinwen Li; Xiangyang Gong; Wendong Wang; |
| 416 | Federated Deep Multi-View Clustering with Global Self-Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Second, the storage and usage of data from multiple clients in a distributed environment can lead to incompleteness of multi-view data. To address these challenges, we propose a novel federated deep multi-view clustering method that can mine complementary cluster structures from multiple clients, while dealing with data incompleteness and privacy concerns. |
Xinyue Chen; Jie Xu; Yazhou Ren; Xiaorong Pu; Ce Zhu; Xiaofeng Zhu; Zhifeng Hao; Lifang He; |
| 417 | Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. |
Nan Xi; Jingjing Meng; Junsong Yuan; |
| 418 | Video Entailment Via Reaching A Structure-Aware Cross-modal Consensus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As human beings, we make sense of the world by synthesizing information from different sense perceptions, which can acquire consensus among multiple modalities to form a more thorough and coherent representation of the surroundings, as well as to perform complicated understanding tasks. In this paper, we attempt to recreate this ability to infer the truthfulness of a given statement in the context of video entailment. |
Xuan Yao; Junyu Gao; Mengyuan Chen; Changsheng Xu; |
| 419 | Selecting Learnable Training Samples Is All DETRs Need in Crowded Pedestrian Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To settle the issue, we propose a simple but effective sample selection method for DETRs, Sample Selection for Crowded Pedestrians (SSCP), which consists of the constraint-guided label assignment scheme (CGLA) and the utilizability-aware focal loss (UAFL). |
Feng Gao; Jiaxu Leng; Ji Gan; Xinbo Gao; |
| 420 | Automatic Network Architecture Search for RGB-D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to limited human efforts and time costs, their performance might be inferior for complex scenarios. To address this issue, we propose the first Neural Architecture Search (NAS) method that designs the network automatically. |
Wenna Wang; Tao Zhuo; Xiuwei Zhang; Mingjun Sun; Hanlin Yin; Yinghui Xing; Yanning Zhang; |
| 421 | Reinforcement Learning-based Adversarial Attacks on Object Detectors Using Reward Shaping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the field of object detector attacks, previous methods primarily rely on fixed gradient optimization or patch-based cover techniques, often leading to suboptimal attack performance and excessive distortions. To address these limitations, we propose a novel attack method, Interactive Reinforcement-based Sparse Attack (IRSA), which employs Reinforcement Learning (RL) to discover the vulnerabilities of object detectors and systematically generate erroneous results. |
Zhenbo Shi; Wei Yang; Zhenbo Xu; Zhidong Yu; Liusheng Huang; |
| 422 | Emotion-Prior Awareness Network for Emotional Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, we develop a novel subordinate emotion masking mechanism between the catalog level and lexical level that facilitates coarse-to-fine emotion learning. |
Peipei Song; Dan Guo; Xun Yang; Shengeng Tang; Erkun Yang; Meng Wang; |
| 423 | Weakly-Supervised Text Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we take the first attempt to perform weakly-supervised text instance segmentation through bridging text recognition and text segmentation. |
Xinyan Zu; Haiyang Yu; Bin Li; Xiangyang Xue; |
| 424 | External Knowledge Dynamic Modeling for Image-text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they lack flexibility due to the limitations of fixed information and empirical feedback. To address these issues, we develop an External Knowledge Dynamic Modeling~(EKDM) architecture based on the filtering mechanism, which dynamically explores different knowledge towards varied image-text pairs. |
Song Yang; Qiang Li; Wenhui Li; Min Liu; Xuanya Li; Anan Liu; |
| 425 | Modality-agnostic Augmented Multi-Collaboration Representation for Semi-supervised Heterogenous Face Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed a novel Modality-Agnostic Augmented Multi-Collaboration representation for Heterogeneous Face Recognition (MAMCO-HFR) in a semi-supervised manner. |
Decheng Liu; Weizhao Yang; Chunlei Peng; Nannan Wang; Ruimin Hu; Xinbo Gao; |
| 426 | Long Short-Term Graph Memory Against Class-imbalanced Over-smoothing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To alleviate the difficulty in specifying complicated relationship, this paper presents a novel perspective on GNNs, i.e., the representations of one node in different layers can be seen as a sequence of states. |
Liang Yang; Jiayi Wang; Tingting Zhang; Dongxiao He; Chuan Wang; Yuanfang Guo; Xiaochun Cao; Bingxin Niu; Zhen Wang; |
| 427 | HCSD-Net: Single Image Desnowing with Color Space Transformation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods usually mask the locations of noises and remove them in RGB color space. In this paper, we rethink this problem by investigating the impacts of color space selection. |
Ting Zhang; Nanfeng Jiang; Hongxin Wu; Keke Zhang; Yuzhen Niu; Tiesong Zhao; |
| 428 | CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. |
Bo Wang; Zhao Zhang; Suiyi Zhao; Haijun Zhang; Richang Hong; Meng Wang; |
| 429 | Scene Text Segmentation with Text-Focused Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. |
Haiyang Yu; Xiaocong Wang; Ke Niu; Bin Li; Xiangyang Xue; |
| 430 | Multi-view Graph Clustering Via Efficient Global-Local Spectral Embedding Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the optimization involved in GLSEF, we present an efficient alternating optimization algorithm accompanied by convergence and time complexity analyses. |
Penglei Wang; Danyang Wu; Rong Wang; Feiping Nie; |
| 431 | Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, this approach neglects the complementary information provided by multimodal data, which can enhance the effectiveness of sentence representation. To address this issue, we propose a Visually-supervised Pre-trained Multimodal Model (ViP) for sentence representation. |
Zhe Li; Laurence T. Yang; Xin Nie; Bocheng Ren; Xianjun Deng; |
| 432 | Learning Profitable NFT Image Diffusions Via Multiple Visual-Policy Guided Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. |
Huiguo He; Tianfu Wang; Huan Yang; Jianlong Fu; Nicholas Jing Yuan; Jian Yin; Hongyang Chao; Qi Zhang; |
| 433 | Multimodal Physiological Signals Fusion for Online Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Online Multimodal HyperGraph Learning (OMHGL) method to fuse multimodal information for emotion recognition based on time-series physiological signals. |
Tongjie Pan; Yalan Ye; Hecheng Cai; Shudong Huang; Yang Yang; Guoqing Wang; |
| 434 | LocalPose: Object Pose Estimation with Local Geometry Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LocalPose, a novel method for 9 DoF object pose estimation from object point clouds. |
Yang Xiao; Bo Duan; Mingwei Sun; Jingwei Huang; |
| 435 | Biased-Predicate Annotation Identification Via Unbiased Visual Predicate Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Obviously, it is essential for the PSG task to tackle this multi-modal contradiction. Therefore, we propose a novel method that utilizes unbiased visual predicate representations for Biased-Annotation Identification (BAI) as a fundamental step for PSG/SGG tasks. |
Li Li; Chenwei Wang; You Qin; Wei Ji; Renjie Liang; |
| 436 | OCSKB: An Object Component Sketch Knowledge Base for Fast 6D Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a fast pipeline for sketch modeling with our tool. |
Guangming Shi; Xuyang Li; Xuemei Xie; Mingxuan Yu; Chengwei Rao; Jiakai Luo; |
| 437 | Object Part Parsing with Hierarchical Dual Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. |
Jiamin Chen; Jianlou Si; Naihao Liu; Yao Wu; Li Niu; Chen Qian; |
| 438 | Kernel Dimension Matters: To Activate Available Kernels for Real-time Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, this paper proposes a kernel-split strategy to activate available kernels for real-time inference. |
Shuo Jin; Meiqin Liu; Chao Yao; Chunyu Lin; Yao Zhao; |
| 439 | End-to-end XY Separation for Single Image Blind Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the one-encoder-one-decoder and the recently proposed one-encoder-two-decoder structures of basic units both fail to comprehensively take advantage of the directional separability of 2D deblurring, which increases the learning content of networks, thus leading to performance degradation. To thoroughly decouple the deblurring into two spatially orthogonal parts, we propose a novel substitution for U-net and its variant, called XYU-net. |
Liuhan Chen; Yirou Wang; Yongyong Chen; |
| 440 | Ada3Diff: Defending Against 3D Adversarial Point Clouds Via Adaptive Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To remedy it, this paper introduces a novel distortion-aware defense framework that can rebuild the pristine data distribution with a tailored intensity estimator and a diffusion model. |
Kui Zhang; Hang Zhou; Jie Zhang; Qidong Huang; Weiming Zhang; Nenghai Yu; |
| 441 | Pseudo Object Replay and Mining for Incremental Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a pseudo object replay and mining method (PseudoRM) to handle the co-occurrence dependent problem, reducing the performance degradation caused by the absence of old-class objects. |
Dongbao Yang; Yu Zhou; Xiaopeng Hong; Aoting Zhang; Xin Wei; Linchengxi Zeng; Zhi Qiao; Weipinng Wang; |
| 442 | A Unified Query-based Paradigm for Camouflaged Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. |
Bo Dong; Jialun Pei; Rongrong Gao; Tian-Zhu Xiang; Shuo Wang; Huan Xiong; |
| 443 | LHAct: Rectifying Extremely Low and High Activations for Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an approach called Rectifying Extremely Low and High Activations (LHAct). |
Yue Yuan; Rundong He; Zhongyi Han; Yilong Yin; |
| 444 | PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. |
Yimin Deng; Huaizhen Tang; Xulong Zhang; Jianzong Wang; Ning Cheng; Jing Xiao; |
| 445 | Patch-Aware Representation Learning for Facial Expression Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods for facial expression recognition (FER) lack the utilization of prior facial knowledge, primarily focusing on expression-related regions while disregarding explicitly processing expression-independent information. This paper proposes a patch-aware FER method that incorporates facial keypoints to guide the model and learns precise representations through two collaborative streams, addressing these issues. |
Yi Wu; Shangfei Wang; Yanan Chang; |
| 446 | Partial Annotation-based Video Moment Retrieval Via Iterative Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, we design a new setting of VMR where users can easily point to small segments of non-controversy video moments and our proposed method can automatically fill in the remaining parts based on the video and query semantics. To support this, we propose a new framework named Video Moment Retrieval via Iterative Learning (VMRIL). |
Wei Ji; Renjie Liang; Lizi Liao; Hao Fei; Fuli Feng; |
| 447 | Rethinking Voice-Face Correlation: A Geometry View Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. |
Xiang Li; Yandong Wen; Muqiao Yang; Jinglu Wang; Rita Singh; Bhiksha Raj; |
| 448 | Temporal Sentence Grounding in Streaming Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. |
Tian Gan; Xiao Wang; Yan Sun; Jianlong Wu; Qingpei Guo; Liqiang Nie; |
| 449 | Transferring CLIP’s Knowledge Into Zero-Shot Point Cloud Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. |
Yuanbin Wang; Shaofei Huang; Yulu Gao; Zhen Wang; Rui Wang; Kehua Sheng; Bo Zhang; Si Liu; |
| 450 | Read Ten Lines at One Glance: Line-Aware Semi-Autoregressive Transformer for Multi-Line Handwritten Mathematical Expression Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing methods have achieved promising performance on publicly available datasets, they still struggle to recognize multi-line mathematical expressions (MEs), suffering from complex structures and slow inference speed. To address these issues, we propose a Line-Aware Semi-autoregressive Transformer (LAST) that treats multi-line mathematical expression sequences as two-dimensional dual-end structures. |
Wentao Yang; Zhe Li; Dezhi Peng; Lianwen Jin; Mengchao He; Cong Yao; |
| 451 | Reconnecting The Broken Civilization: Patchwork Integration of Fragments from Ancient Manuscripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The traditional process of reconstructing these fragments is an arduous task, demanding exhaustive manual intervention and a global collaboration among archaeologists. This paper presents a transformative approach to this challenge, harnessing multi-media techniques to restore the connectable fragments of the invaluable Dunhuang scrolls. |
Yuqing Zhang; Zhou Fang; Xinyu Yang; Shengyu Zhang; Baoyi He; Huaiyong Dou; Junchi Yan; Yongquan Zhang; Fei Wu; |
| 452 | Orthogonal Temporal Interpolation for Zero-Shot Video Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. |
Yan Zhu; Junbao Zhuo; Bin Ma; Jiajia Geng; Xiaoming Wei; Xiaolin Wei; Shuhui Wang; |
| 453 | Rethinking Neighborhood Consistency Learning on Unsupervised Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although neighborhood-based PL can help preserve the local structure, it also risks assigning the whole local neighborhood to the wrong semantic category. To address this issue, we propose a novel framework called neighborhood consistency learning (NCL) that operates at both the semantic and instance levels and features a new consistency objective function. |
Chang Liu; Lichen Wang; Yun Fu; |
| 454 | Fearless Luminance Adaptation: A Macro-Micro-Hierarchical Transformer for Exposure Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the inherent limitations of convolutions, hinder the models ability to restore faithful color or details on extremely over-/under-exposed regions. To overcome these limitations, we propose a Macro-Micro-Hierarchical transformer, which consists of a macro attention to capture long-range dependencies, a micro attention to extract local features, and a hierarchical structure for coarse-to-fine correction. |
Gehui Li; Jinyuan Liu; Long Ma; Zhiying Jiang; Xin Fan; Risheng Liu; |
| 455 | Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration Error Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel calibration-guided source-free domain adaptive semantic segmentation (Cal-SFDA) framework. |
Zixin Wang; Yadan Luo; Zhi Chen; Sen Wang; Zi Huang; |
| 456 | Enhancing Visually-Rich Document Understanding Via Layout Structure Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. |
Qiwei Li; Zuchao Li; Xiantao Cai; Bo Du; Hai Zhao; |
| 457 | Toward Zero-shot Character Recognition: A Gold Standard Dataset with Radical-level Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality. |
Xiaolei Diao; Daqian Shi; Jian Li; Lida Shi; Mingzhe Yue; Ruihua Qi; Chuntao Li; Hao Xu; |
| 458 | Bidomain Modeling Paradigm for Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel bidomain modeling paradigm for pansharpening problem (dubbed as BiMPan), which takes into both local spectral specificity and global spatial detail. |
Junming Hou; Qi Cao; Ran Ran; Che Liu; Junling Li; Liang-jian Deng; |
| 459 | Attentive Alignment Network for Multispectral Pedestrian Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the misalignment between different modalities in spatial dimension and modality reliability would introduce harmful information during feature fusion, limiting the performance of multispectral pedestrian detection. To address the above issues, we propose an attentive alignment network, consisting of an attentive position alignment (APA) module and an attentive modality alignment (AMA) module. |
Nuo Chen; Jin Xie; Jing Nie; Jiale Cao; Zhuang Shao; Yanwei Pang; |
| 460 | Zero-shot Micro-video Classification with Neural Variational Inference in Graph Prototype Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a zero-shot micro-video classification model (NVIGPN) by exploiting the hidden topics behind items to guide the representation learning in user-item interactions. |
Junyang Chen; Jialong Wang; Zhijiang Dai; Huisi Wu; Mengzhu Wang; Qin Zhang; Huan Wang; |
| 461 | Learning Style-Invariant Robust Representation for Generalizable Visual Instance Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this case, the limited style variance in training data may cause the model learning incorrect reliance on the superficial style feature and reduce the generalizability of the model. To address this issue, we propose a novel Style-Invariant robust Representation Learning (SIRL) method for the challenging task, which mainly aims to first diversify the training data with style augmentation, and then enforce the model to learn style-invariant features. |
Tianyu Chang; Xun Yang; Xin Luo; Wei Ji; Meng Wang; |
| 462 | Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We conduct an in-depth analysis to understand user behaviors when viewing volumetric videos. |
Kaiyuan Hu; Haowen Yang; Yili Jin; Junhua Liu; Yongting Chen; Miao Zhang; Fangxin Wang; |
| 463 | Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance. |
Peiran Xu; Yadong Mu; |
| 464 | Debunking Free Fusion Myth: Online Multi-view Anomaly Detection with Disentangled Product-of-Experts Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of the existing methods 1) are only suitable for two views or type-specific anomalies, 2) suffer from the issue of fusion disentanglement, and 3) do not support online detection after model deployment. To address these challenges, our main ideas in this paper are three-fold: multi-view learning, disentangled representation learning, and generative model. |
Hao Wang; Zhi-Qi Cheng; Jingdong Sun; Xin Yang; Xiao Wu; Hongyang Chen; Yan Yang; |
| 465 | Hierarchical Semantic Enhancement Network for Multimodal Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Hierarchical Semantic Enhancement Network (HSEN) for multimodal fake news detection by learning text-related image semantic and precise news high-order knowledge semantic information. |
Qiang Zhang; Jiawei Liu; Fanrui Zhang; Jingyi Xie; Zheng-Jun Zha; |
| 466 | Personalized Single Image Reflection Removal Network Through Adaptive Cascade Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to restore a reflection-free image from a single reflection-contaminated image captured through the glass. |
Mengyi Wang; Xinxin Zhang; Yongshun Gong; Yilong Yin; |
| 467 | A Multitask Framework for Graffiti-to-Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the large number of unknown areas in the graffiti, the generated results may be blurred, resulting in poor visual effects. To address these challenges, this paper proposes a multi-task framework that can predict unknown regions by learning semantic mask from graffiti, thereby improving the quality of generated real scene images. |
Ying Yang; Mulin Chen; Xuelong Li; |
| 468 | BMI-Net: A Brain-inspired Multimodal Interaction Network for Image Aesthetic Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a brain-inspired multimodal interaction network (BMI-Net) that simulates how the association area of the cerebral cortex processes sensory stimuli. |
Xixi Nie; Bo Hu; Xinbo Gao; Leida Li; Xiaodan Zhang; Bin Xiao; |
| 469 | Enhancing Multi-modal Multi-hop Question Answering Via Structured Knowledge and Unified Retrieval-Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the pipelined approaches of retrieval and generation might result in poor generation performance when retrieval performance is low. To address these issues, we propose a Structured Knowledge and Unified Retrieval-Generation (SKURG) approach. |
Qian Yang; Qian Chen; Wen Wang; Baotian Hu; Min Zhang; |
| 470 | Cross-Modal and Multi-Attribute Face Recognition: A Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on modal information removal, we proposed a NIR-VIS cross-modal face recognition model. |
Feng Lin; Kaiqiang fu; Hao Luo; Ziyue Zhan; Zhibo Wang; Zhenguang Liu; Lorenzo Cavallaro; Kui Ren; |
| 471 | FashionDiff: A Controllable Diffusion Model Using Pairwise Fashion Elements for Intelligent Design Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent advances in intelligence-driven fashion design, the complexity of the diverse elements of a fashion item, such as its texture, color and shape, which are associated with the semantic information conveyed, continues to present challenges in terms of generating high-quality fashion images as well as achieving a controllable editing process. To address this issue, we propose a unified framework, FashionDiff, that leverages the diverse elements in fashion items to generate new items. |
Han Yan; Haijun Zhang; Xiangyu Mu; Jicong Fan; Zhao Zhang; |
| 472 | InspirNET: An Unsupervised Generative Adversarial Network with Controllable Fine-grained Texture Disentanglement for Fashion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To accomplish fine-grained texture disentanglement, we propose InspirNET, an unsupervised disentangled generative adversarial framework, that manipulates textures in a fine-grained latent space so as to produce new textures effectively, aiming to broaden the range of fashion options available to common users with distinct textures as well as boosting designers’ potential for fashion innovation and inspiration. |
Han Yan; Haijun Zhang; Jie Hou; Jicong Fan; Zhao Zhang; |
| 473 | Towards Visual Taxonomy Expansion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Visual Taxonomy Expansion (VTE), introducing visual features into the taxonomy expansion task. |
Tinghui Zhu; Jingping Liu; Jiaqing Liang; Haiyun Jiang; Yanghua Xiao; Zongyu Wang; Rui Xie; Yunsen Xian; |
| 474 | Designing Loving-Kindness Meditation in Virtual Reality for Long-Distance Romantic Relationships Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper organized a series of workshops with couples to build a prototype of a couple-preferred LKM app. Through analysis of participants’ design works and semi-structured interviews, we derived design considerations for such VR apps and created a prototype for couples to experience. |
Xian Wang; Xiaoyu Mo; Lik-Hang Lee; Xiaoying Wei; Xiaofu Jin; Mingming Fan; Pan Hui; |
| 475 | FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Look-Up Table Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. |
Wenhao Li; Guangyang Wu; Wenyi Wang; Peiran Ren; Xiaohong Liu; |
| 476 | TIRDet: Mono-Modality Thermal InfraRed Object Detection Based on Prior Thermal-To-Visible Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel neural network called TIRDet, which only utilizes Thermal InfraRed (TIR) images for mono-modality object detection. |
Zeyu Wang; Fabien Colonnier; Jinghong Zheng; Jyotibdha Acharya; Wenyu Jiang; Kejie Huang; |
| 477 | Factorized Omnidirectional Representation Based Vision GNN for Anisotropic 3D Multimodal MR Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the anisotropy issue, we propose FOrViG, an asymmetric vision graph neural network (GNN) framework that captures the correlation between different slices by constructing a graph for multi-slice images and aggregating information from adjacent nodes. |
Bo Zhang; YunPeng Tan; Zheng Zhang; Wu Liu; Hui Gao; Zhijun Xi; Wendong Wang; |
| 478 | Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. |
Wanying Zhang; Shen Zhao; Fanyang Meng; Songtao Wu; Mengyuan Liu; |
| 479 | Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. |
Chen Jiang; Hong Liu; Xuzheng Yu; Qing Wang; Yuan Cheng; Jia Xu; Zhongyi Liu; Qingpei Guo; Wei Chu; Ming Yang; Yuan Qi; |
| 480 | Interactive Image Style Transfer Guided By Graffiti Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: At the same time, the style distribution of the generated stylized image texture differs from the real artwork. In this paper, we propose an interactive image style transfer network (IIST-Net) to overcome the above limitations. |
Quan Wang; Yanli Ren; Xinpeng Zhang; Guorui Feng; |
| 481 | Rethinking Neural Style Transfer: Generating Personalized and Watermarked Stylized Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a personalized and watermark-guided style transfer network (PWST-Net) to tackle the aforementioned issues. |
Quan Wang; Sheng Li; Xinpeng Zhang; Guorui Feng; |
| 482 | Moir\'{e} Backdoor Attack (MBA): A Novel Trigger for Pedestrian Detectors in The Physical World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we specifically focus on the safety-critical task of pedestrian detection and propose a novel backdoor trigger by exploiting the Moir\'{e} effect. |
Hui Wei; Hanxun Yu; Kewei Zhang; Zhixiang Wang; Jianke Zhu; Zheng Wang; |
| 483 | Breaking The Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. |
Zhongfan Sun; Yongli Hu; Qingqing Gao; Huajie Jiang; Junbin Gao; Yanfeng Sun; Baocai Yin; |
| 484 | Multi-label Emotion Analysis in Conversation Via Multimodal Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Self-supervised Multi- Label Peer Collaborative Distillation (SeMuL-PCD) Learning via an efficient Multimodal Transformer Network, in which complementary feedback from multiple mode-specific peer networks (e.g.transcript, audio, visual) are distilled into a single mode-ensembled fusion network for estimating multiple emotions simultaneously. |
Sidharth Anand; Naresh Kumar Devulapally; Sreyasee Das Bhattacharjee; Junsong Yuan; |
| 485 | Spatial-angular Quality-aware Representation Learning for Blind Light Field Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel BLFIQA method using spatial-angular quality-aware representation learning in a self-supervised learning manner. |
Jianjun Xiang; Yuanjie Dang; Peng Chen; Ronghua Liang; Ruohong Huan; Zhengyu Zhang; |
| 486 | MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Multiscale Temporal Similarity Network (MTSN), a novel one-stage method for TAL, which mainly benefits from dynamic complementary modeling and temporal similarity decoding. |
Xiaodong Jin; Taiping Zhang; |
| 487 | Disentangle Propagation and Restoration for Efficient Video Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the first framework for accelerating video recovery, which aims to efficiently recover high-quality videos from degraded inputs affected by various deteriorative factors. |
Cong Huang; Jiahao Li; Lei Chu; Dong Liu; Yan Lu; |
| 488 | Modal-aware Bias Constrained Contrastive Learning for Multimodal Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This random method is likely to lose important information and introduce new noise, resulting in biased augmentation data. Therefore, we propose a Modal-aware Bias Constrained Contrastive Learning method (BCCL) to solve the above problems. |
Wei Yang; Zhengru Fang; Tianle Zhang; Shiguang Wu; Chi Lu; |
| 489 | Towards Real-Time Neural Video Codec for Cross-Platform Application Using Calibration Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a real-time cross-platform neural video codec, which is capable of efficiently decoding (25FPS) of 720P video bitstream from other encoding platforms on a consumer-grade GPU (e.g., NVIDIA RTX 2080). |
Kuan Tian; Yonghang Guan; Jinxi Xiang; Jun Zhang; Xiao Han; Wei Yang; |
| 490 | Recurrent Self-Supervised Video Denoising with Denser Receptive Field Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Self-supervised video denoising has seen decent progress through the use of blind spot networks. However, under their blind spot constraints, previous self-supervised video … |
Zichun Wang; Yulun Zhang; Debing Zhang; Ying Fu; |
| 491 | Multispectral Object Detection Via Cross-Modal Conflict-Aware Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite substantial advancements in this domain, current methodologies primarily rely on rudimentary accumulation operations to combine complementary information from disparate modalities, overlooking the semantic conflicts that arise from the intrinsic heterogeneity among modalities. To address this issue, we propose a novel learning network, the Cross-modal Conflict-Aware Learning Network (CALNet), that takes into account semantic conflicts and complementary information within multi-modal input. |
Xiao He; Chang Tang; Xin Zou; Wei Zhang; |
| 492 | Unfolding Once Is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Therefore, they inevitably contain some redundant operators, posing challenges for subsequent deployment in real-world applications. In this paper, we propose a deployment-friendly transformer unit, namely UFONE (i.e., UnFolding ONce is Enough), to alleviate these problems. |
Yong Liu; Hang Dong; Boyang Liang; Songwei Liu; Qingji Dong; Kai Chen; Fangmin Chen; Lean Fu; Fei Wang; |
| 493 | Draw2Edit: Mask-Free Sketch-Guided Image Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such approaches, however, present a limitation: the mask can cause loss of essential semantic information, compelling the model to perform restoration rather than editing the image. To address this challenge, we propose a novel mask-free image modification method, named Draw2Edit, which enables direct drawing of sketches and editing of images without pixel-level masks, simplifying the editing process. |
Yiwen Xu; Ruoyu Guo; Maurice Pagnucco; Yang Song; |
| 494 | Sensing Micro-Motion Human Patterns Using Multimodal MmRadar and Video Signal for Affective and Psychological Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the Remote Multimodal Affective and Psychological (ReMAP) dataset, for the first time, apply head micro-tremor (HMT) signals for affective and psychological perception. |
Yiwei Ru; Peipei Li; Muyi Sun; Yunlong Wang; Kunbo Zhang; Qi Li; Zhaofeng He; Zhenan Sun; |
| 495 | Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an Orthogonal Uncertainty Representation (hOUR) of feature embedding and an end-to-end training strategy to improve the long-tail phenomenon of model robustness. |
Yanbiao Ma; Licheng Jiao; Fang Liu; Shuyuan Yang; Xu Liu; Lingling Li; |
| 496 | Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, an Induction Network is proposed to bridge the modality gap more effectively. |
Tianyu Liu; Peng Zhang; Wei Huang; Yufei Zha; Tao You; Yanning Zhang; |
| 497 | A Closer Look at Classifier in Adversarial Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, generative adversarial networks have been widely used, but suffer from the problem of low intra-class diversity, which can lead to poor generalization ability. To address this issue, we propose a new method called auxiliary classifier in adversarial domain generalization (CloCls). |
Ye Wang; Junyang Chen; Mengzhu Wang; Hao Li; Wei Wang; Houcheng Su; Zhihui Lai; Wei Wang; Zhenghan Chen; |
| 498 | Automatic Generation of Commercial Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a system automatically synthesizes market-like commercial scenes in virtual environments. |
Shao-Kui Zhang; Jia-Hong Liu; Yike Li; Tianyi Xiong; Ke-Xin Ren; Hongbo Fu; Song-Hai Zhang; |
| 499 | Karma: Adaptive Video Streaming Via Causal Sequence Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate Karma through trace-driven simulations and real-world field tests, demonstrating superior performance compared to existing state-of-the-art ABR algorithms, with an average QoE improvement ranging from 10.8\% to 18.7\% across diverse network conditions. |
Bowei Xu; Hao Chen; Zhan Ma; |
| 500 | Event-Enhanced Multi-Modal Spiking Neural Network for Dynamic Obstacle Avoidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we approach robust dynamic obstacle avoidance twofold. |
Yang Wang; Bo Dong; Yuji Zhang; Yunduo Zhou; Haiyang Mei; Ziqi Wei; Xin Yang; |
This table only includes 500 papers selected based on our selection algorithm. To continue with the full list, please visit Paper Digest: MM-2023 (Full List).