Grand Challenge Proposals – 2024 IEEE International Conference on Multimedia and Expo

Semi-supervised Acoustic Scene Classification under Domain Shift

Challenge Description:

Acoustic scene classification (ASC) is a crucial research problem in computational audition, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is domain shift caused by a distribution gap between training and testing data. Although this task in recent years has achieved substantial progress in device generalization, the challenge of domain shift between different regions, involving characteristics such as time, space, culture, and language, remains insufficiently explored at present. This challenge emphasizes the necessity for ASC models to possess robust performance under domain-shifted conditions. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is encouraged to study the possible ways to utilize these unlabelled data. We encourage participants to innovate with semi-supervised learning techniques, aiming for more effective use of the abundant real-world data.

Challenge Website:

https://ascchallenge.xshengyun.com/

Organizers

Jisheng Bai

Northwestern Polytechnical University, China

baijs@mail.nwpu.edu.cn

Jianfeng Chen

Northwestern Polytechnical University, China

chenjf@nwpu.edu.cn

Bin Xiang

Xi’an Lianfeng Acoustic Technologies Co., Ltd., China

xiangbin@lfxstek.com

Mou Wang

Institute of Acoustics, Chinese Academy of Sciences, China

wangmou21@mail.nwpu.edu.cn

Haohe Liu

University of Surrey, UK

haohe.liu@surrey.ac.uk

Mark D. Plumbley

University of Surrey, UK

m.plumbley@surrey.ac.uk

Woon-Seng Gan
Nanyang Technological University, Singapore
ewsgan@ntu.edu.sg

Susanto Rahardja
Northwestern Polytechnical University, China
susantorahardja@ieee.org

Chat-scenario Chinese Lipreading (ChatCLR) Challenge

Challenge Description:

People acquire information through auditory (e.g., voice) and visual (e.g., lip movements) cues to understand spoken content. The audio may be drowned by noise in poor acoustic scenarios, making the content difficult to acquire. Lipreading infers spoken content through lip movements, situated at the intersection of computer vision and natural language processing. Lipreading tasks primarily concentrate on English, emphasizing the need for increased attention in Chinese. The heightened complexity of Chinese lipreading tasks stems from the extensive number of Chinese characters and the complex mapping with the lip movements. The lack of large-scale Chinese lipreading datasets further constrains research progress. Existing Chinese lipreading datasets mainly focus on professional announcers or carefully prepared topics, limiting practical applicability. In contrast, our competition is based on videos recorded in a real-home scenario with 2-6 speakers chatting in a relaxed and unscripted manner.

Task 1: Wake Word Lipreading

Task 2：Target Speaker Lipreading

Challenge Website:

https://mispchallenge.github.io/ICME2024/

Organizers

Jun Du

University of Science and Technology of China

jundu@ustc.edu.cn

Chin-Hui Lee

Georgia Institute of Technology

chl@ece.gatech.edu

Sabato Marco Siniscalchi

Kore University of Enna

marco.siniscalchi@unikore.it

Low-power Efficient and Accurate Facial-Landmark Detection for Embedded Systems

Challenge Description

In the field of computer vision, facial-landmark detection is crucial for applications like augmented reality and facial recognition. This competition invites participants to develop a lightweight, efficient deep learning model for accurate facial landmark detection across diverse expressions and lighting conditions. The model should be suitable for low-power, real-time performance on embedded systems, especially MediaTek’s Dimensity Series platform. Participants will use the model to identify 51 specific facial landmarks in a test dataset, submitting results as TXT files. Accuracy is assessed based on the mean error, normalized by inter-ocular distance.

Challenge Website:

https://pairlabs.ai/ieee-icme-2024-grand-challenges/

Organizers

Po-Chi Hu

Pervasive Artificial Intelligence Research (PAIR) Labs

pochihu@nycu.edu.tw

Jiun-In Guo

Intelligent Vision System (IVS) Lab, National Yang Ming Chiao Tung University (NYCU),

jiguo@nycu.edu.tw

Marvin Chen

MediaTek

marvin.chen@mediatek.com

Hsien-Kai Kuo

MediaTek

hsienkai.kuo@mediatek.com

Chia-Chi Tsai

AI System (AIS) Lab, National Cheng Kung University (NCKU), Taiwan

cctsai@gs.ncku.edu.tw

Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC)

Challenge Description:

Given the enormous amount of multi-modal multi-media information that we encounter in our daily lives (including visuals, sounds, texts, and interactions with their surroundings), we humans process such information with great ease.

For machines to assist us in holistic understanding and analysis of events, or even achieve such sophistication of human intelligence (e.g., Artificial General Intelligence (AGI)), they need to process visual information from real-world videos, alongside complementary audio and textual data, about the events, scenes, objects, actions, and interactions.

Hence, we hope to further advance such developments in multi-modal video reasoning and analyzing for different scenarios and real-world applications through this Grand Challenge using various challenging multi-modal datasets with different types of computer vision tasks (i.e., video grounding, spatiotemporal event grounding, video question answering, sound source localization, person reidentification, attribute recognition, pose estimation, skeleton-based action recognition, spatiotemporal action localization, behavioral graph analysis, animal pose estimation and action recognition).

Challenge Website:

https://sutdcv.github.io/MMVRAC

Organizers

Jun Liu