News

Introduction

Multimodal large language models (MLLMs) have the potential to support clinical training and assessment by assisting medical experts in interpreting procedural videos and verifying adherence to standardized workflows. Reliable deployment in these settings requires evidence that models can continuously interpret students’ actions during clinical skill assessments, which underpins MLLMs’ understanding of clinical skills. Systematically evaluating and improving MLLMs’ understanding of clinical skills and their continuous perception in clinical skill assessment scenarios is therefore essential for building reliable and high-impact AI systems for medical education. To address this need, the shared task on medical question answering targets the clinical skill assessment scenarios.

Important Dates

Note that all deadlines are 23:59:59 AoE (UTC-12).

Task Definition

ClinSkill QA formulates clinical skill understanding and continuous perception for clinical skill assessment as an ordering task: the MLLM is required to arrange shuffled key frames into a coherent sequence of clinical actions and to provide explanations for the resulting order. The dataset is constructed from video clips of medical student clinical procedures, collected from Zhongnan Hospital of Wuhan University and cofun [1]. This study was approved by the Institutional Review Board (IRB), and all data collection and processing followed relevant ethical guidelines.

Dataset

ClinSkill QA is built on 200 sets of shuffled key frames extracted from three types of clinical skill videos. Each set of key frames represents a sequence of continuous actions and is accompanied by expert-annotated ground-truth ordering and order rationales.

Example

Input frames: [A.jpg, B.jpg, C.jpg, D.jpg]

Output order: ["D", "B", "A", "C"]

Explanation: This example covers the transition from initial approach to pre-compression preparation and the onset of chest compressions in CPR training. The ordering is determined by the visible progression from approaching the manikin, to opening clothing for chest exposure, to starting compressions.

Figure D: The operator is approaching a supine manikin. The clothing appears fully fastened and no patient contact is visible yet, which is most consistent with an early pre-compression stage. This should precede Figure B, where chest exposure has already started.

Figure B: The operator uses both hands to unzip the jacket, partially opening the clothing to expose the chest area. This follows Figure D (clothing still fastened) and precedes Figure A, where the clothing is fully opened and moved aside.

Figure A: The clothing is fully opened and arranged to both sides, indicating complete chest exposure immediately before compressions. This should come after Figure B (partial opening) and before Figure C (compressions underway).

Figure C: The operator places one hand over the other on the manikin’s chest and is performing chest compressions. The chest is exposed and compressions have begun, so this should follow Figure A.

Evaluation

For evaluation, we use Task Accuracy (exact ordering) and Pairwise Accuracy (the fraction of adjacent pairs correctly ordered) for the ordering results, and BertScore as well as an LLM-as-judge(G-Eval) for assessing the quality of the ordering explanations.

For the i-th sample (a set of shuffled keyframes):

Ordering evaluation

Rationale evaluation

Registration and Submission

Organizers

References