Table of Links
2. Community Challenges Overview and 2.1 CCKS
2.2 CHIP and 2.3 CCIR, CSMI, CCL and DCIC
3. Evaluation Tasks Overview and 3.1 Information Extraction
3.2 Text Classification and Text Similarity
3.3 Knowledge Graph and Question Answering
3.4 Text Generation and Knowledge Reasoning and 3.5 Large Language Model Evaluation
4. Translational Informatics in Biomedical Text Mining
5. Discussion and Perspective
5.1. Contributions of Community Challenges
5.2. Limitations of Current Community Challenges
5.3. Future Perspectives in the Era of Large Language Models, and References
5.2. Limitations of Current Community Challenges
There are some limitations in current community challenge evaluation tasks on biomedical text mining.
Firstly, there is a lack of representativeness of data in evaluation tasks. Biomedical and healthcare data possess sensitivity and privacy concerns, making it difficult to obtain largescale datasets. This results in many tasks using small or synthetic datasets, which may limit the representativeness and applicability. Data quality and annotation are also challenging, as the annotation of medical data are complex and requires high levels of expertise from annotators. For data types, many tasks only collect a single type of data. In the future, task organizer should consider incorporating data from multiple sources and modalities. Additionally, some datasets are restricted for use only during the evaluations and not made available afterward, limiting the impact of community challenge evaluation tasks.
Secondly, the developed solutions may lack sufficient innovation and exhibit poor reproducibility. Some participants tend to apply established methods to achieve quick results, rather than exploring novel and innovative approaches. This can lead to a lack of innovation and diversity in methods, limiting further technical advancements. On the other hand, some algorithms may perform well on specific datasets but fail to generalize to others, reducing their broad applicability and reliability. To encourage innovation, future community challenges should emphasize the exploration of novel techniques and reward participants for their creativity. When integrating with existing systems, algorithm performance must be evaluated across multiple datasets and scenarios to ensure robustness and generalizability.
Lastly, there exists a gap between evaluation tasks and applications in clinical practice. These tasks are generally abstracted from complex real-world problems and simplified to facilitate evaluation and comparison. However, such simplification may not reflect the complexity, diversity, and ambiguity of problems in real-world clinical settings. For example, several tasks focus on fundamental NLP tasks, such as medical information extraction and text classification, while more intricate challenges closely aligned with clinical practice are lacking. In real-world clinical practice, models always need to handle more noise and uncertainty, which may not have been fully considered in current evaluation tasks. Therefore, even systems that perform well in evaluation tasks may not be able to achieve similar results in practical applications.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Hui Zong, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;
(2) Rongrong Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;
(3) Jiaxue Cha, Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China;
(4) Erman Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China;
(5) Jiakun Li, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and Department of Urology, West China Hospital, Sichuan University, Chengdu, 610041, China;
(6) Liang Tao, Faculty of Business Information, Shanghai Business School, Shanghai, 201400, China;
(7) Zuofeng Li, Takeda Co. Ltd., Shanghai, 200040, China;
(8) Buzhou Tang, Department of Computer Science, Harbin Institute of Technology, Shenzhen, 518055, China;
(9) Bairong Shen, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and a Corresponding author.