Biomedical NLP: Text Generation & Knowledge Reasoning Tasks

cover
24 Apr 2025

Abstract and 1. Introduction

2. Community Challenges Overview and 2.1 CCKS

2.2 CHIP and 2.3 CCIR, CSMI, CCL and DCIC

3. Evaluation Tasks Overview and 3.1 Information Extraction

3.2 Text Classification and Text Similarity

3.3 Knowledge Graph and Question Answering

3.4 Text Generation and Knowledge Reasoning and 3.5 Large Language Model Evaluation

4. Translational Informatics in Biomedical Text Mining

5. Discussion and Perspective

5.1. Contributions of Community Challenges

5.2. Limitations of Current Community Challenges

5.3. Future Perspectives in the Era of Large Language Models, and References

Figure Legends and Tables

3.4 Text Generation and Knowledge Reasoning

Text generation refers to the process of generate natural language text, while knowledge reasoning refers to the process of using existing knowledge and information to make inferences. Both of them are complex and challenging NLP tasks. In 2020, CHIP organized a question generation task focused on Chinese traditional medical literatures and related texts from internet forum. Participants were required to develop algorithms that process these texts and generate questions. In 2021, CCKS organized a dialogue generation task, which was based on the medical dialogue dataset MedDG [67] related to 12 types of common gastrointestinal diseases. The task aimed to generate question-answer pairs containing 160 related entities from five categories: diseases, symptoms, attributes, examinations, and medications. At same year, CCKS organized another task focused on the reading comprehension of Chinese popular medical knowledge. Given texts and questions, the task aims to extract corresponding text spans as answers. For medical knowledge reasoning, in 2021, CCL released the medical dialogue based intelligent diagnosis evaluation task, which explored the identification of medical entities and symptom information from doctor-patient dialogue texts, the automatic generation of medical reports, and the simulation of dialogue process to determine specific diseases. In 2022, CHIP organized a task, which aims to automatic clinical diagnostic coding by given relevant diagnostic information (e.g., admission diagnosis, preoperative diagnosis, postoperative diagnosis, and discharge diagnosis), surgery, medication, and medical advice [47].

3.5 Large Language Model Evaluation

Large language model (LLM) possesses powerful capability in text understanding and generation. Extensive research has explored wide range of potential applications for LLM. In biomedical and healthcare fields, higher evaluation standards are required for development and application of LLM due to its specialization, rigor, privacy, and ethical considerations. Ensuring the reliability and credibility of these models in clinical applications is crucial, making comprehensive evaluations of biomedical LLM extremely important. In 2023, based on the benchmark of CBLUE [64], CCKS organized a task which transformed various NLP tasks within different medical scenarios into prompt-based language generation tasks, creating a large-scale prompt tuning benchmark PromptCBLUE[59, 60]. Next, the organizers optimized the dataset and conducted evaluation tasks at CHIP in 2023 [50]. Additionally, CHIP organized another task in 2023, which released a dataset CHIP-YIER-LLM contains various multiple-choice questions collected from medical licensing exams, medical textbooks, medical literatures, clinical practice guidelines, publicly available EHRs. This task was designed to assess the capabilities of LLM in the field of biomedical research.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Hui Zong, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;

(2) Rongrong Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;

(3) Jiaxue Cha, Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China;

(4) Erman Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China;

(5) Jiakun Li, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and Department of Urology, West China Hospital, Sichuan University, Chengdu, 610041, China;

(6) Liang Tao, Faculty of Business Information, Shanghai Business School, Shanghai, 201400, China;

(7) Zuofeng Li, Takeda Co. Ltd., Shanghai, 200040, China;

(8) Buzhou Tang, Department of Computer Science, Harbin Institute of Technology, Shenzhen, 518055, China;

(9) Bairong Shen, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and a Corresponding author.