Chinese Biomedical Text Mining Challenges: Information Extraction Tasks Overview

cover
23 Apr 2025

Abstract and 1. Introduction

2. Community Challenges Overview and 2.1 CCKS

2.2 CHIP and 2.3 CCIR, CSMI, CCL and DCIC

3. Evaluation Tasks Overview and 3.1 Information Extraction

3.2 Text Classification and Text Similarity

3.3 Knowledge Graph and Question Answering

3.4 Text Generation and Knowledge Reasoning and 3.5 Large Language Model Evaluation

4. Translational Informatics in Biomedical Text Mining

5. Discussion and Perspective

5.1. Contributions of Community Challenges

5.2. Limitations of Current Community Challenges

5.3. Future Perspectives in the Era of Large Language Models, and References

Figure Legends and Tables

3. Evaluation Tasks Overview

Figure 2 presents the distribution of data sources, organizations, and artificial intelligence tasks in the Chinese biomedical text mining community challenge. For data sources, we categorize into electronic health records (EHRs) and non-electronic health records (non-EHRs). EHR comprises various types such as radiology reports, pathology reports, and scanned documents of paper-based medical records. Non-EHR consists of published literatures (e.g., title and abstract text), internet medical data (e.g., health consultation questions, medical popularization content, online consultation data, and doctor-patient question-answering dialogues), clinical experiment registration documents (e.g., eligibility criteria text), clinical practice guidelines, medical textbooks, and medication instructions. It is important to note that most tasks used textual data from a single source, with only a few tasks incorporating both EHR and non-EHR data. For task organization, the results indicate that the majority of tasks are jointly organized by researchers from both academia and industry, highlighting the intersection and collaboration between the fields of medicine and engineering. Additionally, the number of tasks organized by academic researchers is higher than industrial researchers, suggesting that academia plays a pivotal role in driving advancements within the field. For artificial intelligence tasks, named entity recognition is the most focused task, followed by other tasks including question answering, text classification, relationship extraction, entity normalization, knowledge graph, text similarity, event extraction, large language model evaluation, attribute extraction, optical character recognition, and text generation. In the following sections, we will provide detailed descriptions of these tasks.

3.1 Information Extraction

Information extraction is a fundamental task in the field of biomedical text mining, encompassing the extraction of domain-specific entities, attributes, relationships, and events from both structured and unstructured biomedical texts.

Named entity recognition is the most common task in Chinese biomedical text mining research. From 2017 to 2021, CCKS continuously organized clinical named entity recognition tasks. These task data were all derived from electronic medical records from real hospitals, with variations in entity types and dataset sizes each year. In 2017, the task defined five entity types, including symptom and sign, examination and test, disease and diagnosis, treatment, and body part [53]. In 2018, the task defined five entity types, including anatomical site, symptom description, symptoms item, medication, and surgery [54]. In 2019, the task defined the same five entity types as 2017 [51]. In 2020, six entity types were defined, including disease and diagnosis, examination, test, surgery, medication, and anatomical site [52]. The entity types for 2021 remained the same as in 2020 [55]. The named entity recognition tasks organized by CHIP are more diverse in terms of data types and entity types. In 2020, two named entity recognition tasks were released. One of them focused on identifying and extracting entities from Chinese traditional medication instruction texts, consisting of 13 types: drug-related entities, including drug, drug ingredient, disease, symptom, syndrome, disease group, food, food group, person group, drug group, drug dosage, drug taste, and drug efficacy. The other task aimed to identify and extract clinical entities from medical documents, with entities divided into 9 types, including diseases, symptom, drug, medical equipment, procedure, body part, test item, microorganism, and department [62]. In 2023, two named entity recognition tasks were released. One task focused on few-shot medical named entity recognition, defining 15 labels, including item, sociology, disease, etiology, body, age, adjuvant, therapy, electroencephalogram, equipment, drug, procedure, treatment, microorganism, department, epidemiology, symptom, and others. Another task focused on identifying PICOS information from Chinese medical literatures, where PICOS elements include population, intervention, comparison, outcome, and study design. Additionally, in 2021, DCIC organized a task that required identifying annotated entities from pathology reports of tumors.

Entity normalization is generally performed after named entity recognition. The purpose of entity normalization is to map entities to a unified standard terminology in order to facilitate information exchange and knowledge sharing. In the field of biomedical research, commonly used standard terminologies include International Classification of Diseases 10th Revision (ICD-10), Unified Medical Language System (UMLS), Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), and others. In 2019, the CHIP organized a entity normalization task with the aim of mapping surgery-related entities in Chinese EHRs to the standard terminology "ICD9-2017 Peking Union Medical College Clinical Edition". In 2020 and 2021, CHIP organized two entity normalization tasks for diagnosis-related entities using the standard terminology "ICD-10 Beijing Clinical Edition v601".

Event extraction refers to the process of identifying specific events or occurrences mentioned in text. Attribute extraction aims to extract specific attributes or features from text. In clinical, it focuses on extracting events or attributes related to healthcare and clinical processes. For example, in the radiology reports of lung cancer and breast cancer, the task defined by CHIP in 2018 included three attributes: tumor size, primary tumor site, and metastatic site. Subsequently, CCKS organized tasks in 2019 and 2020 to further explore the extraction of the same three attributes [51, 52]. In 2021, CHIP organized the clinical event extraction task, which aimed to identify four attributes from a given medical history or medical imaging report: anatomical location, subject, description, and occurrence status.

Relation extraction is the process of identifying entities from unstructured texts and determining the relationships between these entities. In 2020, CHIP organized a relation extraction task which defined 53 types of relationships, and required the analysis of medical text sentences to output of all relationships that met the specified conditions [63]. In 2022, CHIP organized another relation extraction task that focused on extracting three key types of medical causal inference relationships from online consultation texts [46]. These relationships include causal relationships, conditional relationships, and hypothetical relationships.

Optical character recognition is a specific type of information extraction task, enabling conversion of printed or handwritten text into machine-readable text. In clinical practice, various paper-based medical documents are generated, and the information contained within them can be used for assist in clinical diagnosis and medical insurance claims [66]. In 2022, CHIP released a task in which the organizer collected scanned images of four types of medical records, including discharge summaries, outpatient invoices, medication invoices, and hospitalization invoices [65]. The task explored the structured data generation and information extraction, which would be utilized for insurance claims. In 2023, CHIP released another task with a corpus of scanned drug package inserts, aiming to identify entities and relationships within them.

Some information extraction tasks may contain multiple subtasks. For example, in 2022, CHIP organized a task which aims to extract semantic associations between genes and diseases from scientific literature [48]. This task defined 12 types of named entities, including nine molecular objects and three regulations. It also required recognizing two semantic roles: ThemeOf and CauseOf, as well as four types of regulatory types: Loss of Function (LOF), Gain of Function (GOF), Regulation (REG), and Compound change of a function (COM). In another task released by CHIP in 2022, which aims to extract medical decision trees from unstructured texts such as clinical practice guidelines and medical textbooks [49]. The task required the system to identify entities and relationships within the text, and interconnect the information in order to construct a complete clinical decision-making process.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Hui Zong, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;

(2) Rongrong Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and the author contributed equally;

(3) Jiaxue Cha, Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China;

(4) Erman Wu, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China;

(5) Jiakun Li, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and Department of Urology, West China Hospital, Sichuan University, Chengdu, 610041, China;

(6) Liang Tao, Faculty of Business Information, Shanghai Business School, Shanghai, 201400, China;

(7) Zuofeng Li, Takeda Co. Ltd., Shanghai, 200040, China;

(8) Buzhou Tang, Department of Computer Science, Harbin Institute of Technology, Shenzhen, 518055, China;

(9) Bairong Shen, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610041, China and a Corresponding author.