Authors:
(1) Jinge Wang, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(2) Zien Cheng, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(3) Qiuming Yao, School of Computing, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
(4) Li Liu, College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA and Biodesign Institute, Arizona State University, Tempe, AZ 85281, USA;
(5) Dong Xu, Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA;
(6) Gangqing Hu, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA (Michael.hu@hsc.wvu.edu).
Table of Links
4. Biomedical Text Mining and 4.1. Performance Assessments across typical tasks
4.2. Biological pathway mining
5.1. Human-in-the-Loop and 5.2. In-context Learning
6. Biomedical Image Understanding
7.1 Application in Applied Bioinformatics
7.2. Biomedical Database Access
7.2. Online tools for Coding with ChatGPT
7.4 Benchmarks for Bioinformatics Coding
8. Chatbots in Bioinformatics Education
9. Discussion and Future Perspectives
7.4. BENCHMARKS FOR BIOINFORMATICS CODING
A thorough assessment of bioinformatics necessitates the establishment of comprehensive benchmarks to cover a broad range of topics in the field. Writing individual functions is a fundamental skill in the development of advanced bioinformatics algorithms. BIOCODER[103] is a benchmark to evaluate language models' proficiency in function writing. This benchmark encompasses over 2,200 Python and Java functions derived from authentic bioinformatics codebases, in addition to 253 functions sourced from the Rosalind project. Comparative analyses have shown that GPT-3.5 and GPT-4 significantly outperform smaller, coding-specific language models. Interestingly, integrating topic-specific context, such as imported objects, into the baseline task descriptions markedly enhances accuracy. However, even the most adept models, namely the GPT series, reach an accuracy ceiling at 60%. A significant proportion of the failures are attributed to syntax or runtime errors[103], suggesting that ChatGPT's effectiveness in bioinformatics coding can be further enhanced through human feedback on error messages.
Execution success is crucial, yet it represents only one facet of evaluating bioinformatics code quality. Sarwal, Munteanu [104] proposed a comprehensive evaluation framework that encompassed seven metrics, assessing both subjective and objective dimensions of code writing. These dimensions include readability, correctness, efficiency, simplicity, error handling, code examples, and clarity of input/output specifications. Each metric is scaled from 1 to 10 and normalized independently post-evaluation across models. When applied to a variety of common bioinformatics tasks, this framework highlighted GPT-4's superior performance over alternatives such as BARD and LLaMA. However, the current evaluation remains narrowly focused on a limited set of tasks[104]. Expanding these evaluations to encompass a broader range of bioinformatics domains asks for community-led efforts for a comprehensive appraisal of these language models.
This paper is available on arxiv under CC BY 4.0 DEED license.