Authors:
(1) Jinge Wang, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(2) Zien Cheng, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(3) Qiuming Yao, School of Computing, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
(4) Li Liu, College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA and Biodesign Institute, Arizona State University, Tempe, AZ 85281, USA;
(5) Dong Xu, Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA;
(6) Gangqing Hu, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA (Michael.hu@hsc.wvu.edu).
Table of Links
4. Biomedical Text Mining and 4.1. Performance Assessments across typical tasks
4.2. Biological pathway mining
5.1. Human-in-the-Loop and 5.2. In-context Learning
6. Biomedical Image Understanding
7.1 Application in Applied Bioinformatics
7.2. Biomedical Database Access
7.2. Online tools for Coding with ChatGPT
7.4 Benchmarks for Bioinformatics Coding
8. Chatbots in Bioinformatics Education
9. Discussion and Future Perspectives
7. BIOINFORMATICS PROGRAMMING
ChatGPT enables scientists who may not possess advanced programming skills to perform bioinformatics analysis. Users can articulate data characteristics, analysis details, and objectives in natural language, prompting ChatGPT to respond with executable code. In this context, we define “prompt bioinformatics”: the use of natural language instructions (prompts) to guide chatbots for reliable and reproducible bioinformatics data analysis through code generation[13]. This concept differs from the development of bioinformatics chatbot before the GPT era, such as DrBioRight[92] and RiboChat[93]. In prompt bioinformatics, the code is generated on the fly by the chatbot in response to a data analysis description. In addition, the generated code inherently varies across different chat sessions even for the same instruction, asking for method developments to ensure result reproducibility. Lastly, the concept covers a broad range of bioinformatics topics, particularly those in applied bioinformatics, where data analysis methods are relatively mature.
Early case studies showcase ChatGPT's versatility in addressing diverse bioinformatics coding tasks, from aligning sequencing reads to constructing evolutionary trees[10], and excelling in introductory course exercises[12]. ChatGPT excels at writing short scripts that call existing functions with specific instructions[94]. However, it shows limitations in writing longer, workable code for more complex data analysis with errors often requiring domain-specific knowledge to spot for correction[94].
7.1. APPLICATION IN APPLIED BIOINFORMATICS
In applied bioinformatics, established methods for data analysis are prevalent used, enhancing the likelihood of their incorporation into LLM training datasets. Thus, applied bioinformatics emerges as a fertile ground for practicing prompt bioinformatics and evaluating its effectiveness. AutoBA[95], a Python package powered by LLMs, streamlined applied bioinformatics for multi-omics data analysis by autonomously designing analysis plans, generating code, managing package installations, and executing the code. Through testing across 40 varied sequencing-based analysis scenarios, AutoBA with GPT-4 attained a 65% success rate in end-to-end automation[95]. Error message feedback for code correction significantly enhanced this success rate. In addition, AutoBA utilizes retrieval-augmented generation (RAG) to increase robustness of code generation[95].
Mergen[96] is an R package that automates data analysis through LLM utilization. It crafts, executes, and refines code based on user-provided textual descriptions. The inclusion of file headers in prompts and error message feedback notably improves coding efficacy. The evaluation tasks for Mergen, while relevant to bioinformatics, cater to a general-purpose scope, covering machine learning, statistics, visualization, and data wrangling. Interestingly, the adoption of role-playing does not yield significant enhancements, possibly due to the general nature of the tasks and the mismatch between the assumed bioinformatician role and the task requirements.
LLMs exhibit inherent limitations in coding with tools beyond their training datasets. Bioinformaticians typically consult user manuals and source code to master new tools, a process LLMs could emulate. The BioMANIA framework[97] exemplifies this approach by creating conversational chatbots for open-source, well-documented Python tools. By understanding APIs from source code and user manuals, it employs GPT-4 to generate instructions for API usage. These instructions inform a BERT-based model to suggest top appropriate APIs based on a user's query, with GPT-4 predicting parameters and executing API calls. Evaluation of the method identifies areas for improvement, such as tutorial documentation and API design, guiding the future development of chatbot-compatible tools[97].
This paper is available on arxiv under CC BY 4.0 DEED license.