Authors:
(1) Jinge Wang, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(2) Zien Cheng, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(3) Qiuming Yao, School of Computing, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
(4) Li Liu, College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA and Biodesign Institute, Arizona State University, Tempe, AZ 85281, USA;
(5) Dong Xu, Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA;
(6) Gangqing Hu, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA ([email protected]).
Table of Links
4. Biomedical Text Mining and 4.1. Performance Assessments across typical tasks
4.2. Biological pathway mining
5.1. Human-in-the-Loop and 5.2. In-context Learning
6. Biomedical Image Understanding
7.1 Application in Applied Bioinformatics
7.2. Biomedical Database Access
7.2. Online tools for Coding with ChatGPT
7.4 Benchmarks for Bioinformatics Coding
8. Chatbots in Bioinformatics Education
9. Discussion and Future Perspectives
7.2. BIOMEDICAL DATABASE ACCESS
Structured Query Language (SQL) serves as a pivotal tool for navigating bioinformatics databases. Mastering SQL requires users to have both programming skills and a deep understanding of the database's data schema—prerequisites that many biomedical scientists find challenging. Recent advancements have seen LLM-chatbots like ChatGPT stepping in to translate natural language questions into SQL queries[98], significantly easing database access for non-programmers.
The work bySima and de Farias [99] explored ChatGPT-4's ability to explain and generate SPARQL queries for public biological and bioinformatics databases. Faced with explaining a complex SPARQL query that identifies human genes linked to cancer and their orthologs in rat brains—requiring to combine data from Uniprot, OMA, and Bgee databases—ChatGPT adeptly breaked down the query's elements. However, its attempt to craft a SPARQL query from a natural language description for the same database search revealed inaccuracies that require specific human feedback for correction. Notably, prompts augmented with sematic clues such as variable names and inline comments indicate a substantial improvement in the performance on translating questions into corresponding SPARQL queries, when evaluated on a fine-tuned OpenLlama LLM[100].
Another work by Chen and Stadler [101] applied GPT-3.5 and GPT-4 to convert user inputs into SQL queries for accessing a database of SARS-CoV-2 genomes and their annotations. Through systematic prompting and learning from numerous examples, the chatbot shows proficiency in understanding the database structure and generates accurate queries for 90.6% and 75.2% of the requests with GPT-4 and GPT-3.5, respectively. In addition, the chatbot initiates a new session to explain each query for the users to cross-ref with their own inputs to minimize risks of misunderstandings.
7.3. ONLINE TOOLS FOR CODING WITH CHATGPT
The Code Interpreter, officially integrated into ChatGPT-4 during the summer of 2023, represents a significant advancement in streamlining computational tasks. This feature facilitates a wide array of operations, including data upload, specification of analysis requirements, generation and execution of Python code, visualization of results, and data download, all through natural language instructions. It stands out for its ability to dynamically adapt code in response to runtime errors and self-assess the outcomes of code execution. Despite its broad applicability for general-purpose tasks such as data manipulation and visualization, its utility in bioinformatics data analysis encounters limitations such as the absence of bioinformatics-specific packages and the inability to access external databases[102].
Shortly after the release of ChatGPT in November 2022, RTutor.AI emerged as a pioneering web-server powered by the GPT technology dedicated to data analysis. This R-based platform simplifies the process for users to upload a single tabular dataset and articulate their data analysis requirements in natural language. RTutor.AI proficiently manages data importing and type conversion, subsequently leveraging OpenAI's API for R code generation. It executes the generated code and produces downloadable HTML reports including figure plots. A subsequent application, Chatlize.AI, developed by the same team, adopts a treeof-thought methodology[46] to enhance data analysis exploration. This approach, extending to Python, enables the generation of multiple code versions for a given analysis task, their execution, and comprehensive documentation of the results. Users benefit from the flexibility to select a specific code for further analysis. This feature is particularly valuable for exploratory data analysis, making Chatlize.AI a flexible solution for practicing prompt bioinformatics.
This paper is available on arxiv under CC BY 4.0 DEED license.
 
 