Skip to main content
ARS Home » Southeast Area » Auburn, Alabama » Aquatic Animal Health Research » Research » Publications at this Location » Publication #411069

Research Project: Integrated Research to Improve Aquatic Animal Health in Warmwater Aquaculture

Location: Aquatic Animal Health Research

Title: A supervised machine learning workflow for the reduction of highly dimensional biological data

Author
item Andersen, Linnea
item READING, BENJAMIN - North Carolina State University

Submitted to: Artificial Intelligence
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/24/2023
Publication Date: 6/1/2024
Citation: Andersen, L.K., Reading, B.J. 2023. A supervised machine learning workflow for the reduction of highly dimensional biological data. Artificial Intelligence in the Life Sciences. 5:100090. https://doi.org/10.1016/j.ailsci.2023.100090.
DOI: https://doi.org/10.1016/j.ailsci.2023.100090

Interpretive Summary: Advances in technology have allowed for more data to be collected in all areas of our lives from home life (e.g., smart appliances) to areas of scientific research (e.g., genetics). Traditional methods of analyzing data, such as basic statistical approaches, are not as effective in gaining meaningful information from the large amounts of data collected today. One solution is to apply artificial intelligence (AI) -based machine learning (ML) approaches to data analysis. Briefly, AI is a term used to describe computer systems that are designed to perform tasks in a similar way that an “intelligent being” (human, or another animal) would without receiving direct instructions from a human. Computer algorithms and technologies that allow for the AI systems to perform tasks in this way by “learning” patterns from the input data fall under the umbrella of ML, and the learning of data patterns to produce output a human can understand is referred to as data mining. AI/ML strategies for data mining have a greater capacity for handling the large amounts and various types of data currently being generated across society. However, AI/ML approaches are not yet well-established or known to people, including many scientific researchers, causing the analysis and interpretation of data and communication of results to be difficult. In this paper we are presenting a ML workflow that can be applied to many different types of data within and outside of science that results in large datasets being reduced to only the items that are the most important for a given comparison or question. For example, this workflow has been used in biology research projects to answer questions such as when butterflies may emerge based on weather patterns and what genes are the most important in determining the difference of animals that are stunted in growth versus those that grow well.

Technical Abstract: Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.