Location: Plant Science ResearchTitle: Good statistical practices in agronomy using categorical data analysis, with alfalfa examples having poisson and binomial underlying distributions
|MOWERS, RONALD - Vis Viva Energy Economics Consulting|
|BUCCIARELLI, BRUNA - University Of Minnesota|
|CAO, YUANYUAN - University Of Minnesota|
|Samac, Deborah - Debby|
Submitted to: Crops
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 5/3/2022
Publication Date: 5/13/2022
Citation: Mowers, R.P., Bucciarelli, B., Cao, Y., Samac, D.A., Xu, Z. 2022. Good statistical practices in agronomy using categorical data analysis, with alfalfa examples having poisson and binomial underlying distributions. Crops. 2(2):154-171. https://doi.org/10.3390/crops2020012.
Interpretive Summary: There are over 11,000 R analysis packages and 137,000 Python libraries for data analysis and statistical inferences on the web for free access and use. Since both R and Python are free open sources and easy to use, they have gained great popularity to calculate the p-values for statistically significant tests. But they also post potential issues and sometimes generate misleading information, thus resulting in risky decisions. This article shows statistical issues faced in analyzing experiments with categorical data and approaches to solve these issues. We highlight some faulty analyses and demonstrate other, better solutions with R programming compared with standard SAS or JMP programming. Although statistical methods presented here are not new, this guide for practitioners covers information not usually given in initial statistics courses for agronomists, breeders, and other researchers.
Technical Abstract: Categorical data has as measurement scale a set of categories. It can be derived from qualitative classifications or countable quantitative data, and is common in biological scientific work and crop breeding. Data analysis of categorical data is important for correct inferences from experiments. However, categorical data can introduce unique issues in data analysis, particularly in real-world settings with collaborators and with periodically-updated dynamic data. This paper discusses common problems arising from categorical variable analysis and data transformations, demonstrates the issues or risks of misapplying analysis, and suggests approaches to address data analysis challenges using two data sets from alfalfa breeding programs. For each data set, we present several analysis methods, e.g. simple t-test, analysis of variance (ANOVA), split plot analysis, generalized linear model (GLM), generalized linear mixed model (GLMM) using R with R markdown, and with the standard statistical analysis software SAS/JMP. The goal is to demonstrate good analysis practices for categorical data by comparing the potential 'bad' analyses with better ones, avoiding too much reliance on reaching a significant p-value of 0.05, and navigating the morass of ever-increasing numbers of potential R functions. The three main aspects of this research focus on choosing the right data distribution to use, using the correct error terms for hypothesis test p-values including the right type of sum of the squares (Type I, II, and III), and proper statistical models for categorical data analysis. Our results show the importance of good statistical analysis practice to help agronomists, breeders, and other researchers apply appropriate statistical approaches to more correctly draw conclusions from their data.