Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification

Image credit: unsplash


Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g. grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types — genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. We examined the method in different settings, including binary and multi-class responses, various penalization schemes, and class balances. We compared our method with standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. We showed that our method performed similarly to or better than ML methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors.

Frontiers in Genetics
Vamsi Manthena
Vamsi Manthena
Data Scientist II | Ph.D. Statistics

Data Scientist | Statistical Consultant