Location: Crop Improvement and Genetics ResearchTitle: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants
|CAGIRICI, BUSRA - Oak Ridge Institute For Science And Education (ORISE)|
|GALVEZ, SERGIO - University Of Malaga|
|BUDAK, HIKMET - Montana Bioagriculture Inc|
Submitted to: Functional and Integrative Genomics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 1/25/2021
Publication Date: 2/26/2021
Citation: Cagirici, B.H., Galvez, S., Sen, T.Z., Budak, H. 2021. LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants. Functional and Integrative Genomics. 21:195-204. https://doi.org/10.1007/s10142-021-00769-w.
Interpretive Summary: Long non-coding ribonucleic acids (lncRNAs) play important functional roles in cells, but their experimental identification is costly and imprecise. Here we develop a machine learning-based computational tool called LncMachine for their computational identification with precision with an average accuracy of 92.67%. We selected and refined sequence features that improved the final performance of the tool. We compared its performance against three other popular lncRNA-prediction methods. The installable code of our tool, together with prebuilt prediction models and the training/test datasets are provided freely and publicly through the GitHub website.
Technical Abstract: Following elucidation of the critical roles they play in numerous important biological processes, long noncoding RNAs (lncRNAs) have gained vast attention in recent years. Manual annotation of lncRNAs is restricted by known gene annotations and is prone to false prediction due to the incompleteness of available data. However, with the advent of high-throughput sequencing technologies, a magnitude of high-quality data has become available for annotation, especially for plant species such as wheat. Here, we compared prediction accuracies of several machine learning algorithms using a 10-fold cross validation. This study includes a comprehensive feature selection step to refine irrelevant and repeated features. We present an alignment-free coding potential prediction tool, LncMachine with Random Forest algorithm, specific to crop species with higher accuracies than the currently available popular tools (CPC2, CPAT, and CNIT). To add to this, LncMachine with Random Forest also performed well on human and mouse data, with an average accuracy of 92.67%. LncMachine can implement several algorithms in real-time and provide the best model for a specific study. It accepts either a FASTA file or a TAB separated CSV file containing features for each sample. As it is open to implementation, LncMachine can be effortlessly applied to a wide range of studies.