Classifying COVID-19 based on amino acids encoding with machine learning algorithms

Walaa Alkady, Khaled ElBahnasy, Víctor Leiva, Walaa Gad

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named “amino acid encoding based prediction” (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning shows similar results with our method.

Original languageEnglish
Article number104535
JournalChemometrics and Intelligent Laboratory Systems
Volume224
DOIs
StatePublished - 15 May 2022

Keywords

  • ANOVA
  • Amino acid composition
  • Artificial intelligence
  • Bagging ensemble and gradient boosting
  • Chi-square test
  • Deep learning
  • Feature extraction and selection
  • Information gain
  • LASSO
  • Molecular modeling
  • Protein sequence
  • SARS-CoV-2

Fingerprint

Dive into the research topics of 'Classifying COVID-19 based on amino acids encoding with machine learning algorithms'. Together they form a unique fingerprint.

Cite this