The approach used for the development of VirulentPred 2.0 prediction models can be described through five steps (Figure 1; Steps A - E) given below:
(A). Data collection: The generation of the current positive dataset began by retrieving 3580 and 2852 virulent protein sequences from VFDB and UNIPROT databases, respectively.
(B). Data pre-processing: From VFDB, the core dataset file was downloaded, whereas the keywords such as virulence, adhesin, adhesions, toxin, invasion, capsule, and other terms related to virulence were used to obtain the sequences from UNIPROT. Further, these 6432 sequences were strictly screened to filter out the entries labelled as “probable”, putative, similarity, fragments, hypothetical, unknown, and possible (Figure 1).
For making a negative dataset, the annotated sequences of bacterial enzymes were downloaded from UNIPROT database. Here, the sequences were mainly searched from the bacterial proteomes for which virulent protein sequences were obtained. This dataset was strictly screened to obtain a high-quality negative dataset.
Subsequently, the sequences were clustered and compared using CD-HIT at a cut-off value of 0.5 to remove identical sequences to reduce redundancy among positive and negative datasets. The distribution of virulent protein sequences from different bacterial pathogens after refinement is shown in Table 1. Eventually, we obtained a non-redundant dataset of 6781 sequences, comprised of 3375 virulent and 3406 non-virulent protein sequences for the generation of models (Figure 1).
Sr. No. |
Bacteria genus |
No. of protein sequences |
1 |
Acinetobacter |
42 |
2 |
Aeromonas |
181 |
3 |
Anaplasma |
21 |
4 |
Bacillus |
92 |
5 |
Bartonella |
64 |
6 |
Bordetella |
62 |
7 |
Brucella |
55 |
8 |
Burkholderia |
143 |
9 |
Campylobacter |
117 |
10 |
Chlamydia |
52 |
11 |
Clostridium |
41 |
12 |
Corynebacterium |
16 |
13 |
Coxiella |
147 |
14 |
Enterococcus |
39 |
15 |
Escherichia |
235 |
16 |
Francisella |
99 |
17 |
Haemophilus |
69 |
18 |
Helicobacter |
101 |
19 |
Klebsiella |
89 |
20 |
Legionella |
383 |
21 |
Listeria |
44 |
22 |
Mycobacterium |
172 |
23 |
Mycoplasma |
26 |
24 |
Neisseria |
53 |
25 |
Pseudomonas |
235 |
26 |
Rickettsia |
28 |
27 |
Salmonella |
156 |
28 |
Shigella |
74 |
29 |
Staphylococcus |
140 |
30 |
Streptococcus |
121 |
31 |
Vibrio |
165 |
32 |
Yersinia |
113 |
(C). Dataset preparation: For a machine learning study, preparation of training and test datasets is an essential requirement. In the present study, the positive and negative datasets were randomly shuffled and each divided into 80% training and 20% test datasets. Table 2, provides the distribution of positive and negative dataset protein sequences used in the present study. Further, the training dataset was randomly divided into actual training set (0.9 fraction of the data) and validation set (holdout 0.1 fraction of data) through an in-built feature of AutoGluon, for the training and internal evaluation of ML models, respectively.
Dataset Type |
Total number of protein sequences |
Number of protein sequences in Training dataset (80%) |
Number of protein sequences in Test dataset (20%) |
Positive dataset |
3375 |
2700 |
675 |
Negative dataset |
3406 |
2725 |
681 |
Both datasets |
6781 |
5425 |
1356 |
(D). Sequence-based features calculation: The standalone package named GPSR 1.0 was used for the calculation of the amino acid composition (AAC), dipeptide composition (DPC), tripeptide composition (TPC) and binary profile pattern (BPP) of virulent and non-virulent protein sequences. An in-house Perl script was used to calculate the PSI-BLAST generated PSSM profiles. For the calculation, a PSI-BLAST iterative search was performed against the SwissProt database, with a cut-off E-value of 0.001.
(E). Models' training, evaluation and deployment: A total of 14 different ML algorithms available with AutoGluon are used for the training and performance evaluation of machine learning (ML) models (Table 3). The training dataset is used for training the machine learning models (0.9 fraction) and identifying the best performing models with the help of the validation dataset (holdout 0.1 fraction from the training dataset). The models with the highest accuracy on the validation dataset are automatically opted for and saved as the best models by AutoGluon. The saved models are further evaluated with a test dataset to estimate their real-life performance. The best performing model on both validation and test dataset i.e., PSSM-based model, is deployed on VirulentPred 2.0 web server. Moreover, codes and the best model is also available to download for their usage as standalone predictors by users on their desktops or workstations.
Sr. No. |
Algorithm Name |
Reference |
1 |
CatBoost |
|
2 |
ExtraTreesEntr |
|
3 |
ExtraTreesGini |
|
4 |
KNeighborsDist |
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html |
5 |
KNeighborsUnif |
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html |
6 |
LightGBM |
|
7 |
LightGBMLarge |
|
8 |
LightGBMXT |
|
9 |
NeuralNetFastAI |
https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#nnfastaitabularmodel |
10 |
NeuralNetTorch |
https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#tabularneuralnettorchmodel |
11 |
RandomForestEntr |
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html |
12 |
RandomForestGini |
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html |
13 |
WeightedEnsemble |
https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf |
14 |
XGBoost |
Comparative performance of VirulentPred models: The comparative evaluation with the latest test dataset (from VirulentPred 2.0) helped assess the performance improvement achieved by VirulentPred 2.0 with respect to (w.r.t.) the previously developed VirulentPred models. Table 4 provides the comparative performance of the best models from VirulentPred 2.0 and VirulentPred 1.0. It can be seen that VirulentPred 2.0 is highly accurate than its previous version. For example, in the case of “Cascade SVM classifier”, approximately 7 % increase in the prediction accuracy (from 75.74 % to 82.82 %) is achieved with VirulentPred 2.0. Whereas, for PSSM profile-based models, about an 11 % increase in the prediction accuracy (from 74.19 % to 85.18 %) is achieved with VirulentPred 2.0. Therefore, the PSSM profile-based model is deployed as the best predictor model on VirulentPred 2.0 web-server and standalone software. This model is trained and evaluated with the "WeightedEnsemble_L2" technique of AutoGluon.
Data description |
Validation dataset performance |
Test dataset performance |
|||
Model type |
Accuracy |
Sensitivity |
Specificity |
Accuracy |
MCC |
VirulentPred 2.0 (Cascade classifier-based model) |
100 |
79.11 |
86.49 |
82.82 |
0.66 |
VirulentPred (Cascade SVM-based classifier model) |
N/A |
77.48 |
74.01 |
75.74 |
0.52 |
VirulentPred 2.0 (PSSM profile-based model) |
84.71 |
85.33 |
85.02 |
85.18 |
0.70 |
VirulentPred (PSSM profile-based model) |
N/A |
68.44 |
79.88 |
74.19 |
0.49 |