The approach used for the development of VirulentPred 2.0 prediction models can be described through five steps (Figure 1; Steps A - E) given below:

(A). Data collection: The generation of the current positive dataset began by retrieving 3580 and 2852 virulent protein sequences from VFDB and UNIPROT databases, respectively.

(B). Data pre-processing: From VFDB, the core dataset file was downloaded, whereas the keywords such as virulence, adhesin, adhesions, toxin, invasion, capsule, and other terms related to virulence were used to obtain the sequences from UNIPROT. Further, these 6432 sequences were strictly screened to filter out the entries labelled as “probable”, putative, similarity, fragments, hypothetical, unknown, and possible (Figure 1).

For making a negative dataset, the annotated sequences of bacterial enzymes were downloaded from UNIPROT database. Here, the sequences were mainly searched from the bacterial proteomes for which virulent protein sequences were obtained. This dataset was strictly screened to obtain a high-quality negative dataset.

Figure 1. A flowchart depicting the overall approach implemented for the data collection, pre-processing, model building and deployment of the best model for VirulentPred 2.0.


Subsequently, the sequences were clustered and compared using CD-HIT at a cut-off value of 0.5 to remove identical sequences to reduce redundancy among positive and negative datasets. The distribution of virulent protein sequences from different bacterial pathogens after refinement is shown in Table 1. Eventually, we obtained a non-redundant dataset of 6781 sequences, comprised of 3375 virulent and 3406 non-virulent protein sequences for the generation of models (Figure 1).

Table 1. The distribution of virulent protein sequences from different bacterial pathogens.

Sr. No.

Bacteria genus

No. of protein sequences

1

Acinetobacter

42

2

Aeromonas

181

3

Anaplasma

21

4

Bacillus

92

5

Bartonella

64

6

Bordetella

62

7

Brucella

55

8

Burkholderia

143

9

Campylobacter

117

10

Chlamydia

52

11

Clostridium

41

12

Corynebacterium

16

13

Coxiella

147

14

Enterococcus

39

15

Escherichia

235

16

Francisella

99

17

Haemophilus

69

18

Helicobacter

101

19

Klebsiella

89

20

Legionella

383

21

Listeria

44

22

Mycobacterium

172

23

Mycoplasma

26

24

Neisseria

53

25

Pseudomonas

235

26

Rickettsia

28

27

Salmonella

156

28

Shigella

74

29

Staphylococcus

140

30

Streptococcus

121

31

Vibrio

165

32

Yersinia

113


(C). Dataset preparation: For a machine learning study, preparation of training and test datasets is an essential requirement. In the present study, the positive and negative datasets were randomly shuffled and each divided into 80% training and 20% test datasets. Table 2, provides the distribution of positive and negative dataset protein sequences used in the present study. Further, the training dataset was randomly divided into actual training set (0.9 fraction of the data) and validation set (holdout 0.1 fraction of data) through an in-built feature of AutoGluon, for the training and internal evaluation of ML models, respectively.

Table 2. Distribution of protein sequences in the training and test datasets.

Dataset Type

Total number of protein sequences

Number of protein sequences in Training dataset (80%)

Number of protein sequences in Test dataset (20%)

Positive dataset

3375

2700

675

Negative dataset

3406

2725

681

Both datasets

6781

5425

1356


(D). Sequence-based features calculation: The standalone package named GPSR 1.0 was used for the calculation of the amino acid composition (AAC), dipeptide composition (DPC), tripeptide composition (TPC) and binary profile pattern (BPP) of virulent and non-virulent protein sequences. An in-house Perl script was used to calculate the PSI-BLAST generated PSSM profiles. For the calculation, a PSI-BLAST iterative search was performed against the SwissProt database, with a cut-off E-value of 0.001.

(E). Models' training, evaluation and deployment: A total of 14 different ML algorithms available with AutoGluon are used for the training and performance evaluation of machine learning (ML) models (Table 3). The training dataset is used for training the machine learning models (0.9 fraction) and identifying the best performing models with the help of the validation dataset (holdout 0.1 fraction from the training dataset). The models with the highest accuracy on the validation dataset are automatically opted for and saved as the best models by AutoGluon. The saved models are further evaluated with a test dataset to estimate their real-life performance. The best performing model on both validation and test dataset i.e., PSSM-based model, is deployed on VirulentPred 2.0 web server. Moreover, codes and the best model is also available to download for their usage as standalone predictors by users on their desktops or workstations.

Table 3. List of algorithms (available with AutoGluon) used for the training and evaluation of VirulentPred 2.0 ML models.

Sr. No.

Algorithm Name

Reference

1

CatBoost

https://catboost.ai/

2

ExtraTreesEntr

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier

3

ExtraTreesGini

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier

4

KNeighborsDist

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

5

KNeighborsUnif

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

6

LightGBM

https://lightgbm.readthedocs.io/en/latest/

7

LightGBMLarge

https://lightgbm.readthedocs.io/en/latest/

8

LightGBMXT

https://lightgbm.readthedocs.io/en/latest/

9

NeuralNetFastAI

https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#nnfastaitabularmodel

10

NeuralNetTorch

https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#tabularneuralnettorchmodel

11

RandomForestEntr

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

12

RandomForestGini

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

13

WeightedEnsemble

https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf

14

XGBoost

https://xgboost.readthedocs.io/en/latest/


Comparative performance of VirulentPred models: The comparative evaluation with the latest test dataset (from VirulentPred 2.0) helped assess the performance improvement achieved by VirulentPred 2.0 with respect to (w.r.t.) the previously developed VirulentPred models. Table 4 provides the comparative performance of the best models from VirulentPred 2.0 and VirulentPred 1.0. It can be seen that VirulentPred 2.0 is highly accurate than its previous version. For example, in the case of “Cascade SVM classifier”, approximately 7 % increase in the prediction accuracy (from 75.74 % to 82.82 %) is achieved with VirulentPred 2.0. Whereas, for PSSM profile-based models, about an 11 % increase in the prediction accuracy (from 74.19 % to 85.18 %) is achieved with VirulentPred 2.0. Therefore, the PSSM profile-based model is deployed as the best predictor model on VirulentPred 2.0 web-server and standalone software. This model is trained and evaluated with the "WeightedEnsemble_L2" technique of AutoGluon.

Table 4. Performance of best models from VirulentPred 2.0 and VirulentPred with latest test dataset.

Data description

Validation dataset performance

Test dataset performance

Model type

Accuracy

Sensitivity

Specificity

Accuracy

MCC

VirulentPred 2.0 

(Cascade classifier-based model)

100

79.11

86.49

82.82

0.66

VirulentPred 

(Cascade SVM-based classifier model)

N/A

77.48

74.01

75.74

0.52

VirulentPred 2.0

(PSSM profile-based model)

84.71

85.33

85.02

85.18

0.70

VirulentPred

(PSSM profile-based model)

N/A

68.44

79.88

74.19

0.49