VirulentPred 2.0: an improved method for prediction of virulent proteins in bacterial pathogens

The approach used for the development of VirulentPred 2.0 prediction models can be described through five steps (Figure 1; Steps A - E) given below:

(A). Data collection: The generation of the current positive dataset began by retrieving 3580 and 2852 virulent protein sequences from VFDB and UNIPROT databases, respectively.

(B). Data pre-processing: From VFDB, the core dataset file was downloaded, whereas the keywords such as virulence, adhesin, adhesions, toxin, invasion, capsule, and other terms related to virulence were used to obtain the sequences from UNIPROT. Further, these 6432 sequences were strictly screened to filter out the entries labelled as “probable”, putative, similarity, fragments, hypothetical, unknown, and possible (Figure 1).

For making a negative dataset, the annotated sequences of bacterial enzymes were downloaded from UNIPROT database. Here, the sequences were mainly searched from the bacterial proteomes for which virulent protein sequences were obtained. This dataset was strictly screened to obtain a high-quality negative dataset.

Figure 1. A flowchart depicting the overall approach implemented for the data collection, pre-processing, model building and deployment of the best model for VirulentPred 2.0.

Subsequently, the sequences were clustered and compared using CD-HIT at a cut-off value of 0.5 to remove identical sequences to reduce redundancy among positive and negative datasets. The distribution of virulent protein sequences from different bacterial pathogens after refinement is shown in Table 1. Eventually, we obtained a non-redundant dataset of 6781 sequences, comprised of 3375 virulent and 3406 non-virulent protein sequences for the generation of models (Figure 1).

Table 1. The distribution of virulent protein sequences from different bacterial pathogens.

Sr. No.	Bacteria genus	No. of protein sequences
1	Acinetobacter	42
2	Aeromonas	181
3	Anaplasma	21
4	Bacillus	92
5	Bartonella	64
6	Bordetella	62
7	Brucella	55
8	Burkholderia	143
9	Campylobacter	117
10	Chlamydia	52
11	Clostridium	41
12	Corynebacterium	16
13	Coxiella	147
14	Enterococcus	39
15	Escherichia	235
16	Francisella	99
17	Haemophilus	69
18	Helicobacter	101
19	Klebsiella	89
20	Legionella	383
21	Listeria	44
22	Mycobacterium	172
23	Mycoplasma	26
24	Neisseria	53
25	Pseudomonas	235
26	Rickettsia	28
27	Salmonella	156
28	Shigella	74
29	Staphylococcus	140
30	Streptococcus	121
31	Vibrio	165
32	Yersinia	113

(C). Dataset preparation: For a machine learning study, preparation of training and test datasets is an essential requirement. In the present study, the positive and negative datasets were randomly shuffled and each divided into 80% training and 20% test datasets. Table 2, provides the distribution of positive and negative dataset protein sequences used in the present study. Further, the training dataset was randomly divided into actual training set (0.9 fraction of the data) and validation set (holdout 0.1 fraction of data) through an in-built feature of AutoGluon, for the training and internal evaluation of ML models, respectively.

Table 2. Distribution of protein sequences in the training and test datasets.

Dataset Type	Total number of protein sequences	Number of protein sequences in Training dataset (80%)	Number of protein sequences in Test dataset (20%)
Positive dataset	3375	2700	675
Negative dataset	3406	2725	681
Both datasets	6781	5425	1356

(D). Sequence-based features calculation: The standalone package named GPSR 1.0 was used for the calculation of the amino acid composition (AAC), dipeptide composition (DPC), tripeptide composition (TPC) and binary profile pattern (BPP) of virulent and non-virulent protein sequences. An in-house Perl script was used to calculate the PSI-BLAST generated PSSM profiles. For the calculation, a PSI-BLAST iterative search was performed against the SwissProt database, with a cut-off E-value of 0.001.

(E). Models' training, evaluation and deployment: A total of 14 different ML algorithms available with AutoGluon are used for the training and performance evaluation of machine learning (ML) models (Table 3). The training dataset is used for training the machine learning models (0.9 fraction) and identifying the best performing models with the help of the validation dataset (holdout 0.1 fraction from the training dataset). The models with the highest accuracy on the validation dataset are automatically opted for and saved as the best models by AutoGluon. The saved models are further evaluated with a test dataset to estimate their real-life performance. The best performing model on both validation and test dataset i.e., PSSM-based model, is deployed on VirulentPred 2.0 web server. Moreover, codes and the best model is also available to download for their usage as standalone predictors by users on their desktops or workstations.

Table 3. List of algorithms (available with AutoGluon) used for the training and evaluation of VirulentPred 2.0 ML models.

Sr. No.	Algorithm Name	Reference
1	CatBoost	https://catboost.ai/
2	ExtraTreesEntr	https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
3	ExtraTreesGini	https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
4	KNeighborsDist	https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
5	KNeighborsUnif	https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
6	LightGBM	https://lightgbm.readthedocs.io/en/latest/
7	LightGBMLarge	https://lightgbm.readthedocs.io/en/latest/
8	LightGBMXT	https://lightgbm.readthedocs.io/en/latest/
9	NeuralNetFastAI	https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#nnfastaitabularmodel
10	NeuralNetTorch	https://auto.gluon.ai/0.4.0/api/autogluon.tabular.models.html#tabularneuralnettorchmodel
11	RandomForestEntr	https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
12	RandomForestGini	https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
13	WeightedEnsemble	https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf
14	XGBoost	https://xgboost.readthedocs.io/en/latest/

Comparative performance of VirulentPred models: The comparative evaluation with the latest test dataset (from VirulentPred 2.0) helped assess the performance improvement achieved by VirulentPred 2.0 with respect to (w.r.t.) the previously developed VirulentPred models. Table 4 provides the comparative performance of the best models from VirulentPred 2.0 and VirulentPred 1.0. It can be seen that VirulentPred 2.0 is highly accurate than its previous version. For example, in the case of “Cascade SVM classifier”, approximately 7 % increase in the prediction accuracy (from 75.74 % to 82.82 %) is achieved with VirulentPred 2.0. Whereas, for PSSM profile-based models, about an 11 % increase in the prediction accuracy (from 74.19 % to 85.18 %) is achieved with VirulentPred 2.0. Therefore, the PSSM profile-based model is deployed as the best predictor model on VirulentPred 2.0 web-server and standalone software. This model is trained and evaluated with the "WeightedEnsemble_L2" technique of AutoGluon.

Table 4. Performance of best models from VirulentPred 2.0 and VirulentPred with latest test dataset.

Data description	Validation dataset performance	Test dataset performance
Model type	Accuracy	Sensitivity	Specificity	Accuracy	MCC
VirulentPred 2.0 (Cascade classifier-based model)	100	79.11	86.49	82.82	0.66
VirulentPred (Cascade SVM-based classifier model)	N/A	77.48	74.01	75.74	0.52
VirulentPred 2.0 (PSSM profile-based model)	84.71	85.33	85.02	85.18	0.70
VirulentPred (PSSM profile-based model)	N/A	68.44	79.88	74.19	0.49