Dataset Information: The training dataset in the present work was compiled from different database sources including Swissprot and RefSeq and scaled to 40% non-redundancy using the PISCES program. The dataset has full-length annotated lipocalins in the positive dataset and non-lipocalins in the negative dataset.
Training Datasets:
Positive Training Dataset Negative Training Dataset
Supplementary Datasets:
Blind Test positive Dataset Blind Test negative Dataset
Support Vector Machine:
The support vector machines (SVM) are universal binary classifiers based on statistical and optimizing theories. The SVM is particularly attractive to biological analysis due to its ability to handle noise, large dataset and large input spaces and mapping of non-linear input data into a high dimensional feature space with minimum error on training set. During this binary classification process, it constructs a hyperplane in the feature space that separates optimally two different classes of feature vectors. These feature vectors are mapped into a feature space by using the kernel function. The hyperplane found by SVM is one that maximizes the separating margins between both binary classes.
In this study, we have used the freely available package svm_light. The software gives the users the choice to define a number of parameters and to select any of the inbuilt kernel function including linear, RBF, Polynomial (given degree) or user defined kernel.
Evaluation of Modules:
The performance of all modules developed in this method is evaluated using jacknife cross-validation (LOOCV). In LOOCV, one
dataset was used as a testing data (for validation of generated model) while the remaining dataset sequences were used as the training data to develop a model
. This was iterated N times till each sequence in the dataset bcome the testing data exactly once. The performance of each module is assessed by calculating the accuracy, sensitivity, specificity and Matthew's correlation coefficient (MCC). The formulas of evaluation parameters are shown below.
where TP: True positive, TN: True negative, FN: False negative, FP: False positive
ROC:-For each SVM module, threshold-independent performance was measured by plotting ROC (Receiver Operating Characteristic) curve between TP (sensitivity) and FP (1-specificity). ROC provides clear information about the performance of all SVM modules optimized with best parameters. The AUC for PSSM based model was found to be highest amongst all modules.
Prediction Approaches:
Following SVM modules were constructed to acquire the global information of protein sequence for accurate prediction.
Amino Acid Composition based SVM:- The SVM was provided with 20 dimensional vector on the basis of composition of amino acids of proteins. The amino acid composition is fraction of each amino acid in a protein. In composition based modules best results are achieved using RBF kernel.
Dipeptide Composition based SVM:- A SVM was developed on the basis of dipeptide composition of proteins. This resulted in a feature space of 400 dimensions. The best results are achieved using RBF kernel.
Secondary Structure Composition (SSC):- Secondary structure is an important feature of cyclins due to its characteristic helical domains assuming helix rich cyclin-folds, containing the cyclin-box. Secondary structure of each residue was predicted by PSIPRED. The scores corresponding to helix, sheet and coils of individual residues are extracted and averaged respectively, thereby making a matrix of 60 (20x3).
Position Specific Substitution Matrix (PSSM) Composition:- This model was designed using PSI-BLAST which has the capability to detect remote homologies. For each amino acid we have 20 substitution scores in the PSSM, which provides the evolutionary information about the protein at the level of residue types. Three iterations of PSI-BLAST were carried out at cut-off E-value of 0.001. Each value in the PSSM represents the likelihood of a particular residue substitution at a specific position of a protein class, were first normalized between 0 and 1 using the logistic function.
Hybrid Approach based Prediction:-To improve the prediction accuracy we adopted many strategies. We developed eight different hybrid modules on the basis amino acid, dipeptide, secondary structure and PSSM compositions.