The dataset used for training in the present work was obtained from different database sources like SwissProt, Refseq, EMBL and sequence redundancy was maintained at 30% identity using a program, PISCES. This means that no two sequences are more than 30% identical. The dataset have the protein sequences
which are complete and are experimentally determined as cyclins. Therefore, final training dataset has cyclins as positive dataset and non-cyclins as negetiv
Positive Training Dataset Negative Training Dataset
Blind Test Dataset Organisms Classification Cyclins Subfamilies
Support Vector Machine:
The support vector machines (SVM) are universal binary classifiers based on statistical and optimizing theories. The SVM is particularly attractive to biological analysis due to its ability to handle noise, large dataset and large input spaces and mapping of non-linear input data into a high dimensional feature space with minimum error on training set. During this binary classification process, it constructs a hyperplane in the feature space that separates optimally two different classes of feature vectors. These feature vectors are mapped into a feature space by using the kernel function. The hyperplane found by SVM is one that maximizes the separating margins between both binary classes. This property of SVM is made is more superiors in comparison to other classifiers based on artificial intelligence. The basic idea of SVM is depicted below:
In this study, we have used SVM_light to predict the novel cyclins. The software is freely downloadable from http://www.cs.cornell.edu/People/tj/svm_light/. The software enable the users to define a number of parameters and allow to select a choice of inbuilt kernel function including linear, RBF, Polynomial (given degree) or user defined kernel.
Evaluation of Modules:
The performance of all modules developed in this method is evaluated using jacknife cross-validation (LOOCV). In LOOCV, one
dataset was used as a testing data (for validation of generated model) while the remaining dataset sequences were used as the training data to develop a model
. This was iterated N times till each sequence in the dataset bcome the testing data exactly once. The performance of each module is assessed by calculating the accuracy, sensitivity, specificity and Matthew's correlation coefficient (MCC). The formulas of evaluation parameters are shown below.
where TP: True positive, TN: True negative, FN: False negative, FP: False positive
ROC:-For each SVM module, threshold-independent performance was measured by plotting ROC (Receiver Operating Characteristic) curve between TP (sensitivity) and FP (1-specificity). ROC provides clear information about the performance of all SVM modules optimized with best parameters. The AUC for PSSM based model was found to be highest amongst all modules.
Following SVM modules were constructed to acquire the global information of protein sequence for accurate prediction.
Amino Acid Composition based SVM:- The SVM was provided with 20 dimensional vector on the basis of composition of amino acids of proteins. The amino acid composition is fraction of each amino acid in a protein. In composition based modules best results are achieved using RBF kernel.
Dipeptide Composition based SVM:- A SVM was developed on the basis of dipeptide composition of proteins. This resulted in a feature space of 400 dimensions. The best results are achieved using RBF kernel.
Secondary Structure Composition (SSC):- Secondary structure is an important feature of cyclins due to its characteristic helical domains assuming helix rich cyclin-folds, containing the cyclin-box. Secondary structure of each residue was predicted by PSIPRED. The scores corresponding to helix, sheet and coils of individual residues are extracted and averaged respectively, thereby making a matrix of 60 (20x3).
Position Specific Substitution Matrix (PSSM) Composition:- This model was designed using PSI-BLAST which has the capability to detect remote homologies. For each amino acid we have 20 substitution scores in the PSSM, which provides the evolutionary information about the protein at the level of residue types. Three iterations of PSI-BLAST were carried out at cut-off E-value of 0.001. Each value in the PSSM represents the likelihood of a particular residue substitution at a specific position of a protein class, were first normalized between 0 and 1 using the logistic function.
Hybrid Approach based Prediction:-To improve the prediction accuracy we adopted many strategies. We developed eight different hybrid modules on the basis amino acid, dipeptide, secondary structure and PSSM compositions.