Dataset Information

We collected experimentally verified in vivo methylated arginine sites from literature along with those reported in UniProt database[1]. We did not consider any in vitro reported methylated sites which had no credible evidence of existence in vivo. The extracted dataset contain 6754 methylation sites from 2077 protein sequences. We removed sites/proteins with ambiguities such as those containing nonstandard amino acids, site mismatches, very small protein fragments (less than 30 aa) and obsolete protein entries. We did not include any methylation sites from PhosphoSitePlus database[2] since it did not provide the exact experimental source and other supporting information for verifying PTM evidence. However majority of our methylation data did match with the ones they reported to have extracted from literature.

We assume that local environment around methylated arginine, dictated by adjacent flanking residues, play major role in substrate selectivity and catalysis by PRMTs. These assumptions arise from observations in which PRMT active site and certain substrate features complement each other (although not always). For instance, in one substrate, positive flanking residues were shown to affect substrate binding and catalysis by PRMT active site (Osborne 2007)[3]. This is supported by the fact that the surface surrounding active site in some PRMTs have grooves that are acidic in nature. Additionally many of the known methylated arginine sites hail from either glycine-arginine rich (GAR) or arginine rich and proline/serine rich regions which have been shown to favour arginine methylation. In order to assess the role of flanking residues, we generated symmetric peptide datasets of varying window lengths (7, 11, 15, 19, 23, 27, 31 and 35) all of which were centered on methylated arginine. Since we adopted position specific feature encoding for model building, therefore it was necessary to fill the ends of peptides which lacked symmetry with arbitrary 'X' residue that has been the generally accepted norm in some previous prediction classifiers as well.

We followed the conventional practice of generating a negative set from those sites which are not reported to be methylated in the methylated proteins. Briefly we first created an unlabeled class of all the arginine sites which were not reported to be methylated from the respective methylated proteins. We termed the set as unlabeled because they may contain potential sites which could be methylated but has not been established yet. Using CD-HIT-2d[4] with 40% identity cut off, we created a negative set from this unlabeled set by removing sequences which were similar to positive set.

There are chances that data will contain highly similar peptide sequences (since 2/3 of data belongs to human and mouse proteome; and also multiple adjacently placed arginine residues are methylated in sequences which are arginine rich such as those hailing from GAR peptides). Since most of our features are calculated position wise thus to reduce any biasness especially during feature assessment with training set, we removed similar sequences from both positive and pseudo-negative sets using CD-HIT with 40% identity cut off. We found that the size of pseudo-negative sets of window lengths 7, 11 and 15 were far lower than positive set and thus were excluded from the model building task.

For each window length, positive dataset was split randomly into training set and test set in 4:1 ratio. We also split negative dataset into training and test set(size of negative test set equal to positive test set). For window length 19 onwards we had a larger proportion of negative traning set with respect to positive training set. Thus to overcome class imbalance issue we opted for under-sampling and created equal subsets of negative training set in 1:1 ratio with positive training set by random sampling. For computational time saving we restricted the size of negative training subsets to 5 for each window length. During the course of our work we accumulated more instances of arginine methylated proteins from recent studies and separately prepared independent dataset for final evaluation and comparison.


Datasets       Supplementary Information


Table 1. Dataset information (after CD-HIT) of different windows length

Positive Dataset

Negative Dataset

Windows Length

Complete

Set

Training Set (80%)

Test Set (20%)

Complete

Set

Training Set

Test Set

19

1298

1038

260

5539

5279

260

23

1964

1571

393

20004

19611

393

27

1845

1476

369

17729

17360

369

31

2288

1830

458

31603

31145

458

35

2175

1740

435

28250

27815

435



References:

1. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic acids research 32: D115-119.

2. Peter V. Hornbeck, Jon M. Kornhauser, Sasha Tkachev, Bin Zhang, Elżbieta Skrzypek,Beth Murray, Vaughan Latham and Michael Sullivan.(2011) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse.

3. Tanesha C. Osborne, Obiamaka Obianyo, Xing Zhang, Xiaodong Cheng, and Paul R. Thompson (2007) Protein Arginine Methyltransferase 1: Positively Charged Residues in Substrate Peptides Distal to the Site of Methylation Are Important for Substrate Binding and Catalysis.

4. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences.