Protein classification algorithms over a distributed computing environment

One of the most important challenges in modern Bioinformatics is the accurate prediction of the functional behavior of proteins. To this end, researchers from the Intelligent Systems and Software Engineering Lab (Dept. of Electrical and Computer Engineering) have been working successfully for several years on the design and implementation of novel data mining algorithms [1-3].

The strong correlation that exists between the properties of a protein and its motif sequence (Figure 1) makes the prediction of protein function possible. The core concept of any approach is to employ data mining techniques in order to construct models, based on data generated from already annotated protein sequences. A major issue in such approaches is the complexity of the problem in terms of data size and computational cost. However, the utilization of the HellasGrid Infrastructure and the EGI Grid, coupled with the close support of the Scientific Computing Center at A.U.Th., helped overcome the computational difficulties often encountered in protein classification problems.

Figure 1: [a] P00747 (Plasminogen precursor – PLMN_HUMAN) protein chain, and [b] an amino-acid pattern expressed as a regular expression

 

G-Class was the first data-mining algorithm successfully ported to the EGI Grid infrastructure [4]. The G-Class methodology follows a “divide and conquer” approach comprised of 3 steps (Figure 2).

Figure 2: First, protein data from PROSITE, an expert-based database, are divided into multiple disjoint sets, each one preserving the original data distribution. The new sets are used as training sets, and multiple models are derived by means of standard data mining algorithms. Finally, the models are combined to produce the final classification rules, which can be used to classify a given instance and evaluate the methodology.

 

G-Class was a fairly simplistic approach to the protein classification problem, using generic data mining algorithms for the construction of several models simultaneously. However, the results were impressive both in terms of the speed-up ratio (ranging from 10 to 60) and the amount of data (ranging from 662 proteins over 27 different classes, to 7027 proteins over 96 classes) that were able to be processed (Figure 3).

Figure 3: The processing time in all cases follows the (e^{-alpha x}) model, where (alpha) depends on the size of the original dataset and (x) is the number of splits. The accuracy of the methodology is fairly constant over the number of splits, with minor fluctuations owing to the distribution of the instances of the overlapping protein classes over the different dataset splits.

 

A second approach was aiming towards the automatic annotation of protein sequences. Although there are a lot of tools for protein annotation, such as the Gene Ontology Project, ProDom, Pfam, and SCOP, in order to assign annotation terms to new non-annotated protein sequences, they have to be either processed directly in a lab or characterized through similarity to already annotated sequences. At the moment, the amino acid sequence of more than 1.000.000 proteins has been obtained. On the contrary, the properties and functions of only 4% of these proteins are known. Therefore, the need for a systematic way to derive clues for the properties of a protein by inspecting its amino acid sequence is obvious. PROTEAS is a novel parallel methodology for protein function prediction which predicts the annotation of an unknown protein, by running its motif sequence each model, producing similarity scores [5-6]. This methodology has been implemented so that it can effectively utilize various classification schemata, such as Gene Ontology, SCOP families, etc (Figure 4).

Figure 4: PROTEAS workflow diagram

 

The main drawback of this methodology is that it requires a substantial amount of computational time to complete. It has been shown experimentally that the execution time needed to process the entire dataset on a single processor is prohibitively long. In order to address this issue, PROTEAS has been implemented both as a standalone and as a grid-based application. The grid-based application utilizes the MPI library for communication between distinct processes and uses the EGI Grid infrastructure in order to minimize the execution times (Figure 5).

 

Figure 5: Execution times for model training

 

Moreover, the Grid provides for the seamless integration of the training process and the actual model evaluation by allowing the concurrent retraining of Gene Ontology models from different input sources or experts and the use of the existing ones (Figure 6).

Figure 6: Execution times for specific Train/Test set ratio and different number of input files (left column), and for different ratios but specific number of input files (right column)

 

The application was executed on available clusters using from 4 to 16 processors in various experiment configurations (Figure 7). In all cases the accuracy of the results was very high and the overall execution time was satisfactory.

 

Figure 7: Total processing times for the classification of a single protein sequence, based on the number of CPUs used and the number of input files used as the model construction base.

 

Contact details:

  • Pericles A. Mitkas, Professor, AUTH, mitkas (at) eng.auth.gr
  • Fotis E. Psomopoulos, Research Associate, CERTH, fpsom (at) issel.ee.auth.gr
  • Scientific Computing Center, AUTH, contact (at) grid.auth.gr

 

References:

  1. Fotis E. Psomopoulos and Pericles A. Mitkas, “Bioinformatics Algorithm Development for Grid Environments”, Journal of Systems & Software, vol. 83, No 7. (2010), pp. 1249-1257.
  2. Fotis E. Psomopoulos and Pericles A. Mitkas, “Data Mining in Proteomics using Grid Computing”, Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare, Editor: Mario Cannataro, Laboratory of Bioinformatics, University Magna Graecia of Catanzaro, 88100 Catanzaro, Italy, 2009, (chapter 13, pp. 245-267), UK: IGI Global.
  3. Fotis E. Psomopoulos and Pericles A. Mitkas: “Sizing Up: Bioinformatics in a Grid Context”, 3rd Conference of the Hellenic Society For Computational Biology and Bioinformatics – HSCBB ’08, 30-31 October 2008, Thessaloniki, Greece.
  4. Helen Polychroniadou, Fotis E. Psomopoulos and Pericles A. Mitkas: “g-Class: A Divide and Conquer Application for Grid Protein Classification”, Proceedings of the 2nd ADMKD 2006: Workshop on Data Mining and Knowledge Discovery (in conjunction with ADBIS’2006: The 10th East-European Conference on Advances in Databases and Information Systems), 3-7 September 2006, Thessaloniki, Greece, pp. 121-132.
  5. Christos N. Gkekas, Fotis E. Psomopoulos and Pericles A. Mitkas, “A parallel data mining application for Gene Ontology term prediction”, 3d EGEE User Forum, Polydome Conference Centre, 11-14 February 2008, Clermont-Ferrand, France.
  6. Christos N. Gkekas, Fotis E. Psomopoulos and Pericles A. Mitkas, “A parallel data mining methodology for protein function prediction utilizing finite state automata”, presented at the 2nd Electrical and Computer Engineering Student Conference, April 2008, Athens, Greece.