How do I make my repayments Payday loans What are payday loans
Off

## Investigating the nature of explosive percolation transition

The Laboratory of Computational Physics is actively involved in the field of investigating the phase transition of various natural and artificial systems. Currently, much effort is being concentrated on the definition of the type of phase transition for a new competitive model named “explosive” percolation: when filling sequentially an empty lattice with occupied sites, instead of randomly occupying a site or bond (according to the classical paradigm), we choose two candidates and investigate which one of them leads to the smaller clustering. The one that does this is kept as a new occupied site on the lattice while the second one is discarded (Figure 1). This procedure considerably slows down the emergence of the giant component, which is now formed abruptly, thus the term “explosive”.

Figure 1: Achlioptas Process according to the sum rule (APSR) for site percolation. White cells correspond to unoccupied sites while colored cells correspond to occupied sites. Different colors (red,green,gray,blue) indicate different clusters. (a) We randomly select two trial unoccupied sites (yellow), noted by A and B, one at a time. We evaluate the size of the clusters that are formed and contain sites A and B, $$s_A$$ and $$s_B$$ respectively. In this example $$s_A = 10$$ and $$s_B = 14$$. (b) According to the Achlioptas Process, we keep site A which leads to the smaller cluster and discard site B.

Following the first publication of Achlioptas et al., a debate was initiated between various teams whether the procedure is continuous or discontinuous. Contributing to these considerations, we have investigated explosive site percolation, both using the product and sum rules. It was found that the exponent $$\beta / \nu$$ is vanishing small for both cases, pointing towards the continuity of the transition. Also, we performed numerical analysis for the case of a reverse Achlioptas process (Figure 2). It was shown that for finite systems there is a hysteresis loop between the reverse and forward procedure (Figure 2). This loop vanishes at infinity, giving strong evidence for the continuity of the “explosive” site percolation (Figure 3). Moreover, “explosive” site and bond percolation seem to belong to a different universality class

 Figure 2: Reverse Achlioptas Process (AP1) for site percolation according to the sum rule. Blue is for the occupied sites while white for the unoccupied sites. Initially, the lattice is fully occupied. (a) An instance of the process. We randomly choose two trial sites (yellow), noted as A and B, and remove them from the lattice. (b) The clusters formed after the removal. (c) We place site A again in the lattice and calculate the size of the cluster in which it belongs, $$s_A = 16$$. (d) We do the same as before for the case of site B and calculate $$s_B = 26$$. We remove site A which leads to the formation of the smaller cluster and keep site B.

Figure 3: (a) Hysteresis loop between a reverse (red dots) and the forward (black squares) Achlioptas process for a $$700\times 700$$ system. (b) The loop vanishes in the thermodynamic limit

Simulations were performed on the EGI. A diagram of the number of jobs and CPU hours consumed per month is shown in Figure 4. We have used extensively the gLite parametric job submission mechanism, using as parameter the different realizations of the system. On average, more than 1000 jobs per simulation were submitted for each lattice size. Considering a typical $$1000 \times 1000$$ lattice, the average time consumed for one run approached 172 minutes. If we had to perform the calculations on a single CPU, this would mean that it would take us 120 days to get complete results for just one lattice size. Using the EGI thus has helped us minimize this time approximately to 172 minutes. This translates to a time gain of the order of $$10^3$$. Moreover, given the availability of more resources, this gain may be even higher. This is a very important feature, because we can numerically analyze systems of the order of $$10^6$$ in a tolerable amount of time.

Figure 4: Number of jobs and CPU hours per month consumed for the simulations

References:

1. D.Achlioptas, R.M. D’Souza and J. Spencer, Explosive Percolation in Random Networks, Science 323,p. 1453 ,(2009)
2. R.A. da Costa, S.N.Dorogovtsev, A.V.Goltsev, and J.F.F.Mendes, “Explosive Percolation” Transition is Actually Continuous, Physical Review Letters 105(25),255701,(2010)
3. P.Grassberger, C. Christensen, G. Bizhani, S-W Son, and M. Paczuski, Explosive Percolation is Continuous, but with Unusual Finite Size Behavior, Phys Rev Lett. 106(22), (2011)
4. O. Riordan and L. Warnke, Explosive percolation is continuous, Science 333 (2011)
5. R.M. Ziff, Explosive growth in biased dynamic percolation on two-dimensional regular lattice networks, Physical Review Letters 103(4),45701,(2009)
6. F. Radicchi and S. Fortunato, Explosive Percolation: A numerical analysis, Physical Review E 81(3),036110,(2010)
7. N.A.M. Araújo and H.J.Hermann, Explosive Percolation via Control of the Largest Cluster, Physical Review Letters 105(3),035701,(2010)
Off

## Protein classification algorithms over a distributed computing environment

One of the most important challenges in modern Bioinformatics is the accurate prediction of the functional behavior of proteins. To this end, researchers from the Intelligent Systems and Software Engineering Lab (Dept. of Electrical and Computer Engineering) have been working successfully for several years on the design and implementation of novel data mining algorithms [1-3].

The strong correlation that exists between the properties of a protein and its motif sequence (Figure 1) makes the prediction of protein function possible. The core concept of any approach is to employ data mining techniques in order to construct models, based on data generated from already annotated protein sequences. A major issue in such approaches is the complexity of the problem in terms of data size and computational cost. However, the utilization of the HellasGrid Infrastructure and the EGI Grid, coupled with the close support of the Scientific Computing Center at A.U.Th., helped overcome the computational difficulties often encountered in protein classification problems.

Figure 1: [a] P00747 (Plasminogen precursor – PLMN_HUMAN) protein chain, and [b] an amino-acid pattern expressed as a regular expression

G-Class was the first data-mining algorithm successfully ported to the EGI Grid infrastructure [4]. The G-Class methodology follows a “divide and conquer” approach comprised of 3 steps (Figure 2).

Figure 2: First, protein data from PROSITE, an expert-based database, are divided into multiple disjoint sets, each one preserving the original data distribution. The new sets are used as training sets, and multiple models are derived by means of standard data mining algorithms. Finally, the models are combined to produce the final classification rules, which can be used to classify a given instance and evaluate the methodology.

G-Class was a fairly simplistic approach to the protein classification problem, using generic data mining algorithms for the construction of several models simultaneously. However, the results were impressive both in terms of the speed-up ratio (ranging from 10 to 60) and the amount of data (ranging from 662 proteins over 27 different classes, to 7027 proteins over 96 classes) that were able to be processed (Figure 3).

Figure 3: The processing time in all cases follows the $$e^{-\alpha x}$$ model, where $$\alpha$$ depends on the size of the original dataset and $$x$$ is the number of splits. The accuracy of the methodology is fairly constant over the number of splits, with minor fluctuations owing to the distribution of the instances of the overlapping protein classes over the different dataset splits.

A second approach was aiming towards the automatic annotation of protein sequences. Although there are a lot of tools for protein annotation, such as the Gene Ontology Project, ProDom, Pfam, and SCOP, in order to assign annotation terms to new non-annotated protein sequences, they have to be either processed directly in a lab or characterized through similarity to already annotated sequences. At the moment, the amino acid sequence of more than 1.000.000 proteins has been obtained. On the contrary, the properties and functions of only 4% of these proteins are known. Therefore, the need for a systematic way to derive clues for the properties of a protein by inspecting its amino acid sequence is obvious. PROTEAS is a novel parallel methodology for protein function prediction which predicts the annotation of an unknown protein, by running its motif sequence each model, producing similarity scores [5-6]. This methodology has been implemented so that it can effectively utilize various classification schemata, such as Gene Ontology, SCOP families, etc (Figure 4).

Figure 4: PROTEAS workflow diagram

The main drawback of this methodology is that it requires a substantial amount of computational time to complete. It has been shown experimentally that the execution time needed to process the entire dataset on a single processor is prohibitively long. In order to address this issue, PROTEAS has been implemented both as a standalone and as a grid-based application. The grid-based application utilizes the MPI library for communication between distinct processes and uses the EGI Grid infrastructure in order to minimize the execution times (Figure 5).

Figure 5: Execution times for model training

Moreover, the Grid provides for the seamless integration of the training process and the actual model evaluation by allowing the concurrent retraining of Gene Ontology models from different input sources or experts and the use of the existing ones (Figure 6).

Figure 6: Execution times for specific Train/Test set ratio and different number of input files (left column), and for different ratios but specific number of input files (right column)

The application was executed on available clusters using from 4 to 16 processors in various experiment configurations (Figure 7). In all cases the accuracy of the results was very high and the overall execution time was satisfactory.

Figure 7: Total processing times for the classification of a single protein sequence, based on the number of CPUs used and the number of input files used as the model construction base.

Contact details:

• Pericles A. Mitkas, Professor, AUTH, mitkas (at) eng.auth.gr
• Fotis E. Psomopoulos, Research Associate, CERTH, fpsom (at) issel.ee.auth.gr
• Scientific Computing Center, AUTH, contact (at) grid.auth.gr

References:

1. Fotis E. Psomopoulos and Pericles A. Mitkas, “Bioinformatics Algorithm Development for Grid Environments”, Journal of Systems & Software, vol. 83, No 7. (2010), pp. 1249-1257.
2. Fotis E. Psomopoulos and Pericles A. Mitkas, “Data Mining in Proteomics using Grid Computing”, Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare, Editor: Mario Cannataro, Laboratory of Bioinformatics, University Magna Graecia of Catanzaro, 88100 Catanzaro, Italy, 2009, (chapter 13, pp. 245-267), UK: IGI Global.
3. Fotis E. Psomopoulos and Pericles A. Mitkas: “Sizing Up: Bioinformatics in a Grid Context”, 3rd Conference of the Hellenic Society For Computational Biology and Bioinformatics – HSCBB ’08, 30-31 October 2008, Thessaloniki, Greece.
4. Helen Polychroniadou, Fotis E. Psomopoulos and Pericles A. Mitkas: “g-Class: A Divide and Conquer Application for Grid Protein Classification”, Proceedings of the 2nd ADMKD 2006: Workshop on Data Mining and Knowledge Discovery (in conjunction with ADBIS’2006: The 10th East-European Conference on Advances in Databases and Information Systems), 3-7 September 2006, Thessaloniki, Greece, pp. 121-132.
5. Christos N. Gkekas, Fotis E. Psomopoulos and Pericles A. Mitkas, “A parallel data mining application for Gene Ontology term prediction”, 3d EGEE User Forum, Polydome Conference Centre, 11-14 February 2008, Clermont-Ferrand, France.
6. Christos N. Gkekas, Fotis E. Psomopoulos and Pericles A. Mitkas, “A parallel data mining methodology for protein function prediction utilizing finite state automata”, presented at the 2nd Electrical and Computer Engineering Student Conference, April 2008, Athens, Greece.
Pages:1234»