
Courtesy GADU/GNARE
|
A powerful approach to the interpretation of newly sequenced genomes is comparative analysis
against all annotated sequences in publicly available resources. The largest sequence database
at the National Center for Biotechnology Information currently contains 2.4 million protein
sequences. The precision of genetic sequence analysis and assignment of function to genes
can be increased markedly by the use of multiple bioinformatics algorithms for data analysis.
The GNARE’s analysis module GADU, a Genome Analysis and Databases Update tool for the
Mathematics and Computer Science department at Argonne National Laboratories, pre-computes
analysis results for every sequence, finding
protein similarities (BLAST), protein family domains (BLOCKS), and structural characteristics.
Grid resources are used to run the resulting millions of processes, a task that must be
repeated frequently owing to the exponentially growing amount of data.
GADU searches periodically through DNA and protein databases for new and updated genomes and
then computes and publishes derived values. Analysis of a single bacterial genome of 4000
sequences by three bioinformatics tools (BLAST, PFAM, and BLOCKS) requires 12,000 steps, each
taking on the order of 30 seconds of run time. GADU is able to perform these tasks in a timely
fashion only because it has access to distributed resources provided by two U.S. national-
scale infrastructures, TeraGrid and Open Science Grid.
—Dinanath Sulakhe
e-mail this article
|