Download poster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Map database management wikipedia , lookup

Earthscope wikipedia , lookup

GPS tracking unit wikipedia , lookup

UNAVCO wikipedia , lookup

Transcript
Xianfeng Chen, Kurt A. Gust, and Edward J. Perkins
Environmental Laboratory, ERDC, Vicksburg, MS 39180
Supercomputer Assembly and Annotation of Transcriptomes for Assessing Impacts
of Army Stressors on Ecological Receptors
Introduction
•
Genomic tool development for ecologically-relevant non-model species has lagged relative to model species, advancements in sequencing
technology, bioinformatics processing, and gene expression platforms have led to an increasing number of non-model species having deepcoverage and well-annotated transcriptomes from which high-quality genomic tools have been produced.
• We have developed a bioinformatics infrastructure and data processing pipeline to transit raw sequence data to robustly annotated coding
genes to support gene expression profiling and biological impact assessment of army stressors on ecological receptors such as Western fence
lizard (Sceloporus occidentalis) and Japanese quail (Coturnix coturnix).
• These gene expression and cyber-infrastucture tools are proving to be indispensable as the focus of biological research and regulatory
decision frameworks continue to shift toward systems biology and predictive toxicology approaches.
Figure 1. Japanese Quail and Western Fence Lizard.
Table 3. Unigenes homology-based coding potential detection
and annotation against the following protein databases: NR.aa
(10,606,545 proteins), Refseq (6,392,535 proteins),
UniProt-SwissProt (515,203 proteins), Uniref90 (6,544,144 proteins),
Uniref100 (9,865,668 proteins).
Unigene
Dataset
Results
•The sequencing effort produced over 328 million base reads for the Western Fence Lizard (WFL) ) [Figure 1] and 189 million base reads for
Japanese Quail (JQ) in 928,780 and 559,833 sequence reads, respectively (Table 1).
•A total of 559,819 and 928,759 sequences for both WFL and JQ were clustered and assembled using Gene Indices Clustering Tools (TGICL,
J. Craig Venter Institute) into 44,455 and 58,962 unigenes, respectively.
• Assembled unigenes were annotated using Basic Local Alignment Search Tool (BLAST) against 5 publicly available protein sequence
databases, produced 33 to 44 % unigene characterization (Table 2 and 3) via the DoD supercomputers, Diamond (SGI Altrix ICE) and Jade
(Cray XT4) [Figure 2].
• Sequences with significant similarity to known proteins were used to design custom high density gene expression microarrays to be used to
assess the impacts of Army activity on the health of the JQ and WFL environmental models.
• Thus, this effort has developed a cyber-infrastructure capability (http://jeff.ifxworks.com/EGGT/) at the Environmental Laboratory to
rapidly develop genomic infrastructure and gene expression tools for any environmental model that emerges as species of interest [Figure 3, 4,
and 5].
Table 1. Results of GS-FLX Pyrosequencing of normalized cDNA Libraries for
Western fence lizard (WFL) and Japanese quail (JQ).
Sequencing Parameters
WFL
JQ
Raw Wells
2,125,263
1,157,019
Key Pass Wells
2,061,220
1,103,565
928,780
559,833
Passed Filter Wells
Total Bases
Length Average
328,540,934
354
189,239,672
338
Median Reads Length
397
388
Longest Reads Length
2,043
686
Shortest Reads Length
2
11
WFL
Contigs
WFL
Singlets
JQ
Contigs
JQ
Singlets
Coding
Detected
23,385
23,173
21,593
23,463
23,508
1,425
1,440
1,457
1,465
1,298
17,873
17,732
15,513
18034
18,031
1,208
1,195
1,140
1,217
1,211
NonCoding
Detected
30,512
30,724
32,304
30,434
30,389
1,825
1,837
1,820
1,812
1,979
23,193
23,334
25,553
23,032
23,035
2,181
2,194
2,249
2,172
2,178
%
Coding
Protein Database
43.39%
43.00%
40.06%
43.53%
43.62%
44.33%
43.94%
44.46%
44.71%
39.61%
43.52%
43.18%
37.78%
43.92%
43.91%
35.65%
35.26%
33.64%
35.91%
35.73%
NR.aa
Refseq
UniProt-SwissProt
Uniref100
Uniref90
NR.aa
Refseq
UniProt-SwissProt
Uniref100
Uniref90
NR.aa
Refseq
UniProt-SwissProt
Uniref100
Uniref90
NR.aa
Refseq
UniProt-SwissProt
Uniref100
Uniref90
Table 2. Summary of sequence clustering and assembly
for Western Fence Lzard (WFL) and Japanese Quail (JQ).
Sequence Assembly
WFL
928,759
559,819
Total Assembled Contigs
53,897
41,066
Data Exchange
Using XML Based
SOAP
Batch Processing
High Performance
and Throughput
Computing using
Super Computers
Data Management
Total Singlets
5,065
3,389
Total Unigenes
58,962
44,455
Figure 3. Web dissemination of the JQ and WFL transcriptome datasets.
(http://jeff.ifxworks.com/EGGT/Quail_Lizard.html).
Web Services
JQ
Total ESTs Available
Figure 2. Diamond and Jade Supercomputers at ITL ERDC.
Data Query, Data
Upload via http:
Oracle Relational
Database
(1)
(2)
(3)
(4)
Data Uploading;
Data Validation;
Data Analysis;
Data Processing
Data
Management
Perl & Java
Private File Server
Public File Server
Figure 4. Proposed bioinformatics system architecture.
Figure 5. Web-based tools for transcriptomes and unigene analysis.