Download Bioinformatics in Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web-based/Open-source
Tools for Bioinformatics and
Genome Analysis
http://www.faculty.ucr.edu/~tgirke/Teaching/Gen240B_2003.ppt
Bioinformatics Areas
A. Traditional Bioinformatics
 Sequence analysis
 Gene expression analysis
 Proteomics
 Metabolic profiling
 Phenotypes
 Networks
B. Structural Bioinformatics
 Molecular modeling
 Drug design
C. Biological Databases
Systems Biology
Focus of this Seminar
1. Sequences
2. Structure
3. Expression
4. Functional Groups
Bio* Projects and Databases
1. Some Analysis Steps
Fragment Assembly: ESTs and genes
 Mapping
 Annotation

 Gene predictions
 ORFs, UTRs, introns, exons, promoters
 Lots of errors in eukaryote genomes!!
 Similarity searches
 BLAST, FASTA, Smith-Waterman

Gene families
 Domain databases
 Multiple alignments

Structure/Function
 2D, 3D structure (availability?)
Important Sequence Databases
Selection
NCBI
Entrez: http://www.ncbi.nlm.nih.gov/
Batch Entrez: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi
Downloads: ftp://ftp.ncbi.nih.gov/blast/db/
EMBL-EBI
General: http://www.ebi.ac.uk/
Downloads: http://www.ebi.ac.uk/FTP/
Swiss-Prot
General: http://us.expasy.org/
Downloads: http://us.expasy.org/expasy_urls.html
TIGR
General: http://www.tigr.org/
Downloads: ftp://ftp.tigr.org/pub/data/
Protein Data Bank (PDB)
General: http://www.rcsb.org/pdb/
Downloads: ftp://ftp.rcsb.org/pub/pdb/data
Example: NCBI
Sequence Database Searches
Important search algorithms
 Swiss-Waterman, FASTA, BLAST
BLAST Flavors: http://www.ncbi.nlm.nih.gov/Sitemap/index.html#BLAST
 BLAST: BLASN, BLASTP, TBLASTN, TBLASTX
 Psi-BLAST: Position-Specific Iterated BLAST
 RPS-BLAST: Reverse Position-Specific BLAST
 Phi-BLAST: Pattern Hit Initiated BLAST
 Mega-BLAST: 10 faster than BLASTN
 BLAST2: pairwise comparisons
 WU-BLAST: Washington University BLAST
Download of NCBI BLAST tools: ftp://ftp.ncbi.nih.gov/toolbox/
Homework Assignment
Finish only one assignment!

Go to http://www.ncbi.nlm.nih.gov/, select protein DB, run query: P450 & hydroxylase & human [organism], select under
‘Limits’ SwissProt


report final query syntax from ‘Details’ page.
Save GIs from this final query to file (select ‘GI List’ format under display)

report how many GIs you retrieved

Retrieve the corresponding sequences through Batch-Entrez (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi) using
GI list file as query input -> save sequences in FASTA format

Generate multiple alignment and tree of these sequences using Multalign
(http://prodes.toulouse.inra.fr/multalin/multalin.html)



save multiple alignment and tree to file

identify putative heme binding cysteine
Open corresponding SwissProt page (http://us.expasy.org/sprot/) for first P450 sequence in your list

Compare putative heme binding cysteine and compare with consensus pattern from Prosite database

Report corresponding Pfam ID

How many mouse (Mus musculus) sequences are in this family (use ‘species tree’ on Pfam db)
BLASTP against nr database (use again first P450 in your list), select on “See Conserved Domains from CDD” (this runs
RPS-BLAST), click on red P450 domain.

Compare resulting alignment with result from MultAlin

View 3D structure in Cn3D, save structure (screen shot) and highlight heme binding cysteine
Remote Homology Detection
Psi-BLAST/RPS-BLAST
HMMs: HMMER, SAM
Domain databases
Fold recognition approaches (Meta Servers)
Protein Domain Databases
Selection
PFAM
http://pfam.wustl.edu/
PROSITE
http://us.expasy.org/prosite/
ProDom
http://prodes.toulouse.inra.fr/prodom/2002.1/html/h
ome.php
InterPro
http://www.ebi.ac.uk/interpro/
Selection of Tools for Promoter
Analysis
 Verbumculus, UC Riverside
• http://www.cs.ucr.edu/%7Estelo/Verbumculus/
 AlignACE & ScanACE
• http://arep.med.harvard.edu/mrnadata/mrnasoft.html
 MEME and META-MEME, San Diego Super
Computer Center:
• http://www.sdsc.edu/Research/biology/
 Regulatory Sequence Analysis Tools (RSA)
• http://rsat.ulb.ac.be/rsat/
 Gibbs Motif Sampler, Coldspring Harbor:
• http://argon.cshl.org/ioschikz/gibbsDNA/mgibbsDNA-form.html
 Motif Sampler, searches for over-represented
motifs
• http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html
 Stanford, motif finding in upstream sequences
• http://genome-www4.stanford.edu/cgi-bin/ewing/oligoAnalysis.pl
Example: RSA
Promoter Databases
Selection
Regulatory Sequence Analysis Tools (RSA)
http://rsat.ulb.ac.be/rsat/
Eukaryotic Promoter Database
http://www.epd.isb-sib.ch/
Human Promoter Database
http://zlab.bu.edu/%7Emfrith/HPD.html
Arabidopsis
http://exon.cshl.org/cgi-bin/atprobe/atprobe.pl
Alternative Homework
Do only one assignment!
 Work through tutorial of Regulatory Sequence Analysis Tools
(http://rsat.ulb.ac.be/rsat/).
 Provide short summary for different tools
2. Protein Modeling
 Tool collection: http://faculty.ucr.edu/~tgirke/Links.htm
 Databases:
Protein Data Bank:
General: http://www.rcsb.org/pdb/
Downloads: ftp://ftp.rcsb.org/pub/pdb/data
More databases:
http://faculty.ucr.edu/~tgirke/Links.htm#Databases
3. Microarrays and Chips
Definition: Hybridization-based technique that allows simultaneous
analysis of thousands of samples on a solid substrate.
Applications: Examples
 Transcriptional Profiling
 Gene copy number
 Resequencing
 Genotyping
 Single-nucleotide polymorphism
 DNA-protein interaction
 Insertional library screening
 Identification of new cell lines
 Etc.
Developing Areas:
 Protein arrays
 Chemical arrays
Why Microarrays?
Input Samples
Outputs
WT
Mutants
Transgenics
Prognosis
DNA Arrays
Diagnosis
gene expression
Treatments
biotic, abiotic, chemicals
Target identification
 Simultaneous analysis of over 50,000 genes
 Signaling and Metabolic Networks
 Regulatory genes
 First step in discovery of gene function
 Prediction of limiting factors in biological processes
 Rapid analysis of mutants and transgenics
 Reduce time of costly clinical studies and field trials
Basic Analysis Steps
Image analysis
Filtering, background correction
Standardization, scaling and normalization
Significance analysis (replicates)
Cluster analysis (time series)
Integration with sequence and functional
information
Planning Steps of Transcriptional
Profiling Experiments
1. Biological question(s), e.g.:
- Which genes are up or down-regulated in a mutant/transgenic line?
- Which genes cycle during a series of treatments?
2. Selection of best biological samples
- Minimize variability in sample collection.
3. Develop validation and follow-up strategy for expected expression hits
- e.g. real-time PCR and analysis of transgenics or mutants
WTt1
WTt2
MTt1
MTt2
4. Choose type of experiment
- pairwise: e.g.WT vs. Mutant/Transgenic
- series of time points or treatments
WTt1
 allows cluster analysis
5. Choose Reference
WTt1
WTt2
WTt3
WTt4
WTt5
- sample with maximum number of expressed genes (maxim. biolog.information)
- pooled RNA of all points: less variability from reference, saves chips
Planning Steps of Transcriptional
Profiling Experiments
6. How many replicates?
- biological replicate: starts with sample collection
- technical replicate: starts usually with same RNA isolation
- dye-swaps: (1) WT-Cy3:MT-Cy5, (2) WT-Cy5:MT-Cy3
7. Management of sample collection and RNA isolation
- Define a “realistic” volume
- RNA quality tests!!!!
8. cDNA/cRNA labeling
- Which labeling technique? RNA amplification, reliability, sensitivity, etc.
9. Array hybridizations and post-processing
10. Array scanning
Important Pattern Recognition
(clustering) Methods
Hierarchical clustering
single, average (UPGMA) and complete
linkage
Non-hierarchical clustering
Self Organizing Maps (SOM)
k-means
Dimension Reduction Analysis
Principal Component Analysis
Neural Networks & Machine Learning
Tools for Microarray Analysis
Image analysis: ScanAlyze
Normalization: SNOMAD, R projects
Mining/clustering: J-Express, R projects
Much more: http://faculty.ucr.edu/%7Etgirke/Links.htm#Profiling
Example of an Integrated
Clustering Tool: J-Express
Microarray Databases
Selection
 Stanford Microarray Database (SMD)
http://genome-www5.stanford.edu/MicroArray/SMD/
 Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
Alternative Homework Assignment
Do only one assignment!
- Go to the SNOMAD page (Standardization and Normalization of Microarray Data):
http://pevsnerlab.kennedykrieger.org/snomadinput.html
- Select “Use an Example dataset to see how SNOMAD works” and chose either option #2 (Incyte dataset) or
#3 (Affymetrix dataset). If you prefer you can use your own or other public data instead. A good resource to
download public data is the Stanford site: http://genome-www5.stanford.edu/cgi-bin/SMD/publicData.pl
- Select all possible transformations and graphs and submit the data for processing.
- Report: Give a short description (one or two sentences) for each graph/transformation of the returned results.
4. Functional Groups
Assigning “Biological Meaning” to Profiling Data
 Protein Families
 COGs (43 genomes, NCBI):
http://www.ncbi.nlm.nih.gov/COG/
 Protein Domain Databases (PFAM)
 Gene Ontology Consortium
Df: controlled vocabulary for all organisms
http://www.geneontology.org/
 Pathways
 KEGG Metabolic Pathways
http://www.genome.ad.jp/kegg/kegg2.html
 WIT Database (39 genomes)
http://wit.mcs.anl.gov/WIT2/
Toolboxes for Bioinformaticians
Popular scripting languages
Perl: http://www.perl.com/
Python: http://www.python.org/
Bio* modules for processing data from databases and applications
BioPerl: http://bio.perl.org/
BioPython: http://biopython.org/
BioJava: http://www.biojava.org/
BioRuby: http://bioruby.org/
Statistics
R: http://www.R-project.org
BioConductor (Microarray): http://www.bioconductor.org/
Database systems
MySQL: http://www.mysql.com/
PostgreSQL: http://www.postgresql.org/
Related documents