Download Genomes and sequence alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

CRISPR wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Genome (book) wikipedia , lookup

Non-coding RNA wikipedia , lookup

Epigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Genetic code wikipedia , lookup

DNA barcoding wikipedia , lookup

Mutation wikipedia , lookup

History of RNA biology wikipedia , lookup

DNA vaccination wikipedia , lookup

NUMT wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Designer baby wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Transposable element wikipedia , lookup

Primary transcript wikipedia , lookup

Minimal genome wikipedia , lookup

Microevolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human Genome Project wikipedia , lookup

Microsatellite wikipedia , lookup

Genomic library wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Monday #6
10
Problem set questions
20
Sequences: what's out there?
Two major archives
NCBI: total nucleotide repository, "reference" sequences (more later), tools, features, annotations, etc.
EBI: EU counterpart, functionally equivalent (tends to have a bit less data, a bit better tools)
DNA
Complete genome sequences
Fully assembled into linear or circular chromosomes
Annotated with specific remaining errors: gaps, ambiguities, expected mistakes per nucleotide
Screened for contamination
Draft genome sequences
Partially assembled into reasonably sized contigs/scaffolds to a few X coverage
Can still contain substantial contamination and error, but very usable if you don't trust specifics
Miscellany
GenBank accepts any identified sequence:
Plasmids, chromosomes, individual transcripts, different clones/inserts, metagenomes, etc.
I've never found most of these random sequences to be particularly useful
Raw short reads
NCBI: Sequence Read Archive (SRA) just went poof; state of high-throughput sequencing unclear
EBI: European Nucleotide Archive (ENA) accepts raw high-throughput seq. read files
These are both difficult to use; what do you do with someone else's gigantic raw short reads?
RNA
cDNAs/ESTs
Not used so much anymore – single pass, high quality sequences from RTed mRNAs
Can be used to catalog portions of genomes that are actively transcribed
Great for organisms without high quality sequenced genomes or annotations
Poor-man's RNA-seq (I can say this now; couldn't five years ago!)
RNA-seq
In the US, deposited in GEO like microarrays
In the EU, deposited in EMBL like DNA
Specific RNA types (miRNA, rRNA, etc.) deposited in specialty databases
Transcriptomic sequence database management is hooey so far
Amino acids
Won't discuss today, but AA seqs. typically handled very differently and in different DBs
Features: annotations, from location to function
Loci are referred to as "features", which can be anything
Genes, introns/exons, polymorphisms, regulatory elements, conserved regions, islands, etc.
Raw sequences don't have these (obviously!)
Have to be added after the fact, usually first-pass computational, subsequently curated
DBs differ by type of feature, automation process, level of curation, philosophy, etc. etc.
Now typically much harder to get these than the sequence itself!
Alignments
Pairwise alignment: best match between exactly two sequences
Multiple sequence alignment (MSA): best match between >2 sequences
Note that MSA != best multiple pairwise alignments
Comparative genomes, conservation, and evolution: more on this later
Searches: BLAST and friends are a special case of pairwise alignment
Find the best part of a "big" reference sequence that matches a "small" query sequence
Or find the best "other" sequence in a big database that matches a "small" query sequence
A note on gene IDs
How do you put a standard identifier on _anything_ in genomics?
Largest ratio of (poor implementation)*(irritation)/(necessity)/(apparent simplicity) in bioinformatics
Partially overlapping ID systems:
GenBank, RefSeq, EMBL-Bank, EMBL, UniGene, UniRef, HomoloGene, KO,
every array platform, Entrez, HGNC, KEGG, UCSC, every model organism DB...
And these just cover genes!
Different competing systems for proteins, functions, diseases, physiology, you name it
A large % of the reason we learn Python is so you can automate things like gene ID conversion
20
What can you get where
NCBI GenBank and RefSeq
GenBank handles "vanilla" sequences with mandatory identifiers and optional feature annotations
RefSeq filters GenBank to provide the "best" sequences for features of interest
In particular, genomes, chromosomes, genes, mRNAs, peptides
Mostly curated and non-redundant, uses ID codes to indicate status/type
NC = chromosome/genome, NM = mRNA, NP = protein, NR = other RNA, X = automated
Entrez handles "conceptual" information about genes and other loci
For example one Entrez gene ID will tag a gene's location, GenBank IDs, RefSeq IDs, etc.
Plus links to other info: other DBs, disease, function, conservation, regulation, etc.
EBI EMBL and Ensembl
Beautiful genome browser for finished + draft genomes
End-to-end annotation pipeline from genomes to proteins with varying levels of curation
My favorite – consistent ID scheme, pretty web interface
Programmatic interface at www.biomart.org – critical for automation
UCSC Genome Browser
The tool for overlaying your own information (tracks) over GenBank genomes
Allows either start-stop feature annotations or per-nucleotide continuous "wiggle" tracks
Primary site for (mod)ENCODE, polymorphisms (HapMap), conservation, nucleosomes, etc.
UI solidly in 1990s, Web 2.0ey updates like JBrowse mildly horrid but progressing
30
BLAST, considerations, and flavors
Basic Local Alignment Search Tool: performs one or more local alignments
Heuristically scores "best" alignments, length of stat. sig. alignment, %ID, and p/e-value
E-value: expected # of hits this good given the query + database size
Will thus change for different databases, or for the same DB over time!
Related to bit score: amount of information in the query relative to the DB
i.e. amount of non-random change in the sequence
Interconvertible with p-value, the probability of getting a hit this good by chance
DNA/DNA, DNA/protein, protein/DNA, protein/protein
blastn: nuc/nuc, best at finding evolutionarily related DNA sequences
blastp: aa/aa, best at finding functionally related protein sequences
blastx: nuc->aa, best at finding functionally related seqs. without a translated ORF (slow!)
tblastn: aa->nuc, ditto above (slower, less often useful)
Finds protein homologs in unannotated sequence – now usually easier to annotate
tblastx: nuc/nuc wt. translation, ditto above (slowest)
Only use this if you have no idea what your sequence is or what you're looking for
psi-blast: Position Specific Iterative, good for very remote homology
Starts with weak hits, repeatedly "zooms in" to the surrounding area to strengthen hit
This is not a class about the BLAST algorithm(s)!
Basic idea: not all sequence similarities are created equal
BLAST uses scoring matrices that capture empirical nuc/nuc and aa/aa mutation probabilities
PAM# matrices: Point or Percent Accepted Mutation, # expected mutations per 100aa
Note that these are _with_ replacement, so PAM250 per 100aa means ~80aa affected
BLOSUM# matrices: BLOcks of Amino Acid SUbstitution Matrix
Based on # expected %ID, so higher is closer; opposite of PAM!
Basic idea is that perfect hits of a given _word size_ are found first
Mismatches and gaps are extended around these high-scoring segment pairs (HSPs)
Higher word size = faster = more similar sequences = more false negatives
Some potentially obvious caveats
e-values are not probabilities!
nuc/nuc, aa/aa, and translated queries will tell you very different things
Often boils down to conservation versus function
It's easy to get very strong matches for only short segments of a query sequence
Must watch all of e-value, %ID, hit length, and hit structure
Think of e-value like a p-value, %ID and length like effect size
It's easy to be fooled by sequence families like transposable elements, TM domains, env. seqs.
30
Tools
BioMart: http://www.biomart.org
UGENE: http://ugene.unipro.ru
GenomeView: http://genomeview.org
Some others that didn't make the cut
Genome Workbench: http://www.ncbi.nlm.nih.gov/projects/gbench/
This would be my backup option: automatically connects to NCBI, like UGENE but different
IGV: http://www.broadinstitute.org/igv/
IGV is to GenomeView as Genome Workbench is to UGENE: nice alternative, bit more complex
geWorkbench: http://wiki.c2b2.columbia.edu/workbench/
Great for combining basic seq. data with expression/proteomics, fewer seq.-specific operations
SeqVISTA: http://zlab.bu.edu/SeqVISTA/
Potentially interesting for comparative genomics, but buggy and not as feature-rich