Download Genomes and sequence alignment

Monday #6 10 Problem set questions 20 Sequences: what's out there? Two major archives NCBI: total nucleotide repository, "reference" sequences (more later), tools, features, annotations, etc. EBI: EU counterpart, functionally equivalent (tends to have a bit less data, a bit better tools) DNA Complete genome sequences Fully assembled into linear or circular chromosomes Annotated with specific remaining errors: gaps, ambiguities, expected mistakes per nucleotide Screened for contamination Draft genome sequences Partially assembled into reasonably sized contigs/scaffolds to a few X coverage Can still contain substantial contamination and error, but very usable if you don't trust specifics Miscellany GenBank accepts any identified sequence: Plasmids, chromosomes, individual transcripts, different clones/inserts, metagenomes, etc. I've never found most of these random sequences to be particularly useful Raw short reads NCBI: Sequence Read Archive (SRA) just went poof; state of high-throughput sequencing unclear EBI: European Nucleotide Archive (ENA) accepts raw high-throughput seq. read files These are both difficult to use; what do you do with someone else's gigantic raw short reads? RNA cDNAs/ESTs Not used so much anymore – single pass, high quality sequences from RTed mRNAs Can be used to catalog portions of genomes that are actively transcribed Great for organisms without high quality sequenced genomes or annotations Poor-man's RNA-seq (I can say this now; couldn't five years ago!) RNA-seq In the US, deposited in GEO like microarrays In the EU, deposited in EMBL like DNA Specific RNA types (miRNA, rRNA, etc.) deposited in specialty databases Transcriptomic sequence database management is hooey so far Amino acids Won't discuss today, but AA seqs. typically handled very differently and in different DBs Features: annotations, from location to function Loci are referred to as "features", which can be anything Genes, introns/exons, polymorphisms, regulatory elements, conserved regions, islands, etc. Raw sequences don't have these (obviously!) Have to be added after the fact, usually first-pass computational, subsequently curated DBs differ by type of feature, automation process, level of curation, philosophy, etc. etc. Now typically much harder to get these than the sequence itself! Alignments Pairwise alignment: best match between exactly two sequences Multiple sequence alignment (MSA): best match between >2 sequences Note that MSA != best multiple pairwise alignments Comparative genomes, conservation, and evolution: more on this later Searches: BLAST and friends are a special case of pairwise alignment Find the best part of a "big" reference sequence that matches a "small" query sequence Or find the best "other" sequence in a big database that matches a "small" query sequence A note on gene IDs How do you put a standard identifier on _anything_ in genomics? Largest ratio of (poor implementation)*(irritation)/(necessity)/(apparent simplicity) in bioinformatics Partially overlapping ID systems: GenBank, RefSeq, EMBL-Bank, EMBL, UniGene, UniRef, HomoloGene, KO, every array platform, Entrez, HGNC, KEGG, UCSC, every model organism DB... And these just cover genes! Different competing systems for proteins, functions, diseases, physiology, you name it A large % of the reason we learn Python is so you can automate things like gene ID conversion 20 What can you get where NCBI GenBank and RefSeq GenBank handles "vanilla" sequences with mandatory identifiers and optional feature annotations RefSeq filters GenBank to provide the "best" sequences for features of interest In particular, genomes, chromosomes, genes, mRNAs, peptides Mostly curated and non-redundant, uses ID codes to indicate status/type NC = chromosome/genome, NM = mRNA, NP = protein, NR = other RNA, X = automated Entrez handles "conceptual" information about genes and other loci For example one Entrez gene ID will tag a gene's location, GenBank IDs, RefSeq IDs, etc. Plus links to other info: other DBs, disease, function, conservation, regulation, etc. EBI EMBL and Ensembl Beautiful genome browser for finished + draft genomes End-to-end annotation pipeline from genomes to proteins with varying levels of curation My favorite – consistent ID scheme, pretty web interface Programmatic interface at www.biomart.org – critical for automation UCSC Genome Browser The tool for overlaying your own information (tracks) over GenBank genomes Allows either start-stop feature annotations or per-nucleotide continuous "wiggle" tracks Primary site for (mod)ENCODE, polymorphisms (HapMap), conservation, nucleosomes, etc. UI solidly in 1990s, Web 2.0ey updates like JBrowse mildly horrid but progressing 30 BLAST, considerations, and flavors Basic Local Alignment Search Tool: performs one or more local alignments Heuristically scores "best" alignments, length of stat. sig. alignment, %ID, and p/e-value E-value: expected # of hits this good given the query + database size Will thus change for different databases, or for the same DB over time! Related to bit score: amount of information in the query relative to the DB i.e. amount of non-random change in the sequence Interconvertible with p-value, the probability of getting a hit this good by chance DNA/DNA, DNA/protein, protein/DNA, protein/protein blastn: nuc/nuc, best at finding evolutionarily related DNA sequences blastp: aa/aa, best at finding functionally related protein sequences blastx: nuc->aa, best at finding functionally related seqs. without a translated ORF (slow!) tblastn: aa->nuc, ditto above (slower, less often useful) Finds protein homologs in unannotated sequence – now usually easier to annotate tblastx: nuc/nuc wt. translation, ditto above (slowest) Only use this if you have no idea what your sequence is or what you're looking for psi-blast: Position Specific Iterative, good for very remote homology Starts with weak hits, repeatedly "zooms in" to the surrounding area to strengthen hit This is not a class about the BLAST algorithm(s)! Basic idea: not all sequence similarities are created equal BLAST uses scoring matrices that capture empirical nuc/nuc and aa/aa mutation probabilities PAM# matrices: Point or Percent Accepted Mutation, # expected mutations per 100aa Note that these are _with_ replacement, so PAM250 per 100aa means ~80aa affected BLOSUM# matrices: BLOcks of Amino Acid SUbstitution Matrix Based on # expected %ID, so higher is closer; opposite of PAM! Basic idea is that perfect hits of a given _word size_ are found first Mismatches and gaps are extended around these high-scoring segment pairs (HSPs) Higher word size = faster = more similar sequences = more false negatives Some potentially obvious caveats e-values are not probabilities! nuc/nuc, aa/aa, and translated queries will tell you very different things Often boils down to conservation versus function It's easy to get very strong matches for only short segments of a query sequence Must watch all of e-value, %ID, hit length, and hit structure Think of e-value like a p-value, %ID and length like effect size It's easy to be fooled by sequence families like transposable elements, TM domains, env. seqs. 30 Tools BioMart: http://www.biomart.org UGENE: http://ugene.unipro.ru GenomeView: http://genomeview.org Some others that didn't make the cut Genome Workbench: http://www.ncbi.nlm.nih.gov/projects/gbench/ This would be my backup option: automatically connects to NCBI, like UGENE but different IGV: http://www.broadinstitute.org/igv/ IGV is to GenomeView as Genome Workbench is to UGENE: nice alternative, bit more complex geWorkbench: http://wiki.c2b2.columbia.edu/workbench/ Great for combining basic seq. data with expression/proteomics, fewer seq.-specific operations SeqVISTA: http://zlab.bu.edu/SeqVISTA/ Potentially interesting for comparative genomics, but buggy and not as feature-rich

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genomes and sequence alignment