Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is sequencing? Video: https://www.youtube.com/watch?v=womKfik WlxM (Illumina video) Sequences: what's out there? NCBI: total nucleotide repository, "reference" sequences (more later), tools, features, annotations, etc. https://www.youtube.com/watch?v=Phxkg5H5Q6E Sequences: what's out there? EBI: EU counterpart, functionally equivalent (tends to have a bit less data, a bit better tools) Sequences: what's out there? DNA: Complete genomes sequences: Draft genome sequences: has lower accuracy, partially assembled, useful but annotation often improves dramatically. Miscellany: GenBank accepts any identified sequence. Eric Altermann et al. PNAS 2005;102:3906-3912 Sequences: what's out there? Sequence Read Archive Raw short reads European Nucleotide Archive (ENA) Sequences: what's out there? Sequence Read Archive Raw short reads European Nucleotide Archive (ENA) Sequences: what's out there? SRA and ENA are recently been plant to be joined by the INSDC. Standardizing deposition protocols, storage formats, access patterns, etc.. Sequences: what's out there? RNA: CDNA/ESTs: Not used so much anymore – single pass, high quality sequences from RTed mRNAs Can be used to catalog portions of genomes that are actively transcribed. Great for organisms without high quality sequenced genomes or annotations ESTs are often 300-800 bp Early efforts resulted in the identification of many hundreds of genes novel at the time. DbEST is a division of GeneBank Sequences: what's out there? RNA: RNA-seq US EU Underutilized Sequences: what's out there? Amino acids: Won't discuss today, but AA seqs. typically handled very differently and in different databases Features: annotations, from location to function. Loci are referred to as "features", which can be anything: Genes, introns/exons, polymorphisms, regulatory elements, conserved regions, islands, etc. Sequences: what's out there? Alignments Pairwise alignment is the process of lining up two sequences to achieve maximal levels of identity Fig 3.5 Pevsner. Pairwise alignment of human beta globin (query) and myoglobin (subject) Basic Local Alignment Search Tool (BLAST) It is an algorithm that allows the user to select one sequence (query) and perform pairwise sequence alignment between the target and the entire database of sequences, and identify the ones that resemble. Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values 4. Click BLAST. Output includes a pairwise alignment using the single letter amino acid code. NOTE: Similar pairs of residues are structurally or functionally related. That means they may look different but they are related because they share similar biochemical properties. BLAST algorithm 1) Make a k-letter word list of the query sequence. For example K=3 2) List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a threshold T. 3) Organize the remaining high-scoring words into an efficient search tree. 4) Repeat steps 2-3) for each k-letter word in the query sequence. BLAST algorithm 1) Make a k-letter word list of the query sequence. For example K=3 2) List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a threshold T. 3) Organize the remaining high-scoring words into an efficient search tree. 4) Repeat steps 2-3) for each k-letter word in the query sequence. BLAST algorithm 1) Make a k-letter word list of the query sequence. For example K=3 2) List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a treshold T. 3) Organize the remaining high-scoring words into an efficient search tree. 4) Repeat steps 2-3) for each k-letter word in the query sequence. 5) Scan the database sequences for exact matches witht the remaining high scoring words. High scoring segment pairs (HSPs) BLAST algorithm 1) Make a k-letter word list of the query sequence. For example K=3 2) List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a treshold T. 3) Organize the remaining high-scoring words into an efficient search tree. 4) Repeat steps 2-3) for each k-letter word in the query sequence. 5) Scan the database sequences for exact matches witht the remaining high scoring words. E-value: the number of times that an unrelated 6) List all the HSPs inwould the database whos database sequence obtain a score S score is high enough to be cosnsidered. higher than x by chance. 7) Evaluate the significance of the HSP score. (e-value) 8) Report every match whose expect score is lower than a threshold parameter E. BLAST PAM# BLOSUM# BLOSUM62 scoring matrix of Henikoff and Henikoff (1992). Sequences: what's out there? A note on gene IDs How do you put a standard identifier on _anything_ in genomics? Partially overlapping ID systems: GenBank, RefSeq, EMBL-Bank, EMBL, UniGene, UniRef, HomoloGene, KO, every array platform, Entrez, HGNC, KEGG, UCSC, every model organism DB... And these just cover genes! Different competing systems for proteins, functions, diseases, physiology, you name it A large % of the reason we learn Python is so you can automate things like gene ID conversion What can you get where? http://www.ncbi.nlm.nih.gov/genbank/ Tutorial: https://www.youtube.com/watch?v=g5a__okj5Zs http://www.ncbi.nlm.nih.gov/refseq/ What can you get where? http://www.ensembl.org/index.html Tutorial: http://www.ensembl.org/Multi/Help/Movie?db=core;id=188 What can you get where? USC Genome Browser https://genome.ucsc.edu http://jbrowse.org