* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Searching
Multilocus sequence typing wikipedia , lookup
Community fingerprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Proteolysis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Point mutation wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Two-hybrid screening wikipedia , lookup
DB Web addresses Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches Software Web addresses 1 Why Search Databases? • To find out if a new DNA sequence already is deposited in the databanks. • To find proteins homologous to a putative coding ORF. 2 Why Search Databases? • To find similar non-coding DNA stretches in the database, (for example: repeat elements, regulatory sequences). • To locate false priming sites for a set of PCR oligonucleotides. 3 What Databases Are Available? • DNA (nucleotide sequences): The big databases: Genbank, Embl, DDBJ an their weekly updates. These databases exchange information routinely. • Genomic databases like the: Human (GDB), Mouse (MGB), Yeast (SGB), etc… • Special databases: ESTs (expressed sequence tags) STSs (sequence-tagged sites) EPD (eukaryotic promoter database) REPBASE (repetitive sequence database) 4 and many others. What Databases Are Available? • Protein (amino acid sequences): The big databases are: Swiss-Prot ( high level of annotation) PIR (protein identification resource) • Translated databases like: SPTREMBL (translated EMBL) GenPept (translation of coding regions in GenBank) • Special databases like: PDB(sequences derived from the 3D structure Brookhaven PDB) 5 Web Addresses • http://www.ncbi.nlm.nih.gov/Entrez/ – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=sear ch&DB=nucleotide – http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview. html – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein 6 Let us go http://www.ncbi.nlm.nih.gov/Entrez/ 7 What is GenBank? • http://www.ncbi.nlm.nih.gov/Genbank/Genbank Overview.html • GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences … 8 Access to GenBank • http://www.ncbi.nlm.nih.gov/Genbank/GenbankOvervi ew.html. • GenBank is available for searching at NCBI via several methods. • The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. 9 NCBI databases • http://www.ncbi.nlm.nih.gov/Database/inde x.html Let us try a tutorial http://www.ncbi.nlm.nih.gov/Database/tut1.html 10 Web Addresses • http://www.ebi.ac.uk/Databases/ – http://www.ebi.ac.uk/embl/index.html – http://www.ebi.ac.uk/swissprot/index.html – http://www.ebi.ac.uk/microarray/ArrayExpress/ arrayexpress.html 11 Homology and Analogy It is important to understand a concept that underpins sequence analysis - homology. The term homology is confounded and abused in the literature. Simply, sequences are said to be homologous if they are related by divergence from a common ancestor. 12 What Is Homology ? (from the Technion course) • Similarity or likeness between properties in species. • Before Darwin, homology was defined morphologically: • Example: 13 Homology Bats and butterflies fly, but are different. Bats fly and whales swim, yet the bones in a bat's wing and a whale's flipper are strikingly alike. Bats and butterflies wings are not homologous. Bats wings and whales flippers are homologous. 14 Homology Interpretation from Darwin to 21st Century • Darwin (1859) explained homology as the result of descent with modification from a common ancestor. • Modern genetics: Homology information is in the genes. • Two sequences are homologous if they are both similar and have a common ancestor. 15 When Does Similarity Imply Homology? • Similarity by itself is not enough: for example, short sequences similarity could be random (result from different ancestors). • Large enough similarities typically imply homology (and usually we do not have direct evidence on descent). • Sequence similarity comes with a significance measure. 16 Homology and Analogy Understanding homology allows us to appreciate the concept of analogy; this is encountered in protein structures that share similar folds but have no demonstrable sequence similarity; or that share groups of catalytic residues with almost exactly equivalent spatial geometries, but otherwise have neither sequence nor structural similarity. Such relationships are thought to result from convergence to similar biological solutions from different evolutionary starting17 points. Homology and Analogy The essence of sequence analysis is the inference of homology. Homology is not a measure of similarity, but an absolute statement that sequences have a divergent rather than a convergent relationship. Thus, phrases that quantify homology are meaningless. 18 Orthology and Paralogy Homologous proteins may perform the same function in different species (orthologues) or different but related functions within one organism (paralogues). Comparison of orthologues allows study of molecular palaeontology, while paralogues have provided deeper insights into the underlying mechanisms of evolution. 19 Orthology and Paralogy Paralogues arose from single genes via successive duplication events. The duplicated genes followed separate evolutionary pathways, and new specificities evolved through variation and adaptation. 20 Complete genomes • http://www.ncbi.nlm.nih.gov/entrez/query.fc gi?db=Genome • Let us walk around among genomes 21 COGs Phylogenetic classification of proteins encoded in complete genomes Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. Proteins from two eukaryotic genomes (Drosophila melanogaster and Caenorhabditis elegans) were assigned to COGs and can be reached from each individual COG page. 22 COGs • http://www.ncbi.nlm.nih.gov/COG/ • Cognitor • http://www.ncbi.nlm.nih.gov/COG/xognitor.html • COG Help • http://www.ncbi.nlm.nih.gov/COG/COGhelp.htm l#top »FTP ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycobacterium_leprae/ 23