Download Slide 1

Tools in bioinformatics Fall 2009-10 1 Overview Goals  To provide students with practical knowledge of bioinformatics tools and their application in research Prerequisites  The course “Introduction to bioinformatics”  Familiarity with topics in molecular biology (cell biology, biochemistry, and genetics)  Basic familiarity with computers & internet 2 Administration Course website http://ibis.tau.ac.il/intro_bioinfo/tools.html 3 Administration Classes: A class will be given every two weeks There are three class groups: Sunday 16:00-18:00 Monday 12:00-14:00 Monday 14:00-16:00 Location: Computer classroom Sherman 03 4 Administration Teachers:  Nimrod Rubinstein [email protected] (Sundays)  Daiana Alaluf [email protected] (Mondays I)  Osnat Penn [email protected] (Mondays II)  Reception hours: Email your instructor any question at any time or set an appointment (Britania 405, 6409245) 5 Requirements  Assignments – 50% of final grade (compulsory)  Assignments include class and home works: • Class works are planned to be completed during the lesson and handed in at the end of it. They will be checked but not graded. • Home works should be handed in the following lesson (two weeks after their hand out). They will be checked and graded.  Final project – 50% of final grade When emailing your instructor (a question, your assignment, or whatever) please state in the “Subject” field: “Tools in Bioinfo”, IDs, CW/HW number (if relevant) 6 BIOINFORMATICS DATABASES 7 What’s in a database? Sequences – genes, proteins, etc… Full genomes Expression data Structures Annotation – information about genes/proteins: - function - cellular location - chromosomal location - introns/exons - phenotypes, diseases  Publications      8 NCBI and Entrez  One of the most largest and comprehensive databases belonging to the NIH (national institute of health. The primary Federal agency for conducting and supporting medical research in the USA)  Entrez is the search engine of NCBI  Search for : genes, proteins, genomes, structures, diseases, publications, and more http://www.ncbi.nlm.nih.gov 9 PubMed: NCBI’s database of biomedical articles Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95. 10 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags 11 Example  Retrieve all publications in which the first author is: Davidovich C and the last author is: Yonath A 12 Using limits Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years 13 Google scholar http://scholar.google.com/ 14 15 GenBank: NCBI’s gene & protein database  GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations)  Holds ~106.5 billion bases of ~108.5 million sequence records (Oct. 2009) 16 Search demonstration Searching NCBI for the protein human CD4 17 18 Using field descriptions, qualifiers, and boolean operators  Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]  List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers  Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE! 19 This time we directly search in the protein database 20 RefSeq  Subcollection of NCBI databases with only nonredundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products) 21 22 An explanation on GenBank records 23 Fasta format header ID/accession description > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save accession numbers for future use (makes searching quicker): RefSeq accession number: NP_000607.1 sequence 24 Downloading 25 Swissprot A protein sequence database which strives to provide a high level of annotation regarding: * the function of a protein * domains structure * post-translational modifications * variants  One entry for each protein http://www.expasy.ch/sprot 26 27 GenBank Vs. Swissprot GenBank results Swiss-Prot results 28 PDB: Protein Data Bank  Main database of 3D structures of macromolecules  Includes ~61,000 entries (proteins, nucleic acids, complex assemblies)  Is highly redundant http://www.rcsb.org 29 Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4 30 Accession Numbers GenBank EMBL Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 Refseq RefSeq accession numbers can be distinguished from GenBank accessions by their prefix: 2 characters+underscore], e.g.: NP_015325 NM_: mRNA transcript, NP_: protein SWISSPROT Six characters: 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 PDB One digit followed by three letters/digits, e.g.: 1hxw 31 GeneCards  All-in-one database of human genes (a project by the Weizmann institute)  Attempts to integrate as many as possible databases, publications, and all available knowledge http://www.genecards.org 32 33 Organism specific databases  Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index 34 Summary  General and comprehensive databases:   Genome specific databases (to be discussed):   NCBI, EMBL UCSC, ENSEMBL Highly annotated databases:  Human genes • Genecards  Proteins: • Swissprot, RefSeq  Structures: • PDB 35 As important: 1.Google (or any search engine) 36 And always remember: 2.RT(F)M - Read the manual!!! (/help/FAQ) 37 GO: Gene Ontology Gene Ontology  Strives to provide consistent descriptions of gene products obtained from different databases  GO annotations include three hierarchical ontologies of gene products:  cellular component(s) – the environment in which the gene product functions  biological processe(s) – the biological program/pathway in which the gene product is involved  molecular function(s) – the elemental activities of the gene product  E.g., cytochrome c:  cellular components: mitochondrial matrix and mitochondrial inner membrane  biological processes: oxidative phosphorylation and induction of cell death 39  molecular functions: oxidoreductase activity AmiGO: the official GO browser 40 42 .. 43 Through NCBI 44 .. .. 45 Enrichment analysis Query set Reference set N n k K Total – n genes Total – N genes Function f – k genes Function f – K genes Is k/n > K/N, significantly ??? 46 Statistical significance testing Problem formulation: In a group of N genes there are K “special” ones If we sample n genes out of N (without replacement), and found k “special” ones, would that be considered a random outcome? Mathematically, we use the hypergeometric distribution to compute the probability of obtaining k or more “special” ones in a sample of n k 1 p  value  1   i 0  K  N  K     k 1  i  n  i   f HG (i; N , K , n)  1   N i 0   n 47 48 Materials & Methods 21,121 siRNA knockdown assays, literally covering the entire coding-sequence part of the genome 49 Results 273 HIV-dependency factors (HDFs) were discovered Biological processes 50 Subcellular localizations Molecular functions 51 Observations  Nuclear pore complex: their loss may impede HIV nuclear access  Mediator members (couples TFs to Pol II): requirement for activators to bind HIV LTRs  Enzymes involved in glycosilation: HIV’s envelope protein is heavily glycosilated assisting in the virus entry to cells 52

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1