* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Human genome wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Metagenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Pathogenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene expression programming wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene nomenclature wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Tools in bioinformatics Fall 2009-10 1 Overview Goals To provide students with practical knowledge of bioinformatics tools and their application in research Prerequisites The course “Introduction to bioinformatics” Familiarity with topics in molecular biology (cell biology, biochemistry, and genetics) Basic familiarity with computers & internet 2 Administration Course website http://ibis.tau.ac.il/intro_bioinfo/tools.html 3 Administration Classes: A class will be given every two weeks There are three class groups: Sunday 16:00-18:00 Monday 12:00-14:00 Monday 14:00-16:00 Location: Computer classroom Sherman 03 4 Administration Teachers: Nimrod Rubinstein [email protected] (Sundays) Daiana Alaluf [email protected] (Mondays I) Osnat Penn [email protected] (Mondays II) Reception hours: Email your instructor any question at any time or set an appointment (Britania 405, 6409245) 5 Requirements Assignments – 50% of final grade (compulsory) Assignments include class and home works: • Class works are planned to be completed during the lesson and handed in at the end of it. They will be checked but not graded. • Home works should be handed in the following lesson (two weeks after their hand out). They will be checked and graded. Final project – 50% of final grade When emailing your instructor (a question, your assignment, or whatever) please state in the “Subject” field: “Tools in Bioinfo”, IDs, CW/HW number (if relevant) 6 BIOINFORMATICS DATABASES 7 What’s in a database? Sequences – genes, proteins, etc… Full genomes Expression data Structures Annotation – information about genes/proteins: - function - cellular location - chromosomal location - introns/exons - phenotypes, diseases Publications 8 NCBI and Entrez One of the most largest and comprehensive databases belonging to the NIH (national institute of health. The primary Federal agency for conducting and supporting medical research in the USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications, and more http://www.ncbi.nlm.nih.gov 9 PubMed: NCBI’s database of biomedical articles Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95. 10 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags 11 Example Retrieve all publications in which the first author is: Davidovich C and the last author is: Yonath A 12 Using limits Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years 13 Google scholar http://scholar.google.com/ 14 15 GenBank: NCBI’s gene & protein database GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) Holds ~106.5 billion bases of ~108.5 million sequence records (Oct. 2009) 16 Search demonstration Searching NCBI for the protein human CD4 17 18 Using field descriptions, qualifiers, and boolean operators Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE! 19 This time we directly search in the protein database 20 RefSeq Subcollection of NCBI databases with only nonredundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products) 21 22 An explanation on GenBank records 23 Fasta format header ID/accession description > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save accession numbers for future use (makes searching quicker): RefSeq accession number: NP_000607.1 sequence 24 Downloading 25 Swissprot A protein sequence database which strives to provide a high level of annotation regarding: * the function of a protein * domains structure * post-translational modifications * variants One entry for each protein http://www.expasy.ch/sprot 26 27 GenBank Vs. Swissprot GenBank results Swiss-Prot results 28 PDB: Protein Data Bank Main database of 3D structures of macromolecules Includes ~61,000 entries (proteins, nucleic acids, complex assemblies) Is highly redundant http://www.rcsb.org 29 Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4 30 Accession Numbers GenBank EMBL Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 Refseq RefSeq accession numbers can be distinguished from GenBank accessions by their prefix: 2 characters+underscore], e.g.: NP_015325 NM_: mRNA transcript, NP_: protein SWISSPROT Six characters: 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 PDB One digit followed by three letters/digits, e.g.: 1hxw 31 GeneCards All-in-one database of human genes (a project by the Weizmann institute) Attempts to integrate as many as possible databases, publications, and all available knowledge http://www.genecards.org 32 33 Organism specific databases Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index 34 Summary General and comprehensive databases: Genome specific databases (to be discussed): NCBI, EMBL UCSC, ENSEMBL Highly annotated databases: Human genes • Genecards Proteins: • Swissprot, RefSeq Structures: • PDB 35 As important: 1.Google (or any search engine) 36 And always remember: 2.RT(F)M - Read the manual!!! (/help/FAQ) 37 GO: Gene Ontology Gene Ontology Strives to provide consistent descriptions of gene products obtained from different databases GO annotations include three hierarchical ontologies of gene products: cellular component(s) – the environment in which the gene product functions biological processe(s) – the biological program/pathway in which the gene product is involved molecular function(s) – the elemental activities of the gene product E.g., cytochrome c: cellular components: mitochondrial matrix and mitochondrial inner membrane biological processes: oxidative phosphorylation and induction of cell death 39 molecular functions: oxidoreductase activity AmiGO: the official GO browser 40 42 .. 43 Through NCBI 44 .. .. 45 Enrichment analysis Query set Reference set N n k K Total – n genes Total – N genes Function f – k genes Function f – K genes Is k/n > K/N, significantly ??? 46 Statistical significance testing Problem formulation: In a group of N genes there are K “special” ones If we sample n genes out of N (without replacement), and found k “special” ones, would that be considered a random outcome? Mathematically, we use the hypergeometric distribution to compute the probability of obtaining k or more “special” ones in a sample of n k 1 p value 1 i 0 K N K k 1 i n i f HG (i; N , K , n) 1 N i 0 n 47 48 Materials & Methods 21,121 siRNA knockdown assays, literally covering the entire coding-sequence part of the genome 49 Results 273 HIV-dependency factors (HDFs) were discovered Biological processes 50 Subcellular localizations Molecular functions 51 Observations Nuclear pore complex: their loss may impede HIV nuclear access Mediator members (couples TFs to Pol II): requirement for activators to bind HIV LTRs Enzymes involved in glycosilation: HIV’s envelope protein is heavily glycosilated assisting in the virus entry to cells 52