* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 2 PowerPoint Slides
Nucleic acid analogue wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene expression programming wikipedia , lookup
Molecular cloning wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Primary transcript wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Transposable element wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression profiling wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenomics wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Point mutation wikipedia , lookup
Genetic engineering wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Metagenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Minimal genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Designer baby wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human Genome Project wikipedia , lookup
History of genetic engineering wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome Sequence • Chapter 2 1 2 Position Weight Matrix • TATA box -- believed to be used by RNA polymerase to find transcription start site • Given a PWM (Table MM2.1) -- how do we decide if any given 15 nts form a "TATA" box? • Ex: if genome wide GC average == 44% • P(A pos. 1) = (1-.44)*P(A or T) = .56*.5 = .28 (expected value • If TATA box, P(A pos. 1) = 61/392 ~ 0.1556 (from table MM2.1) (observed value) • Finally, P(TATA)/P(A overall) == 0.1556/.28 ==> take logorithm == log odds ratio 3 log odds • log odds > 1 ==> NT likely at position for real TATA box • log odds < -1 ==> NT is less likely to occur at that position of a TATA box than overall • since log -- can sum at each position to get log odds score 4 5 Table MM2.1 6 Tables MM2.2 and MM2.3 Genome resources for Annotated Genes • Entrez http://www.ncbi.nlm.nih.gov/gquery/gquery.fc gi • GeneCard • http://www.genecards.org/index.shtml 7 BLOSUM62 substitution Matrix • default substitution matrix for BLAST alignment between two amino acid sequences • provides the "score" (or "cost") of aligning one amino acid to another • common substitutions have higher scores • based on observed amino acid substitution in orthologous proteins with aligned sequences 8 9 Table MM2.4 10 Table MM2.5 Protein Structure • PDB http://www.rcsb.org/pdb/home/home.do • Entrez • http://www.ncbi.nlm.nih.gov/gquery/gquery .fcgi 11 Protein Structure Prediction • predict 3D structure from primary amino acid sequence • considered computationally intractable • however, many individuals are working on this problem 12 Structure and Function Descriptions • Gene Ontology (or "GO") • Controlled vocabulary of hierarchical terms that describes genes/proteins • biological process -- overall objective • molecular function -- biochemical activity • cellular component -- location of protein activity • http://www.geneontology.org/ 13 • • • • survey Survey SURVEY ENDS -- 12/02/2007 14 Caution • 1) Note -- multiple listings of information/data in multiple different databases are not necessarily independent validations (many sites crosslink/cross-reference). The literature is probably the lowest-level confirmatory resources • 2) Do not assume that all data sources/repositories contain the same information (examples: gene assemblies) 15 Mapping • STSs -- sequence tagged sites -- a pair of primers that amplifies a distinct portion of the genome • chromosomes were fragmented and inserted into bacteria and/or yeast -- to maintain the DNA • bacterial vectors carried approximately 150 kb of sequence -- BAC (E. coli.) • YACs -- 150 kb to 1.5 Mb • Using restriction maps, and the STSs, the BACs and YACs could be assembled into longer contigs • Mapping was considered crucial by the public effort due to the number and sizes of large repeats in the genome 16 Vector A vector in biology has several meanings: * An organism (biotic vector, pollinator) or medium (abioti vector, e.g. wind) which transports pollen to a stigma. * An organism that transmits disease by conveying pathogens from one host to another (vector insect) * A virus used to deliver genetic material into a cell * A piece of DNA meant to carry DNA fragments into a host cell 17 18 Figure 2.5 ESTs - Expressed Sequence Tags • cDNA (DNA made from RNA) • short reads of cDNAs (typically 200-800 nts from the 3' and 5' ends of genes (cDNA) • Massive efforts to sequence ESTs -- and assembled into a database -- UniGene • ESTs provide information on: – – – – – existence of genes tissue specific expression alternative splicing cluster of ESTs -- bioinformatics problem (Show NEIBank) 19 Whole Genome Shotgun Sequencing • TIGR (the institute for genomic research) -sequenced a bacteria • TIGR and Celera (Craig Venter) split – the basic idea of WGSS is to cut up the DNA into small pieces – sequence all the pieces – then using software, assemble all of the overlapping sequences 20 Human Genome Project • Publicly funded effort – considerable effort was put into mapping and marker identification • for assembly • and organization of sequences for sequencing • markers used to choose the minimum number of slightly overlapping fragments that completely spanned each chromosome • called the "golden tiling path" -- ~45,000 fragments 21 Clones, Libraries, and Mapping http://www.genome.gov/Pages/Education/Kit/main.cfm?pageid=92# E. coli BACs -100,000 - 200,000 markers or STS 22 Mapping -- an aside • Genetic map • physical map – radiation hybrid map 23 Clones, Libraries, and Mapping • BAC's are typically cleaved into smaller fragments -- about 2000 bases, and stored on E. coli viruses (a plasmid) • precise order of larger BACs is determined - because determining the order of many smaller fragments is more work -- however • shorter fragments are more amenable to the chemistry of the sequencing reactions 24 Celera -- Shotgun Sequencing • Genome -- cleaved into 27,271,853 fragments • Celera had access to the public data • Claimed not to have used it (not clear why) for assembly. Any guesses why? 25 Whole Genome Shotgun VS Mapping • WGS fails in large sections of highly repetitive DNA • Yet, the WGS "lost" only 103 genes (pretty good), given the cost/time savings • In hindsight, a hybrid approach appears to be optimal – WGS to get a majority of sequence (say 6x coverage) – minimum tiling path to resolve repetitive regions – estimated that 3000 BACs would be sufficient for human (93% less than was sequenced for human) • However, it would be impossible to know this without the results of both approaches for comparison 26 27 Figure 2.B1 28 locations 690 KB 46 kb 28 Figure 2.B2 Annotating Genomes -- Summary "Fully annotating a newly sequenced genome requires many people with different academic backgrounds working together in teams. As you might guess, software development for genome analysis is a very hot research area in computer science, mathematics, engineering, and biology. Few people can master more than one or two of these areas, so collaborations are common. If you learn both math and biology, you will have many career opportunities…" 29 How many proteins? From "one" "gene" • Nox1 -- ESTs identified the gene for voltage-gated H+ ion channels • 3 different mRNAs are encoded (2 long, 1 short) thru alternative splicing -- that are tissue specific 30 31 Figure 2.6 Imprinting • Prader-Willi Syndrome, and – weak muscles, short in stature, obese, mental retardation • Angelman Syndrome – balance problems, motor skills, excessively happy, severely retarded • deletion of Mb in 15q11-13 • disease status determined by "imprinting" -marking of a gene so that only the paternal or maternal copy will be transcribed (the other copy is not transcribed) • Is this form of inheritance supported by Mendel's 32 rules? Imprinting • many genes involved with the placenta and developing embryos • analogous to a parasitic drain on the mother • genetically speaking, the male would prefer to extract maximal nutrients, to ensure propagation of genetic line • female would prefer to ration resources to increase the chance of having multiple lines • opposition is genetic "conflict" 33 Imprinting -- example • IGF2 -- insulin-like growth factor (paternally expressed gene) • IGF2R -- receptor expressed only by the maternally inherited allele • by controlling the embryonic expression of the receptor, the mother maintains control of the paternally driven ligand from IGF2 34 Figure 2.7 Igf2 gene expression -in 2 strains of mouse 35 Methylation -- a mechanism • DNA methylation -- the addition of a methyl group (CH3) to cytosine ( C ) in DNA • methylation – associated with non-transcription – loss of methylation observed in cancerous cells • regulation of gene expression without altering the DNA sequence is -- epigenetic regulation • ~400,000 methylated sites in any given cell type • ~ 100 unique cell types in humans ==> 40,000,000 methylated sites? • Plus gender-based differences in expression 36 37 Figure 2.8 Other interesting finds… • • • • haplotypes and genetic "invariability"? micro-RNA's (a new kind of gene) pseudogenes more duplications and deletions than expected (copy number variations) 38