* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The human genome
Gene expression wikipedia , lookup
X-inactivation wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene desert wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Copy-number variation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genomic library wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Introduction to genomes & genome browsers Content The human genome Human genetic variation SNPs CNVs Alternative splicing Browsing the human genome Celia van Gelder CMBI UMC Radboud June 2010 [email protected] The human genome • Genome: the entire sequence of DNA in a cell • 3 billion basepairs (3Gb) • 22 chromosome pairs + X en Y chromosomes • Chromosome length varies from ~50Mb to ~250Mb • About 22000 protein-coding genes • Human genome is 99.9% identical among individuals Eukaryotic Genomes: more than collections of genes • Protein coding genes • RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA) • Structural DNA (centromeres, telomeres) • Regulation-related sequences (promoters, enhancers, silencers, insulators) • Parasite sequences (transposons) • Pseudogenes (non-functional gene-like sequences) • Simple sequence repeats The human genome cntnd • Only 1.2% codes for proteins • Long introns, short exons • Large spaces between genes • More than half consists of repetitive DNA From: Molecular Biology of the Cell (4th edition) (Alberts et al., 2002) Variation along genome sequence • Nucleotide usage varies along chromosomes – Protein coding regions tend to have high GC levels • Genes are not equally distributed across the chromosomes – Housekeeping generally in genedense areas – Gene-poor areas tend to have many tissue specific genes From: Ensembl Chromosome organisation (1) From: Lodish (4th edition) Chromosome organisation (2) • DNA packed in chromatin Genes that are OFF • Non-active genes often in densely packed chromatin (30-nm fiber) Genes that are ON • Active genes in less dense chromatin (beads-on-a-string) • Gene regulation by changing chromatin density, methylation/acetylation of the histones From: Lodish (4th edition) Introduction to genomes & genome browsers Content The human genome Human genetic variation CNVs SNPs Alternative splicing Browsing the human genome Human Genetic Variation • Every human has essentially the same set of genes • But there are different forms of each gene -- known as alleles – blue vs. brown eyes – genetic diseases such as cystic fibrosis or Huntington’s disease are caused by dysfunctional alleles Variations in the Genome Common Sequence Variations Polymorphism Deletions Insertions Chromosome Translocations Today’s focus 1. Single Nucleotide Polymorphisms (SNPs) 2. Copy number variations (CNV) 3. Alternative transcripts Single Nucleotide Polymorphisms (SNPs) • SNPs are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. • For a variation to be considered a SNP, it must occur in at least 1% of the population. • SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome. TA AT CG G TA GC TA TA AT C G (SNP) G TA AGTC TA CG AT CG AT GC T A GC T A SNPs & medicine • Although more than 99% of human DNA sequences are the same, variations in DNA sequence can have a major impact on how humans respond to: – disease; – environmental factors such as bacteria, viruses, toxins, and chemicals; – and drugs (& side-effects). • This makes SNPs valuable for biomedical research and for developing pharmaceutical products or medical diagnostics. SNP & disease, example Alzheimer's disease & apolipoprotein E • ApoE contains two SNPs that result in three possible alleles for this gene: E2, E3, and E4. • Each allele differs by one DNA base, and the protein product of each gene differs by one amino acid. • Each individual inherits one maternal copy of ApoE and one paternal copy of ApoE. • Research has shown that a person who inherits at least one E4 allele will have a greater chance of developing Alzheimer's disease. Today’s focus 1. Single Nucleotide Polymorphisms (SNPs) 2. Copy number variations (CNV) 3. Alternative transcripts Copy Number Variation • People do not only vary at the nucleotide level (SNPs) • Copy Number Variations (CNVs): duplications and deletions of pieces of chromosome • When there are genes in the CNV areas, this can lead to variations in the number of gene copies between individuals • CNVs may either be inherited or caused by de novo mutation Copy Number Variation Normal cell CN=2 deletion CN=0 amplification CN=1 CN=3 CN=4 CNVs • CNVs are common in cancer and other diseases. • CNVs are also common in normal individuals and contribute to our uniqueness. These changes can also influence the susceptibility to disease. • Since CNVs often encompass genes, they can have important roles both in characterizing human disease and discovering drug response targets. CNV & disease, examples CNVs have been implicated in • Gene copy number can be elevated in cancer cells • Autism • Schizophrenia (dept. human genetics) • Mental retardation (dept. human genetics) Today’s focus 1. Copy number variations (CNV) 2. Single Nucleotide Polymorphisms (SNPs) 3. Alternative transcripts Alternative splicing Alternative splicing • Defects of the machinery of alternative splicing have been implicated in many diseases, including: – – – neuropathological conditions such as Alzheimer disease cystic fibrosis, those involving growth and developmental defects many human cancers Introduction to genomes & genome browsers Content The human genome Human genetic variation CNVs SNPs Alternative splicing Browsing the human genome Exponential Growth in Genomic Sequence Data # of genomes First 2 bacterial genomes complete Currently 1000+ completed genomes First eukaryote complete (yeast) First metazoan complete (flatworm) Annotating the genome • Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1. identifying elements on the genome, a process called Gene Finding, and 2. attaching biological information to these elements. • Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline. why present the whole genome? • Browsers provide context to understand genomic regions of interest • See features in and around a specific gene • Explore larger chromosome regions • Search & retrieve information on a gene- and genomescale • Investigate genome organization • Compare genomes Basic Genome Annotation • Genomic location • Gene features • • • Exons Introns UTRs • Transcript(s) • • Pseudogenes Non-coding RNA • Protein(s) • Links to other sources of information Advanced Genome Annotation • • • • • • • • Cytogenetic bands Polymorphic markers Genetic variation Repetitive sequences Expressed Sequence Tags (ESTs) cDNAs or mRNAs from related species Regions of sequence homology Genomic sequence variation Possible research questions P. Schattner, Genomics 93 (2009):187-195 [Human] Genome Browsers Not limited to only human data EBI Ensembl NCBI Map Viewer UCSC Genome Browser Other Ensembl Installations http://www.ensemblgenomes.org/ Organized Data Based on Chromosome Location Gene X tracks genes & predictions variations & repeats cross-species comparative data & many more types of data from expression & regulation to mRNA and ESTs… Description Transcript data Structure Gene Ontology Pathway Data Homologous Genes Expression Data Etc…. Ensembl Genes – biological basis • All Ensembl transcripts are based on proteins and mRNAs in: – UniProt/Swiss-Prot (manually curated) – UniProt/TrEMBL – NCBI RefSeq (manually curated) 34 ↔ Ensembl Homepage 35 Ensembl and VEGA/Havana • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl • Manual curation: case-by-case basis VEGA/Havana: Vertebrate Genome Annotation • Merged: (gold) Havana/Ensembl HGNC • HGNC – a unique name and symbol for every gene in human http://www.genenames.org/ Names in Ensembl • • • • • ENSG### Ensembl Gene ID ENST### Ensembl Transcript ID ENSP### Ensembl Peptide ID ENSE### Ensembl Exon ID For other species than human a suffix is added: – MUS (Mus musculus) for mouse: ENSMUSG### – DAR (Danio rerio) for zebrafish: ENSDARG###, etc. Tabs in Ensembl • Location Tab • Transcript Tab • Gene summary Tab 41 tracks tracks Ensembl: An Example Click for more details Gene Structure in Ensembl Synopsis- What can I do with Ensembl? • View genes along with other annotation along the chromosome • View alternative transcripts (including splice variants) for a gene • Explore homologues and phylogenetic trees across more than 40 species for any gene • Compare whole genome alignments and conserved regions across species • View microarray sequences that match to Ensembl genes • View ESTs, clones, mRNA and proteins for any chromosomal region • Examine single nucleotide polymorphisms (SNPs) for a gene or chromosomal region • View SNPs across strains (rat, mouse), populations (human), or even breeds (dog) • View positions and sequence of mRNA and protein that align with an Ensembl gene 52/37 ©CMBI 2009 Alternative Transcripts Ensembl: Many Additional Tools best scoring match BLAST/BLAT BioMart data retrieval and download Copyright OpenHelix. No use or reproduction without