Download BROWSING GENES AND GENOMES WITH ENSEMBL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

X-inactivation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Transposable element wikipedia , lookup

Metagenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Human genome wikipedia , lookup

NEDD9 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

History of genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Point mutation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
The Bioinformatics Roadshow
Tórshavn, The Faroe Islands
28-29 November 2012
BROWSING GENES AND GENOMES
WITH ENSEMBL
EXERCISES AND ANSWERS
1
BROWSER 3 BIOMART 8 VARIATION 13 COMPARATIVE GENOMICS 18 2
Note: These exercises are based on Ensembl version 69 (October 2012).
After in future a new version has gone live, version 69 will still be available at
http://e69.ensembl.org/. If your answer doesn’t correspond with the given
answer, please consult the instructor.
______________________________________________________________
BROWSER ______________________________________________________________
Exercise 1 – Exploring a gene
(a) Find the human F9 (coagulation factor IX) gene. On which chromosome
and which strand of the genome is this gene located? How many transcripts
(splice variants) have been annotated for it?
(b) What is the longest transcript? How long is the protein it encodes? Has
this transcript been annotated automatically (by Ensembl) or manually (by
Havana)? How many exons does it have? Are any of the exons completely or
partially untranslated?
(c) Have a look at the external references for ENST00000218099. What is the
function of F9?
(d) Is it possible to monitor expression of ENST00000218099 with the
ILLUMINA HumanWG_6_V2 microarray? If so, can it also be used to monitor
expression of the other two transcripts?
(e) In which part (i.e. the N-terminal or C-terminal half) of the protein encoded
by ENST00000218099 does its peptidase activity reside?
(f) Have any missense variants been discovered for the protein encoded by
ENST00000218099?
(g) Is there a mouse orthologue predicted for the human F9 gene?
(h) If you have yourself a gene of interest, explore what information Ensembl
displays about it!
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Select ‘Search: Human’ and type ‘f9’ or ‘factor IX’ in the ‘for’ text box.
8 Click [Go].
8 Click on ‘Gene’ on the page with search results.
8 Click on ‘Human’.
3
8 Click on ‘F9’ or ‘ENSG00000101981’.
The human F9 gene is located on the X chromosome on the forward strand.
There have been three transcripts annotated for this gene,
ENST00000218099 (F9-001), ENST00000394090 (F9-201) and
ENST00000479617 (F9-002).
(b)
The longest transcript is ENST00000218099 (F9-001). The length of this
transcript is 2780 base pairs and the length of the encoded protein is 461
amino acids.
8 Click on the transcript ‘F9-001’ in the ‘Gene Summary’ display.
It is an Ensembl/Havana merge transcript that has been annotated both
automatically and manually. This can also be seen from the fact that the
transcript is golden coloured.
8 Click on the Ensembl Transcript ID ‘ENST00000218099’ in the list of
transcripts.
It has eight exons.
8 Click on ‘Sequence - Exons’ in the side menu.
The first and last exons are partially untranslated (sequence shown in purple).
(c)
8 Click on ‘External References - General identifiers’ in the side menu.
8 Explore some of the links (a good place to start is usually
‘UniProtKB/Swiss-Prot’).
8 Do the same for ‘Ontology – Ontology table’.
Factor IX is a vitamin K-dependent plasma protein that participates in the
intrinsic pathway of blood coagulation by converting factor X to its active form
in the presence of Ca2+ ions, phospholipids, and factor VIIIa (this is the
function description as found in UniProtKB/Swiss-Prot).
(d)
8 Click on ‘External References - Oligo probes’ in the side menu.
The ILLUMINA HumanWG_6_V2 microarray contains one probe,
ILMN_1787598, that maps to ENST00000218099, so it is possible to monitor
expression of this transcript using this array.
4
8 Click on ‘ENST00000394090’ and ‘ENST00000479617’ in the list of
transcripts.
No ILLUMINA HumanWG_6_V2 probes map to the other two transcripts, so
expression of these transcripts cannot be monitored using this array.
(e)
8 Click on ‘ENST00000218099’ in case you are not already on the
‘Transcript: F9-001’ tab.
8 Click on ‘Protein Information – Protein summary’ in the side menu.
8 Click on ‘Protein Information - Domains & features’ in the side menu.
The peptidase activity of the protein resides in the peptidase domain that is
located in the C-terminal half of the protein.
(f)
8 Click on ‘Protein Information - Variations’ in the side menu.
8 Type ‘missense in the ‘Filter’ text box at the top of the ‘Variations’
table.
For the protein encoded by ENST00000218099 many missense variants have
been discovered. Most are imported from dbSNP (i.e. all variants with an
identifier starting with ‘rs’) while a few are from the COSMIC (Catalogue of
Somatic Mutations in Cancer) database (i.e. all variants with an identifier
starting with ‘COSM’) and the NHLBI GO Exome Sequencing Project (ESP)
(i.e. all variants starting with ‘TMP_ESP’).
(g)
8 Click on the ‘Gene: F9’ tab.
8 Click on ‘Comparative Genomics - Orthologues’ in the side menu.
8 Type ‘mouse’ in the ‘Filter’ text box at the top of the ‘Selected
orthologues’ table.
There is one mouse orthologue predicted for human F9, i.e.
ENSMUSG00000031138.
______________________________________________________________
Exercise 2 – Exploring a region
(a) Go to the region from bp 32,448,000 to 33,198,000 on human
chromosome 13. On which cytogenetic band is this region located? How
many contigs make up this portion of the assembly (contigs are contiguous
stretches of DNA sequence that have been assembled solely based on direct
sequencing information)?
5
(b) Zoom in on the BRCA2 gene.
(c) Are there any BAC clones that contain the complete BRCA2 gene?
(d) Add the track with RefSeq gene models. Has RefSeq annotated the
BRCA2 gene? If so, how many transcripts have been annotated? Do they
differ from the Ensembl transcripts?
(e) Export the genomic sequence of the region you are looking at in FASTA
format.
(f) Delete all tracks you added to the ‘Region in detail’ page.
(g) If you have yourself a genomic region of interest, explore what information
Ensembl displays about it!
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Select ‘Search: Human’ and type ‘13:32448000-33198000’ in the ‘for’
text box (or alternatively leave the ‘Search’ drop-down list like it is and
type ’human 13:32448000-33198000’ in the ‘for’ text box).
8 Click [Go].
This genomic region is located on cytogenetic band q13.1. It is made up of
seven contigs, indicated by the alternating light and dark blue coloured bars in
the ‘Contigs’ track.
(b)
8 Draw with your mouse a box around the BRCA2 transcripts.
8 Click on ‘Jump to region’ in the pop-up menu.
(c)
8 Click [Configure this page] in the side menu (or the ‘Configure this
page’ icon on the main panel).
8 Type ‘clones’ in the ‘Find a track’ text box.
8 Select ‘1Mb clone set’, ‘32k clone set’ and ‘Tilepath’.
8 Click (P).
6
It doesn’t look like there is a clone that contains the complete BRCA2 gene.
For example clone RP11-37E23 contains most of the gene, but not its very 3’
end.
(d)
8 Click [Configure this page] in the side menu.
8 Type ‘refseq’ in the ‘Find a track’ text box.
8 Select ‘RefSeq import – Expanded with labels’.
8 Click (P).
8 Click on individual transcript models to retrieve more information
about them.
There has been one transcript annotated by RefSeq for the BRCA2 gene, i.e.
NM_000059.3. This transcript is almost identical to Ensembl transcript
BRCA2-001 (ENST00000380152). Both encode a 3418 aa protein. The
RefSeq transcript is 6 bp shorter at the 5’ end and 462 bp longer at the 3’ end.
(e)
8 Click [Export data] in the side menu.
8 Click [Next>].
8 Click on ‘Text’.
Note that the sequence has a header line that provides information about the
genome assembly (GRCh37), the chromosome, the start and end coordinates
and the strand. For example:
>13 dna:chromosome chromosome:GRCh37:13:32883613:32978196:1
(f)
8 Click [Configure this page] in the side menu.
8 Click [Reset configuration].
8 Click (P).
______________________________________________________________
7
______________________________________________________________
BIOMART ______________________________________________________________
Exercise 1
The paper ‘Fine mapping of the usher syndrome type IC to chromosome
11p14 and identification of flanking markers by haplotype analysis’ (Ayyagari
et al. Mol Vis. 1995 Oct 25;1:2) describes the mapping of the human Usher
Syndrome type I C to the genomic region between the markers D11S1397
and D11S1310.
Confirm this finding by generating a list of the genes located in the region
between D11S1397 and D11S1310. Include the Ensembl Gene ID, name and
description.
______________________________________________________________
Answer
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Click on the ‘BioMart’ link on the toolbar.
... or if you are already in BioMart:
8 Click the [New] button on the toolbar.
8 Choose the ‘Ensembl Genes 69’ database.
8 Choose the ‘Homo sapiens genes (GRCh37.p8)’ dataset.
8 Click on ‘Filters’ in the left panel.
8 Expand the ‘REGION’ section by clicking on the + box.
8 Type ‘Marker Start: d11s1397’ and ‘Marker End: d11s1310’.
8 Click on ‘Attributes’ in the left panel.
8 Expand the ‘GENE’ section by clicking on the + box.
8 Deselect ‘Ensembl Transcript ID’.
8 Select ‘Associated Gene Name’ and ‘Description’.
8 Click the [Results] button on the toolbar.
8 Select ‘View All rows as HTML’ or export all results to a file.
Your results should show 33 genes. Among these there should be one gene
(ENSG00000006611) named ‘USH1C’ with the description ‘Usher syndrome
1C (autosomal recessive, severe) [Source:HGNC Symbol;Acc:12597]’. This
confirms that Ayyagari et al. correctly mapped Usher Syndrome type I C to
this genomic region.
8
______________________________________________________________
Exercise 2
In the paper ‘Discovery of novel biomarkers by microarray analysis of
peripheral blood mononuclear cell gene expression in benzene-exposed
workers’ (Forrest et al. Environ Health Perspect. 2005 June;
113(6): 801–807) the effect of benzene exposure on peripheral blood
mononuclear cell gene expression in a population of shoe factory workers
with well-characterized occupational exposures was examined using
microarrays. The microarray used was the Affymetrix U133A/B GeneChip
(also called ‘U133 plus 2’). The top 25 probe sets up-regulated by benzene
exposure were:
207630_s_at, 221840_at, 219228_at, 204924_at 227613_at, 223454_at,
228962_at, 214696_at, 210732_s_at, 212371_at, 225390_s_at, 227645_at,
226652_at, 221641_s_at, 202055_at, 226743_at, 228393_s_at, 225120_at,
218515_at, 202224_at, 200614_at, 212014_x_at, 223461_at, 209835_x_at,
213315_x_at
(a) Generate a list of the genes to which these probe sets map. Include the
Ensembl Gene ID, name and description as well as the probe set name.
(b) As a first step towards analysing them for possible regulatory features they
have in common, retrieve the 250 bp upstream of the transcripts of these
genes. Include the Ensembl Gene and Transcript ID, name and description in
the sequence header.
(c) In order to be able to study these human genes in mouse, generate a list
of the human genes and their mouse orthologues. Include the Ensembl Gene
ID for both the human and mouse genes and the homology type in your list.
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Click on the ‘BioMart’ link on the toolbar.
... or if you are already in BioMart:
8 Click the [New] button on the toolbar.
8 Choose the ‘Ensembl Genes 69’ database.
8 Choose the ‘Homo sapiens genes (GRCh37.p8)’ dataset.
8 Click on ‘Filters’ in the left panel.
8 Expand the ‘GENE’ section by clicking on the + box.
9
8 Select ‘ID list limit - Affy hg u133 plus 2 probeset ID(s)’.
8 Enter the list of probeset IDs in the text box (either comma
separated or as a list).
8 Click on ‘Attributes’ in the left panel.
8 Expand the ‘GENE’ section by clicking on the + box.
8 Deselect ‘Ensembl Transcript ID’.
8 Select ‘Associated Gene Name’ and ‘Description’.
8 Expand the ‘EXTERNAL’ section by clicking on the + box.
8 Select ‘Affy HG U133-PLUS-2 probeset’.
8 Click the [Results] button on the toolbar.
8 Select ‘View All rows as HTML’ or export all results to a file. Tick the
box ‘Unique results only’.
Your results should show 25 genes. In most cases, one probe set maps to
one gene. Exceptions are 212014_x_at and 209835_x_at, that both map to
ENSG00000026508 (CD44), 227613_at and 219228_at, that both map to
ENSG00000130844 (ZNF331) and 213315_x_at, that maps to both
ENSG00000197620 (CXorf40A) and ENSG00000197021 (CXorf40B).
(b)
You can leave the dataset and filters the same, so you can directly specify the
attributes:
8 Click on ‘Attributes’ in the left panel.
8 Select the ‘Sequences’ attributes page.
8 Expand the ‘SEQUENCES’ section by clicking on the + box.
8 Select ‘Flank (Transcript)’.
8 Type ‘250’ in the ‘Upstream flank’ text box.
8 Expand the ‘Header Information’ section by clicking on the + box.
8 Select ‘Associated Gene Name’ and ‘Description’.
Note: ‘Flank (Transcript)’ will give the flanks for all the transcripts of a gene
with multiple transcripts. ‘Flank (Gene)’ will only give the flank for the
transcript with the outermost 5’ (or 3’) end.
8 Click the [Results] button on the toolbar.
8 Select ‘View All rows as FASTA’ or export all results to a file.
(c)
You can leave the dataset and filters the same, so you can directly specify the
attributes:
10
8 Click on ‘Attributes’ in the left panel.
8 Select the ‘Homologs’ attributes page.
8 Expand the ‘GENE’ section by clicking on the + box.
8 Deselect ‘Ensembl Transcript ID’.
8 Expand the ‘ORTHOLOGS’ section by clicking on the + box.
8 Select ‘Mouse Ensembl Gene ID’ and ‘Homology Type’.
8 Click the [Results] button on the toolbar.
8 Select ‘View All rows as HTML’ or export all results to a file. Tick the
box ‘Unique results only’.
Your results should show that for most of the 25 human genes a one-to-one
orthologue in mouse has been identified, while ENSG00000123130 has two
mouse orthologues and ENSG00000172716 has three mouse orthologues.
ENSG00000197620 and ENSG00000197021 map to the same mouse gene.
For four human genes (ENSG00000186594, ENSG00000263141,
ENSG00000130844 and ENSG00000089335) no mouse orthologue has been
identified.
______________________________________________________________
Exercise 3
In the paper ‘Identification of seven new prostate cancer susceptibility loci
through a genome-wide association study’ (Eeles et al. Nat Genet. 2009
Oct;41(10):1116-21.) the following twelve variants are shown to be associated
with prostate cancer susceptibility: rs10993994, rs2735839, rs4242384,
rs6983267, rs7931342, rs7501939, rs9364554, rs6465657, rs5945619,
rs2660753, rs1016343, rs1859962.
Use BioMart to generate a list of the genes to which these variants map.
Include the Ensembl Gene ID, name and description.
Hint: You should start with the Ensembl Variation Mart. To be able to include
the gene name and description, add the Ensembl human genes as a second
dataset.
______________________________________________________________
Answer
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Click on the ‘BioMart’ link on the toolbar.
8 Choose the ‘Ensembl Variation 69’ database.
8 Choose the ‘Homo sapiens Variation (GRCh37)’ dataset.
8 Click on ‘Filters’ in the left panel.
11
8 Expand the ‘GENERAL VARIATION FILTERS’ section by clicking on
the + box.
8 Enter the list of SNP IDs in the ‘Filter by Variation ID’ text box
(either comma separated or as a list).
Add the Ensembl human genes as a second dataset:
8 Click on ‘Dataset’ at the bottom of the left panel.
8 Choose the ‘[Ensembl Genes 69] Homo sapiens genes (GRCh37.p8)’
dataset.
8 Click on ‘Attributes’ in the left panel.
8 Expand the ‘GENE’ section by clicking on the + box.
8 Deselect ‘Ensembl Transcript ID’.
8 Select ‘Description’.
8 Expand the ‘EXTERNAL’ section by clicking on the + box.
8 Select ‘HGNC symbol’.
8 Click the [Results] button on the toolbar.
8 Select ‘View All rows as HTML’ or export all results to a file. Tick the
box ‘Unique results only’.
Your results should show that six out of the twelve variants map to one
Ensembl gene, while three (rs5945619, rs10993994 and rs7501939) map to
two Ensembl genes. Seven of the genes have an HGNC symbol assigned.
Three of the twelve variants don’t map to an Ensembl gene.
______________________________________________________________
12
______________________________________________________________
VARIATION ______________________________________________________________
Exercise 1 – Exploring a sequence variant
The MTHFR (methylenetetrahydrofolate reductase (NAD(P)H)) gene encodes
an enzyme that is involved in the processing of amino acids. One of the
variants in this gene, Ala222Val (A222V), can lead in human to elevated
plasma homocysteine levels, which is considered to be a risk factor for
cardiovascular disease (see also
http://en.wikipedia.org/wiki/Methylenetetrahydrofolate_reductase and
http://en.wikipedia.org/wiki/Homocysteine).
(a) Find the MTHFR gene for human. Go to the ‘Variation table’ page. What is
the dbSNP accession number (rs number) for the Ala222Val variant?
(b) Is the consequence type of the variant missense for all transcripts that
have been annotated for the MTHFR gene?
(c) Why are its alleles in Ensembl given as G/A and not as C/T, like in the
literature (C665T or C677T) and dbSNP?
(d) Why does Ensembl put the G allele first (G/A)?
(e) Is there ethnic variability in the frequency of the A allele?
(f) In which paper is the association between the variant and homocysteine
levels described?
(g) According to the data imported from dbSNP, G is the ancestral allele of
this variant. Ancestral alleles in dbSNP are based on a comparison between
human and chimp. Does the sequence at the position of the variant in the
other primates, i.e. gorilla, orangutan, macaque and marmoset, confirm that G
is indeed the ancestral allele?
(h) (optional) Were both alleles already present in Neandertal? To answer this
question, have a look at the individual reads at the genomic position of the
variant in the Neandertal Genome Browser
(http://neandertal.ensemblgenomes.org/).
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Select ‘Search: Human’ and type ‘mthfr’ in the ‘for’ text box.
13
8 Click [Go].
8 Click on ‘Gene’ on the page with search results.
8 Click on ‘Human’.
8 Click on ‘Variation Table’ on the page with search results.
8 Click on ‘Show’ for the ‘Missense variants’ in the ‘Summary of variations
in ENSG00000177000 by consequence’ table.
8 Type e.g. ‘222’ and/or ‘a/v’ in the ‘Filter’ text box.
The dbSNP accession number for the Ala222Val variant is rs1801133.
Note that HGVS (Human Genome Variation Society) notations are not by
default shown in the table. They can be added as follows:
8 Click on ‘Configure this page’ in the side menu.
8 Click on ‘Consequence options’.
8 Check ‘Show HGVS notations’.
8 Click (P).
(b)
8 Click on ‘rs1801133’.
8 Click on ‘Genomic context – Genes and regulation’ in the side menu or on
the ‘ Genes and regulation’ icon.
No, rs1801133 is missense for four MTHFR transcripts. It is downstream for
one MTHFR transcript, i.e. ENST00000418034.
Note that in total nine transcripts have been annotated for the MTHFR gene:
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00
000177000.
(c)
In Ensembl the alleles of rs1801133 are given as G/A, because these are the
alleles in the forward strand of the genome. In the literature and dbSNP
(http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs1801133) the alleles are
given as C/T, because the MTHFR gene is located on the reverse strand of
the genome, thus the alleles in the actual gene and transcript sequences are
C/T.
(d)
In Ensembl the allele that is present in the GRCh37 reference genome
assembly is put first, i.e. G. In the literature normally the major allele (in the
population of interest) is put first. In the case of rs1801133 the allele in the
reference genome is the major allele in almost all populations studied, but as
14
the reference genome is an amalgamation of the genomes of just a few
individuals this is by no means the case for all variants.
(e)
8 Click on ‘Population genetics’ in the side menu.
Yes, there is considerable ethnic variation in the frequency of the A allele.
Among the 1000 Genomes populations studied, it ranges from 0.108 in the
YRI (Yoruba in Ibadan, Nigera) population to 0.567 in the CLM (Colombians
from Medellin, Colombia) population.
(f)
8 Click on ‘Phenotype Data’ in the side menu.
8 Click on ‘pubmed/20031578’ behind ‘Homocysteine levels’ in the
‘Phenotype Data’ table.
The association between rs1801133 and homocysteine levels is described in
the paper ‘Novel associations of CPS1, MUT, NOX4, and DPEP1 with plasma
homocysteine in a healthy population: a genome-wide evaluation of 13 974
participants in the Women's Genome Health Study’ (Paré et al. Circ
Cardiovasc Genet. 2009 Apr;2(2):142-50.).
(g)
8 Click on ‘Phylogenetic Context’ in the side menu.
Gorilla, orangutan, macaque and marmoset all have a G in this position,
which confirms that G is indeed the ancestral allele.
(h)
8 Go to the Neandertal Genome Browser
(http://neandertal.ensemblgenomes.org/).
8 Type ‘rs1801133’ in the ‘Search Neandertal’ text box.
8 Click [Go].
8 Click on ‘rs1801133’ on the page with search results.
8 Click on ‘Jump to region in detail’.
8 Click on ‘Configure this page’ in the side menu.
8 Click on ‘Variation features’.
8 Select ‘All variations – Normal’.
8 Click [SAVE and close].
8 Draw a box of about 50 bp around rs1801133 (shown in yellow in the
center of the display).
8 Click on ‘Jump to region’ on the pop-up menu.
15
The ‘Sequences’ track shows that there are three reads for Neandertal at the
position of rs1801133, all with a G, so based on these (extremely limited!)
data there is no evidence that both alleles were already present in Neandertal.
______________________________________________________________
Exercise 2 – Variant Effect Predictor
Resequencing of the genomic region of the human CFTR (cystic fibrosis
transmembrane conductance regulator (ATP-binding cassette sub-family C,
member 7)) gene (ENSG00000001626) has, amongst others, revealed the
following variants:
7 117171039 117171039 G/A +
7 117171092 117171092 T/C +
7 117171122 117171122 T/C +
(a) Determine if these variants result in a change in the proteins encoded by
any of the Ensembl transcripts of the CFTR gene. Are any of the variants
deleterious according to the SIFT tool? Does PolyPhen agree with this? Have
the variants already been annotated by Ensembl?
(b) Have a look at the uploaded variants on the ‘Region in detail’ page.
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Click on the ‘Tools’ link on the toolbar.
8 Click on the icon (spanner/wrench) for the ‘Variant Effect Predictor –
Online tool’.
8 Type ‘New variants’ in the ‘Name for this upload (optional)’ text box.
8 Enter the list of variants in the ‘Paste file’ text box
7 117171039 117171039 G/A +
7 117171092 117171092 T/C +
7 117171122 117171122 T/C +
8 Select ‘SIFT predictions: Prediction only’.
8 Select ‘PolyPhen predictions: Prediction only’.
8 Click [Next>].
8 Click on ‘HTML’.
Two of the variants (at positions 117171092 and 117171122) are missense in
three of the encoded proteins and thus cause an amino acid change (L/P and
16
I/T, respectively). One variant (at position 117171039) is synonymous and
thus doesn’t change the amino acid sequence (A).
Both missense variants are considered deleterious according to SIFT.
PolyPhen does not agree in all cases.
All three variants have already been annotated by Ensembl as can be seen
from the dbSNP and COSMIC accession numbers that are shown in the ‘Colocated Variation’ column.
(b)
8 Click for one of the variants on the coordinates in the ‘Location’ column
on the ‘Variant Effect Predictor Results’ page.
The uploaded variants are shown on the ‘Region in detail’ page in a new track
named ‘New variants’. The ‘Sequence variants (dbSNP and all other sources)’
track has also been added by default to allow comparison with the uploaded
variants.
8 Draw with your mouse a box around the uploaded variants.
8 Click on ‘Jump to region’ in the pop-up menu.
8 Drag the ‘New variants’ track until it’s just above or below the
‘Sequence variants (dbSNP and all other sources)’ track, so the two tracks
can be more easily compared.
It can be clearly seen that all three uploaded variants correspond to an
already annotated variant shown in the ‘Sequence variants (dbSNP and all
other sources)’ track, one synonymous (green) and two missense (yellow).
______________________________________________________________
17
______________________________________________________________
COMPARATIVE GENOMICS ______________________________________________________________
Exercise 1 – Orthologues, paralogues and gene trees
The photoreceptor cells in the retina of the human eye contain a number of
different photoreceptors. The rod cells contain rhodopsin, which is responsible
for monochromatic vision in the dark. The cone cells all contain one of three
types of opsins, which respond to long-wave (red), medium-wave (green) and
short-wave (blue) light, respectively, and are responsible for trichromatic
colour vision (see also http://en.wikipedia.org/wiki/Opsin).
(a) Find the gene encoding the long-wave-sensitive (red) opsin for human.
(b) How many within-species paralogues have been identified for this gene?
Note the ‘Target %id’ and ‘Query %id’. Which paralogues show the most
sequence similarity with the red opsin?
(c) Have a look at the genomic location of the OPN1LW (red opsin),
OPN1MW and OPN1MW2 (green opsin), and OPN1SW (blue opsin) genes.
Does their location explain why red-green colour blindness is much more
prevalent in males than in females (e.g. in the US population 7% vs 0.4%)?
(d) Have a look at the gene tree for the OPN1LW gene. Which of its
paralogues are due to the most recent duplication event? Is this reflected in
the sequence similarity between the red opsin and these paralogues when
compared with the other paralogues (see question b)? On which taxonomic
level did this duplication take place?
(e) Retrieve an alignment between the red opsin and all its paralogues in
Jalview. To this end, select all aligned protein sequences using ‘Select >
Select all’, then order them by using ‘Calculate > Order > by ID’, subsequently
select all human protein sequences and finally use ‘Select > Invert Sequence
Selection’ and ‘Edit > Delete’ to delete all non-human protein sequences.
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Select ‘Search: Human’ and type ‘long wave sensitive opsin’ in the ‘for’
text box.
8 Click [Go].
8 Click on ‘Gene’ on the page with search results.
8 Click on ‘Human’.
18
8 Click on ‘OPN1LW’.
Note that ‘LW’ in the gene symbol OPN1LW stands for ‘long-wave’.
(b)
8 Click on ‘Comparative Genomics - Paralogues’ in the side menu.
Nine within-species paralogues have been identified for the human OPN1LW
gene. According to the Target and Query %id, the proteins encoded by the
genes ENSG00000147380 (OPN1MW) and ENSG00000166160
(OPN1MW2), i.e. the medium-wave-sensitive (green) opsins, show the
highest sequence similarity to red opsin (Target %id indicates the percentage
of the sequence of red opsin matching the sequence of the paralogue protein.
Query %id indicates the percentage of the sequence of the paralogue protein
matching the sequence of red opsin).
(c)
8 Click on the ‘Location: X:153,409,698-153,424,507’ tab.
The OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes
are located next to each other on the X chromosome, while the OPN1SW
(blue opsin) gene is located on chromosome 7. As females have two X
chromosomes a normal gene on one chromosome can often make up for a
defective one on the other, whereas males cannot make up for a defective
gene. Thus, red-green colour blindness is much more prevalent in males than
in females. Variation in the genes for red and green opsin can cause subtle
differences in colour perception, while tandem rearrangements due to unequal
crossing-over between these genes cause more serious defects in colour
vision.
(d)
8 Click on the ‘Gene: OPN1LW’ tab.
8 Click on ‘Comparative Genomics - Gene tree (image)’ in the side menu.
8 Click on ‘View options: View paralogs of current gene’ below the gene
tree image.
8 Click on the nodes (red squares) for the duplication events that have
given rise to the various paralogues.
A duplication event on the level of the Hominines or Homininae (humans,
gorillas and chimpanzees) has given rise to the OPN1LW (red opsin) and
OPN1MW and OPN1MW2 (green opsin) genes. The other paralogues are
due to earlier duplication events. This agrees with the fact that the green
opsins show the highest sequence similarity with red opsin (see question b)
and the fact that the genes for the red and green opsins are located close to
each other on the genome (see question c).
19
Note: On the ‘Paralogues’ page nine paralogues are shown (see question b).
Five of these are of the type ‘other paralogue’. These are paralogues that are
too distant to be in the same gene tree, but can still be related as part of a
broader “super-family”. Therefore, the gene tree for the OPN1LW gene only
shows four of its nine paralogues. The precise taxonomic level of duplication
for the ‘other paralogues’ is left as undetermined.
(e)
8 Click on the speciation node (blue square) that is at the base of the
complete gene tree.
8 Click on ‘Expand for Jalview’ in the pop-up menu (that should say
‘Taxon: Chordates’).
8 Click [Start Jalview].
8 Close the pop-up window with the gene tree.
8 Click on ‘Select > Select all’ on the menu bar of the pop-up window with
the protein sequence alignment.
8 Click on ‘Calculate > Sort > by ID’ on the menu bar.
8 Select the protein sequences of the human paralogues.
8 Click on ‘Select > Invert Sequence Selection’ on the menu bar.
8 Click on ‘Edit > Delete’ on the menu bar.
As the alignment is based on the complete set of protein sequences in the
gene tree, the alignment of this subset of five proteins will contain empty
columns. These can be removed using the option ‘Edit > Remove Empty
Columns’ on the menu bar.
8 Click on ‘Edit > Remove Empty Columns’ on the menu bar.
______________________________________________________________
Exercise 2 – Whole genome alignments
Not only protein coding sequences are evolutionary conserved. Many
conserved, and even ultraconserved, sequences are found in the non-coding
parts of the genome. These are of interest because of their potential to be
involved in gene regulation.
(a) Find the Ensembl BRCA2 (breast cancer type 2 susceptibility protein)
gene for human and go to its ‘Region in detail’ page.
(b) Turn on some of the ‘BLASTZ alignment’ and ‘Translated BLAT alignment’
tracks for a broad taxonomic range of species (e.g. chimp, mouse, platypus,
chicken, anole lizard and zebrafish). Does the degree of conservation
between human and the various other species reflect their evolutionary
relationship? Which parts of the BRCA2 gene show the most conservation?
Would you have expected this?
20
(c) Have a look at the ‘Conservation score’ and ‘Constrained elements’ tracks
for the sets of 36 mammals and 19 vertebrates. Do these tracks confirm what
is shown in the tracks with pairwise alignment data?
(d) Go to the human POLA1 (polymerase (DNA directed), alpha 1, catalytic
subunit) gene. How does conservation in this gene compare with that in the
BRCA2 gene?
(e) In the paper ‘Ultraconserved elements in the human genome’ (Bejerano et
al. Science. 2004 May 28;304(5675):1321-5) sequences are described that
are for no apparent reason perfectly conserved across many mammals (and
often beyond).
Below are the coordinates of the ten ultraconserved elements that are located
in the POLA1 gene and the intergenic region between POLA1 and its
downstream neighbour, the ARX (aristaless related homeobox) gene.
chrX
chrX
chrX
chrX
chrX
chrX
chrX
chrX
chrX
chrX
24823511
24864797
24894826
24915882
24916158
24917481
24946458
25008354
25017563
25018053
24823785
24865193
24895604
24916156
24916927
24917790
24946806
25009084
25018051
25018274
uc.460
uc.461
uc.462
uc.463
uc.464
uc.465
uc.466
uc.467
uc.468
uc.469
(These coordinates were taken from http://users.soe.ucsc.edu/~jill/ultra.html
and converted from NCBI35 to GRCh37).
Upload these ten elements to Ensembl.
Hint: To this end, click [Manage your data] in the side menu and subsequently
on ‘Upload Data’. The rest should be self-explanatory. The above data are in
BED format.
(f) Have a look at the conservation of uc.460 on the basepair level for the
group of 13 eutherian mammals and the group of 19 amniota vertebrates. Is
this ultraconserved element indeed almost perfectly conserved across the
mammals? And beyond?
______________________________________________________________
Answer
(a)
8 Go to the Ensembl homepage (http://www.ensembl.org/).
8 Select ‘Search: Human’ and type ‘brca2 ’ in the ‘for’ text box.
21
8 Click [Go].
8 Click on ‘Gene’ on the page with search results.
8 Click on ‘Human’.
8 Click on ’13:32889611-32973805:1’ below ‘BRCA2’.
You may want to turn off all tracks that you added to the display in the
previous exercises.
8 Click [Configure this page] in the side menu.
8 Click [Reset configuration].
8 Click (P).
(b)
8 Click [Configure this page] in the side menu
8 Click on ‘Comparative genomics - BLASTZ/LASTz alignments’.
8 Select ‘Chicken (Gallus gallus) - BLASTZ_NET – Compact’, ‘Chimpanzee
(Pan troglodytes) – LASTZ_NET – Compact’, ‘Mouse (Mus musculus) –
LASTZ_NET – Compact’ and ‘Platypus (Ornithorhynchus anatinus) BLASTZ_NET – Compact’.
8 Click on ‘Comparative genomics - Translated blat alignments’.
8 Select ‘Anole Lizard (Anolis carolinensis) – Compact’ and ‘Zebrafish
(Danio rerio) – Compact’.
8 Click (P).
Yes, the degree of conservation does reflect the evolutionary relationship
between human and the other species; the highest degree of conservation is
found in chimp, followed by mouse, platypus, chicken, lizard and zebrafish,
respectively. Especially the exonic sequences of BRCA2 seem to be highly
conserved between the various species, which is what is to be expected
because these are supposed to be under higher selection pressure than
intronic and intergenic sequences.
(c)
8 Click [Configure this page] in the side menu
8 Click on ‘Comparative genomics – Conservation regions’.
8 Click on ‘Enable/disable all Conservation regions’ and select ‘On’.
8 Click (P).
The ‘Conservation score’ and ‘Constrained elements’ tracks largely
correspond with the data seen in the pairwise alignment tracks; all exons of
the BRCA2 gene show a high degree of conservation. Note that the 5’ and 3’
UTRs don’t, though.
22
(d)
8 Type ‘pola1’ in the ‘Gene’ text box.
8 Click [Go].
In contrast to the BRCA2 gene, the POLA1 gene also shows conservation in
its non-coding regions, especially in its last three introns.
(e)
8 Click [Manage your data] in the side menu.
8 Click on ‘Upload Data’.
8 Type ‘Ultraconserved elements’ in the ‘Name for this upload (optional)’
text box.
8 Select ‘Data format: BED’.
8 Paste the list of ultraconserved elements in the ‘Paste data’ text box.
8 Click [Upload].
8 Click (P).
A new track named ‘Ultraconserved elements’ has now been added to the
‘Region in detail’ page.
To display the names of the ultraconserved elements:
8 Hover over the ‘Ultraconserved elements’ track name.
8 Hover over the ‘Change track style’ icon (the cogwheel).
8 Select ‘Labels’.
8 Zoom in on the region encompassing the 3’ ends of the POLA1 and ARX
genes and the intergenic region.
Because the group of conserved sequences lies at the 3’ end of the 303-kb
POLA1 gene, nearer to the 3’ end of ARX than to the rest of POLA1
it has been suggested by Bejerano et al. that their function is not related to
POLA1 but that they instead form a cluster of enhancers of ARX.
(f)
8 Type ‘X:24823511-24823785’ in the ‘Location’ text box.
8 Click [Go].
8 Click on ‘Comparative Genomics – Alignments (text)’ in the side menu.
8 Select ‘Alignment: 13 eutherian mammals EPO.
8 Click [Go].
To give positions in the alignment where >50% of bases match a light-blue
background colour:
23
8 Click [Configure this page].
8 Select ‘Conservation regions: All conserved regions’.
8 Click (P).
Ultraconserved element uc.460 is indeed almost perfectly conserved across
the mammals.
8 Select ‘Alignment: 19 amniota vertebrates Pecan’.
8 Click [Go].
Uc.460 is also conserved in other vertebrates like opossum, chicken, turkey,
zebrafinch and anole lizard.
______________________________________________________________
24