Download Gene Hunting for the Cystic Fibrosis gene - CusMiBio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Metagenomics wikipedia , lookup

Mutation wikipedia , lookup

NEDD9 wikipedia , lookup

Protein moonlighting wikipedia , lookup

Genetic code wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Sequence alignment wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Centro Università di Milano-Scuola
per la diffusione delle bioscienze e delle biotecnologie
www.cusmibio.unimi.it
Gene Hunting for the Cystic Fibrosis gene
The Cus-Mi-Bio staff, composed of both University Professors and High School teachers, are the
scientific editors and authors of this Handbook’s contents.
Workshop Leaders
Giovanna Viale
Professor of Biology and Genetics, Dept. of Biology and Genetics for Medical Sciences, University of
Milan, via Viotti 3/5, Milan, Italy
Cinzia Grazioli
High school teacher fully working at Cus-Mi-Bio
Dept. of Biology and Genetics for Medical Sciences, University of Milan, via Viotti 3/5, Milan, Italy
Cristina Gritti
High school teacher fully working at Cus-Mi-Bio
Dept. of Biology and Genetics for Medical Sciences, University of Milan, via Viotti 3/5, Milan, Italy
2
Cystic Fibrosis
A bioinformatic gene hunting
Scenario
You are a geneticist and are requested counselling from a young couple. Sophie and David have two
children, Peter 6 years old is a healthy boy, Mary 4 years old has frequent cough, colds and recurrent lung
infections. Sophie tells you that her sister died at 12 years of age from Cystic Fibrosis (CF), while in
David’s family there are no cases of the disease. Sophie is now in the 6th week of pregnancy of her third
child and asks information about her risk of giving birth to a CF affected child. You suggest that Mary
has a sweat test to establish Na+ and Cl- ions levels. This is a very simple test routinely used for CF
diagnosis.
From the data available at this point, draw the pedigree of the family and answer the following questions:

•
•
•
•
The risk of Sophie being heterozygous for CF
The risk of David being heterozygous for CF
The risk of the foetus being affected by CF
The probability of the baby being healthy / being heterozygous for CF
David and Sophie come back to show you the results of the sweat test. The test confirms the diagnosis of
CF (Na+ = 87 nmol/l, well above the reference range of 60 nmol/l) for Mary. With this new piece of
information, answer the same questions again.

•
•
•
•
The risk of Sophie being heterozygous for CF
The risk of David being heterozygous for CF
The risk of the foetus being affected by CF
The probability of the baby being healthy / being heterozygous for CF
The affected child Mary can now be represented by a full symbol in the pedigree.
3
Cystic Fibrosis (CF)
Cystic fibrosis is a homozygous recessive genetic disorder
associated with abnormal secretions of the exocrine glands
and therefore affecting many vital body functions, from
respiration to digestion and reproduction. This disease
involves a genetic mutation that disrupts the cystic fibrosis
transmembrane regulator (CFTR) protein (Fig.1), resulting
in abnormal regulation of electrolyte transport across the
epithelial cell membrane. The CFTR channel protein
possesses two ATP-hydrolyzing domains which allow the
protein to use energy in the form of ATP to control channel
gating. Malfunction of CFTR results in production of
Fig.1. CFTR is a unique member of the ABC transporter
poorly hydrated, thickened
family that forms a novel Cl- channel. It is located
mucous secretions in the
predominantly in the apical membrane of epithelia where
it mediates transepithelial salt and liquid movement. The
lungs and pancreas and in the
protein has 5 domains: 2 MSD (membrane spanning
characteristic high chlorine
domain), 2 NBD (nucleotide binding domain) binding
ATP and a regulatory (R) domain. In the majority of CF
levels in sweat (Fig.2).
cases, the NBD1 domain is affected.
CF is the most common
hereditary disease in the
caucasic white population with an incidence of 1 in 2500-3500 individuals.
One in 25-30 people of European descent carry one gene for CF (are
heterozygous).
The CFTR gene is found at the q31.2 locus of chromosome 7, is 230 000
base pairs long (27 exons), and creates a protein that is 1,480 amino acids
long. The most common mutation, ΔF508 is a deletion (Δ) of three
Fig. 2. Main body apparatus
nucleotides that results in a loss of the amino acid phenylalanine (F) at the
affected in CF. In all affected
508th (508) position on the protein (Fig.3). This mutation accounts for
individuals, sweat glands produce
seventy percent of CF worldwide and more than 80% in Northern European
excessive amounts of salt. Thick
mucus blocks digestive enzymes
populations. There are over 1,400 other rare mutations that can produce CF,
in the pancreas. The viscous
however. All the identified mutations affect single nucleotides or a small
mucus blocks respiratory ducts
and makes respiration difficult. In
number of adjacent nucleotides. To describe mutations, a specific
males, dense mucus obstructs
nomenclature is used: a mutation can be described as a change in genomic
sperm ducts leading to infertility.
DNA (g.), in cDNA (c.) or in the protein (p.). I.e. p.F508del or ΔF508
Only 2-3% of the affected males
are fertile.
means that the mutation is a deletion of the aminoacid phenylalanine (F) in
position 508. Five mutations cover around 90% of CF cases in Caucasian
populations (Tab.1). Affected individuals can carry two identical mutations or two different mutations.
Mutation
p.F508del
p.Gly551Asp
p.Gly542X
c.621+1G>T
c.1898+1G>A
exon
10
11
11
Intron 4
Intron 12
Tab.1 Most common mutations in
CFTR gene in Northern Europe; p
and c mean that the mutation is
described in the protein or in cDNA
respectively; “+” means that the
involved nucleotide is the first in the
intron following the exon which
terminates at the nucleotide position
indicated by the number before the
“+ “.
%
81,0
3,5
1,1
1,0
0,9
Fig. 3. The figure shows chromosome 7 involved in CF and the
nucleotide and aminoacid sequences involved in the ΔF508
(F508 del) mutation.
CF mutations affect CFTR function at different
levels. As shown in Fig.4, non-sense mutations
cause premature arrest of protein synthesis,
missense mutations can affect protein
processing, regulation or function (Fig. 4).
ΔF508 deletion affects CFTR glycosylation in
the Golgi, blocking its transport to the plasma
membrane. The different mutations result in
different gravity of the disease. There is no
cure for CF. Average life expectancy is around
36.8 years. Management of CF is limited at
modifying and reducing the effects of most
frequent mutations. Gene therapy holds
promise as a potential avenue to cure cystic
Fig. 4. Effects of most common mutations on CFTR function.
fibrosis. Gene therapy attempts to place a
normal copy of the CFTR gene into affected
cells. Studies have shown that to prevent the lung manifestations of cystic fibrosis, only 5–10% the
normal amount of CFTR gene expression is needed.
CF is due to mutations of the CFTR on chromosome 7. CF is an autosomic recessive disease and
therefore Mary carries two mutated copies of the CFTR gene. (Note that as all affected individuals, Mary
could carry two identical mutations or two different mutations).
To answer Sophie’s and David’s questions you need to know the genotypes of some crucial members of
the family. Which ones?
DNA analysis allows the definition of genotypes and the
identification of the mutations present in the family (in particular
Mary and her parents). You recommend Sophie to undergo
amniocentesis to examine fetal DNA for the presence of CF
mutations.
The most common DNA based diagnostic test for CF is an
hybridization technique (Dot Blot) for the detection of the most
frequent mutations of CFTR gene (see Tab 1 in the CF card). DNA
is extracted from patient’s cells and deposited in multiple spots on a
filter, denatured and hybridized to labelled oligonucleotide probes
specific for the different mutations. After extensive washing, to
eliminate excess unbound probe, a simple colorimetric reaction is
performed to identify positive probe hybridization.
Dot Blot results:
The three DNA samples (parents’ and Mary’s) hybridize to the same probe;
Parent’s DNA also hybridize to the control probe (wild type gene);
Mary’s DNA does not hybridize to the control probe;
Fetal DNA (from amniocentesis) hybridizes to a specific probe (the same as Mary’s and her parents’) and
to the control probe.
From the Dot Blot results it is possible to derive:

•
•
•
The parents’ genotypes
Mary’s genotype
The fetus genotype
5
Identification of the mutation present in Mary’s family
You are given the sequence of the probe hybridizing with Mary’s and her parents’ DNAs and are
requested to identify the mutation present in this family.
Probe sequence: 5’- CACCATTTAAAGAAAATATCGGTGTTTCCTATGATGA -3’
Start your search at the National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/
In the homepage, choose BLAST (on the horizontal menu bar on top of the page) .
(NCBI)
BLAST (Basic Local Alignment Search Tool) finds regions of local similarity between sequences
(homology regions). The program compares nucleotide or protein sequences to sequence databases and
calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary
relationships between sequences as well as help identify members of gene families.
The software BLAST is composed of different algorithms allowing the alignment not only of nucleotide
sequences to other nucleotide sequences, but also protein sequences to protein sequences, nucleotide
sequences to protein sequences and vice versa, using the rules of the genetic code to translate nucleotide
into aminoacid sequences.
In the BLAST page of the NCBI
site choose Nucleotide-nucleotide
BLAST (blastn). In the search box
paste the sequence you want to
compare to the database.
In the field “choose database” you
can choose the database you want to
use for the analysis. Choose the
default option nr (non redundant)
which contains all the deposited non
redundant sequences; this is indeed
6
only an historical definitions, since now many of the information contained in the databases are
redundant. (In many cases, you might want to select alignment only to sequences from a given organism;
in this case, you can specify the organism of interest in the field “Options for advanced blasting” and
restrict sequence analysis, for instance, to Homo sapiens).
All the underlined entries are hyperlinks with short definitions of the corresponding field. You can have
further information clicking on the various entries. For simplicity, we will use standard criteria for the
analysis, without any option for advanced blasting. Once you have pasted your sequence, click on the
“BLAST!” button.

Identification of the sequence with the best alignment score
The results page is divided in five sections.
In the first section you find:
• link to “reference” on the program (BLASTN);
• information on the database;
• link to “taxonomy report” where you can find a phylogenetic representation of all the species to
whom the sequences identified by the BLAST analysis belong. The same sequences are also listed by
organism;
• information about the “query” sequence (i.e. length, shown as “n. letters”, where letters stands for
nucleotide bases).
The second section is a graphic representation of the sequences which have obtained the highest scores in
the alignment to the“query”:
7
• The red thick line represents the “query”;
• the numbers underneath refer to the length in nucleotides;
• each of the thin lines below the “query”, of various colors, shows an alignment of the “query”
sequence to one sequence from the nucleotide database;
• the color code used to represent sequences reflects the alignment score obtained, which in turn
depends on the percent identity between the two sequences. If you pass with the pointer on the different
colored segments, the name and the access number of the corresponding sequence will appear; clicking
on one of the segments will take you to the corresponding sequence alignment;
• under the graphic scheme you find a new link: “distance tree of result” which is a graphic
representation of the evolution distances separating the sequences which share alignments with the
“query. BLAST can be used to infer evolutionary relationships between sequences. Phylogenetic trees
programs have been developed to provide graphic representations (trees) of the relationships existing
among biological sequences. Note that, since they are inevitably based on a certain degree of
approximation, phylogenetic trees in certain cases don’t give definite results.
In the third section, under the phylogenetic tree, nucleotide sequences producing significant alignments to
the “query” are listed, starting from those with the highest score and the lowest statistic significance value
(E). The statistic significance value E gives the probability to obtain the same match just by chance. For
example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current
size one might expect to see 1 match with a similar score simply by chance. This means that the lower the
E-value, or the closer it is to "0", the more "significant" the match is.
8
The forth section shows significative alignments of the "query" sequence to sequences in the database
with the highest similarity to the "query" (“subject “ sequences).
For each alignment, the following features are given:
• alignment Score;
• Expect, i.e. ”E value”;
• Identities showing the ratio between the number of identical bases (in this case 37) and the alignment
length (in this case 40); in brackets the resulting percentage (in this case 92%) between the two sequences
in the region of alignment;
• Gaps showing the ratio between the number of gaps present in the alignment (in this case 3) and the
alignment length; in brackets the resulting percentage (in this case 7%);
• Strand shows the polarity of the "query" sequence in respect to the sequence in the database
(Plus/Plus means that the “query” sequence has the same polarity as the sequence in the database, while
Plus/Minus means that the two sequences have opposite polarity).
• The complete alignment follows between the "query" sequence and a given sequence in the database
(" sbjct”). Numbers correspond to base positions within the two sequences; when in a given position the
base in the "query" sequence corresponds to that in the “sbjct” sequence, the symbol "|" appears in
between. This symbol is missing when, in a given position, the nucleotide in the "query" sequence and
that in the “sbjct” sequence are different or when one sequence shows a gap, as found in the alignments
with low identity scores.
The fifth section contains statistic details of the search.
9
Now click on the link corresponding to S71320.1 (the first in the list) to go to the database page
containing info relative to this sequence.
Which mutation is present in Mary’s family?
Hunting the cystic fibrosis gene
Go back to the NCBI homepage:
http://www.ncbi.nlm.nih.gov/
To limit your search, choose Nucleotide in the Search box and type CFTR cystic fibrosis homo sapiens in
the for box. Now click GO.
10
The database has found 301 hits.
In the list below, choose the hit related to the CFTR gene and click on the corresponding code
NM_000492 to find a typical GenBank nucleotide file.
A lot of it is hard to read, but a few things are clear. First, you find the LOCUS DEFINITION and the
ACCESSION code. Under references, you find citations to the publication of this sequence in the
scientific literature. To see an abstract of the article in which this gene is described, click the PubMed
link below the reference.
Scrolling down the page you find the FEATURES section, i.e. info on the main sequence elements, the
protein sequence and the coding sequence (CDS).
To obtain the CFTR cDNA sequence, click in the box Display (top left of the page) and select FASTA. In
the following page you find the complete cDNA sequence in the format compatible to your next search.
Save the sequence in a word file (cDNAFASTA.doc).
Going back to the NM_000492 page, find the identifying code of the normal CFTR protein (NP_000483)
and its aminoacid sequence. Copy and save it as a CFTRprot.doc file. You can count the aa residues
(using the word tool “word count”), or you can find the protein length (number of aa residues) clicking on
the protein code.
From the cDNA sequence you can find the genomic sequence of the CFTR gene using the BLAT
(BLAST-Like Alignment Tool) software.
11
Return to http://genome.ucsc.edu/cgi-bin/hgBlat?db=mm2 and paste the full cDNA sequence (in FASTA
format) in the BLAT search box. This software allows to easily compare cDNA sequences and genomic
sequences and to identify the exon-intron structure of a given gene.
Submit and, in the results page, click “details” on the left of the first record (score 6106, size in
nucleotides 8094, 100% identity). The gene, as you already know, is located on chromosome 7.
In the following page, you find your cDNA sequence and, below, the alignment of this sequence to the
genomic sequence. Matching bases in cDNA and genomic sequences are colored blue and capitalized.
Light blue bases mark the boundaries of gaps in either sequence (often splice sites).
In the genomic sequence below, the aligned regions (exons) are shown in capital blue letters while the
regions with no alignment (introns) are in small black letters. Click on links, in the left column, to
navigate through the alignment. Clicking on the various blocks, you are directed to the corresponding
exon in the genomic sequence.
The CFTR gene has 27 exons.
Construction of deleted cDNA
Open the text file cDNAFASTA.doc, make a copy and save it as ΔcDNA.doc. You have now to insert the
ΔF508 in the wild type DNA sequence.
We suggest you use the Find function, finding the 15 nucleotides wild type sequence in Fig.3 (CF card).
The sequence ATCATCTTTGGTGTT includes the positions deleted in the ΔF508 mutation. Delete the 3
nucleotides involved in ΔF508 (CTT) and save the new sequence.
12
Translation of the deleted cDNA sequence to obtain the mutated protein
sequence
Go to: http://arbl.cvmbs.colostate.edu/molkit/translate/index.html
Here you can translate your cDNA sequence into an aminoacid sequence. Just paste your deleted cDNA
sequence in the white box and click “translate DNA”.
You will get an image with black background, where the results of the translation of your sequence in the
six possible reading frames (three
for each DNA strand) are
represented. In green you will find
the putative ATG start codons and in
mauve the stop codons.
Choose the frame with the longest
open reading frame, ORF (i.e. the
longest sequence without stop
codons). In our case, the longest
ORF is forward frame 1. Now
click “Text Output” to visualize the
sequence and its translation into
aminoacids.
Choose forward frame 1. You can
easily personalize the software
output (you can choose to visualize
the aa sequence with the one or three
letter code, alone or together with
the DNA sequence). Choose “one
letter code” and “amino acids and
DNA”. Copy the whole content of
13
the box in a new Word file (pCFTR mutated.doc) using the font “courier new”, font size 8, black; this
font is suitable for sequence alignments since all the letters and spaces have the same size.
Localization of the main features of the cDNA
Now we can start the localization of all the main cDNA/mRNA features, i.e.:
• 5’UTR (5’ UnTranslated Region)
• start codon (start of translation, ATG (AUG) coding for methionine
• CDS (CoDing Sequence)
• stop codon (end of translation, TAA (UAA) / TAG (UAG) / TGA (UGA)
• 3’UTR (3’ UnTranslated Region)
• polyA site (polyadenilation site) : consensus sequence is “AATAAA” generally located 20-30 bp (max
10-35 bp) upstream the polyA tail. In some genes, this signal can slightly differ from the consensus
sequence (i.e. ATTAAA).
Now try to identify all the above mentioned elements and mark them with the yellow highlighter.
Starting from the 5’ end of the molecule, find the correct start codon, i.e. the first ATG,
corresponding to the first aa M, not followed, at short distance, by any stop codon (aa position 45).
Once identified the start codon, you can cancel the upstream aa sequence. This region (5’UTR) is not
translated by the ribosomes into protein. You now find the stop codon (in this case a TGA codon) and
cancel the downstream sequence (3’UTR). The CDS (i.e. the sequence which is translated into protein)
extends from nucleotide 133 to nucleotide 4570. Eventually, identify the polyadenilation signal the
sequence signalling the transcription can stop and where the polyA tail is added (starting at position
6106).
How long is the mutated protein?
Compare it with the length of the normal protein you have taken note of before.
View the CFTR 3D structure and compare the normal and mutated protein
structures.
Accurate 3D macromolecular structures are obtained by two main techniques: X-ray crystallography
(83%) and Nuclear Magnetic resonance (NMR). Rough structure previsions can also be obtained by
comparative modelling (when the unknown structure shows a good level of sequence identity with a
known structure), but templates are available for only 20% of domains. This kind of analysis, although
less accurate, can also provide important insights on the structure/function of a protein. The 3D
coordinates of each single atom are collected in a file and can be viewed using dedicated software. We
will use the Deep View software. All files containing molecular coordinates of the molecules whose
structure has already been determined are collected (in different formats) in the PDB (“Protein Data
Bank”) database (and are accessible using different kinds of software).
Deep View
Deep View (formerly called Swiss-PdbViewer) is a powerful molecular graphics program available from
the Expert Protein Analysis System (ExPASy) Molecular Biology Server in Geneva, Switzerland.
Deep View is downloadable from http://www.expasy.org/spdbv/.
Deep View is an application that provides a user friendly interface allowing to view protein 3D structures,
create models and also analyse several proteins at the same time. The proteins can be superimposed in
order to deduce structural alignments and compare their active sites or any other relevant parts. Amino
14
acid mutations, H-bonds, angles and distances between atoms are easy to obtain thanks to the intuitive
graphic and menu interface.
For proteins of known sequence but unknown structure, DeepView submits amino acid sequences to
ExPASy to find homologous proteins, onto which you can subsequently align your sequence to build a
preliminary three-dimensional model. Then DeepView submits your alignment to ExPASy, where the
SWISS-MODEL server builds a final model, called a homology model, and returns it directly to
DeepView.
With the Deep View software, you will learn to:
• observe tertiary structures of proteins, identify domains and secondary structures
• get familiar with the different types of visualizations: ribbon diagrams, backbone and sidechains
• superimpose and compare two, or many, models simultaneously
• identify the residues involved in a given pathologic mutation within a structure (es: F508del)
• compare 3D structures of the normal and mutated protein.
Open the file 2BB0 (with “.pdb” o “.ent” extension) containing the 3D structure of the normal CFTR
protein. This file is also available at the “Protein Data Bank” http://www.rcsb.org/pdb/ site writing CFTR
in the search box.
Open the file with “DeepView” visualization software.
To display a molecule, simply drag a Pdb file onto the Swiss-PdbViewer icon, select "Open" item from
the "File" menu, or select one of the recently opened proteins that appear at the bottom of the file menu.
Center
Drag
Zoom
Rotate
Information
about the
protein
Image of the protein structure
By default, the molecule will appear in wireframe representation into the Display Window.
The upper window gives access to the Menu and to the most common tools to manipulate the protein.
The lower window shows the protein structure.
Protein manipulation
15
The icons in the toolbar window allow protein manipulation. The icon on the left
allows to move the
protein to the center of the screen. The three next icons,
above the Move all button, are
used to drag, zoom and rotate the protein. Once the icon has been selected, (with a mouse click) you can
manipulate the protein shown in the lower window, moving the mouse .
Icons of the third group, on the right,
distances, bond angles, and dihedral angles. We won’t use these functions.
are for measuring
The icon
(with the written page, to the left of Move all, opens a window containing info about the
protein, including its aminoacid sequence.
Control Panel
Use the command Wind: Control Panel to display it. The window will appear on the left part of the
screen.
The Control Panel lists the amino-acids residues and other contents of the PDB
file. You use the Control Panel to select residues, establish the content of the
display, label residues, and color them.
The first click on an inactive window activates it, but does not cause other
changes. First, simply scroll down to the bottom of the list to see how many
amino-acid residues the protein contains. Now click and drag, starting on the first
residue at the top of the window, dragging down to the tenth, and releasing the
mouse button. All group names from 1 to 10 should turn red. Groups printed in
red are now selected. The simplest way to select a small number of residues is to
click their names and drag to select a range of them. Press the Return key: this
action hides all groups except those selected. Notice also that a checkmark has
appeared in the show column next to the selected residues, indicating that they are
on display. A side chain is shown only if the rest of the residue is shown, so you
only see the side chains of displayed residues in the graphics window.
There are also two narrow columns to the left of the group column. The first
column is blank when the current model contains only one protein chain. If the
crystal is made of more chains, in this column the letters A, B etc will appear. In
the Control Panel a chain corresponding to aa residues 389-678 is shown,
corresponding to the NBD1 functional domain of the CFTR protein. The second
column contains groups of the letters h or s. Groups labeled h comprise alpha
helices, while groups labeled s comprise strands of beta sheet. Click anywhere in
the A column, you have selected the entire chain (all the A groups turn red). Press
Return and the A chain will be displayed in the window. In the same way, you can
select h or s parts of the chain. If you want to select two separate groups, select the first one and then the
second keeping pressed the Control key. Once selected, press Return to visualize them.
Select some residues and click on the columns show, side, labl, surface (Van der Waals forces are
rendered as dots surfaces) and ribn.
• Labl: shows the names of the selected aa residues.
• Surface: shows for each aa the Van der Waals forces, rendered as dots surfaces. Other kinds of
surfaces are visible from the menu under the black triangle symbol.
• Ribn: designs the 3D protein structure.
You can select or deselect groups just keeping the mouse button pressed and dragging them along the
Control Panel columns; you can get the same result clicking on the top of the columns.
16
Remember that you can center, move, zoom, rotate the image in the display window using the
corresponding icons.
To show the alpha helices (red) and beta sheet (yellow) structure of the protein, in Control Panel select
all the aminoacids in the ribn column (pressing the control and shift keys), from the Color menu select
secondary structure and from the display menu select render in 3D.
Coloring (Color menu)
DeepView provides many different ways to color a model. Colors can reveal structural, chemical, and
comparative features of the protein.
Go to the Color menu and click:
• Secondary Structure: DeepView colors helical residues red, beta sheet residues (strands) yellow, and
all others gray. Notice that the color for each residue is displayed in a small square beside the residue in
the right-hand column of the Control Panel.
• Secondary Structure Succession: DeepView colors helices and strands, but with this command,
color reflects the order of each structural element in the overall sequence of residues. The result is that it
is easy to follow the chain through the protein -- elements of secondary structure are colored from the Nterminal to the C-terminal end in the order violet, blue, green, yellow, orange, red.
• Chain: DeepView colors the entire model yellow. If there were more than one chain in this model,
each chain would be shown in a different color. This color provides good contrast to the colors you will
now add for sidechains.
• Type: This command recolors the side chains all residues according to chemical type: non polar
groups in grey (note that most non polar idrophobic side chains are inside the protein), acidic (negative)
groups in red and basic (positive) in blue.
• CPK: this operation returns the atoms to the standard colors: white for carbon, red for oxygen, blue
for nitrogen and yellow for sulfur.
N.B You can change the default colors for atoms, individual residues or complex selections of residues
(chains) or the background color from Preferences (clicking color). In the new window you can choose
the colors you like.
Select Menu
From the Select menu you can select the submenus:
• All: to select all the protein residues (press Return to display).
17
• Secondary structure: to select and display different parts of the protein (press Return to display).
- Helices: to select and display residues forming alfa helices,
- Strand: to select and display only beta sheets.
- Coil: to select and display the remaining residues.
• Group property: to select and display only positive, negative, polar and non polar residues (press
Return to display).
Wind Menu
From the Wind Menu you can select the submenu:
• Alignment: The Align window appears at the bottom left. You use this window when you want to
compare two or more proteins.
The sequences of the two chains in
one-letter abbreviations are shown.
The name of the currently active
layer is shown in red, with the
reference layer at the top. You can
click a layer name to make it active,
and you can also select residues and
switch layers from this window.
Select 508 phenylalanine in the list of
aa residues and color it blue; in the
3D image, the aa position will be
shown and in the sequence in the
Alignment window the name of the
currently active residue will be
shown in red.
Comparing the wild type and mutated forms of the CFTR protein
In this case, we can align the wild type protein with its aberrant form. 3D structures are available for both
forms of this important protein. We can identify the mutated amino acid from its position in the sequence
and view the structural changes induced by the mutation. This is an optimal situation. However, 3D
structures are not available for many proteins. And even less often for different forms of the same protein.
Thus, one has to resort to identifying the mutation from the aminoacid positions in the sequence and
inferring structural changes from the nature and position of the change. This is certainly less than optimal
and will change with the availability of more accurate structure prediction algorithms that will allow to
build testable hypotheses without having to experimentally determine the 3D structure of mutated
proteins. Resolving protein structures experimentally is a tremendous effort and should be reserved for
the final confirmation of already validated drug targets.
Now, you will compare two different forms of the same protein, the normal and the P508del forms of
CFTR. DeepView allows you to superimpose different models, and to color the models to emphasize
conformational differences.
Open the file 1XMJ.pdb (P508del) in Deep View (the normal protein is already displayed on your
screen). Below the 3D structure, you will find the sequence alignment of the two proteins. The 1XMJ
corresponds to aa residues 389-678. Select in the Control Panel, with a different color, the aa residues
around position 508. As expected, P508 is missing in the 1XMJ crystal. To compare the two structures,
you carry out a simple superposition of the two proteins in the same window. Before you do this, click on
the Control Panel, then click and hold on the name 2BBO at the top. A menu appears, showing the names
of all loaded models. Select 1XMJ. Examine the effects of the visible and can move buttons on 1XMJ.
Restore movement and visibility for 1XMJ.
18
To better compare the two structures, before superimposing them, color the normal protein (2BBO)
model (2BBO) in pink (to act on one of the two proteins, to change colors etc , choose the correct file
from the scroll down menu in the Control Panel).
To superimpose the two structures, select Magic Fit from the Fit menu. You will immediately note that
the two proteins have superimposable 3D structures.
Go back to the Cystic Fibrosis card at the beginning of this handbook.
As you can learn in Fig. 4, F508del mutation causes the block of processing of the CFTR protein in the
ER. The mutated protein doesn’t reach the membrane not because of its altered structure but because of
its incomplete maturation. F508 is involved in a crucial signal for the correct post-translational
modification of the protein: the F508del protein doesn’t pass the quality control step at the level of the
endoplasmic reticulum. This mutated protein is transported to proteasomes and degraded.
19
Links
Bioinformatics is a tool for accessing and managing data in a database. Bioinformatics includes:
1. Databases of experimental data generated in research laboratories
2. Software for navigating databases
A biological database is a collection of data generated by experimental research, both from in vitro and
in vivo studies, from bioinformatics analysis (in silico) and from the scientific literature.
Databases are systematically arranged collections of data structured to enable efficient user access and
management of different types of information.
Databases are composed of “entries”. Each entry contains data according to the database type.
First level Databases: collections of nucleotide (DNA, RNA) or protein sequences and info to identify
their species of origin and their functions.
Specialized Databases: collections of specific info about taxonomy, functional domains, scientific
literature, diseases associated with sequence mutations.
A list of Databases used by Bioinformatics labs (save them in your Bookmarks)
http://www.ncbi.nlm.nih.gov/
NCBI (National Biotechnology Institute, NIH) has developed a public database and the software for
genome analysis.
http://www.ensembl.org
Ensembl (from the french word “ensemble” and “EMBL” European Molecular Biology Laboratory) is a
database created from the collaboration of EMBL - European Bioinformatics Institute (EBI) and
Wellcome Trust Sanger Institute (WTSI) to develop a software system to automatically manage all the
available annotations on some eukaryotic genomes.
http://www.expasy.org/sprot/sprot-top.html
Swiss Prot
http://smart.embl-heidelberg.de/
SMART (Simple Modular Architecture Research Tool)
http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/
The PDBsum Database is a collection of images allowing to have a preview of the 3D structures
contained in the protein database Protein Data Bank (PDB). It shows the different component of the
structure (i.e. protein chains, ligands, metal ions) and gives a schematic view of their interactions.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
OMIM, Online Mendelian Inheritance in Man
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=pubmed
PubMed, accessible from NCBI Entrez was developed by National Center for Biotechnology Information
(NCBI) and by National Library of Medicine (NLM), at the U.S. National Institutes of Health (NIH).
PubMed allows accession to scientific and medical literature.
Bioinformatics tools
http://genome.ucsc.edu/cgi-bin/hgBlat?db=mm2
BLAT (BLAST-Like Alignment Tool),
http://arbl.cvmbs.colostate.edu/molkit/translate/index.html
This tool allows to insert nucleic acid sequences (cDNA or mRNA) and to obtain all the possible forms of
translation into aminoacid sequences.
http://www.expasy.org/spdbv/
Deep View (from Expert Protein Analysis System (ExPASy) Molecular Biology Server in Geneva, is an
application that provides a user friendly interface allowing to view protein 3D structures, create models
and also analyze several proteins at the same time. For proteins of known sequence but unknown
structure, DeepView submits amino acid sequences to ExPASy to find homologous proteins, onto which
you can subsequently align your sequence to build a preliminary three-dimensional model. Then
DeepView submits your alignment to ExPASy, where the SWISS-MODEL server builds a final model,
called a homology model, and returns it directly to DeepView.
20