Download Biomart/ GENOME ALIGNMENT III

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Primary transcript wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Minimal genome wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genetic engineering wikipedia , lookup

Frameshift mutation wikipedia , lookup

NUMT wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomic library wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Point mutation wikipedia , lookup

Transposable element wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microevolution wikipedia , lookup

Human Genome Project wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Gene desert wikipedia , lookup

Alternative splicing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Metagenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Human genome wikipedia , lookup

Designer baby wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Exercises databases
Bioinformatics
BIOMART/ GENOME ALIGNMENT III
CONTENTS
Biomart/ GENOME ALIGNMENT ................................................................................................................................................ 1
Introduction ........................................................................................................................................................................... 1
Downloading the sequences .................................................................................................................................................. 1
Aligning the sequences using AVID ...................................................................................................................................... 11
Aligning the sequences using VISTA genome browser ........................................................................................................ 13
INTRODUCTION
The comparison of the mouse and human genomes has demonstrated the power of comparative genomics in inferring the
evolutionary history of species and in identifying functional regions in genomes. The possibilities for identifying regions
under selection are enhanced with the addition of more sequences and this observation has led to numerous ‘focused
sequencing’ projects which seek to obtain sequence for a small region of a genome in numerous other organisms.
Biologists who seek to analyze conserved regions among homologous sequences are faced with the daunting task of
aligning large genomic regions and subsequently sifting through massive amounts of data. In order to facilitate the
discovery process without requiring biologists to download and install complex software, a number of web servers for
alignment and analysis have been set up in recent years. These servers align submitted sequences and then generate plots
or graphs designed to help researchers identify conserved regions.
AVID is a progressive alignment program. The program works by recursively aligning the ‘alignments’ at ancestral nodes of
the guide tree. At each internal node, ancestral sequences are inferred from the existing alignments using maximum
likelihood and these alignments are then aligned using the AVID program.
The server goes through a number of steps:
1.
Sequences are repeat-masked using the DUST program (Tatusov and Lipman, unpublished).
2.
A random (almost complete) binary guide tree is generated for alignment of the sequences using the progressive
alignment method.
3.
The sequences are aligned using AVID.
4.
A phylogenetic tree is inferred from the multiple alignment using the neighbor joining method.
5.
Steps 3 and 4 are repeated for a total of three iterations.
6.
Pairwise alignments are generated from the multiple alignment with respect to all of the sequences and these are
used to generate conservation plots and to identify conserved regions.
In this exercise we will perform an alignment of orthologs of the pax6 gene.
1) In a first step we will download the sequences of the orthologs of the pax6 gene using Biomart
2) Subsequently orthologs will be aligned using AVID/LAGAN.
DOWNLOADING THE SEQUENCES/ BIOMART
Exercises databases
Bioinformatics
 http://www.ensembl.org/biomart/martview/f0e53c7f00c3cded9dff7e1e22d391dc
view the tutorial
http://www.ensembl.org/info/website/tutorials/index.html


Choose a database (“Ensembl Genes 78”)
Choose a dataset (“Homo sapiens genes (GRCh38)”)

Go to the Filters section
You want to select a specified gene based on its ENSEMBL gene ID (ENSG00000007372) in the human genome.

Move to the Attributes section
Selects the attributes you want to download:
Exercises databases
Bioinformatics
You might want to select the position and the strand of the gene on the genome. Also select a protein identifier (e.g.
HGNC ID) and most importantly the ENSEMBLgeneIDs of the original gene.


Press the Results button to see your selection
Export the requested data as a Tab Delimited text file
Exercises databases
Bioinformatics
how many PAX6 gene transcripts were selected?

Go to the Attributes section and select the Homologs genes in chicken, fugu and mouse
This information can be used to download the corresponding sequences
 Go to the attributes and select ‘sequences’
You can select different parts of the sequence: Either you select the protein sequence (introns spliced out, only translated
part is downloaded). Alternatively you might wish to download the entire gene (introns, exons included and neglect the
transcript info) together with the 5 and 3 ‘ ends.
Try both. How many sequences do you get when you download the peptide. How many when you download the gene.
Explain.
Exercises databases
Bioinformatics
You can do this by the Export data section in the left panel
Download the gene (genomic sequence, unspliced gene). Do not forget to also export ‘header’ information such as the
strand information and the exon positions and the gene start and end and save it as textfile (FASTA).
To annotate the file we will also download the genes structural information (exon-intron start +order of the exon and
introns. Save in excel format. This information will be used to make for our file an annotation file.
To test what exactly we down loaded, go to the ensemble gene Pax6 and view the sequence
information. Perform a find function and try to locate the beginning (find with
CCCTCTTTTCTTATCA) and end of your downloaded sequence in the displayed sequence. This
shows you that the sequence you downloaded starts with the end of the last exon and ends with the
beginning of the first exon (as the sequence is located on -1 strand). So the downloaded sequence is not
reverse complemented.
Exercises databases
Bioinformatics
Looking at the header information of the biomart downloaded pax6 gene we see that
the gene start is 31784792|and the gene end is 31817961
>ENSG00000007372|1|31784792|31817961|31801776;31812926;31806013;31793802;31802834;31800856;31806462;
31817948;31790019;31806925;31793553;31794788;31794114;31790860;31806921;31811677;31
812183;31803673;31811015;31801912;31811118;31811137;31804046;31811331;31811045;3181
1237;31802971;31791309;31817961;31803333;31812177;31804619;31800646;31817937;318040
25;31810667;31811308;31794126;31801335;31804044;31810305;31817874|31801561;31812572
;31805389;31793652;31802729;31800763;31806344;31817809;31789936;31806849;31802704;3
1793438;31806402;31794630;31794032;31790710;31801728;31811115;31788910;31784792;318
00691;31812093;31803398;31810828;31801869;31801871;31794780;31811213;31800707;31801
762;31789830;31804452;31801578;31800539;31789182;31789918;31794098;31801617;3178991
7;31801745;31803952;31789922;31801230;31793173;31793483;31809906;31806406;31789913;
31793674
We know from the structural information that our downloaded sequence starts at the end of last exon 14 31784792 and
ends with the start of the first exon 31817961 (these values can be obtained from the structural annotation file). So the
beginning of the downloaded sequence and the end correspond to the end of exon1 and the beginning of exon 14 (as we
observed previously).
So to annotate the positions of the exons:
Exercises databases
Bioinformatics
1 in our downloaded file corresponds to the genomic position 31784792
33170 in our downloaded file corresponds to the genomic position 31817961 (31784792-31817961+1)
This information will be used to annotate the positions of the exons on our downloaded file (see below).
AVID
You might wish to add the annotation (positions of the exons, introns) to the multiple alignment you are going to make in
AVID. You will have to construct a gff file with the essential annotation. To construct this file you need the exon positions.
See the instructions for creating this file in the figures below.
Exon 1
31817809
31817961
ENSE00001479873
31817961-31817809+1 31817961-31784792+1
1
Exon 14
1
Exercises databases
31784792
31790019
31784792-31784792+1
1
5228
Bioinformatics
ENSE00001213516
31790019-31784792+1
Exon 13
31790710
31790860
ENSE00003700637
31790710-31784792+1 31790860-31784792+1
5919
6069
Exon 12
31793438
31793553
ENSE00003701932
31793438-31784792+1 31793553-31784792+1
8647
8762
14
13
12
If you do this in excel this results in
< 1 33169 PAX6
14
1
5228 exon
13
5919 6069 exon
12
8647 8762 exon
11
8861 9011 exon
10
9241 9323 exon
9
9839 9997 exon
8
15900 16065 exon
7
16770 16985 exon
6
15900 16065 exon
4
17913 18043 exon
3
21621 21671 UTR
2
22058 22134 UTR
1
33018 33170 UTR
This information can also directly be obtained from ensemble. Go to the genome
browser. Search for PAX 6 human. Select the gene summary and select at the left
panel ‘download sequences’
Exercises databases
Bioinformatics
< 1 33170 PAX6
1 5142 UTR
5143 5228 exon
5919 6069 exon
8647 8762 exon
8861 9011 exon
9241 9323 exon
9839 9997 exon
15900 16065 exon
16770 16985 exon
17080 17121 exon
17913 18043 exon
21611 21620 exon
21621 21671 UTR
22058 22134 UTR
26037 26224 UTR
33018 33170 UTR
Download the orthologs
Repeat the complete flow to download the corresponding complete gene sequences of the orthologs making use of their
ENSEMBL gene Ids (save them in separate FASTA files).
ENSMUSG00000027168
Check by comparing the gene sequence in the ensemble browser what
exactly you downloaded: indeed your downloaded sequence starts with
the first exon (the gene is located on +1)
Exercises databases
Download the annotation file )ook voor pax6 geven featured strand and forward strand idem)
Forward strand
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
27457 28465 UTR
Featured strand
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
Bioinformatics
Exercises databases
27457 28465 UTR
Compare with the annotation you make yourself
Download the gene structural info and save in xls
Exon 1
ENSMUSG00000027168 105668900
105669049
1
105668900-105668900+1 105669049-105668900+1
1
150
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
ALIGNING THE SEQUENCES USING AVID





Go to mVISTA tools of the VISTA genome browser (http://genome.lbl.gov/vista/index.shtml).
Align the sequences with lagan
Specify the number of sequences you want to align, then press ‘Submit’.
Fill in your email address and provide the fasta and annotation files, after ‘Submit’.
Wait until you get an email with the results.
Bioinformatics
Exercises databases
Bioinformatics
Before you use the downloaded FASTA file you have to adapt them (short header because otherwise the visualization is
messed up; and an enter after the header and before the sequence otherwise youhave not a correct FASATA file.
ENSMUSG00000027168_gene_unspliced_23012014_adapted.txt
ENSG00000007372_gene_unspliced_23012014_adapted.txt
Annotation files:
human_PAX6_annotatie_23012014.txt
mus_PAX6_annotatie_23012014.txt
do not reverse complement the sequences
View the pdf file in which all sequences are compared relative to the human sequence.
The blue boxes are the exons in the human sequence from the annotation file. Remark the high homology between the rat
and the mouse sequence. Even in remote organisms such as fugu and zebrafish some of the human exons are conserved.
Between rat and mouse and human in the region 1000 bp upstream of the first exons parts of the sequence are conserved
as well. These might correspond to the regulatory motifs, responsible for transcriptional regulation.
The low homology between the two zebra fish sequences (paralogs) can be attributed to the bad sequence quality of the
second zebrafish copy (genome assembly not complete yet) and the short sequence of fugu.
Exercises databases
Bioinformatics
If you want to exercise yourself you can also start from the gene ENSMUSG00000025190.
ALIGNING THE SEQUENCES USING VISTA GENOME BROWSER


Go the the VISTA genome browser website: http://genome.lbl.gov/vista/index.shtml
Go to the Precomputed Alignments

Provide the proper coordinates for the human PAX6 gene (Chr11: 31,806,340-31,839,509) and Submit
Exercises databases





Compare with previous results. (note we do not have the UTRs)
Bioinformatics