Download Biomart/ GENOME ALIGNMENT III

Exercises databases Bioinformatics BIOMART/ GENOME ALIGNMENT III CONTENTS Biomart/ GENOME ALIGNMENT ................................................................................................................................................ 1 Introduction ........................................................................................................................................................................... 1 Downloading the sequences .................................................................................................................................................. 1 Aligning the sequences using AVID ...................................................................................................................................... 11 Aligning the sequences using VISTA genome browser ........................................................................................................ 13 INTRODUCTION The comparison of the mouse and human genomes has demonstrated the power of comparative genomics in inferring the evolutionary history of species and in identifying functional regions in genomes. The possibilities for identifying regions under selection are enhanced with the addition of more sequences and this observation has led to numerous ‘focused sequencing’ projects which seek to obtain sequence for a small region of a genome in numerous other organisms. Biologists who seek to analyze conserved regions among homologous sequences are faced with the daunting task of aligning large genomic regions and subsequently sifting through massive amounts of data. In order to facilitate the discovery process without requiring biologists to download and install complex software, a number of web servers for alignment and analysis have been set up in recent years. These servers align submitted sequences and then generate plots or graphs designed to help researchers identify conserved regions. AVID is a progressive alignment program. The program works by recursively aligning the ‘alignments’ at ancestral nodes of the guide tree. At each internal node, ancestral sequences are inferred from the existing alignments using maximum likelihood and these alignments are then aligned using the AVID program. The server goes through a number of steps: 1. Sequences are repeat-masked using the DUST program (Tatusov and Lipman, unpublished). 2. A random (almost complete) binary guide tree is generated for alignment of the sequences using the progressive alignment method. 3. The sequences are aligned using AVID. 4. A phylogenetic tree is inferred from the multiple alignment using the neighbor joining method. 5. Steps 3 and 4 are repeated for a total of three iterations. 6. Pairwise alignments are generated from the multiple alignment with respect to all of the sequences and these are used to generate conservation plots and to identify conserved regions. In this exercise we will perform an alignment of orthologs of the pax6 gene. 1) In a first step we will download the sequences of the orthologs of the pax6 gene using Biomart 2) Subsequently orthologs will be aligned using AVID/LAGAN. DOWNLOADING THE SEQUENCES/ BIOMART Exercises databases Bioinformatics  http://www.ensembl.org/biomart/martview/f0e53c7f00c3cded9dff7e1e22d391dc view the tutorial http://www.ensembl.org/info/website/tutorials/index.html   Choose a database (“Ensembl Genes 78”) Choose a dataset (“Homo sapiens genes (GRCh38)”)  Go to the Filters section You want to select a specified gene based on its ENSEMBL gene ID (ENSG00000007372) in the human genome.  Move to the Attributes section Selects the attributes you want to download: Exercises databases Bioinformatics You might want to select the position and the strand of the gene on the genome. Also select a protein identifier (e.g. HGNC ID) and most importantly the ENSEMBLgeneIDs of the original gene.   Press the Results button to see your selection Export the requested data as a Tab Delimited text file Exercises databases Bioinformatics how many PAX6 gene transcripts were selected?  Go to the Attributes section and select the Homologs genes in chicken, fugu and mouse This information can be used to download the corresponding sequences  Go to the attributes and select ‘sequences’ You can select different parts of the sequence: Either you select the protein sequence (introns spliced out, only translated part is downloaded). Alternatively you might wish to download the entire gene (introns, exons included and neglect the transcript info) together with the 5 and 3 ‘ ends. Try both. How many sequences do you get when you download the peptide. How many when you download the gene. Explain. Exercises databases Bioinformatics You can do this by the Export data section in the left panel Download the gene (genomic sequence, unspliced gene). Do not forget to also export ‘header’ information such as the strand information and the exon positions and the gene start and end and save it as textfile (FASTA). To annotate the file we will also download the genes structural information (exon-intron start +order of the exon and introns. Save in excel format. This information will be used to make for our file an annotation file. To test what exactly we down loaded, go to the ensemble gene Pax6 and view the sequence information. Perform a find function and try to locate the beginning (find with CCCTCTTTTCTTATCA) and end of your downloaded sequence in the displayed sequence. This shows you that the sequence you downloaded starts with the end of the last exon and ends with the beginning of the first exon (as the sequence is located on -1 strand). So the downloaded sequence is not reverse complemented. Exercises databases Bioinformatics Looking at the header information of the biomart downloaded pax6 gene we see that the gene start is 31784792|and the gene end is 31817961 >ENSG00000007372|1|31784792|31817961|31801776;31812926;31806013;31793802;31802834;31800856;31806462; 31817948;31790019;31806925;31793553;31794788;31794114;31790860;31806921;31811677;31 812183;31803673;31811015;31801912;31811118;31811137;31804046;31811331;31811045;3181 1237;31802971;31791309;31817961;31803333;31812177;31804619;31800646;31817937;318040 25;31810667;31811308;31794126;31801335;31804044;31810305;31817874|31801561;31812572 ;31805389;31793652;31802729;31800763;31806344;31817809;31789936;31806849;31802704;3 1793438;31806402;31794630;31794032;31790710;31801728;31811115;31788910;31784792;318 00691;31812093;31803398;31810828;31801869;31801871;31794780;31811213;31800707;31801 762;31789830;31804452;31801578;31800539;31789182;31789918;31794098;31801617;3178991 7;31801745;31803952;31789922;31801230;31793173;31793483;31809906;31806406;31789913; 31793674 We know from the structural information that our downloaded sequence starts at the end of last exon 14 31784792 and ends with the start of the first exon 31817961 (these values can be obtained from the structural annotation file). So the beginning of the downloaded sequence and the end correspond to the end of exon1 and the beginning of exon 14 (as we observed previously). So to annotate the positions of the exons: Exercises databases Bioinformatics 1 in our downloaded file corresponds to the genomic position 31784792 33170 in our downloaded file corresponds to the genomic position 31817961 (31784792-31817961+1) This information will be used to annotate the positions of the exons on our downloaded file (see below). AVID You might wish to add the annotation (positions of the exons, introns) to the multiple alignment you are going to make in AVID. You will have to construct a gff file with the essential annotation. To construct this file you need the exon positions. See the instructions for creating this file in the figures below. Exon 1 31817809 31817961 ENSE00001479873 31817961-31817809+1 31817961-31784792+1 1 Exon 14 1 Exercises databases 31784792 31790019 31784792-31784792+1 1 5228 Bioinformatics ENSE00001213516 31790019-31784792+1 Exon 13 31790710 31790860 ENSE00003700637 31790710-31784792+1 31790860-31784792+1 5919 6069 Exon 12 31793438 31793553 ENSE00003701932 31793438-31784792+1 31793553-31784792+1 8647 8762 14 13 12 If you do this in excel this results in < 1 33169 PAX6 14 1 5228 exon 13 5919 6069 exon 12 8647 8762 exon 11 8861 9011 exon 10 9241 9323 exon 9 9839 9997 exon 8 15900 16065 exon 7 16770 16985 exon 6 15900 16065 exon 4 17913 18043 exon 3 21621 21671 UTR 2 22058 22134 UTR 1 33018 33170 UTR This information can also directly be obtained from ensemble. Go to the genome browser. Search for PAX 6 human. Select the gene summary and select at the left panel ‘download sequences’ Exercises databases Bioinformatics < 1 33170 PAX6 1 5142 UTR 5143 5228 exon 5919 6069 exon 8647 8762 exon 8861 9011 exon 9241 9323 exon 9839 9997 exon 15900 16065 exon 16770 16985 exon 17080 17121 exon 17913 18043 exon 21611 21620 exon 21621 21671 UTR 22058 22134 UTR 26037 26224 UTR 33018 33170 UTR Download the orthologs Repeat the complete flow to download the corresponding complete gene sequences of the orthologs making use of their ENSEMBL gene Ids (save them in separate FASTA files). ENSMUSG00000027168 Check by comparing the gene sequence in the ensemble browser what exactly you downloaded: indeed your downloaded sequence starts with the first exon (the gene is located on +1) Exercises databases Download the annotation file )ook voor pax6 geven featured strand and forward strand idem) Forward strand > 1 28465 Pax6 1 150 UTR 10911 10990 UTR 11348 11398 UTR 11399 11408 exon 14929 15059 exon 15852 15893 exon 15988 16203 exon 16879 17044 exon 22666 22824 exon 23295 23377 exon 23571 23721 exon 23838 23953 exon 26407 26557 exon 27371 27456 exon 27457 28465 UTR Featured strand > 1 28465 Pax6 1 150 UTR 10911 10990 UTR 11348 11398 UTR 11399 11408 exon 14929 15059 exon 15852 15893 exon 15988 16203 exon 16879 17044 exon 22666 22824 exon 23295 23377 exon 23571 23721 exon 23838 23953 exon 26407 26557 exon 27371 27456 exon Bioinformatics Exercises databases 27457 28465 UTR Compare with the annotation you make yourself Download the gene structural info and save in xls Exon 1 ENSMUSG00000027168 105668900 105669049 1 105668900-105668900+1 105669049-105668900+1 1 150 > 1 28465 Pax6 1 150 UTR 10911 10990 UTR 11348 11398 UTR 11399 11408 exon 14929 15059 exon 15852 15893 exon 15988 16203 exon 16879 17044 exon 22666 22824 exon 23295 23377 exon 23571 23721 exon 23838 23953 exon 26407 26557 exon 27371 27456 exon ALIGNING THE SEQUENCES USING AVID      Go to mVISTA tools of the VISTA genome browser (http://genome.lbl.gov/vista/index.shtml). Align the sequences with lagan Specify the number of sequences you want to align, then press ‘Submit’. Fill in your email address and provide the fasta and annotation files, after ‘Submit’. Wait until you get an email with the results. Bioinformatics Exercises databases Bioinformatics Before you use the downloaded FASTA file you have to adapt them (short header because otherwise the visualization is messed up; and an enter after the header and before the sequence otherwise youhave not a correct FASATA file. ENSMUSG00000027168_gene_unspliced_23012014_adapted.txt ENSG00000007372_gene_unspliced_23012014_adapted.txt Annotation files: human_PAX6_annotatie_23012014.txt mus_PAX6_annotatie_23012014.txt do not reverse complement the sequences View the pdf file in which all sequences are compared relative to the human sequence. The blue boxes are the exons in the human sequence from the annotation file. Remark the high homology between the rat and the mouse sequence. Even in remote organisms such as fugu and zebrafish some of the human exons are conserved. Between rat and mouse and human in the region 1000 bp upstream of the first exons parts of the sequence are conserved as well. These might correspond to the regulatory motifs, responsible for transcriptional regulation. The low homology between the two zebra fish sequences (paralogs) can be attributed to the bad sequence quality of the second zebrafish copy (genome assembly not complete yet) and the short sequence of fugu. Exercises databases Bioinformatics If you want to exercise yourself you can also start from the gene ENSMUSG00000025190. ALIGNING THE SEQUENCES USING VISTA GENOME BROWSER   Go the the VISTA genome browser website: http://genome.lbl.gov/vista/index.shtml Go to the Precomputed Alignments  Provide the proper coordinates for the human PAX6 gene (Chr11: 31,806,340-31,839,509) and Submit Exercises databases      Compare with previous results. (note we do not have the UTRs) Bioinformatics

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biomart/ GENOME ALIGNMENT III