Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GBIO009-1 - Bioinformatics Homework 1 and 2 review session Presented by Kirill Bessonov November 2012 ____________________________________________________________________________________________________________________ Kirill Bessonov 1 GBIO009-1 - Bioinformatics HW1: classical Q & A (GenomeGraphs) (1) • First two questions were on Bioconductor libraries. There are BioC 608 packages • To get citations on particular library use citation("library_name") • You were asked to get genomic data on specific gene library(GenomeGraphs) #download the whole database of Ensemble IDs ensembl_Human_Genes = useMart("ensembl",dataset="hsapiens_gene_ensembl"); #get info on gene form the database on the Ensemble ID gene <- makeGene(id = "ENSG00000115145", type="ensembl_gene_id", biomart = ensembl_Human_Genes ) #get info on transcript transcript <- makeTranscript(id = "ENSG00000115145", type="ensembl_gene_id", biomart= ensembl_Human_Genes) gdPlot ( list("gene"=gene, "transcripts"=transcript)) #retrieve info from the database displaying first 25 entries getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )[1:25,] ____________________________________________________________________________________________________________________ Kirill Bessonov 2 GBIO009-1 - Bioinformatics HW1: classical Q & A (GenomeGraphs) (2) • What is the gene name (i.e. hgnc_symbol) and function represented by the Ensembl ID - ENSG00000115145? geneInfo=getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes ) > geneInfo[geneInfo$ensembl_gene_id == "ENSG00000115145", ] ensembl_gene_id hgnc_symbol description 4829 ENSG00000115145 STAM2 signal transducing adaptor molecule (SH3 domain and ITAM motif) 2 • How many exons does the ensemble id ENSG00000115145 has? 51 exons attr(gene, "ens") ensembl_gene_id ensembl_transcript_id ensembl_exon_id exon_chrom_start exon_chrom_end rank strand biotype 1 ENSG00000115145 ENST00000263904 ENSE00001351655 153032117 153032506 1 -1 protein_coding 2 ENSG00000115145 ENST00000263904 ENSE00002888710 153006659 153006743 2 -1 protein_coding …… 48 ENSG00000115145 ENST00000494589 ENSE00002785037 153004538 153004636 3 -1 protein_coding 49 ENSG00000115145 ENST00000494589 ENSE00002808134 153003676 153003822 4 -1 protein_coding 50 ENSG00000115145 ENST00000494589 ENSE00002929781 153001402 153001471 5 -1 protein_coding 51 ENSG00000115145 ENST00000494589 ENSE00001828491 153000503 153000527 6 -1 protein_coding ____________________________________________________________________________________________________________________ Kirill Bessonov 3 GBIO009-1 - Bioinformatics HW1: classical Q & A (GenomeGraphs) (3) • Execute the following command. How many chromosomes do you see? 25 chromosomes. 22 autosomal pairs, 1 sex pair and one mitochondrial chromosome • Why the number of chromosomes in this Ensembl dataset is greater than 23 chromosome pairs? What does “MT”, “X” and “Y” refer to? Because of the MT chromosome, since X and Y can be grouped to a single pair > getBM("chromosome_name","","", ensembl_Human_Genes)[c(1:22,433:435),1] [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "3" "4" "5" "6" "7" "8" "9" "MT" "X" "Y" ____________________________________________________________________________________________________________________ Kirill Bessonov 4 GBIO009-1 - Bioinformatics HW2: Pairwise alignments (classical Q&A) ____________________________________________________________________________________________________________________ Kirill Bessonov 5 GBIO009-1 - Bioinformatics HW2: Pairwise alignments (classical Q&A) Q1 • • Please align globally using Needleman–Wunsch algorithm the following DNA sequences. Use The following scoring rules: a) gap -5; b) match between two bases +5; c) mismatch between two bases +3; ____________________________________________________________________________________________________________________ Kirill Bessonov 6 GBIO009-1 - Bioinformatics HW2: Pairwise alignments (classical Q&A) Q3 • Do local protein alignment using BLOSUM 62 matrix on the HEAGAWGHEE and PAWHAE sequence. The scoring rules are a) gap -8; matches and mismatches are given in BLOSUM 62 matrix. ____________________________________________________________________________________________________________________ Kirill Bessonov 7 GBIO009-1 - Bioinformatics HW2: Pairwise alignments (classical Q&A) Q5 Produce a dot plot of Human and Mouse p53 proteins from previous question and paste the plot below. Complete the lines of R code to get the dot plot. Are both proteins similar? Yes, very similar since we see clear diagonal corresponding to >90% of sequences length Where is/are the region(s) of greatest variation occur? Between 50-100 ____________________________________________________________________________________________________________________ Kirill Bessonov 8 GBIO009-1 - Bioinformatics HW2: Pairwise alignments (classical Q&A) Q7 • What global alignment score do you get for the two p53 proteins, when you use the BLOSUM62 alignment matrix, a gap opening penalty of -10 and a gap extension penalty of 0.5? Answer: score of 1556 query("p53_HUMAN", "AC=P04637"); p53_HUMAN_seq = getSequence(p53_HUMAN); query("p53_MOUSE", "AC=P02340"); p53_MOUSE_seq = getSequence(p53_MOUSE); globalAlign <- pairwiseAlignment(p53_HUMAN_seq, p53_MOUSE_seq, substitutionMatrix = "BLOSUM62", gapOpening = -10, gapExtension = -0.5) • Errors: the R-code was not stated and the ID of proteins were not given such as Uniprot ID P04637 ____________________________________________________________________________________________________________________ Kirill Bessonov 9 GBIO009-1 - Bioinformatics HW2: Computer Style Implementation of NW algorithm in R ____________________________________________________________________________________________________________________ Kirill Bessonov 10 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [1] • Given the pseudo-code implement NW algorithm in R – Algorithm has two parts • Calculation of the alignment F-matrix • Finding the optimal path(s) through the matrix for to length(A) d = gap penalty score F(i,0) ← d*i for j=0 to length(B) i and j = positions in A F(0,j) ← d*j for i=1 to length(A){ for j=1 to length(B) { Match ← F(i-1,j-1) + S(Ai, Bj) Delete ← F(i-1, j) + d Insert ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) } } & B sequences ____________________________________________________________________________________________________________________ Kirill Bessonov 11 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [2] Fmatrix = function(A,B){ fmatrix = matrix(0, nrow = (nchar(A)+1) , ncol = nchar(B)+1) d = -8 #this is gap penalty for(i in 0 : nchar(A)){ fmatrix[i+1,1] = d * i #populates initial row with gap penalty } for(j in 0 : nchar(B)){ fmatrix[1,j+1] = d * i } for(i in 1 : nchar(A)){ for(j in 1 : nchar(B)) { score = rules(A,B) #get me sccore for the pair of aa or nt match = fmatrix[i,j] + score delete = fmatrix[i,j+1] + d insert = fmatrix[i+1,j] + d fmatrix[i+1,j+1] = max(match,delete,insert) } } colnames(fmatrix) = strsplit( paste(" " , B, sep=""), "")[[1]]; rownames(fmatrix) = strsplit( paste(" " , A, sep=""), "")[[1]]; return(fmatrix) } ____________________________________________________________________________________________________________________ Kirill Bessonov 12 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [3] rules = function(A,B){ s.matrix <- matrix(rep(0,16), nrow = 4, ncol=4, byrow=TRUE, dimnames = list(c("A","C","G","T"),c("A","C","T","G"))) s.matrix["A",] = c(2,-1,-1,-1) s.matrix["C",] = c(-1,2,-1,-1) s.matrix["T",] = c(-1,-1,2,-1) s.matrix["G",] = c(-1,-1,-1,2) } > s.matrix A C T G A 2 -1 -1 -1 C -1 2 -1 -1 G -1 -1 2 -1 T -1 -1 -1 2 ____________________________________________________________________________________________________________________ Kirill Bessonov 13 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [4] • Check the F-matrix fmatrix=Fmatrix("ATCG", "TG") T G -32 -32 -32 A -8 -16 -24 T -16 -6 -14 C -24 -14 -4 G -32 -22 -12 • Start finding the optimal path(s) through the matrix AlignmentA = "" AlignmentB = "" i = nchar(A) + 1 j = nchar(B) + 1 while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] #get score at current position of F-matrix ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] what is around that F-matrix cell? ScoreLeft = fmatrix[i - 1, j] ____________________________________________________________________________________________________________________ Kirill Bessonov 14 GBIO009-1 - Bioinformatics HW1: Computer style (NW algorithm) [5] • Selecting the bottom right cell and starting to trace-back the path of optimal alignment AlignmentA = "" AlignmentB = "" Which cell of the F-matrix I am now? while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] ScoreLeft = fmatrix[i - 1, j] On diagonal path: previous + next cell #considering the score came from diagonal if (CurrentScore == ScoreDiag + s.matrix[substr(A,i,i), substr(B,j,j)) ){ AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "") AlignmentB = paste(substr(B,j-1,j-1),AlignmentB, sep = "") i = i - 1 j = j - 1 } ____________________________________________________________________________________________________________________ Kirill Bessonov 15 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [6] #considering if the score comes from left (introducing a gap) else if(CurrentScore == ScoreLeft + d){ AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "") AlignmentB = paste( "-", AlignmentB, sep = "") i = i - 1 } #considering if the score comes from upper cell (introducing a gap) else if(CurrentScore == ScoreUp + d) { AlignmentA = paste( "-", AlignmentA, sep = "") AlignmentB = paste(substr(B,j-1,j-1), AlignmentB, sep = "") j = j – 1 } print(AlignmentA) print(AlignmentB) finalScore = cat("Final score :",fmatrix[(nchar(A)+1),(nchar(B)+1)]) ____________________________________________________________________________________________________________________ Kirill Bessonov 16 GBIO009-1 - Bioinformatics HW2: Computer style (NW algorithm) [7] • The scoring matrices could have been accessed though character indices not requiring conversion and making code faster • How one would output more than one BEST possible alignments? • Please use more comments in your R-code • Would be nice to see trace-backs visually • Also the scoring rules were not stated clearly ____________________________________________________________________________________________________________________ Kirill Bessonov 17