* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Detection of Transcription Factor Binding Sites
Bioinformatics wikipedia , lookup
DNA barcoding wikipedia , lookup
History of genetic engineering wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Metagenomics wikipedia , lookup
Homology modeling wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Community fingerprinting wikipedia , lookup
Ligand binding assay wikipedia , lookup
Sequence alignment wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Gene expression wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Social sequence analysis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Detection of Transcription Factor Binding Sites MICHAEL MORRA CSE 4939W Background DNA is comprised of a combination of 4 chemical bases Adenine – A Thymine – T Guanine – G Cytosine - C Image from : http://www.genetest.org/page5.html Background (Continued) Each individual organism has a unique DNA sequence The DNA sequence contains information which can be used by a cell to construct proteins Each set of instructions within this sequence is called a gene Image from: http://www.buzzle.com/articles/point-mutations.html Transcription Factors To regulate the expression of genes, proteins known as transcription factors are used Each transcription factor binds to the DNA sequence, turning a gene on or off Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html Binding Sites The portions of the DNA where the transcription factors are able to bind are known as binding sites A single transcription factor’s binding sites may vary Introduction The detection of binding sites is important to understanding the regulatory network of an organism As binding sites can vary considerably, searching for them within a DNA sequence is tedious Project Implement a method used to accurately and precisely discover the locations of transcription factor binding sites within a DNA sequence. Data 4 species (Human, Mouse, Fruit Fly & Yeast) Human 26 Transcription Factors, 300 binding sites Mouse 12 Transcription Factors, 98 binding sites Fruit Fly 6 Transcription factors, 51 binding sites Yeast 8 Transcription Factors, 75 binding sites Multiple Sequence Alignment To be able to analyze the data effectively, each transcription factor’s binding sites need to be aligned http://www.ebi.ac.uk/Tools/clustalw2/index.html >s1 GACTTTTCGCT >s2 CGATTTTCTCG >s3 GCATTTTCCCA >s4 AGAGAAAACCC >s5 GAATAACCCAAGAGAAA >s6 ACAGAAAAATC >s7 CGAGAAAATCG >s8 TGGTTTTCCCG >s9 GGGTTTCTCCC Scoring Berg and von Hippel method l = length of the sequence to be scored j = position in the sequence nj = number of times a base occurs at position j in the alignment tj = base at position j in the sequence to be scored nj(0) = most common base at position j Scoring Example ACTCA n1(0) = 3 n2(0) = 2 n3(0) = 2 n4(0) = 2 n5(0) = 2 n1(A) = 3 n2(C) = 1 n3(T) = 2 n4(C) = 1 n5(A) = 2 Score = log(1) + log(1.5/2.5) + log(1) + log(1.5/2.5) + log(1) = -0.443697499 Leave One Out Cross Validation To determine the effectiveness of the algorithm, a cross validation technique is used This technique involves leaving one binding site out when the multiple sequence alignment is performed, and then scoring that left out sequence If the algorithm is effective, the left out sequence should score higher than the majority of other binding sites within that species. (>80-90%) Implementation C++ Input Multiple Sequence Alignment of a transcription factor’s binding sites All binding sites of a species Output Scores Results of Leave One Out Cross Validation Desired Functionality Deal with cases where the sequence to be scored is longer or shorter than the multiple sequence alignment Slide the sequence over the alignment and take the highest scoring portion Timeline Oct 4th – Oct 18th Create multiple sequence alignments for all transcription factors Oct 18th – Nov 15th Implement scoring algorithm in C++ Nov 15th – Nov 29th Implement leave one out methods Nov 29th – Dec 6th Tweaks and Improvements Questions? Image from: http://www.ideacenter.org/contentmgr/showdetails.php/id/954