* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download DNA Sequences Analysis
Comparative genomic hybridization wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Maurice Wilkins wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Promoter (genetics) wikipedia , lookup
DNA barcoding wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
DNA sequencing wikipedia , lookup
Genomic library wikipedia , lookup
Molecular cloning wikipedia , lookup
Point mutation wikipedia , lookup
Molecular evolution wikipedia , lookup
Homology modeling wikipedia , lookup
DNA supercoil wikipedia , lookup
Non-coding DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
DNA Sequences Analysis Hasan Alshahrani CS6800 • Statistical Background : HMMs. • What is DNA Sequence. • How to get DNA Sequence. • DNA Sequence formats. • Analysis methods and tools. • What is next ? HMMs Hidden Markov Model (HMM) is very useful statistical model for molecular biology although it was aimed to be used for speech recognition purposes. HMM can be used as a statistical profile for a protein family (DNAs) and hence used to search a database for other similarities or family members. Q1 :How can HMMs be used in DNA analysis? To calculate the probability of the sequence ACTTCG, we multiply the probabilities; where the probability is the conditional probability that a certain nucleotide appears in a position, given that a specific nucleotide was in the previous position: P (ACTTCG….) = P1(A) * P2(C|A) * P3(T|C) * P4(T|T) * P5(C|T) * P6(G|C)………… In more formal way , HMM cannot be observed directly but we can infer the hidden state qt from a random observation Yt What is DNA sequence ? • DNA consists of two long interwoven strands that form the famous “double helix”. Each Strand is built from a small set of molecules called nucleotides. • Often the length of double-stranded DNA is expressed in the units of basepairs (bp), kilobasepairs (kb), or megabasepairs (Mb), so that this size could be expressed equivalently as 5X 10 ^6 bp,5000 kb, or 5Mb • Collectively, the 46 chromosomes in one human cell consist of approximately 3 X 10^9 bp of DNA How to get DNA sequence • By using chemical methods for determining the order of the nucleotide bases: Adenine, Guanine, Cytosine, and Thymine - in a molecule of DNA • Used in many fields and applications such as Forensics and biological systems • why don’t we use the powerful text searching algorithms and tools to search DNA databases? DNA can be sequenced by a chemical procedure that breaks a terminally labelled DNA molecule partially at each repetition of a base. DNA Sequencing can be done by different methods : 1.Maxam-Gilbert sequencing 2.Chain-termination methods 3.Dye-terminator sequencing 4.Automation and sample preparation 5.Large scale sequencing strategies Q2: Name four of DNA Sequencing methods Example :a chain termination method A DNA sequencing printout. The sequence is represented by a series of peaks, one for each nucleotide position. In this example, a red peak is an A, blue is a C, orange is a G, and green is a T. DNA Sequence formats: • Plain sequence format • EMBL format • FASTA format • GCG format • GCG-RSF (rich sequence format) • GenBank format • IG format FASAT Format : • FASTA format is the standard format in the field of bioinformatics to represent either nucleotide sequences or peptide sequences. • This format is single-letter code and it allows sequence names and comments • FASAT consists of a single-line description at the beginning followed by sequence data in multiple lines. • The length of the each chunk (line) of the sequence must not exceeds 80 characters. • Sequence identifiers are defined by a standard called NCBI Q3: what is FASAT format? NCBI Data Base: • National Centre for Biotechnology Information (www.ncbi.nlm.nih.gov) is sequence database in US maintain a huge collection of DNA and protein sequences. • Each sequence in NCBI is stored in a separate record with a unique identifier called accession. • Example : By accessing the NCBI website and using this accessing NC_001477, we can retrieve the DNA sequence for Dengue virus that causes Dengue fever NCBI cont….. The database query can be done either directly from the website or by using the R functions choosebank() and query() Analysis methods The analysis fall into 5 main methods : • Knowledge-based single sequence analysis. • Pairwise sequence comparison. • Multiple sequence alignment. • Sequence motif discovery in multiple alignments. • Phylogenetic inference. Q4: What are the main methods of DNA sequence analysis ? Analysis methods: alignment • Alignment: to compare a sequence with sequences that have already been reported and stored in a database. • Alignment can be global and local • Local alignments: reveal regions that are highly similar, but do not necessarily provide a comparison across the entire two sequences. • The global approach compares one whole sequence with other entire sequences. Alignment Examples: Alignment Tools : BLAST • The most common local alignment tool is BLAST (Basic Local Alignment Search Tool) developed by Altschul et al. (1990. J Mol Biol 215:403) “BLAST is a set of algorithms that attempt to find a short fragment of a query sequence that aligns perfectly with a fragment of a subject sequence found in a database.” • That initial alignment must be greater than a neighborhood score threshold (T) , the fragment is then used as a seed to extend the alignment in both directions… Which means BLAST algorithm breaks the query into short words of a specific length Joshua Naranjo Q5: what is BLAST algorithm ? State its steps . Can R Help ? • Yes . • It has so many useful packages to process DNA Sequences. • It can be used to access BLAST as well. Examples : DNA sequence Composition 1. GC fraction: GC content is one of the fundamentals properties of a genome sequence, which is the percentage of Gs and Cs ((GC)s). We can do that by two ways: • lengthy one is to use the statistics to calculate the percentage of GC with respect to the whole string. • The other way is to use function GC () from the R package SeqinR, and we will go with this option as shown below 2. DNA words: It the same idea of knowing the frequency of some nucleotides such as A or G but with longer words like “AA” or “CA”. Those can be 2 nucleotides such as “GC”, 3 nucleotides like “AAA” or 4 nucleotides long and so on. An example of 3 nucleotides words is shown below: 3. To find the score for the optimal global alignment between the sequences ‘GAATTC’ and ‘GATTA’, we type: 4. Comparing two sequences using a dotplot() Is it that easy ? No • It is not simply give the sequences to R and get the results . • It is an art which need a degree of skills. • Fitting the sequences to be compared to a form that reflects some shared quality. For example: -How they look structurally, -How they evolved from a common ancestor, or -Optimization of a mathematical construct What is next ? Are we monkeys ? References: 1. 2. 3. 4. 5. 6. 7. 8. http://www.garlandscience.com/res/pdf/9780815365099_ch02.pdf http://library.umac.mo/ebooks/b28050393.pdf https://courses.cs.washington.edu/courses/cse527/00wi/lectures/roottr.pdf http://www.lancaster.ac.uk/pg/nemeth/Hidden%20Markov%20Models%20with%20Applications%20to%20D NA%20Sequence%20Analysis.pdf https://www.ndsu.edu/pubweb/~mcclean/plsc411/Blast-explanation-lecture-and-overhead.pdf http://www.cs.ru.ac.za/research/g07V3343/deliverables%5CShort%20Paper%5CSpecies%20Identification%2 0through%20DNA%20String%20Analysis%20-%20Summary.pdf http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html https://www.bioconductor.org/packages/3.3/bioc/vignettes/DECIPHER/inst/doc/ArtOfAlignmentInR.pdf