Download EE150a – Genomic Signal and Information Processing

EE150a – Genomic Signal and Information Processing • Seminar series – lectures on first 3 meetings, followed by students presentations – statistical signal processing basics – background reading for each meeting • Location: Moore 080 (except today) • List of papers with links: www.its.caltech.edu/~hvikalo/gsip.html – • minor modifications of the list are likely Contact: Haris Vikalo, Moore 125 – Phone: 395-4184 – E-mail: [email protected] • Occasionally check website for updates and increasing list of research related links • Today’s handouts: – basic course info and a list of papers – R. Karp’s “Mathematical Challenges from Genomics and Molecular Biology” – sign-up sheet • Next time: Prof. Vaidyanathan’s lecture on “Signal Processing Problems in Genomics” • In two weeks: lecture on DNA microarray technology and novel estimation techniques of gene expression levels • Today: introduction with brief overview of the topics for presentation Central Dogma of Molecular Biology • Flow of information in a cell: • [Due to Francis Crick. It has recently been realized that the dogma requires modifications but more about that later in course.] • Recent development of high-throughput technologies that study the above flow – requires interdisciplinary effort – dealing with a huge amount of information DNA Structure • Four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T) • Bindings: – A with T (weaker), C with G (stronger) • Forms a double helix – each strand is linked via sugar-phosphate bonds (strong), strands are linked via hydrogen bonds (weak) • Genome is the part of DNA that encodes proteins: – …AACTCGCATCGAACTCTAAGTC… genetics.gsk.com/ graphics/dna-big.gif Sidenote: Sequence Alignment • Perhaps the most fundamental operation in bioinformatics – used to decide if two genes or proteins are related by function, structure, or evolutionary history – can identify patterns of conservation and variability • Performs pairwise matching between characters of each sequence • One place where it is useful: SNP (single-nucleotide polymorphism) detection – SNPs may indicate a disease development (myocardial diseases, arthritis, etc. have been associated with SNPs) • Sequence alignment is the first student presentation topic in the series (HMM, dynamic programming, Bayesian methods) Details of the information flow • Replication of DNA – {A,C,G,T} to {A, C, G,T} • Transcription of DNA to mRNA – {A,C,G,T} to {A, C, G,U} • Translation of mRNA to proteins – {A,C,G,U} to {20 amino-acids} http://www-stat.stanford.edu/~susan/courses/s166/central.gif Genes can be turned on and off Microarray Technology • A medium for matching known and unknown DNA samples based on hybridization (base-pairing) • Two major applications – identification of a sequence (gene or gene mutation) – determination of expression level (abundance) of genes • Enables massively parallel gene expression studies • Two types of molecules take part in the experiments: – probes, orderly arranged on an array – targets, the unknown samples to be detected Types of Microarrays • “Traditionally”, there are two formats: – probe cDNA immobilized to a solid surface using robot spotting and exposed to a set of targets, and – an array of oligonucleotide probes synthesized on chip (via, e.g., photolithography) • Targets are typically fluorescently labeled cDNA molecules obtained from mRNA samples – hybridize to their complementary probes – image readout Illustration: DNA microarray http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg Sample Microarray Readout Some Design Issues • Hybridization is binding of a target to its perfect complement • However, when a probe differs from a target by a small number of bases, it still may bind • This non-specific binding (cross-hybridization) is a source of measurement noise • In special cases (e.g., arrays for gene detection), designer has a lot of control over the landscape of the probes on the array • Second topic for presentations considers a combinatorial design of such arrays • [How to deal with cross-hybridization on arrays used for expression level measurements is the topic of the third lecture.] Clustering Gene Expression Profiles • Microarrays measure expression levels of thousands of gene simultaneously • For instance, we might take samples at different times during a biological process • Cluster data in the expression level space – relatedness in biological function often implies similarity in expression behavior (and vice versa) – similar expression behavior indicates co-expression • Clustering of expression level data is one of the topics (traditional statistical methods but also graph-theoretic approach, information-theoretic approach, etc.) Example of Clustering • Rows: various gene expression levels • Columns: Time progression • So-called hierarchical clustering http://www.genomatix.de/gif/node43_documentation.gif Co-regulated genes • Co-expressed genes may be co-regulated – a combination of transcription factors (activating or repressing proteins) regulates genes jointly • Finding binding sites (control regions) of co-regulated genes is another topic • HMM, probabilistic methods (EM, Gibbs sampling) Genetic Regulatory Networks • Proteins take part in the gene regulation – feedback loop in the Central Dogma information flow • Thus to fully understand gene regulation, we need to consider interactions – DNA, RNA, proteins, small molecules • Requires network formalism – directed graphs, Boolean networks, Bayesian networks, differential equations etc. • Explore some of these models in gene regulation context An Illustration of a Regulatory Network Protein Translation/Folding • [Should time permit.] • Sequence-structure relationship will play very important role in the postgenomic era – potential great impact on genetics and pharmaceutical chemistry, protein design – diseases such as Alzheimer’s are believed to be related to protein misfolding • Computationally very hard – parallel, distributed computing Genomic data fusion • Consider the problem of classification of a protein and assume that we know: – original gene sequence encoding the protein – gene expression levels – some of the protein-protein interactions • Question: how to combine various types of data to classify the protein • The last (right now…) topic of the seminar will be data fusion of the various genomic data listed above – efficient convex optimization based statistical learning algorithm Summary • Trying to understand gene regulation • Recent technologies revolutionized research – huge amount of data • Multidisciplinary; identify opportunities • Challenging problems, quite important: – understanding information processes on genetic level gives insights about phenotypic effects (disease) – some of the ultimate goals are molecular diagnostics and creating personalized drugs

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EE150a – Genomic Signal and Information Processing