Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics Brad Windle [email protected] Ph# 628-1956 Web Site: http://www.people.vcu.edu/~bwindle/Courses Click on Link to MEDC 310 course Or http://www.phc.vcu.edu/310/ Profiling The term "bioinformatics" is about 15 years old. It covers a variety of data analyses that include: DNA and protein sequence analysis Biological analysis of drugs, can overlap with chemoinformatics Genetics Taxonomy Clinical data statistics Genomic and proteomic research Bioinformatics is sometimes equated to the term "data mining", which is commonly used in e-business and internet data handling. Chemoinformatics Chemoinformatics has a special challenge in that a structure of a compound or drug needs to be quantified. Specific structures are characterized by molecular descriptors useful in Quantitative Structure Activity Relationship (QSAR) modeling. QSAR tells you what about the structure of a drug that makes it do what it does. Much of this information has implications on what a drug will do in a cell. However, the complexity of a cell makes the reality of what a drug does in the cell deviate significantly from what is anticipated based on chemistry and enzymatic assays. This stresses the need for characterizing drugs based on more biological data. Analogies for looking for patterns Looking at patterns in images A mixture of many patterns We need to identify individual patterns There are methods for extracting the patterns from the data There is also noise tht obscures the patterns One method for identifying object patterns of interest amidst the noise Another method for identifying different object patterns of interest amidst the noise This is what was actually buried in the noise Questions? Philosophy of Science Reductionist Approach (Reductionism) VS Systems Approach (Systemism) Reductionist Systems Approach Traditional Scientific Methods Updated Scientific Methods Obervat ions are made wit h or wit hout making changes to the system Technology allows a large amount of observations to be made Dat a are analyzed and a hypot hesis developed Bioinformat ics allows analysis of a large amount of data Experiment s are designed and conduct ed t o t est t he hypot hesis, usually involves changing something in the system Obervat ions are made t o det ermine if the hypothesis is true or false Technology allows a large amount of observations to be made Data are analyzed and conclusions made Bioinformat ics allows analysis of a large amount of data The hypot hesis is eit her proved t rue and advancing t o t he next st age occurs, or t he hypot hesis is proved false and new obervat ions are made or dat a is reanalyzed to develop a better hypothesis How Does a Cell, or Person Respond to Therapy or a Drug? Treat 10 people suffering from Disease A with Drug X. • • • • 2 people suffer adverse reactions 3 exhibit good recovery from disease 2 exhibit modest recovery from disease 3 exhibit no sign of recovery from disease What Factors Cause in Differences Between People? Genes and their sequence Health-wise • Disease • Health-related Traits • Response to Drugs What Are the Differences in Genes? Single nucleotide polymorphisms (SNPs) SerSerIleAsnGlyGlnLeuArgPro AGTTCTATAAATGGCCAGCTTAGACCT TCAAGATATTTACCGGTCGAATCTGGA SerSerIleHisGlyGlnIleArgPro AGTTCTATACATGGCCAGATTAGACCA TCAAGATATGTACCGGTCTAATCTGGT How does a difference in a gene affect drug response? Transport of the drug Metabolism of the drug Interaction with the drug target 5 Million SNPs Let’s say there are 10 SNPs that contribute to response to Drug X Combinatorial approach to identifying SNPs that correlate with drug response All combinations = 1060 Narrow SNPs down to those within genes to 100,000 Combinations = 1043 Traveling Salesman Problem SNPs thus far described were inherited, affecting the quality of proteins What about differences between people that are somatic? What about quantitative differences in proteins? Differences in Protein Expression and Gene Expression 20,0000 genes - Genomics 100,000 proteins - Proteomics Traditional Scientific Methods Updated Scientific Methods Obervat ions are made wit h or wit hout making changes to the system Technology allows a large amount of observations to be made Dat a are analyzed and a hypot hesis developed Bioinformat ics allows analysis of a large amount of data Experiment s are designed and conduct ed t o t est t he hypot hesis, usually involves changing something in the system Obervat ions are made t o det ermine if the hypothesis is true or false Technology allows a large amount of observations to be made Data are analyzed and conclusions made Bioinformat ics allows analysis of a large amount of data The hypot hesis is eit her proved t rue and advancing t o t he next st age occurs, or t he hypot hesis is proved false and new obervat ions are made or dat a is reanalyzed to develop a better hypothesis In genomics and proteomics research, the data is extensive and the patterns complex. The emphasis shifts from asking specific questions or testing hypotheses to trying to filter out the most significant observation the data offers. Bioinformatics and Data Mining in general use two forms of learning: Unsupervised learning and Supervised learning Supervised learning is the process of learning by example: Use example patterns with known characteristics to learn and predict characteristics for the unknown This is essentially the modeling process Unsupervised learning is the learning by observation and exploratory data analysis is a general form Let the data reveal prominent patterns and associations, you don’t look for specific patterns Exploratory data analysis is used when there is no hypothesis to test, or when there is no specific pattern expected. This type of analysis shows the most significant pattern or trends within the data; it does not imply biologically or statistical significant. Cluster analysis is a popular form of exploratory data analysis. Cluster analysis sorts whatever is being analyzed into clusters with the greatest similarities in trend or pattern. It is a form of non-descriptive statistics and exploratory data analysis. A dendrogram or tree diagram is used to present the results. Below is an example of a dendrogram for bacterial species of Escherichia. New technology= lots of data Microarray Technology DNA Microarray Cell 1’s mRNA Cell 2’s mRNA Pseudo-colored MicroarraySpots The total intensity for each spot is summed and the values plotted on a scatterplot. A scatterplot of 2000 points is shown. Each point respresents a gene. Cluster analysis methods The most straightforward methods involve calculating the Euclidean (Euclid) distance between two points, for all combinations of points. Pythagorean Theorem If we perform cluster analysis on the 2000 points, we can see that we have one giant cluster with a handful of outliers. Adding Dimensions to Cluster Analysis The distance calculation would be: Thus, while we can't visualize more than three dimensions, the computer can perform cluster analysis on as many dimensions imaginable or as processing time allows. Pearson Correlation Coefficient Two-fold Cluster Analysis Gene expression analysis in drug development can involve a large number of genes and a large number of drugs. It is not only important to identify what genes cluster together, but also what drugs cluster . This is done by two-fold cluster analysis. The genes are arranged and clustered as well as the drugs. The drugs that illicit similar gene expression patterns will cluster. Both clusters can be viewed in a single 2-D dendrogram. Questions? Cluster Tree of cell lines Classifying Cancer Using supervised learning, models have been developed Classifying different subsets of cancers that the pathologist can’t Predicting response to therapy and patient prognosis Any kind of data can be explored Cell response profile Monks et al. Anti-Cancer Drug Design 12:553 (1997) Drug clusters correspond to drug targets or mechanisms of action not necessarily drug structure. Scherf et al, nature genetics 24:236 (2000) Exploratory Tools allows us to focus on what most relevant based on the data And developed relevant hypotheses For example Geldanamycin is cytotoxic through inhibition of microtubules The End Any Questions?