Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene Expression and Networks Microarray Analysis • Unsupervised -Partion Methods K-means SOM (Self Organizing Maps -Hierarchical Clustering • Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) 2 Clustering • Grouping genes together according to their expression profiles. • Hierarchical clustering: generate a tree – – – – Each gene is a leaf on the tree Distances reflect similarity of expression Internal nodes represent functional groups Similar approach to phylogenetic trees • k-means clustering: generate k groups – Number k is chosen in advance – Each group represents similar expression 3 Hierarchical Clustering Example Five separate clusters are indicated by colored bars and by identical coloring of the corresponding region of the dendrogram. The sequence-verified named genes in these clusters contain multiple genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes. 4 Expression Correlation • Similar expression between genes – One gene controls the other in a pathway – Both genes are controlled by another – Both genes required at the same time in cell cycle – Both genes have similar function • Clusters can help identify regulatory motifs – Search for motifs in upstream promoter regions of all the genes in a cluster 5 Support Vector Machine(SVM) • As applied to gene expression data, an SVM would begin with a set of genes that have a common function, for example, genes coding for components of the proteasome (positive set). In addition, a separate set of genes that are known not to be members of the functional class (negative set) is specified. • Using this training set, an SVM would learn to discriminate between the members and nonmembers of a given functional class based on expression data. • Having learned the expression features of the class, the SVM could recognize new genes as members or as non-members of the class based on their expression data. 6 How do SVM’s work ? Knowing the label of each example, the SVM tries to separates all training examples correctly and maximizes the distance between the points of each class kernel ? If this is not possible in the input space it searches for a hyperplane in a higher dimension space 7 Probe Selection • Probe on DNA chip is shorter than target – Choice of which section to hybridize • Select a region which is unstructured – RNA folding, DNA stem-and-loop • Choose region which is target-specific – Avoid cross-hybridization with other DNA • Avoid regions containing variation – Minimize presence of SNP sites 8 Probe Design Two main factors to optimize • Sensitivity – Strength of interaction with target sequence – Requires knowledge of target only • Specificity – Weakness of interaction with other sequences – Requires knowledge of ‘background’ 9 Sensitivity • Basic measure: best gapless alignment of entire probe against part of target sequence: - -2+6=+4 -7+1=-6 6+2=-4 CTACACGA CTACACGA CTACACGA AGTGCAAGTCCGATATGCCGTAATGCTATCA CTACACGA CTACACGA -6+2=-4 -8 • Better: +3 for C–G, +2 for A–T, etc… 10 Selectivity E-value • Can be calculated by Blasting the probe against the genome studied in the specific experiment. 11 Sources of Inaccuracy • Some sequences bind better than others – Cross-hybridization, A–T versus G–C • Scanning of microarray images – Scratches, smears, cell spillage • Effects of experimental conditions – Point in cell cycle, temperature, density 12 Gene Expression Databases and Resources on the Web • GEO Gene Expression Omnibus - http://www.ncbi.nlm.nih.gov/geo/ • List of gene expression web resources – http://industry.ebi.ac.uk/~alan/MicroArray/ • Another list with literature references – http://www.gene-chips.com/ • Cancer Gene Anatomy Project – http://cgap.nci.nih.gov/ • Stanford Microarray Database – http://genome-www.stanford.edu/microarray/ 13 Functional Genomics The task is to define the function of a gene (or its protein) in the life processes of the organism, where function refers to the role it plays in a larger context. 14 GO (gene ontology) http://www.geneontology.org/ • The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated • molecular functions (F) • biological processes (P) • cellular components (C) Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents 15 GO Annotations RIM11 GO evidence and references Molecular Function glycogen synthase kinase 3 activity (ISS) protein serine/threonine kinase activity (IDA) Biological Process protein amino acid phosphorylation (IGI, ISS) proteolysis (IGI) response to stress (IGI, IMP) sporulation (sensu Fungi) (IMP) Cellular Component cytoplasm (IDA) Extracted from SGD Saccharomyces Genome Database 16 Cellular Processes • The cell is a dynamic entity – Grows, divides, responds to environmental changes • Cellular processes - composed of molecular interactions Yeast cell cycle 17 • Different cellular processes can be represented as graphs -Genetic networks -Metabolic pathways -Regulatory networks -protein-protein interaction networks 18 Representing Genetic Networks Entity Relationship Entity Enabler Gene, protein, ligand Enhances, represses, becomes Energy source, catalyst 19 Metabolic pathways 20 Regulatory Network 21 Analysis of transcription regulation networks Network Motifs Connected patterns of interactions that recur in the integrated cellular network statistically significantly more often than at random 22 Analysis of transcription regulation networks A Feed-forward loop Single input module P1 g2 Dense regulons ………. (Shen-Orr S. et al., 2002) ……….. 23 A network of interactions can be built For all proteins in an organism P1 P2 DATA TYPE Gal4 Gal80 Ste12 Dig2 Swi4 Swi6 ……. A large network of 8184 interactions among 4140 S. Cerevisiae proteins 24 Highthroughput biological data is required for for generating networks • Measure direct interactions – DNA footprinting – One-hybrid, two-hybrid experiments – Accurate but low throughput Yeast Yeast 22 -- hybrid hybrid 25 Networks generated from microarray data are less accurate • Expression levels with microarrays – Examine expression correlations – Problem: multiple interpretations – High throughput but only suggestive 26 Other Resources • BioCyc – http://www.biocyc.org/ • Biomolecular Interaction Network Database – http://www.bind.ca/ • ‘What is There’ Interaction Database – http://wit.mcs.anl.gov/WIT2/ • Gene Ontology Consortium – http://www.geneontology.org/ 27