Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 18 Functional Genomics •High throughput analysis of gene functional at a genome level •Transcriptomics •Reveal how genes work together to form metabolic, regulatory and signaling pathways and networks •Reveals co-expressed and co-regulated genes •Insight into biological function of the whole genome Expressed sequence tags (ESTs) The protocol •Libraries of cDNA closes are prepared by reverse transcription of oligo-d(T) primers •Clones are randomly selected for sequencing of 200-400nt from 3’ or 5’ direction •Thus, if the mRNA of a gene was present, there is a likelihood that its corresponding EST should be found •The number of each EST gives an indication of the level of mRNA (~expression level) Drawbacks •High error rates •Vector sequence contamination •Double inserts: vector_left-5’-gene_1-3’-5’-gene_2-3’-vector_right •Highly expressed genes predominate the EST library EST Index Construction •Remove vector sequences (VecScreen) •Clustering to associate ESTs with single, unique genes •Derive consensus sequence to produce EST contig •Use HMM to delineate coding region •Annotate by database similarity search of translates sequence UniGene •http://www.ncbi.nlm.nih.gov/unigene/ •Database of overlapping, processed EST sequences •Sequences imported from dbEST and GenBank •Only EST sequences with 3’ ends are used (avoid double inserts) •Vector sequences removed •ESTs searched against database of known genes •EST sequence corrected to that of known gene •Partitioned into clusters and assembled into contigs TIGR Gene Indices •http://compbio.dfci.harvard.edu/tgi/ Serial Analysis of Gene Expression (SAGE) SAGE •The number of times that is sequence tag is present in the sequenced pool is proportional to the level of the complementary mRNA •Tags typically 15bp in length •Sequencing most costly step •10,000 clones representing ~500,000 tags are typically sequenced •Tag ID very sensitive to sequencing error due to short length •One tag can map to more than one gene SAGEmap •http://www.ncbi.nlm.nih.gov/SAGE/ •Use cDNA sequence to find corresponding SAGE tag and find expression level SAGExProfiler •Allows subtraction of one SAGE library from another to view differences •Allows identification of genes differentially expressed in control versus experimental samples Microarray analysis •Primers (25-70nt) or cDNA spotted onto poly-lysine coated microscope slide •The spots are called probes that hybridize to fluorescently labeled DNA samples •Probes must be specific enough to minimize cross-hybridization •Use BLAST to find unique regions in genes •Use RepeatMasker to remove low-complexity regions (nonspecific) •No stable internal secondary structures (use Mfold) •All probes should have similar Tm and GC% ~55% OligoWiz •http://www.cbs.dtu.dk/services/OligoWiz/ •Java client for oligfo design Microarray analysis Image processing •Extract total RNA or mRNA from tissues or cells •Incorporate Cy3 (absorption: 550nm, emission 570nm) or Cy5 (absorption 650nm, emission 670nm) fluorescent dye into two cDNA samples (control and experiment) during RT step •Mix two samples and hybridize simultaneously to microarray slide •Visualize hybridization of one cDNA set at 570nm and other at 670nm •Subtract background •Express as ration experiment : control •Log2 transform: 2:1 log2(2)=1; 1:2 log2(0.5)=-1 •Initial “raw analysis” performed in GenePix Data transformation and normalization •Background-corrected ratios (experiment : control) are further normalised to correct for technical errors and biases •Non-linear relation in intensity-ratio plot (panel C) can be corrected by Lowess normalisation (klocalised, weighted linear regression) ArrayPlot •http://transcriptome.ens.fr/arrayplot/ •Windows program that allows visualization, filtering and normalization of microarray data Statistical analysis to identify differentially expressed genes Arbitrary 2-fold cut-off Replicates t-Test and ANOVA Calculate 95% confidence level Microarray Data classification Identify groups of genes with similar expression profiles Partition data into grousp on grounds of simmilarity Distance measure •Euclidean distance between genes X and Y under conditions i=1, 2, …, n x y n 2 i i i 1 Pearson correlation 1 n xi xav yi yav n i 1 sdi sdi •If profiles are identical, corr = +1, anti-correlated -1, no correlation 0 Supervised and unsupervised classification •Genes can be grouped by the distances between their expression profiles •Supervised – classification into a set of predefined categories •Unsupervised – no predefines categories; define categories by data similarities •Gene3s within categories have more similarity than genes in other categories •Co-regulated genes often reflect similar functions •Functions of unknown genes may thus be assigned •Clustering algorithms of two types Agglomerative •Define most similar two data points, and repeat the process of merging data points until no points remain Divisive •Lump all data together and successively remove most distant data points until all data points are resolved Hierarchical clustering •Uses agglomerative approach to construct relationship tree •Different linkage types •Single: minimum distance between members of two clusters •Complete: maximum distance •Average: mean of the distances •Number of clusters depends on arbitrary user set threshold k-Means clustering •Does not produce a dendrogram •Classifies data through single-step partition •Average of group is calculated and distance of each point from this average •Randomly reassign all data points to new cluster •Recompute distances •If distance to group average is smallest, retain point in group •If not, reassign in next round Self organizing maps (SOMs) •Similar to k-means •Define number of nodes, and assign data points randomly to nodes •Calculate distances to node averages •‘Redistribute data points, and recalculate •Repeat until no decrease in distance is obtained •Nodes are not isolated groups, but are connected Clustering programs Cluster http://rana.lbl.gov/EisenSoftware.htm Windows program, capable of hierarchical clustering, SOM and kmeans clustering Treeview http://rana.lbl.gov/EisenSoftware.htm Visualize data from Cluster program