Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
D. RESEARCH DESIGN & METHODS Patient demographics and classification. Study subjects are derived from the Immunodeficiency Services Clinic of the Erie County Medical Center, Buffalo, NY, the largest treatment provider to HIV infected patients in Western NY. This is the first HIV adherence clinic in the U.S. and the first to use Therapeutic Drug Monitoring (TDM). Our patient enrollment is 1200, with an age range from 16-78 yr. Patient demographics are shown in (Figure 3). Of the patients, 34% are female and 66% are male. The ethnic distribution includes 507 (42.2%) African Americans, 471 (39.3%) Caucasians, 133 (11.1%) Hispanics, and 89 (7.4%) other ethnicities. Clinical samples from HIV-1 infected patients are obtained by Chiu-Bin Hsiao, M.D., Immunodeficiency Services Director and Co-I of this application. Donors are apprised of the study and informed consent is obtained consistent with the policies of the UB Health Sciences IRB. A standard data collection form that includes demographic, social, HIV risk factor, clinical, treatment and hospitalization data is completed for each subject. The Centers for Disease Control and Prevention’s classification system for HIV-1 infection is used to determine clinical category; this system provides mutually exclusive states based on the intersection clinical, virological and immunological criteria. We adhere to the following criteria for the classification of LTNP patients: a) duration of infection >12 yr; b) CD4 count >500; c) viral load <10,000 RNA copies/ml and d) no antiretroviral therapy. Based on these criteria, we currently have at least 60 LTNP subjects available for this study. A small percentage of HIV-infected patients who rapidly progress to AIDS within 4 yr after primary infection are classified as rapid progressors (RP) and are not included in this study. The remainder of our clinic population is designated normal progressors (NP) and they will serve as a clinical comparator to our LTNP cohort. A clinical database, created with Microsoft Access, including age, gender, duration of infection, CD4 count, HIV-1 viral load, treatment regimens, and disease outcomes is available for correlation with our genomic and proteomic data. Statistical analyses are performed with SAS software. Descriptive statistics are used to describe the population (Figure 3) and Chi-square analysis is used to test for differences between groups. Logistic regression is used to determine independent factors associated with LTNP while controlling for other factors. All subjects are encoded to prevent individual identification and to provide the utmost of personal privacy consistent with the guidelines of the Health Insurance Portability and Accountability Act of 1996 (HIPPA). Standard clinical methods are briefly described below. CD4 counts: Absolute counts of circulating CD4+ T lympocytes are obtained with a dedicated CyFlow® Counter. Qualitative HIV-1 DNA PCR amplification: We use the Amplicor HIV-1 test (Roche Diagnostic Systems, Inc.), the only FDA-licensed kit available for qualitative HIV-1 DNA PCR amplification, to diagnose HIV-1 infection. The assay utilizes recombinant Thermus thermophilus enzyme to catalyze both reverse transcription and DNA amplification of a 142 base HIV gag gene sequence. The assay has an analytic sensitivity of 200 RNA copies/ml and a quantitation limit of 400 RNA copies/ml, with a dynamic range up to 750,000 RNA copies/ml. The Amplicor HIV-1 Monitor Ultrasensitive Specimen Preparation Protocol, (Roche Molecular Systems) concentrates virus by centrifugation and increases sensitivity to 50 copies/ml; the quantitation limit is 200 copies/ml. p24 assay: A commercial ELISA kit (Zeptometrix, Buffalo, NY) is used to quantitate p24 in plasma samples. The sensitivity of the p24 assay is 4 pg/ml. Western Blot Analysis of p24: Total protein is extracted from peripheral blood mononuclear cells (PBMC) of subjects using the Mammalian Protein Extraction Reagent (Pierce, Rockford, IL) and 30 g of protein is loaded per lane and separated by a 4-20% SDS-Tris glycine PAGE. Membranes are probed with p24 monoclonal antibodies. Anticipated Results: We diligently screen our HIV-1 infected patients and have found ~5-8% of them meet criteria for classification as LTNP, thus we anticipate no problems in procuring adequate numbers of subjects for this unique patient cohort. The fundamental mechanisms that determine the LTNP state remain largely unexplored. Thus this study may yield previously unrecognized biomarkers of the LTNP state that may serve as predictors of disease progression and could yield new therapeutic agents. Research Design and Methods for Specific Aim I: Genomic Analysis: Gene arrays are performed to identify functional genes in PBMC from HIV-1 infected NP and LTNP patients. RNA from PBMC is used for cDNA microarrays. A total of 15 arrays are run from each subject 1 cohort which is sufficient to derive statistically significant differences as determined by a power analysis. Differences in expression levels are determined using at least 2-fold changes between the patient groups. A p value of <0.05 using the non-parametric Wilcoxon–Mann–Whitney test is considered as significant. PBMC Isolation: PBMC are isolated from 20ml blood samples from patients. Blood is diluted 1:2 using Dulbecco’s PBS without MgCl2 and CaCl2, 5mM Na2-EDTA and then overlaid on 15ml Ficoll-Paque® Plus (Amersham-Pharmacia, Piscataway, NJ) in a 50 ml tube. Samples are centrifuged for 20 min at 700 × g at 20°C. The PBMC interface is carefully removed, washed twice with PBS/EDTA, resuspended in 2 ml of complete RPMI media and the total number of cells counted using a hemocytometer. RNA Extraction: Cytoplasmic RNA is extracted from 3X106 PBMC/ml using Trizol reagent (GIBCO-Life technologies, Grand Island, NY) (Chomczynski and Saachi, 1987). RNA is quantitated using an ND-1000 spectrophotometer (Nano-Drop™ Wilmington, DE) and isolated RNA is stored at –80oC. Gene Microarrays; Production of cDNA microarrays: The cDNA arrays are performed at the adjacent Roswell Park Cancer Institute, Microarray and Genomics Core Facility. They consist of ~6000 cDNA clones (Research Genetics) selected based on their association with the immune response as cited in the scientific literature. Each clone is amplified from 100 ng of bacterial DNA by performing PCR amplification of the insert using M13 universal primers for the plasmids represented in the clone set (5'–TGAGCGGATAACAATTTCACACAG– 3', 5'–GTTTTCCCAGTCACGACGTTG–3'). Each PCR product (75ml) is purified by ethanol precipitation, resuspended in 25% DMSO and adjusted to 200ng/l. Printing solutions are spotted in duplicate on Type A Schott Glass slides using a MicroGrid II TAS Arrayer and MicroSpot 10K split pins (Apogent Discoveries). Preparation and hybridization of fluorescent labeled cDNA: A total of 30 RNA samples (30 cDNA arrays), 15 each from NP and LTNP donors are screened for gene expression. From each RNA sample, cDNA is synthesized and labeled with Cy3 (NP) and Cy5 (LTNP) dyes, using the Atlas Powerscript Fluorescent Labeling Kit (BD BioSciences). For each reverse transcription reaction, 2.5 g total RNA is mixed with 2l of random primers (Invitrogen) in a total volume of 10 l, heated to 70°C for 5 min and cooled to 42oC. An equal volume of reaction mix (4l 5X first-strand buffer, 2 l 10X dNTP Mix, 2 l DTT, 1 l deionized H2O, and 1 l Powerscript Reverse Transcriptase) is added to the sample. After 1 hr at 42°C, RNA is degraded by incubating at 70°C for 5 min. The mixture is cooled to 37oC and incubated for 15 min with 0.2 l RNase H Figure 10: Normalization procedures for gene microarray data. A. M vs A plot for data before (10 units/l). The resultant and after Lowess normalization. Cy3 and Cy5 labeled expression intensity from 1 of the 4 slides were plotted using M (log2(Cy3/Cy5)) vs A ((log2(Cy3) + log2(Cy5))/2)) plot for the detection of amino-modified cDNA is intensity-dependent expression on each individual slide: (a) data before Lowess normalization purified, precipitated, and and (b) data after normalization. Dashed lines show the dye labeling efficiency and red lines show the intensity-dependent dye bias. B. Distribution of expression values from three slides fluorescently labeled. using different normalizations Cy3 (in green color) and Cy5 (in red color) labeled expression Uncoupled dye is removed from intensity from 3 slides were plotted using boxplots to show their variances: (a) data without the labeled probe by washing 3 normalization, (b) data after Lowess normalization, and (c) data after both Lowess and scaled normalization. C. Distribution of differentially expressed genes. The scatter plot of the average times on a Qiaquick PCR log2 ratio between samples from the 2 patient cohorts for the 5043 genes. Statistically significant Purification Kit (Qiagen). The genes (p<0.05 and FDR<0.05) identified by 2-step analyses of t-test and SAM are shown in color, including 70 down-regulated genes (in green) and 90 up-regulated genes (in red). probe is eluted in 60 l elution buffer and dried in a SpeedVac. Prior to hybridization, the 2 separate probes are resuspended in 10 l dH2O, combined and mixed with 2 l of Human Cot-1 (20 g/ul, Invitrogen) and 2 l of poly A (20 g/ul, 2 SIGMA). The probe mixture is denatured at 95oC for 5 min, placed on ice for 1 min, and prepared for hybridization by addition of 110 l of preheated (65oC) SlideHyb #3 buffer (Ambion). After 5 min incubation at 65oC, the probe solution is placed on the array in an assembled GeneTAC hybridization station module (Genomic Solutions, Inc). Slides are incubated at 55oC for 16–18 hr with occasional pulsation of the solution. After hybridization, slides are automatically washed in the GeneTAC station with reducing concentrations of SSC and SDS. The final wash is 30 sec in 0.1X SSC, followed by a 5 sec 100% ethanol dip. The slides are dried and scanned immediately on a GenePix 4200A scanner (Axon, Inc). To minimize intra-array variations, 2 hybridizations for each sample type are performed, and labeling with the Cy dyes is interchanged, to provide technical replicates. Image Analysis: Hybridized slides are scanned using a GenePix 4200A scanner to generate high-resolution (10 m) images for both Cy3 and Cy5 channels. Image analysis is performed on the raw image files using ImaGene (version 6.0.1) from BioDiscovery Inc. Each cDNA spot is defined by a circular region. The size of the region is programmatically adjusted to match the size of the spot. Local background for a spot is determined by ignoring a 2-3 pixel buffer region around the spot and measuring signal intensity in a 2-3 pixel wide area outside the buffer region. Raw signal intensity values for each spot and its background region are segmented using a proprietary optimized segmentation algorithm that excludes pixels that are not representative of the majority pixels in that region. The background corrected signal for each cDNA spot is the mean signal (of all the pixels in the region) minus the mean local background. The output of the image analysis is 2 tab delimited files, one for each channel, containing all of the raw fluorescence data. Microarray data processing and analysis: Expression data extracted from image files are first checked by a M (log2(Cy3/Cy5)) vs A ((log2(Cy3) + log2(Cy5))/2)) plot to see if intensity-dependent expression bias exists between spots (genes) labeled with Cy3 and Cy5 on each individual slide. On determining that intensity dependent expression bias exists for all slides, we perform a Lowess data normalization to correct the observed intensity-dependent expression bias. We then perform a global normalization to bring the median expression values of Cy3 and Cy5 on all 4 slides to the same scale. This is done by selecting a baseline array (e.g. Cy3) from 1 of the 4 slides, followed by scaling expression values of the remaining 7 arrays to ~ ): the median value of the baseline array ( m base ~ m xi' ~base xi . mi After data normalization, the average intensity of individual genes on each slide is computed using an inhouse developed PERL script. A total of 6000 average expression values are obtained, including empty, dry, null, and DMSO control spots. Paired t-tests on normalized intensity with p-values <0.05 are used to generate a list of genes with significant change in expression between NP and LTNP samples. The false positive rate of significant genes is estimated using the SAM algorithm (Tusher et al, 2001). Quality control measures: Ratios of housekeeping genes, G3PDH and -actin, scaling factors, background, and Q-values must be within acceptable limits. Fold-change is calculated from the signal log ratio. Microarray data preprocessing: Preprocessing of microarray data is very important for enhancing meaningful data characteristics. One of the most important procedures is data normalization, which corrects systematic differences such as intensity-dependent expression bias and different dye efficiency between and across datasets. To check whether a bias exists between gene spots labeled with Cy3 and Cy5 on each individual slide, we first make an M vs A plot for genes from each individual slide. Figure 11A [a] shows data display bias, not only from dye labeling efficiency indicated by the non-zero (dashed) lines from the M vs A plot, but also from intensity-dependent dye bias indicated by the Lowess fitting line in red. Therefore, the within slide normalization is first carried out using intensity–dependent normalization. This procedure corrects most of the expression bias as shown in Figure 10A [b]. However this normalization can not correct the bias introduced from different slides (Figure 10B [a] & [b]). Therefore, scale normalization is used to bring the overall intensity of Cy5 and Cy3 in each slide to the same level (Figure 10B [c]). To remove genes with great expression variation between replicates, we perform data filtering by measuring the repeatability of gene expression using co-efficient of variation (CV) (http://www.r-project.org). This is done by computing the CV of Cy3/Cy5 ratios for all individual genes from 4 slides, followed by constructing a 99% confidence interval for the CV values of all genes. Genes with CVs outside the upper 99% confidence interval limit are regarded as unreliable measurements and are removed from further analysis. 3 Gene ontology analysis: The use of gene ontologies enables us to summarize results of quantitative analyses and annotate genes and their products with a limited set of attributes. The 3 organizing principles of gene ontology are molecular function, biological process, and cellular component. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. Molecular function describes activities, such as catalytic or binding effects. In gene ontology, molecular functions represent activities rather than entities (molecules or complexes) that perform the actions and do not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by complexes of gene products. A biological process is a series of events accomplished by one or more ordered assemblies of molecular functions. It may be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct step. However a biological process is distinct from a pathway. GeneSifter software (VizXlabs) allows user-defined filtering to focus on data of greatest interest and these queried files can be exported for secondary analyses. GeneSifter software rapidly characterizes the biology involved in a particular experiment, and helps identify specific genes of interest from a list of potential targets by identification of broad biological themes. Potential results: As shown in Preliminary Results, our data set demonstrates reproducible and significant changes in genes that had been previously implicated or hypothesized to have a role in pathogenesis of HIV-1 infection. Gene ontology analysis classifies the significantly modulated genes into distinct functional groups (Figure 6). These data provide insight into the signaling mechanisms that are involved in HIV-1 disease progression. However, one must interpret cDNA data with caution since a large number of hypothetical genes may be obtained on data analyses that may not be specific to HIV-1 disease. This method can identify unique genes that could be new biomarkers for the progression of HIV infection. Proteomic Analysis: Proteomic studies are undertaken to analyze differential protein expression in lysates of PBMC from NP and LTNP patients to identify unique proteins associated with a specific cohort. Proteins are extracted from lysates of PBMC from NP and LTNP and run on 2–dimensional difference gel electrophoresis (2DDIGE; EttanTM DIGE system, Amersham Biosciences, Piscataway, NJ) available as a core facility at our institution. A total of 15, 2D-DIGE runs are performd for samples from each cohort, which is sufficient to derive statistically significant differences as determined by power analysis. The Ettan DIGE system yields highly accurate, quantitative data and the key benefit is that multiplexing enables the incorporation of the same internal standard on every 2-D gel. Extracts from NP and LTNP are co-electrophoresed on the same gel and quantified, avoiding problems associated with registering independently-derived gel maps. This technique depends on pre-derivatization of samples with a fluorescent Cy-dye (Cy-2, Cy-3, or Cy-5). These dyes are molecular weight and charge-matched, have non-overlapping excitation and emission spectra, and can be covalently attached to amino groups of proteins. At the end of the run, DeCyder Differential Analysis software, consisting of Differential In-gel Analysis (DIA) and Biological Variation Analysis (BVA) components, permits automated detection of spots, background subtraction, quantitation, normalization, internal standardization and intergel matching. This software can screen 2500 to 10,000 spots. We select up to 32 individual protein spots showing the highest fold- increase or decrease in intensity and a corresponding statistically significant p value between NP and LTNP from the 2D gels. Spots are robotically picked from the gels and deposited into the wells of a microtiter plate. This is followed by in-gel tryptic digestion and peptide isolation. Digested spots are analyzed in our Proteomics Core Facility. Our initial proteomics data were determined by matrix-assisted, laser-desorption-ionization/time of flight (MALDI-TOF) mass spectrometry (MS). Protein Identification by Peptide Mass Fingerprinting: Most spots are initially analyzed by MALDI-TOF by peptide mass fingerprinting (PMF). The MALDITOF instrument is relatively tolerant of common contaminants extracted with peptides from an in-gel digestion and salts. Extracts are pooled, dried and reconstituted in 50% ACN and 0.1% TFA, mixed 1:1 with CHCA (4mg/mL) in 50% ACN, 0.1%TFA and l is spotted on specified wells of MALDI plates. External calibration between m/z 1000-3000 and tuning of the MALDI-TOF MS instrument (Resolution >10,000 FWHM) is routinely achieved using human adenocorticotropic hormone (ACTH) fragment 18-39, which provides a monoisotopic calibration peak at 2465.199(M+H)+. The instrument is further calibrated 4 using a PEG (1K, 2K, and 3K) mixture and sensitivity is checked with a signal to noise ratio of 124:1 using 10 fmol GFP before analyzing samples. Positive identification of 50 fmol yeast aldehyde dehydrogenase peptide mixture (Waters Corp) by PMF is used routinely as quality control. High mass accuracy, automated MALDI-MS spectra are acquired in the reflectron positive mode from each sample spot followed by acquisition from the lock-mass well (ACTH standard) for external lock-mass correction. The acquired spectra by PMF are followed by protein identification through database matching using PLGS (v 2.3) and Mascot (v 2.2, 2 cpu licensed version). Remaining aliquots of the protein spots not identified by this approach or that give ambiguous results are subjected to nanospray or LC/MS/MS analysis. Nano high performance liquid chromatography tandem mass spectrometry (LC-MS/MS) method coupled with the SEQUEST for protein identification. MS data are analyzed by SEQUEST software and searched against the latest version of the entire National Center for Biotechnology’s Conserved Domains Protein Database provided for fragments that are approximately the same mass using a 16 processor IBM Cluster computer. The closest 500 sequences provide theoretical tandem mass spectra, with fragment ions produced depending on the amino acid sequence of the peptide. Experimental spectra are compared to theoretical spectra using cross correlation analysis. Results are filtered with DTAselect software (Tabb et al, 2002) to limit the possibilities. Criteria for a positive peptide identification are cross correlation value (Xcorr) of 2.5 or greater for a 2+ ions value (ion), 3.5 or greater for a 3+ ion, delta Xcorr of 0.1 or greater, and at least 1 tryptic terminus (Hunter et al, 2002). The best match is reported as an identified protein. Post-translational modification of proteins is indicated when spots with different migratory properties on 2-D DIGE are subsequently identified as the same protein by tandem MS. When multiple spots of the same protein migrate in a diagonal pattern, this is consistent with post-translational modification by glycosylation. However a linear migratory pattern of several spots of the same protein on 2-D gels suggests post-translational phosphorylation. These qualitative differences can be resolved by isolation of the protein of interest and chemical analysis of post-translational glycosylation and/or phosphorylation. For verification of protein spot identification, we use immunoidentification on 2D western blots, that may reveal additional immunoreactive spots representing isoforms or post-translational modifications that can be analyzed as above. N-terminal sequencing can be carried out on spots to confirm their identity. Sequences of new proteins unique to a particular sample are input in SP3 and SPARKS 2 to determine their possible structural folds and in INSPIRE 2.0 and INSP3IRE for their potential binding sites and possible binding partners. Functions of identified proteins will be tested. Lastly, changes in protein expression are correlated with changes in gene expression. To limit variations in the analysis of genomic and proteomic data, we repeatedly measure gene and protein expression in aliquots of the same sample and do reciprocal interchanging of the Cy dye labeling of the proteins. Limitations of DIGE based proteomics-solutions and strategies: Limitations of DIGE-based proteomics include: low-abundance proteins, hydrophobic proteins such as integral membrane proteins, poor resolution of proteins at extreme PI points, poor resolution of high MW proteins, and difficulty in identifying low MW proteins. There are varying reasons for the failure of protein identification ranging from incorrect excision of the gel spot, poor extraction of peptides in the in-gel digestion procedure, incomplete trypsinization, faulty spotting on the MALDI plate, insufficient sensitivity of the mass spectrometer, suppression of ionization, and loss of mass accuracy to low-peptide numbers from low MW proteins. Strategies to improve the identification of low abundance proteins include using higher protein concentrations and enrichment methods (Quin et al., 2005) especially in the study of post-translational modifications (Larsen., 2005). In case of failure or ambiguities of identification we routinely use LC-MS/MS analysis (Q-ToF Premier interfaced with a Waters Nano-Acquity UIPLC or the LTQ instruments interfaced with Eksigent or GE Healthcare MDLC nano-flow chromatography systems). Proteins containing more than one transmembrane domain-c are difficult to separate by 2-DIGE. Use of strong detergents, such as SDS, interferes with isoelectric focusing. Thiourea, along with urea and detergents such as CHAPS, have improved but not completely solved the problem (Luche et al.,2003). Other detergents such as oligooxyethylene, sulfobetaine, dodecyl maltoside, and decaethylene glycol mono hexadecyl also can be used in 2-DIGE (Luche et al.,2003; Santoni et al., 2000). PMF by MALDI suffers from the inherent limitation of not being able to resolve mass changes in peptides due to post-translational modifications. This hinders protein identification. We usually use tandem mass spectrometry with the Q-ToF Premier interfaced with the Nano-Acquity UIPLC for these situations. Tandem MS uses collision-induced dissociation (CID) in the presence of an inert gas to fragment individual peptide ions. The second mass-analyzer measures the molecular masses of the fragments. The product 5 ion spectrum allows detailed analysis of the specific selected ion. The spectra acquired by LC-MS/MS is routinely processed using Masslynx 4.1 followed by protein identification through database matching of at least 2 matched unique sequenced peptides, using PLGS v 2.3 and Mascot v 2.2. As noted below, we have additional software packages and Figure 11 approaches for troubleshooting protein or peptide identification. Validating results by additional experimental procedures such as immunoblotting are routinely performed. We also combine DIGE– proteomics analysis with a bioinformatic pathway analysis such as Metacore. Proteins analyzed by DIGE are used as input data for pathway analysis which point to several additional proteins including low abundance regulatory proteins or transcription factors that could be involved in the HIV-1 pathogenesis process. Label-free proteomic comparison (expression profiling). Label-free proteomic quantification, defined as the relative quantification of proteins by direct comparison of peptide peak areas between LC/MS runs without the use of peptide labeling, has been emerging as an excellent alternative to 2D-Gels and label-based methods such as isotope coded affinity tag (ICAT), isobaric tag for relative and absolute quantitation (ITRAQ), 18O-incorporation, or stableisotope labelling by amino acids in cell Figure 11: A high resolution, low void volume, homogenous mixing nanoLC/nanospray interface developed in our proteomics facility. culture (SILAC). Label-free methods can provide better accuracy and eliminate expensive, and sometimes problematic, labeling steps (Higgs et al., 2007). However, label-free quantification of differentially expressed proteins on a proteomic scale remains challenging for several reasons. First, cleanup of extracted proteins from complex samples without compromising quantitative information can be a significant challenge (Higgs et al., 2007). Second, labelfree strategies are intrinsically biased toward higher abundance proteins, while more important, lowerabundance regulatory proteins might escape sequencing (Bantscheff et al., 2007). To address these challenges, we shall apply several novel technical advances that have been developed and validated in our lab for label-free proteome comparisons. Our experimental approach involves 3 key innovations: First, samples are extracted by a method that is optimized (see below). Before analyzing the samples by nano-flow liquid chromatography/mass spectrometry (nano-LC/MS), surfactants and matrix components such as lipids that could compromise reversed-phase chromatography must be removed while avoiding loss of proteins. Traditionally, crude protein extracts are resolved on a preparative SDS-PAGE, followed by excision of gel bands and in-gel digestion (Cho et al., 2007), but this approach suffers from relatively low recovery of peptides following in-gel digestion, and band excision is inexact, leading to the inevitable loss of proteins at the edge of bands, and thereby compromising quantification. We have developed an approach involving protein precipitation and on-pellet digestion; this method provides high protein/peptide recovery and identification (ID) of a larger number of proteins/peptides, while avoiding gel separation. The crude protein extract is subjected to a multiple-step precipitation procedure employing organic solvents that is optimized for the specific cell or tissue sample. For example, the protein extract is mixed with 4:2:3 (vol:vol:vol) methanol:chloroform:water, and then centrifuged. After removal of the aqueous layer that contains hydrophilic non-protein components, 4 vol methanol is added to eliminate phase separation and precipitate the protein. After centrifugation and removal of the supernatant, a small amount of enzyme is 6 added to the pelleted protein to initiate partial proteolysis and solubilization of the pellet. The mixture is then reduced, alkylated, and further digested to completely-cleaved peptides, which are ready for nanoLC/MS analysis. This strategy is superior to the traditional gel fractionation method in both protein recovery and peptide ID. Second, because the samples proposed for proteomic analysis are all highly complex, the ability to separate the samples efficiently prior to analysis by the MS/MS detector is critical to obtain sequence and quantitative information, particularly for lowerabundance regulatory proteins. We developed a novel, high-resolution nanospray ionization (nano-LC/NSI ) configuration, which provides homogenous mixing, low void volume, and high chromatographic resolving power. The setup is illustrated in Fig. 11. A trap and a small-particle nano-HPLC column are connected back-to-back by an entirely metal zero-dead-volume tee, with a waste line connected to the 90º arm. Because there is no valve between the trap and nano-column, peak tailing and band broadening due to turbulence/mixing in valve channels are eliminated. A large-diameter trap and bi-directional sample loading/analysis are used to achieve homogenous nano-gradient mixing, a highly reproducible gradient, and a high loading capacity. An online zero-dead-volume conductivity sensor is used to monitor gradient quality and trap washing efficiency. Because of the highly reproducible gradient and high resolving power this interface provides, we can employ shallow gradients and long run times to resolve very complex protein mixtures without unacceptable peak broadening. Third, the design for sample analysis and data processing is shown in Figure 12. We use 2 label-free quantification software packages, Sieve (Thermo) and Decyder MS (GE Health Care), to provide relative quantification by comparison of multiple runs and samples. Searches for peptide ID are performed with Sequest running on a 32 node cluster in UB’s supercomputer center. Identified proteins are documented, and those of interest for quantification at higher temporal resolution are analyzed using the ultra-sensitive quantification scheme employing LC/MRM (multiple reactions monitoring) -MS/MS described below. Comprehensive Identification of post-translational modifications (PTM) in complex samples. PTM play a critical role in physiological and pathological processes, and are important markers for certain pathological states (Hunter, 2007). The ability to identify PTM in PBMC will provide key insights necessary to accomplish the aims of this project. Our proteomic facility has developed analytical procedures to monitor routinely and comprehensively 20 biologically important PTM in biological samples, such as phosphorylation, methylation, dimethylation, ethylation, Figure 13 biotinylation, ubiquitinylation, nitrosylation, etc. Due to the usually low stoichiometry of PTM, combined with the complexity of the digested sample, identifying multiple PTM Figure 13: Scheme for a novel dual-enzyme and dual-activation method to enhance PTM identification significantly. in these samples is challenging. To increase the number of PTM identified, we shall employ the high resolution nano-LC/NSI system described above, and use a shallow gradient to improve resolution of the complex mixture. In addition, we shall employ a dual-enzyme/dual-activation technique that we recently developed. The flow chart of this new technique is illustrated in Figure 13. In this sample processing scheme, two enzymes are employed 7 individually in parallel: trypsin (cuts at K and R) and V8 (cuts at D and E). These produce complementary proteolytic peptide profiles. Parallel samples are analyzed by both collisionally activated dissociation (CID) and electron transferring dissociation (ETD), which are alternative approaches for fragmenting peptides to obtain sequence-informative product ions. CID can fragment singly to triply charged peptides with good efficiency, but not peptides with higher charge-states, and often does not preserve fragile PTM such as phosphorylation. In contrast, ETD is not optimal for doubly-charged peptides, and does not work well on singly-charged peptides, but can efficiently fragment peptides having 3 or more charges. It also can preserve phosphorylations. Therefore CID and ETD provide complementary information. Using this dualenzymes and dual-activation method, we are able to observe significantly more PTM than when using trypsin and CID alone, as well as identify more peptide/proteins (data not shown). For example, with conditions otherwise the same, using trypsin and CID alone to analyze a human liver sample enables identification of ~5400 PTM, when using a set of stringent filters, while the dual-enzyme/dual-activation approach resulted in ~9000 PTM identified. For each sample, the database search for CID samples is filtered using Xcorr (2.0 z=1, 2.5 z=2 and 3 z=3) and probability score (P<0.01); for samples analyzed by ETD, the standard filter will be Xcorr > 2.5 z=2, 3 z=3 and 3.5 if z=4 and Sf>0.7. All identified proteins, peptides and PTM are cataloged in a database. Potential results: As seen in our preliminary data (Figures 8 & 9) proteomic analysis using 2D-DIGE showed significant changes in protein expression between the NP and the LTNP cohorts. Identification of these unique protein spots is currently underway using several methodologies described above. These investigations will provide a list of known or unknown proteins that play a role in HIV-1 disease progression. A large number of hypothetical genes and proteins may be obtained on data analyses. Some of them may not have been described in the literature or exist in a database. These unanticipated proteins may be candidates for new biomarkers associated with progression of HIV disease or even emerge as potential targets for immovative therapies. Although, we expect a large number of proteins to be expressed in each group, only a few may be expressed at levels detected by the 2D-DIGE approach. Even with low concentration polyacrylamide gels, some high MW proteins may not be adequately separated by 2D electrophoresis. Limitations of this methodology are peptide clustering, protein modifications, protein-protein interaction, low mass accuracy, and enzymatic cleavage of specific proteins, some or all of which can result in inaccurate or incomplete identification of the unique protein. Once a unique protein has been identified we shall quantitate its expression with western blots. Attempts will be made to account for any confounding variables using technological advances in proteomics as described above. Candidate genes and proteins, as selected by genomic and proteomic analysis, will be confirmed by appropriate assays such as QPCR, and western and antibody based microarrays. Changes in protein expression will be correlated with changes in gene expression. To limit variations from the analysis of genomic and proteomic data, we shall repeatedly measure gene/protein expression in different aliquots of the same sample using different reciprocal labeling of the same sample. We anticipate that using a combination of these state of the art proteomic methodologies we shall be able to identify new biomarkers of HIV infection. Research Design and Methods for Specific Aim II: SNP analysis: SNP analysis will be undertaken by Sequenom MassARRAY spectrometry, a powerful genotyping technology that efficiently and precisely measures the amount of genetic target material and variations therein using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (Maldi TOF-MS). It is able to deliver reliable and specific data from trace amounts of DNA from patients. This methodology is one of the cheapest and most error free technologies for high throughput SNP typing. It uses samples arrayed in 384 well plates and allows custom genotyping of SNPs within candidate genes or genomic intervals. Sequenom MassARRAY can be used to identify alleles on the basis of mass and therefore does not required expensive labeled primers and is also far more reliable than other genotyping approaches with <0.5% error rate. Briefly, the technology involves PCR amplification of the region containing the SNP of interest, an optimized primer extension reaction to generate allele-specific DNA products, and chip-based mass spectrometry for separation and analysis of the DNA analytes. A single post-PCR primer extension reaction generates diagnostic products that, based on their unique mass values, allow discrimination between two alleles. The entire process has been designed for complete automation including assay development, PCR setup, post-PCR treatment, nano liter transfer of diagnostic products onto silicon chips, 8 serial reading of chip positions in the mass spectrometer, and final analytical interpretation. Sequenom SPECTRODESIGNER software designs primers for genotyping in multiplex fashion. IPLEX software allows for the design of assays for 24-28 SNPs at a time. Sequenom MassARRAY will be used to identify allelic variants, RANTES In1.1c, CCR2b-641, CCR5-∆32, IL10-5'A, IL-4 -589T, TNF--238A and HLA-B27, in NP and LTNP. DNA is extracted from PBMC isolated from NP and LTNP samples and used for the Sequenom MassARRAY analysis. A total of 65 samples from each cohort will be run in triplicate on a 384 sample plate. A total of 8 SNPS will be run per sample plate. Genotyping data are automatically analyzed using a script that computes descriptive statistics and runs the Shapiro–Wilk test for goodness-of-fit to normality. Tests of association are automatically performed using a script that runs the modified χ2 test. The interface also allows classical association studies to be carried out based on genotypes of individuals. Analysis of the genetic variants using real time, quantitative PCR: Quantitation of gene expression of the specific SNPs listed above will be performed using QPCR in both study groups, NP and LTNP. The RANTES In1.1c, CCR2b-641 and the CCR5-∆32 alleles that have been implicated in HIV-1 disease progression will be genotyped using DNA extracted from PBMC from patients in one of our cohorts (NP or LTNP). Genomic DNA will be extracted from 5–10 ml of peripheral blood, using standard procedures. DNA (100 ng) will be amplified using QPCR. The following are sequences that we have designed for use in QPCR for the CCR5-∆32 allele (CCR5-∆32-F: 5'CTTCATTACACCTGCAGCT3' and CCR5-∆32-R: 5'TGAAGATAAGCCTCACAGCC3'); RANTES In1.1C allele (RANTES In1.1C-F, 5′-CCTGGTCTTGACCACCACA and RANTES In1.1C-R,5′GCTGACAGGCATGAGTCAGA); CCR2b-641 allele (CCR2b-641-F 5' TTG TGG GCA ACA TGA TGG and CCR2b-641-R, 5' GAG CCC ACA ATG GGA GAG TA); IL10-592A allele, (IL10- 592-A-F-5'TACTCTTACCCACTTCCCCC-3' and IL10-592-A-R-5'-TGAGAAATAATTGGGTCCCC-3'); IL-4 -589T allele (IL-4 -589T –F- 5’-CAGTCCTCTGGCCAGAGAG-3’ and IL-4 -589T –R-5’-CACCGCATGTACAAACTCCC3’); TNF- -238G/A allele, (TNF- -238G/A-F-5'AGAAGACC-CCCCTCGGAACC3' and TNF- -238G/A-R5'ATCTGGAG-GAAGCGGTAGTG3') and the HLA-B27 allele (HLA-B27-F-5’GGG TCT CAC ACC CTC CAG AAT-3’ and HLA-B27 R-5’-CGG CGG TCC AGG AGC T-3’) respectively. PCR fragments for CCR5 ∆32, RANTES In1.1C, CCR2b-641, IL10-592A, TNF- -238G/A and the HLA-B27 alleles are 164 bp, 240 bp, 128 bp, 311 bp, 700 bp, 152 bp and 135 bp respectively when separated on 2% agarose gels. Primer mixes always include control primers that amplify a nonpolymorphic region. Relative abundance of each mRNA species is quantitated by QPCR. Relative expression of mRNA species is calculated using the comparative CT method (Shivley et al, 2003). All data are controlled for quantity of RNA input by measurements on a reference gene, -actin, and the 18S RNA standard as internal controls. Results on RNA from LTNP samples are normalized to results obtained on RNA from NP samples. Data are expressed as transcript accumulation index (TAI) assuming that all PCR reactions are working at 100% efficiency. Potential results: Accurate estimation of allele frequencies requires calculation of a correction factor for unequal allelic amplification from the peak height ratios of a small set of heterozygotes. The Sequenom MassARRAY method gives the best reproducibility. The use of pooled samples as controls in each run (we use pools of 384 individuals) and multiple replicates increase the accuracy of detection. Although genotyping accuracy has not been systematically examined, no genotyping method is 100% accurate and as many as 5% of individual genotypes could be mis-called. Such genotyping errors would decrease the power to detect quantitative trait loci or could have serious effects on linkage disequilibrium measures (Abecasis et al, 2001; Akey et al, 2001). Most of the scoring errors are caused by ambiguities in the allele peaks, sampleto-sample contamination, or mislabeling of DNAs. The use of pools will reduce all of these sources of error. Nevertheless, Sequenom MassARRAY provides very accurate association studies, in a large set of samples, compared with genotyping individual samples. We expect that data obtained from this analysis will provide information on specific allelic variants that play an important role in HIV-1 disease progression in our patient cohorts. Having identified the most important allelic variants from among the 8 SNPs studied, we will further quantitate the expression levels of specific genes in our patient cohorts by QPCR. Research Design and Methods for Specific Aim III: This aim focuses on the development of new computational tools to integrate and analyze genomic, proteomic, and clinical data from our different HIV-1 infected patient cohorts. The wealth of information that 9 will be produced by our project will require novel methods for its storage, analysis, and dissemination. Thus, information management and analysis is a key component of this proposal. III.a Data warehouse, data modeling and system design: The most essential component for the research proposed is to design an efficient and organized database allowing integration of various streams of information and study. Implementation of the system to organize our data is a complex and data-intensive process since the data are inherently noisy, complex, and distributed across multiple information resources. The core of this project’s information management environment, therefore, will be a data warehouse and associated tools. The data warehouse will be used to integrate information on HIV-1 and host cell gene expression, protein identification, post-transcriptional and posttranslational modifications of proteins, functional activities, and protein-macromolecule interactions with clinical data from our HIV-1 patient cohorts. Application tools will be built to analyze and integrate these data, and make easy access to all this information via the World Wide Web. To support the intensive computational research activities proposed in this proposal, we must design a robust system to integrate genomic, proteomic, and clinical datasets. A high-level logical view of system architecture is shown in Figure 14. The central component, the data warehouse, will contain primary, derived and external data. The primary information includes genomic, proteomic, and clinical data collected from individual patients. Since clinical data obtained from various patients are heterogeneous datasets, they will be transformed and imported into the data warehouse using Perl programs. Associated with the primary data will be a set of derived data generated from our data analysis approaches and different data analysis tools. Public information also will be integrated into the database. These include data from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) such as human EST, UniGene and RefSeq data; gene ontology data from the Gene Ontology Consortium (http://www.geneontology.org/); protein-protein interaction data from the DIP database (http://dip.doe-mbi.ucla.edu/); protein domain models from the Pfam database (http://pfam.wustl.edu/); and microarray gene expression datasets from GEO (http://www.ncbi.nlm.nih.gov/projects/geo/). Data Sources Data Integration Data Warehouse Unified Access Data Mining Clinical data and sample annotations Gene functional annotations Gene Data extraction, trans formation, cleaning & loading • Ad hoc A standard interface queries for application tools • OLAP • Cluster analysis expression datasets Metadata capturing & Promoter sequences and motifs Protein and domain interactions Object oriented • Mining gene regulatory networks integration Data quality control Refreshmet Defining basic operators for data • Interactome prediction • Pathway analysis access Protein structural information Figure 14. System architecture of the integrated data warehouse under development . Schema design. Our preliminary study showed that we can use multidimensional schemas to effectively model the microarray, proteomic and experimental data, and provide highly efficient query processing. However, the multidimensional schemas do not appear to be sufficient for modeling the semantics of the clinical data and sample data. For the clinical data, a multidimensional schema has problems to model the complex many-to-many relationships between the fact and dimensions and to provide bi-temporal support for some clinical measures. If a single fact table is used to store all the different clinical measures, most entries (including foreign keys of the central fact table) would contain null values due to the incompleteness of data. We propose to develop a data model that is most suitable to specify and integrate biomedical data including genomic, proteomic, clinical, and other related data such as ontology information in the data 10 warehouse so that efficient querying and analysis of these data can be performed. We propose to design a new schema called Hybrid BioSchema (HBS), which combines multidimensional model with object-oriented model. The HBS model can support both of multidimensional on-line analytical processing (MOLAP) server and object-oriented (OO) data processing server. Then, we can take the benefit from greater scalability of the OO model and faster computation of MOLAP. In HBS, data spaces are specified as either main data spaces or sub-data spaces. The main data space is modeled by the OO model and each sub-data space is modeled by the multidimensional model and is connected to central objects in the main data space. Thus, each sub-space can be individually managed. Clinical data and sample data are in the main data space, for example, a patient object with demographic and other related attributes, while microarray and proteomic data are in the sub-data spaces. The object-oriented model can express all complex relationships between every entity that exist in the clinical data. This model can be applied to many biomedical applications. The HBS schema has many advantages. First, MOLAP operations can be applied easily to analyze genomic and proteomic data and the data in the main data space can be accessed through OO server, with explicit navigation capabilities. Thus, the analysis can be applied to any biomedical data with high query performance. Second, HBS is very expressive, showing clear data semantics because the main data space has clear object relationships. Third, HBS is very scalable and extensible. Updating or adding a large amount of data can be easily done in any sub-data space. If we need to update or add data to the main data space, then OO data server can be used during updating or adding. Finally, HBS has very simple and concrete structure. Also, the data warehouse schema is easy to understand. Querying and online analytical processing. The most important feature of MOLAP is its ability to present multidimensional data at different levels of detail through roll-up and drill-down operations along a concept hierarchy associated with a dimension. Such operations are especially suited for gene and protein expression data analysis, where at least three major concept hierarchies can be identified. For example, for the Array Probe dimension in the gene data sub-space, gene expression data may be summarized using Gene Ontology (GO) or other ontology hierarchies (Gene Ontology Consortium 2000; Rosse and Mejino., 2003). These vocabularies encode significant background knowledge of biology, and thus are important for meaningful MOLAP analyses of genomic and proteomic data. The classification hierarchies of the other dimensions may also be defined in a similar way based on domain-specific knowledge. The abovementioned ontology hierarchies are critical for meaningful MOLAP analyses of clinical and gene expression data. Examples include summarization of clinical test results using disease hierarchies for pattern discoveries, and gene expression summarization using terms from GO’s biological process hierarchy, which may reveal changes in pathways due to diseases or in response to drug treatments, and thus provide useful information for clinical research and drug discovery. System construction: Data warehousing is a complex and data-intensive process that transforms heterogeneous data into an integrated objects and multidimensional representation. This is especially true for biomedical data warehousing, in which data are inherently noisy, complex and distributed at multiple information resources. We will develop wrapper programs for data extraction from various sources, design transformation algorithms for integration of the heterogeneous datasets, and define constraints for data quality control. To monitor data lineage and quality, it is important to capture all the metadata generated in the data staging process. The metadata include the semantics and structure of the data warehousing processes. Thus, metadata repository is a key resource for the operational design, evolution and refreshment of the data warehouse. We will design a framework to model the clinical and genomic data warehousing processes, and store the information in the metadata repository. The underlying database for managing the data warehouse will be powered by the ORACLE database management system, which has proven to be stable, reliable, and easy to administer. It will serve as the data repository. The object database architecture in ORACLE will enable us to specify complex clinical data and relationships. The relational database architecture in ORACLE will enable users to ask complex question regarding all aspects of the HIV disease. Constituting the application layer, a set of Perl and Java modules and programs provide centralized program logic to handle transactions between the user and the database. Finally, users will be able to interact with and navigate the database via a userfriendly Web interface. III.b Advanced analysis methods and tools We now describe the analysis methods and tools to aid researchers in mining the data. We will support the use of routine statistical and computational approaches to identify differentially expressed genes and proteins. For genomic analysis, the gene expression image analysis of the cDNA array and 11 quality control will be performed with the Microarray Suite software (MAS 5.0), which utilizes global (linear) normalization procedures. The latest version of the software uses the paired version of non- parametric Wilcoxon’s test. For the proteomic analysis, Ciphergen Protein Chip software version 3.0 will be used to analyze the spectra and relative abundance of the individual proteins. Further, a classification tree will be utilized to define cutpoints in variable peaks in different patient cohorts, NP and LTNP. A best discrimination between 2 cohorts will be achieved by considering all possible confounding variables by the algorithms used in the analysis. A classification tree will be constructed by separating the subgroups based on the selected cutpoints. These splitting processes will be continued until a minimum number of patients required in each patient cohort is achieved. The final tree will be pruned to eliminate overfitting of the datasets. The data will be cross-validated and the tree error rate will be eliminated and a tree with the smallest error rate will be selected. Further, in parallel analysis, the protein peaks that are over-expressed or under-expressed in different patient cohorts will be compared to control groups. The results of this analysis will be compared to that of the classification tree to evaluate the sensitivity of the analysis. In parallel analysis, De Cyder software will be used to distinguish clear statistical differences in protein expression between treated versus untreated, and between different patient cohorts. However, since the genomic, proteomic, and clinical data are very complex, fast evolving and are often incomplete, we shall investigate more advanced data analysis approaches to analyze gene and protein profiles based on biological significance. Target selection, common profile extraction and association. Target selection discovers a subset of genes/proteins which explain the phenotypic variations. While previous studies have proved useful for identifying informative genes/proteins, several fundamental challenges still remain. (1) Correlations between genes are often ignored. Most previous target selection approaches are single-gene-based. These methods simply assume genes are independent while ignoring their correlations. However, genes are well known to interact with each other through gene regulative networks. The assumption of independence between genes/proteins oversimplifies the complex relationship between them. (2) Domain knowledge can be incorporated to improve the performance of target selection. Gene/protein array datasets are typically noisy due to technical constraints. Incorporation of domain knowledge will help to reduce the effect of noise and improve the quality of result. (3) Association between gene and protein expression may provide considerable amount of information for delineating the roles of genes/proteins in disease state. However, few, if any, studies have been devoted to analyzing inherent correlation between these two types of expressions. Correlation-based feature extraction. We propose to use the correlation between genes (or proteins) as features for identifying phenotypes. Such correlation between gene expression levels could result from underlying biological process and warrant further investigation. Instead of trying to get rid of correlation in the selected gene set, we examine whether such correlation itself is a good predictor of sample class labels. As we have pointed out in the preliminary study, the correlated genes bear biological meaning: the weighted summation or difference of expression levels of several genes. Incorporation of domain knowledge. We shall integrate the domain knowledge such as those imbedded in the Gene Ontology (GO) (Gene Ontology Consortium 2000) annotations into our target selection process. The rationale is that while it is likely that even random gene expression can achieve relatively high discriminative scores when the number of samples is limited, it is less likely that several random genes annotated with the same GO term will all have similarly high scores. Our algorithm first examines for each GO term whether genes annotated with it have statistically higher discriminative scores. Where this is so, this is an indication of a correlation between the corresponding GO term and sample class labels. We then choose from genes that are annotated with GO terms that are highly correlated with sample class labels. The discriminative power DP of a GO term of sample class labels is then defined as the percentage of genes that are annotated with this GO term with discriminative scores larger than a threshold. The discriminative power of a GO term measures the collective discriminative power of individual genes annotated with such GO term. The higher DP score is, the stronger a GO term is correlated with sample class labels. The best GO adjusted single gene based scores are then sorted. Top ranked genes are then selected as informative genes. The utility of best GO adjusted scores are two folds. On one hand irrelevant noises are further filtered. On the other hand, implicit sample classes may become explicit. Association analysis between mRNA and protein expression. Association between mRNA and protein expression may provide considerable amount of information for delineating the roles of genes and proteins in disease states. So far few, if any, studies have been devoted to analyzing inherent correlation between 12 these two types of expressions. Instead of using simple spearman rank correlation, association analysis between mRNA and protein expression profiles can be performed using KWII (Jakulin, 2005) that will enable detection of two- or three-way interactions between mRNA and protein expression samples that can be used for effectively detecting differentially expressed genes/proteins for HIV. We propose to use k-way interaction information (Chanda et al., 2007; 2008), an information theoretic metric (Moore et al., 2006; Bhasi et al 2006a, 2006b; Liu et al., 2005) (that can be considered as a multivariate generalization of the Kullback-Leibler divergence (KLD) (Liu et al., 2005; Rosenberg et al ., 2003; Smith et al., 2001; Anderson and Thompson., 2002) for association analysis of genetic data (SNPs) with gene and protein expression data. For the n-variable case on the set {X1,X2 ,,Xn } , the KWII can be written succinctly as an alternating sum over all possible subsets T of ν using difference operator notation. The following definition T of KWII follows that of Jakulin (Jakulin.,2005) : KWII( ) (1) H(T) where H denotes entropy. The T KWII measures the gain or loss of information due to the inclusion of additional variables in the model. It quantitates interactions by representing the information that cannot be obtained without observing all k variables at the same time (Jakulin.,2005). In the bi-variate case, the KWII is always positive but in the or negative. The interpretation of KWII values is intuitive because multivariate case, KWII can be positive positive values indicate synergy between variables, negative values indicate redundancy between variables and a zero value indicates the absence of k-way interactions. The significance of the KWII values can be ascertained using permutation or bootstrap based methods. Entropy calculations of the expression profile combinations needed for KWII can be done using kernel based density estimations of the empirical probability distributions in a nonparametric fashion that further highlights the versatility and flexibility of the approach. Furthermore, to identify genes that may serve as ideal targets for novel treatment strategies, we shall examine the selected subsets of mRNA and protein expressions shown to be significant. If from both genomic and proteomic profiles we independently discover the same mRNA/protein to be differentially expressed, then the chance of the error can be greatly reduced. Thus by combining the results from genomics and proteomics profiles, we can gain significance increases in power of detecting differentially expressed genes. We also plan to analyze genetic data from SNPs together with gene expression data of NP and LTNP subjects. As an example, consider a single gene G whose expression levels are measures across T time points and N individuals or subjects. Let G(n, t) denote the expression level for the nth individual at the tth time point. Let Gm denote the mean expression level for this gene across all the individuals. Then treating each G(n) as a random variable, its KLD with the distribution of Gm can be determined using empirical estimation methods. Finally these KLD values for G can be analyzed for association with the combinations of genetic markers using KWII to detect genes whose expression levels differ in the presence of different alleles at a marker position. This approach can be extended to analysis of combinations of multiple genes and multiple markers upon treating the data as multivariate data. Gene biomarker selection. From the above analysis, we expect to get a set of genes displaying differential expression between the HIV-1 infected patient cohorts, LTNP and NP. We shall select 10 target genes for each comparison and use both Q-PCR and proteomics approaches to determine optimal mRNA and protein signature for single gene biomarkers from multiple patient cohorts and healthy individuals. QPCR and proteomics approaches will provide for each sample an accurate measurement for the mRNA and protein concentration which will be used for the construction of receiver operator characteristic (ROC) curves for the determination of sensitivity and specificity. For the HIV infected patient cohorts and healthy individuals, the percentage of patients whose specific mRNA and/or protein concentration passes selected thresholds based on the above data analysis is defined as the “specificity”; the percentage of healthy individuals whose mRNA and/or protein concentration does not pass the selected threshold is defined as the “sensitivity”. We shall look for an optimal mRNA and/or protein concentration, which has the maximum sensitivity and specificity combination in ROC curves, as single gene biomarker. To reduce false predictions, we shall integrate the results from both QPCR and proteomics and find those genes with similar trend in ROC curves as the gene biomarker. For the HIV infected patients before and after drug treatments, a conceptual model can be similarly designed. In addition to the single gene biomarker, we also shall look for multi-gene biomarkers. This will first be done by building a classification and regression trees model using the threshold from the above single 13 gene biomarker. Out of the 10 selected genes we shall investigate the biomarkers using 3 or 4 genes, which will result in 120 and 210 classification and regression trees, respectively. If 65 patients in each patient cohort were used, this method will classify the 130 samples into final patient and normal groups by 3 or 4 splitting steps. Similar to those cases in single gene biomarker analyses the percentage of patients from the 65 patients in the classified patient groups is the “specificity”; the percentage of healthy individuals from the 65 healthy persons in the classified healthy group is the “sensitivity”. A few gene combinations with maximum sensitivity and specificity will be used for multi-gene biomarkers. Similar to the single gene biomarker, we shall integrate the results from both QPCR and proteomics to reduce false predictions. To incorporate other phenotype information into our analyses, we shall also build a regression model by integrating the patient’s age, gender, and medical history into the analysis. The best model will be searched using the backward stepwise regression and validated by leave-one-out cross-validation approach. These models will provide useful means for predicting disease states and HIV patients responding to drug treatment based on expression profiles. Genetic network reconstruction. A full understanding of virtually any complex biological system requires the identification of the regulatory networks that control gene expression within that system. Such networks are composed of genes, the transcription factors that regulate them, and crucially, the cis-regulatory sequences on which the transcription factors act. Fully comprehending all aspects of these regulatory networks is fundamental to understanding normal development, progression to disease, and response to pharmacological agents. In eukaryotic organisms, transcriptional regulation of a gene’s spatial, temporal, and expression level is generally mediated by multiple transcription factors (TFs). Therefore, the identification of synergistic TFs and the elucidation of relationships among them are of great importance for understanding gene regulatory networks. Previous methods employed for the identification of synergistic TFs are based on either TF enrichment from co-regulated genes or phylogenetic footprinting. Despite the success of these methods, both have limitations. For example, methods based on phylogenetically conserved sequences, although they can greatly reduce the false prediction rate (Wasserman and Sandelin, 2004), have limitations related to missing potentially significant observations. Moreover, if the species are very closely related, nonfunctional sequences may not have diverged enough to allow functional sequence motifs to be identified; conversely, if the species are distantly related, short conserved regions may be masked by nonfunctional background sequences. We shall employ a new strategy to identify synergistic TFs. First, information from both genomics and proteomics will be integrated to find co-regulated genes in both HIV infected patient cohorts vs. healthy individuals. Human orthologous promoter sequences within 1-kb upstream of the annotated transcriptional start sites (TSS) then will be obtained from the Database of Transcriptional Start Sites (Yamashita et al., 2006) for these co-regulated genes. The orthologous promoter sequences will be searched for transcription factor binding sites (TFBSs) using the Match® program (Kel et al., 2003) and ~550 (Precision Weight Matrix (PWMs) from the professional TRANSFAC 9.1 database (Matys et al., 2003). To minimize false predictions, we shall use an in-house developed novel approach (Hu et al., 2007) to detect synergistic TFs in these co-regulated genes. In this novel approach, rather than aligning the regulatory sequences from orthologous genes and then identifying conserved TFBS in the alignment, we proposed a new concept of function conservation to identify TF combinations. The first is functional conservation of TFs between species. The second is functional conservation of TFBSs between promoter sequences of individual orthologous genes. The algorithm for this novel function conservation approach has been implemented at 3 levels: (1) functional TFBS enrichment based on the pattern of binding site arrangement on promoters of orthologous genes by distance constraint, (2) enrichment of overlapping orthologous genes whose regulatory sequences contain the enriched TFBS combinations, and (3) integration of function conservation from both TF and TFBS levels by correlation analyses. Genome-wide TF analyses have demonstrated that our novel algorithm is better able to predict synergistic, functional TFBSs, TF-TF interactions, and thus genetic networks. We shall combine our novel approaches with existing tools for discovering a genetic network involved in HIV disease and build publicly available application tools (see database section) into the database system. Identification of genetic contributions to HIV and drug response. One challenge in the post genomic era is to develop robust strategies for identifying the genetic contributions to HIV and drug response, which involve multiple gene interactions. Central to an understanding of such complex systems and an understanding of the role of genes underlying the individual 14 response to drug treatments, are effective models and software tools that lead to the characterization of genetic variations based on genomic data, mostly SNPs. Although SNP data are available from The HapMap (The International HapMap Consortium) Project (Li 2005), one cannot perform the whole genomebased association studies based directly on the genotypes or allele frequencies of individual markers due to the relative low power of each SNP and the huge number of total SNPs. To increase the power of detection, we have chosen closely linked SNPs inherited together during the history of evolution to find specific patterns in the non-random association between alleles and the haplotype structures that they form. We shall employ different algorithms such as haplotype-based (Li 2005) or clustering technique (Liu et al., 1999) for haplotype mapping to find disease susceptibility (DS) genes which embeds haplotypes, especially mutants of recent origin. These mutations tend to be close to each other due to linkage dysequilibrium, while other haplotypes can be regarded as random noise sampled from the haplotype space. The association between genetics and disease state will not only result in the identification of disease related genes but also provide personalized medicine fingerprinting. Tools Interactive mining. We propose to design InterM, an integrated environment for interactive exploration of coherent expression patterns and co-expressed genes/proteins in gene/protein expression data for HIV infections. Our system will integrate the users' domain knowledge and effectively handle the high connectivity in the data. Based on our density-based approach, InterM models a cluster of co-expressed genes as a dense area (Jiang et al., 2004; 2005). Through this density-based model, InterM can distinguish co-expressed genes from intermediate genes by their relative density. The coherent expression pattern in a dense area is represented by the expression profile of the gene that has the highest local density in the dense area. Other genes in the same dense area can be sorted in a list according to the similarity between their expression profiles and the coherent expression pattern. Since the intermediate genes have low similarity to the coherent pattern, they are at the rear part of the sorted list. Users can set up a similarity threshold and thus cut the intermediate genes from the cluster. A user should be able to explore the coexpressed genes and their coherent patterns by unfolding a hierarchy of genes and patterns. The exploration starts from the root. Three main components of InterM are shown in Figure 15. Users can explore the coherent patterns in the data set and save/load the coherent patterns through the pattern manager (Figure 15(a)). InterM has a working zone (Figure 15(b)), which integrates the parallel coordinates, the coherent pattern index graph (Jiang et al., 2005), and a tree view. Users can select a node in the tree view, then the working zone will display the corresponding expression profiles and coherent pattern index graph. Users can click on the coherent pattern index graph to split the node or roll back previous split operations. The tree structure is adjusted dynamically according to the exploration operations. We will also design a gene annotation panel. Given a specific node on the hierarchical tree, the panel sorts the genes belonging to the node, and displays the name and the annotation (if any) for each gene (Figure 15(c)). This InterM function helps to integrate such domain knowledge into the system. (a) Pattern manager (b) Working zone (c) Gene annotation panel Figure 15: Screen snapshots of InterM. Visualization tool. We shall expand the scope of our VizStruct tool (Zhang et al., 2004) to meet the need of the proposed HIV research by providing the following functions for viewing the structures of genomic data: Zip zooming view. We propose a zip zooming view method extending circular parallel coordinate plots. Instead of showing all dimensional information, it combines several adjacent dimensions and 15 displays the reduced dimension information. The number of dimensions displayed, we call it a granularity setting, can be set by the user. This allows different levels of combination. Two distant points in input space may be mapped to nearby points in 2D space or vise versa. One solution is to use zip zooming view to inspect these 2 points more closely. Another approach is to allow the user to interactively adjust the weight of individual dimension parameter to change data distribution in 2D space. It can easily cause the separation of falsely mapped points. By adjusting the coordinate weights of the dataset, data's original static state is changed into dynamic state which may compensate the information loss from the mapping. Dimension tour. To effectively tackle the multi-dimensional characteristic nature of gene data, we will design an animation method, called dimension tour. It is a sequence of either scatterplots or zip zooming views in which each frame has a specific dimension parameter settings. Dimension tour can be defined as a function of time. Our system can be used in a variety of ways during exploratory array data analysis. Figure 16 shows three snapshots that are taken from a dimension tour of a sample set. Figure 16 We shall also develop a visualization tool to provide an integrated view of various genomic and proteomic data in the warehouse. Figure 17 shows an example, in which gene expression changes are mapped onto the known protein interactome, and gene ontology annotations are used to characterize the main function of the highly connected graph components. The graph provides an integrated and global view of cellular changes due to disease or in response to treatment. Protein interactome Microarray data Gene ontology . . . Tumor versus normal colon tissues Regulation of cell cycle Gene Signal transduction, protein biosynthesis Cell-cell signaling DNA replication/repair GO annotation P19838 GO:0007165 Signal transduction Q14164 GO:0006468 Protein phosphorylation Q99759 GO:0000165 MAPKKK cascade P07900 GO:0006457 Protein folding P52292 GO:0006886 Protein transport P12956 GO:0006302 Double-strand break repair P30304 GO:0000074 Regulation of cell cycle P05231 GO:0007267 Cell-cell signaling P25445 GO:0006916 Anti-apoptosis Apoptosis Figure 17. Genomic and proteomic data integration. Red nodes represent up-regulated genes, green nodes are downregulated genes, and black nodes indicate gene expression unknown. At present, information and tools for the systematic analysis of the genes and proteins of both the host and virus associated with HIV-1 infections are limited and scattered across a wide-range of online resources. The application tools that we shall develop will be user-friendly, and constitute a bioinformatics resource that integrates the genomic and proteomic analysis of host and HIV-1 proteins and correlates this information to clinical data from unique HIV-1 patient cohorts. Our ultimate aim is to use the systems 16 biology approach to better understand HIV-1 disease and build a specific database to make this information readily available to the public. Software Dissemination and Timeline Data resources and analysis tools that we develop will be made publicly available according to NIH policies. The tools will be geared for use by bench biologists and users with a wide range of quantitative skills. We shall establish procedures and training to insure that the software and tools developed by this project are production quality and easy to use by biomedical researchers outside our university and shall incorporate these software packages in a novel computing environment that facilitates their wide-spread distribution to the biomedical research community. UB has a strong foundation in place from which this effort will be based. Specifically, the proposed environment will be built around a system currently in place to support biomedical research at UB, namely BioACE (Bioinformatics Application Computing Environment). We shall develop a portal that will tie together all of the services that an existing or prospective client will need. These include: (1) a means to request/register for services, (2) links to online training materials, (3) application and tool downloads, and (4) access to database applications. We intend to implement mechanisms whereby authorized users may access these and any documentation securely. Access may be via direct download or via media such as CDs or DVDs. We shall also develop a database to hold bibliographic and reference materials relevant to the project. The timeline for the project follows: TIME LINE OF RESEARCH DEVELOPMENT PLAN GRANT PERIOD APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN FEB MAR Year 1 4/1/2009 to 3/31/2010 AIM-I: Patient Sample Collection; Genomic analysis & Proteomic Analysis-2D-DIGE/ MALDITOF/ NanoLCMS/MS; AIM III – Design of data schema and data collection Year 2 4/1/2010 to 3/31/2011 AIM-I: Proteomic analysis Contd… ITRACK/SILAC/ PTEM AIM III: Data integration and data loading AIM-II: Quantitation Of Allelic Variants AIM III: Data integration and data loading Year 3 4/1/2011 to 3/31/2012 AIM-II: SNP Analysis; AIM III: Design of data analysis methods Year 4 4/1/2012 to 3/31/2013 AIMs I, II, and III: Applying data Analysis methods to genomic, proteomic, and clinical data AIMs I, II, and III: Results analysis and summarization. Manuscript preparation & Submission, tool Year 5 4/1/2013 to 3/31/2014 dissemination and Preparation of Future Grant proposal Based on Results obtained from this Investigation. 17