Download Research Design Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene prediction wikipedia , lookup

Transcript
D. RESEARCH DESIGN & METHODS
Patient demographics and classification.
Study subjects are derived from the Immunodeficiency Services Clinic of the Erie County Medical Center,
Buffalo, NY, the largest treatment provider to HIV infected patients in Western NY. This is the first HIV
adherence clinic in the U.S. and the first to use Therapeutic Drug Monitoring (TDM). Our patient enrollment
is 1200, with an age range from 16-78 yr. Patient demographics are shown in (Figure 3). Of the patients,
34% are female and 66% are male. The ethnic distribution includes 507 (42.2%) African Americans, 471
(39.3%) Caucasians, 133 (11.1%) Hispanics, and 89 (7.4%) other ethnicities. Clinical samples from HIV-1
infected patients are obtained by Chiu-Bin Hsiao, M.D., Immunodeficiency Services Director and Co-I of
this application. Donors are apprised of the study and informed consent is obtained consistent with the
policies of the UB Health Sciences IRB. A standard data collection form that includes demographic, social,
HIV risk factor, clinical, treatment and hospitalization data is completed for each subject. The Centers for
Disease Control and Prevention’s classification system for HIV-1 infection is used to determine clinical
category; this system provides mutually exclusive states based on the intersection clinical, virological and
immunological criteria. We adhere to the following criteria for the classification of LTNP patients: a)
duration of infection >12 yr; b) CD4 count >500; c) viral load <10,000 RNA copies/ml and d) no
antiretroviral therapy. Based on these criteria, we currently have at least 60 LTNP subjects available for
this study. A small percentage of HIV-infected patients who rapidly progress to AIDS within 4 yr after
primary infection are classified as rapid progressors (RP) and are not included in this study. The remainder
of our clinic population is designated normal progressors (NP) and they will serve as a clinical comparator
to our LTNP cohort. A clinical database, created with Microsoft Access, including age, gender, duration of
infection, CD4 count, HIV-1 viral load, treatment regimens, and disease outcomes is available for
correlation with our genomic and proteomic data. Statistical analyses are performed with SAS software.
Descriptive statistics are used to describe the population (Figure 3) and Chi-square analysis is used to test
for differences between groups. Logistic regression is used to determine independent factors associated
with LTNP while controlling for other factors. All subjects are encoded to prevent individual identification
and to provide the utmost of personal privacy consistent with the guidelines of the Health Insurance
Portability and Accountability Act of 1996 (HIPPA). Standard clinical methods are briefly described below.
CD4 counts:
Absolute counts of circulating CD4+ T lympocytes are obtained with a dedicated CyFlow® Counter.
Qualitative HIV-1 DNA PCR amplification:
We use the Amplicor HIV-1 test (Roche Diagnostic Systems, Inc.), the only FDA-licensed kit available
for qualitative HIV-1 DNA PCR amplification, to diagnose HIV-1 infection. The assay utilizes recombinant
Thermus thermophilus enzyme to catalyze both reverse transcription and DNA amplification of a 142 base
HIV gag gene sequence. The assay has an analytic sensitivity of 200 RNA copies/ml and a quantitation
limit of 400 RNA copies/ml, with a dynamic range up to 750,000 RNA copies/ml. The Amplicor HIV-1
Monitor Ultrasensitive Specimen Preparation Protocol, (Roche Molecular Systems) concentrates virus by
centrifugation and increases sensitivity to 50 copies/ml; the quantitation limit is 200 copies/ml.
p24 assay:
A commercial ELISA kit (Zeptometrix, Buffalo, NY) is used to quantitate p24 in plasma samples. The
sensitivity of the p24 assay is 4 pg/ml.
Western Blot Analysis of p24:
Total protein is extracted from peripheral blood mononuclear cells (PBMC) of subjects using the
Mammalian Protein Extraction Reagent (Pierce, Rockford, IL) and 30 g of protein is loaded per lane and
separated by a 4-20% SDS-Tris glycine PAGE. Membranes are probed with p24 monoclonal antibodies.
Anticipated Results: We diligently screen our HIV-1 infected patients and have found ~5-8% of them
meet criteria for classification as LTNP, thus we anticipate no problems in procuring adequate numbers of
subjects for this unique patient cohort. The fundamental mechanisms that determine the LTNP state
remain largely unexplored. Thus this study may yield previously unrecognized biomarkers of the LTNP
state that may serve as predictors of disease progression and could yield new therapeutic agents.
Research Design and Methods for Specific Aim I:
Genomic Analysis:
Gene arrays are performed to identify functional genes in PBMC from HIV-1 infected NP and LTNP
patients. RNA from PBMC is used for cDNA microarrays. A total of 15 arrays are run from each subject
1
cohort which is sufficient to derive statistically significant differences as determined by a power analysis.
Differences in expression levels are determined using at least 2-fold changes between the patient groups.
A p value of <0.05 using the non-parametric Wilcoxon–Mann–Whitney test is considered as significant.
PBMC Isolation: PBMC are isolated from 20ml blood samples from patients. Blood is diluted 1:2 using
Dulbecco’s PBS without MgCl2 and CaCl2, 5mM Na2-EDTA and then overlaid on 15ml Ficoll-Paque® Plus
(Amersham-Pharmacia, Piscataway, NJ) in a 50 ml tube. Samples are centrifuged for 20 min at 700 × g at
20°C. The PBMC interface is carefully removed, washed twice with PBS/EDTA, resuspended in 2 ml of
complete RPMI media and the total number of cells counted using a hemocytometer.
RNA Extraction: Cytoplasmic RNA is extracted from 3X106 PBMC/ml using Trizol reagent (GIBCO-Life
technologies, Grand Island, NY) (Chomczynski and Saachi, 1987). RNA is quantitated using an ND-1000
spectrophotometer (Nano-Drop™ Wilmington, DE) and isolated RNA is stored at –80oC.
Gene Microarrays;
Production of cDNA microarrays: The cDNA arrays are performed at the adjacent Roswell Park Cancer
Institute, Microarray and Genomics Core Facility. They consist of ~6000 cDNA clones (Research Genetics)
selected based on their association with the immune response as cited in the scientific literature. Each
clone is amplified from 100 ng of bacterial DNA by performing PCR amplification of the insert using M13
universal primers for the plasmids represented in the clone set (5'–TGAGCGGATAACAATTTCACACAG–
3', 5'–GTTTTCCCAGTCACGACGTTG–3'). Each PCR product (75ml) is purified by ethanol precipitation,
resuspended in 25% DMSO and adjusted to 200ng/l. Printing solutions are spotted in duplicate on Type A
Schott Glass slides using a MicroGrid II TAS Arrayer and MicroSpot 10K split pins (Apogent Discoveries).
Preparation and hybridization of fluorescent labeled cDNA: A total of 30 RNA samples (30 cDNA arrays),
15 each from NP and LTNP donors are screened for gene expression. From each RNA sample, cDNA is
synthesized and labeled with
Cy3 (NP) and Cy5 (LTNP) dyes,
using the Atlas Powerscript
Fluorescent Labeling Kit (BD
BioSciences). For each reverse
transcription reaction, 2.5 g
total RNA is mixed with 2l of
random primers (Invitrogen) in a
total volume of 10 l, heated to
70°C for 5 min and cooled to
42oC. An equal volume of
reaction mix (4l 5X first-strand
buffer, 2 l 10X dNTP Mix, 2 l
DTT, 1 l deionized H2O, and 1
l
Powerscript
Reverse
Transcriptase) is added to the
sample. After 1 hr at 42°C, RNA
is degraded by incubating at
70°C for 5 min. The mixture is
cooled to 37oC and incubated
for 15 min with 0.2 l RNase H
Figure 10: Normalization procedures for gene microarray data. A. M vs A plot for data before
(10 units/l). The resultant
and after Lowess normalization. Cy3 and Cy5 labeled expression intensity from 1 of the 4 slides
were plotted using M (log2(Cy3/Cy5)) vs A ((log2(Cy3) + log2(Cy5))/2)) plot for the detection of
amino-modified
cDNA
is
intensity-dependent expression on each individual slide: (a) data before Lowess normalization
purified,
precipitated,
and
and (b) data after normalization. Dashed lines show the dye labeling efficiency and red lines
show the intensity-dependent dye bias. B. Distribution of expression values from three slides
fluorescently
labeled.
using different normalizations Cy3 (in green color) and Cy5 (in red color) labeled expression
Uncoupled dye is removed from
intensity from 3 slides were plotted using boxplots to show their variances: (a) data without
the labeled probe by washing 3
normalization, (b) data after Lowess normalization, and (c) data after both Lowess and scaled
normalization. C. Distribution of differentially expressed genes. The scatter plot of the average
times on a Qiaquick PCR
log2 ratio between samples from the 2 patient cohorts for the 5043 genes. Statistically significant
Purification Kit (Qiagen). The
genes (p<0.05 and FDR<0.05) identified by 2-step analyses of t-test and SAM are shown in
color, including 70 down-regulated genes (in green) and 90 up-regulated genes (in red).
probe is eluted in 60 l elution
buffer and dried in a SpeedVac. Prior to hybridization, the 2 separate probes are resuspended in 10 l
dH2O, combined and mixed with 2 l of Human Cot-1 (20 g/ul, Invitrogen) and 2 l of poly A (20 g/ul,
2
SIGMA). The probe mixture is denatured at 95oC for 5 min, placed on ice for 1 min, and prepared for
hybridization by addition of 110 l of preheated (65oC) SlideHyb #3 buffer (Ambion). After 5 min incubation
at 65oC, the probe solution is placed on the array in an assembled GeneTAC hybridization station module
(Genomic Solutions, Inc). Slides are incubated at 55oC for 16–18 hr with occasional pulsation of the
solution. After hybridization, slides are automatically washed in the GeneTAC station with reducing
concentrations of SSC and SDS. The final wash is 30 sec in 0.1X SSC, followed by a 5 sec 100% ethanol
dip. The slides are dried and scanned immediately on a GenePix 4200A scanner (Axon, Inc). To minimize
intra-array variations, 2 hybridizations for each sample type are performed, and labeling with the Cy dyes is
interchanged, to provide technical replicates.
Image Analysis: Hybridized slides are scanned using a GenePix 4200A scanner to generate high-resolution
(10 m) images for both Cy3 and Cy5 channels. Image analysis is performed on the raw image files using
ImaGene (version 6.0.1) from BioDiscovery Inc. Each cDNA spot is defined by a circular region. The size of
the region is programmatically adjusted to match the size of the spot. Local background for a spot is
determined by ignoring a 2-3 pixel buffer region around the spot and measuring signal intensity in a 2-3
pixel wide area outside the buffer region. Raw signal intensity values for each spot and its background
region are segmented using a proprietary optimized segmentation algorithm that excludes pixels that are
not representative of the majority pixels in that region. The background corrected signal for each cDNA
spot is the mean signal (of all the pixels in the region) minus the mean local background. The output of the
image analysis is 2 tab delimited files, one for each channel, containing all of the raw fluorescence data.
Microarray data processing and analysis: Expression data extracted from image files are first checked by a
M (log2(Cy3/Cy5)) vs A ((log2(Cy3) + log2(Cy5))/2)) plot to see if intensity-dependent expression bias exists
between spots (genes) labeled with Cy3 and Cy5 on each individual slide. On determining that intensity
dependent expression bias exists for all slides, we perform a Lowess data normalization to correct the
observed intensity-dependent expression bias. We then perform a global normalization to bring the median
expression values of Cy3 and Cy5 on all 4 slides to the same scale. This is done by selecting a baseline
array (e.g. Cy3) from 1 of the 4 slides, followed by scaling expression values of the remaining 7 arrays to
~ ):
the median value of the baseline array ( m
base
~
m
xi'  ~base xi .
mi
After data normalization, the average intensity of individual genes on each slide is computed using an inhouse developed PERL script. A total of 6000 average expression values are obtained, including empty,
dry, null, and DMSO control spots. Paired t-tests on normalized intensity with p-values <0.05 are used to
generate a list of genes with significant change in expression between NP and LTNP samples. The false
positive rate of significant genes is estimated using the SAM algorithm (Tusher et al, 2001).
Quality control measures: Ratios of housekeeping genes, G3PDH and -actin, scaling factors, background,
and Q-values must be within acceptable limits. Fold-change is calculated from the signal log ratio.
Microarray data preprocessing: Preprocessing of microarray data is very important for enhancing
meaningful data characteristics. One of the most important procedures is data normalization, which
corrects systematic differences such as intensity-dependent expression bias and different dye efficiency
between and across datasets. To check whether a bias exists between gene spots labeled with Cy3 and
Cy5 on each individual slide, we first make an M vs A plot for genes from each individual slide. Figure 11A
[a] shows data display bias, not only from dye labeling efficiency indicated by the non-zero (dashed) lines
from the M vs A plot, but also from intensity-dependent dye bias indicated by the Lowess fitting line in red.
Therefore, the within slide normalization is first carried out using intensity–dependent normalization. This
procedure corrects most of the expression bias as shown in Figure 10A [b]. However this normalization can
not correct the bias introduced from different slides (Figure 10B [a] & [b]). Therefore, scale normalization is
used to bring the overall intensity of Cy5 and Cy3 in each slide to the same level (Figure 10B [c]). To
remove genes with great expression variation between replicates, we perform data filtering by measuring
the repeatability of gene expression using co-efficient of variation (CV) (http://www.r-project.org). This is
done by computing the CV of Cy3/Cy5 ratios for all individual genes from 4 slides, followed by constructing
a 99% confidence interval for the CV values of all genes. Genes with CVs outside the upper 99%
confidence interval limit are regarded as unreliable measurements and are removed from further analysis.
3
Gene ontology analysis:
The use of gene ontologies enables us to summarize results of quantitative analyses and annotate
genes and their products with a limited set of attributes. The 3 organizing principles of gene ontology are
molecular function, biological process, and cellular component. A gene product has one or more molecular
functions and is used in one or more biological processes; it might be associated with one or more cellular
components. Molecular function describes activities, such as catalytic or binding effects. In gene ontology,
molecular functions represent activities rather than entities (molecules or complexes) that perform the
actions and do not specify where or when, or in what context, the action takes place. Molecular functions
generally correspond to activities that can be performed by individual gene products, but some activities
are performed by complexes of gene products. A biological process is a series of events accomplished by
one or more ordered assemblies of molecular functions. It may be difficult to distinguish between a
biological process and a molecular function, but the general rule is that a process must have more than
one distinct step. However a biological process is distinct from a pathway. GeneSifter software (VizXlabs)
allows user-defined filtering to focus on data of greatest interest and these queried files can be exported for
secondary analyses. GeneSifter software rapidly characterizes the biology involved in a particular
experiment, and helps identify specific genes of interest from a list of potential targets by identification of
broad biological themes.
Potential results: As shown in Preliminary Results, our data set demonstrates reproducible and significant
changes in genes that had been previously implicated or hypothesized to have a role in pathogenesis of
HIV-1 infection. Gene ontology analysis classifies the significantly modulated genes into distinct functional
groups (Figure 6). These data provide insight into the signaling mechanisms that are involved in HIV-1
disease progression. However, one must interpret cDNA data with caution since a large number of
hypothetical genes may be obtained on data analyses that may not be specific to HIV-1 disease. This
method can identify unique genes that could be new biomarkers for the progression of HIV infection.
Proteomic Analysis:
Proteomic studies are undertaken to analyze differential protein expression in lysates of PBMC from
NP and LTNP patients to identify unique proteins associated with a specific cohort. Proteins are extracted
from lysates of PBMC from NP and LTNP and run on 2–dimensional difference gel electrophoresis (2DDIGE; EttanTM DIGE system, Amersham Biosciences, Piscataway, NJ) available as a core facility at our
institution. A total of 15, 2D-DIGE runs are performd for samples from each cohort, which is sufficient to
derive statistically significant differences as determined by power analysis. The Ettan DIGE system yields
highly accurate, quantitative data and the key benefit is that multiplexing enables the incorporation of the
same internal standard on every 2-D gel. Extracts from NP and LTNP are co-electrophoresed on the same
gel and quantified, avoiding problems associated with registering independently-derived gel maps. This
technique depends on pre-derivatization of samples with a fluorescent Cy-dye (Cy-2, Cy-3, or Cy-5). These
dyes are molecular weight and charge-matched, have non-overlapping excitation and emission spectra,
and can be covalently attached to amino groups of proteins. At the end of the run, DeCyder Differential
Analysis software, consisting of Differential In-gel Analysis (DIA) and Biological Variation Analysis (BVA)
components, permits automated detection of spots, background subtraction, quantitation, normalization,
internal standardization and intergel matching. This software can screen 2500 to 10,000 spots. We select
up to 32 individual protein spots showing the highest fold- increase or decrease in intensity and a
corresponding statistically significant p value between NP and LTNP from the 2D gels. Spots are robotically
picked from the gels and deposited into the wells of a microtiter plate. This is followed by in-gel tryptic
digestion and peptide isolation. Digested spots are analyzed in our Proteomics Core Facility. Our initial
proteomics data were determined by matrix-assisted, laser-desorption-ionization/time of flight (MALDI-TOF)
mass spectrometry (MS).
Protein Identification by Peptide Mass Fingerprinting:
Most spots are initially analyzed by MALDI-TOF by peptide mass fingerprinting (PMF). The MALDITOF instrument is relatively tolerant of common contaminants extracted with peptides from an in-gel
digestion and salts. Extracts are pooled, dried and reconstituted in 50% ACN and 0.1% TFA, mixed 1:1
with CHCA (4mg/mL) in 50% ACN, 0.1%TFA and l is spotted on specified wells of MALDI plates.
External calibration between m/z 1000-3000 and tuning of the MALDI-TOF MS instrument (Resolution
>10,000 FWHM) is routinely achieved using human adenocorticotropic hormone (ACTH) fragment 18-39,
which provides a monoisotopic calibration peak at 2465.199(M+H)+. The instrument is further calibrated
4
using a PEG (1K, 2K, and 3K) mixture and sensitivity is checked with a signal to noise ratio of 124:1 using
10 fmol GFP before analyzing samples. Positive identification of 50 fmol yeast aldehyde dehydrogenase
peptide mixture (Waters Corp) by PMF is used routinely as quality control. High mass accuracy, automated
MALDI-MS spectra are acquired in the reflectron positive mode from each sample spot followed by
acquisition from the lock-mass well (ACTH standard) for external lock-mass correction. The acquired
spectra by PMF are followed by protein identification through database matching using PLGS (v 2.3) and
Mascot (v 2.2, 2 cpu licensed version). Remaining aliquots of the protein spots not identified by this
approach or that give ambiguous results are subjected to nanospray or LC/MS/MS analysis.
Nano high performance liquid chromatography tandem mass spectrometry (LC-MS/MS) method
coupled with the SEQUEST for protein identification.
MS data are analyzed by SEQUEST software and searched against the latest version of the entire
National Center for Biotechnology’s Conserved Domains Protein Database provided for fragments that are
approximately the same mass using a 16 processor IBM Cluster computer. The closest 500 sequences
provide theoretical tandem mass spectra, with fragment ions produced depending on the amino acid
sequence of the peptide. Experimental spectra are compared to theoretical spectra using cross correlation
analysis. Results are filtered with DTAselect software (Tabb et al, 2002) to limit the possibilities. Criteria for
a positive peptide identification are cross correlation value (Xcorr) of 2.5 or greater for a 2+ ions value (ion),
3.5 or greater for a 3+ ion, delta Xcorr of 0.1 or greater, and at least 1 tryptic terminus (Hunter et al, 2002).
The best match is reported as an identified protein. Post-translational modification of proteins is indicated
when spots with different migratory properties on 2-D DIGE are subsequently identified as the same protein
by tandem MS. When multiple spots of the same protein migrate in a diagonal pattern, this is consistent
with post-translational modification by glycosylation. However a linear migratory pattern of several spots of
the same protein on 2-D gels suggests post-translational phosphorylation. These qualitative differences
can be resolved by isolation of the protein of interest and chemical analysis of post-translational
glycosylation and/or phosphorylation. For verification of protein spot identification, we use immunoidentification on 2D western blots, that may reveal additional immunoreactive spots representing isoforms
or post-translational modifications that can be analyzed as above. N-terminal sequencing can be carried
out on spots to confirm their identity. Sequences of new proteins unique to a particular sample are input in
SP3 and SPARKS 2 to determine their possible structural folds and in INSPIRE 2.0 and INSP3IRE for their
potential binding sites and possible binding partners. Functions of identified proteins will be tested. Lastly,
changes in protein expression are correlated with changes in gene expression. To limit variations in the
analysis of genomic and proteomic data, we repeatedly measure gene and protein expression in aliquots of
the same sample and do reciprocal interchanging of the Cy dye labeling of the proteins.
Limitations of DIGE based proteomics-solutions and strategies: Limitations of DIGE-based proteomics
include: low-abundance proteins, hydrophobic proteins such as integral membrane proteins, poor
resolution of proteins at extreme PI points, poor resolution of high MW proteins, and difficulty in identifying
low MW proteins. There are varying reasons for the failure of protein identification ranging from incorrect
excision of the gel spot, poor extraction of peptides in the in-gel digestion procedure, incomplete
trypsinization, faulty spotting on the MALDI plate, insufficient sensitivity of the mass spectrometer,
suppression of ionization, and loss of mass accuracy to low-peptide numbers from low MW proteins.
Strategies to improve the identification of low abundance proteins include using higher protein
concentrations and enrichment methods (Quin et al., 2005) especially in the study of post-translational
modifications (Larsen., 2005). In case of failure or ambiguities of identification we routinely use LC-MS/MS
analysis (Q-ToF Premier interfaced with a Waters Nano-Acquity UIPLC or the LTQ instruments interfaced
with Eksigent or GE Healthcare MDLC nano-flow chromatography systems). Proteins containing more than
one transmembrane domain-c are difficult to separate by 2-DIGE. Use of strong detergents, such as SDS,
interferes with isoelectric focusing. Thiourea, along with urea and detergents such as CHAPS, have
improved but not completely solved the problem (Luche et al.,2003). Other detergents such as
oligooxyethylene, sulfobetaine, dodecyl maltoside, and decaethylene glycol mono hexadecyl also can be
used in 2-DIGE (Luche et al.,2003; Santoni et al., 2000).
PMF by MALDI suffers from the inherent limitation of not being able to resolve mass changes in
peptides due to post-translational modifications. This hinders protein identification. We usually use tandem
mass spectrometry with the Q-ToF Premier interfaced with the Nano-Acquity UIPLC for these situations.
Tandem MS uses collision-induced dissociation (CID) in the presence of an inert gas to fragment individual
peptide ions. The second mass-analyzer measures the molecular masses of the fragments. The product
5
ion spectrum allows detailed analysis of the specific selected ion. The spectra acquired by LC-MS/MS is
routinely processed using Masslynx 4.1 followed by protein identification through database matching of at
least 2 matched unique sequenced peptides, using PLGS v 2.3 and Mascot v 2.2. As noted below, we
have additional software packages and
Figure 11
approaches for troubleshooting protein or
peptide identification. Validating results by
additional experimental procedures such
as
immunoblotting
are
routinely
performed. We also combine DIGE–
proteomics analysis with a bioinformatic
pathway analysis such as Metacore.
Proteins analyzed by DIGE are used as
input data for pathway analysis which
point to several additional proteins
including low abundance regulatory
proteins or transcription factors that could
be involved in the HIV-1 pathogenesis
process.
Label-free
proteomic
comparison
(expression profiling).
Label-free
proteomic
quantification,
defined as the relative quantification of
proteins by direct comparison of peptide
peak areas between LC/MS runs without
the use of peptide labeling, has been
emerging as an excellent alternative to
2D-Gels and label-based methods such as
isotope coded affinity tag (ICAT), isobaric
tag for relative and absolute quantitation
(ITRAQ), 18O-incorporation, or stableisotope labelling by amino acids in cell
Figure 11: A high resolution, low void volume, homogenous mixing nanoLC/nanospray interface developed in our proteomics facility.
culture (SILAC). Label-free methods can
provide better accuracy and eliminate expensive, and sometimes problematic, labeling steps (Higgs et al.,
2007). However, label-free quantification of differentially expressed proteins on a proteomic scale remains
challenging for several reasons. First, cleanup of extracted proteins from complex samples without
compromising quantitative information can be a significant challenge (Higgs et al., 2007). Second, labelfree strategies are intrinsically biased toward higher abundance proteins, while more important, lowerabundance regulatory proteins might escape sequencing (Bantscheff et al., 2007). To address these
challenges, we shall apply several novel technical advances that have been developed and validated in our
lab for label-free proteome comparisons. Our experimental approach involves 3 key innovations:
First, samples are extracted by a method that is optimized (see below). Before analyzing the samples by
nano-flow liquid chromatography/mass spectrometry (nano-LC/MS), surfactants and matrix components
such as lipids that could compromise reversed-phase chromatography must be removed while avoiding
loss of proteins. Traditionally, crude protein extracts are resolved on a preparative SDS-PAGE, followed by
excision of gel bands and in-gel digestion (Cho et al., 2007), but this approach suffers from relatively low
recovery of peptides following in-gel digestion, and band excision is inexact, leading to the inevitable loss
of proteins at the edge of bands, and thereby compromising quantification. We have developed an
approach involving protein precipitation and on-pellet digestion; this method provides high protein/peptide
recovery and identification (ID) of a larger number of proteins/peptides, while avoiding gel separation. The
crude protein extract is subjected to a multiple-step precipitation procedure employing organic solvents that
is optimized for the specific cell or tissue sample. For example, the protein extract is mixed with 4:2:3
(vol:vol:vol) methanol:chloroform:water, and then centrifuged. After removal of the aqueous layer that
contains hydrophilic non-protein components, 4 vol methanol is added to eliminate phase separation and
precipitate the protein. After centrifugation and removal of the supernatant, a small amount of enzyme is
6
added to the pelleted protein to initiate partial proteolysis and solubilization of the pellet. The mixture is
then reduced, alkylated, and further digested to completely-cleaved peptides, which are ready for nanoLC/MS analysis. This strategy is superior to the traditional gel fractionation method in both protein recovery
and peptide ID.
Second,
because
the
samples
proposed for proteomic analysis are all
highly complex, the ability to separate
the samples efficiently prior to analysis
by the MS/MS detector is critical to
obtain sequence and quantitative
information, particularly for lowerabundance regulatory proteins. We
developed a novel, high-resolution
nanospray ionization (nano-LC/NSI )
configuration,
which
provides
homogenous mixing, low void volume,
and high chromatographic resolving
power. The setup is illustrated in Fig. 11.
A trap and a small-particle nano-HPLC
column are connected back-to-back by
an entirely metal zero-dead-volume tee,
with a waste line connected to the 90º
arm. Because there is no valve between
the trap and nano-column, peak tailing and band broadening due to turbulence/mixing in valve channels
are eliminated. A large-diameter trap and bi-directional sample loading/analysis are used to achieve
homogenous nano-gradient mixing, a highly reproducible gradient, and a high loading capacity. An online
zero-dead-volume conductivity sensor is used to monitor gradient quality and trap washing efficiency.
Because of the highly reproducible gradient and high resolving power this interface provides, we can
employ shallow gradients and long run times to resolve very complex protein mixtures without
unacceptable peak broadening.
Third, the design for sample analysis and data processing is shown in Figure 12. We use 2 label-free
quantification software packages, Sieve (Thermo) and Decyder MS (GE Health Care), to provide relative
quantification by comparison of multiple runs and samples. Searches for peptide ID are performed with
Sequest running on a 32 node cluster in UB’s supercomputer center. Identified proteins are documented,
and those of interest for quantification at higher temporal resolution are analyzed using the ultra-sensitive
quantification scheme employing LC/MRM (multiple reactions monitoring) -MS/MS described below.
Comprehensive Identification of post-translational modifications (PTM) in complex samples.
PTM play a critical role in physiological and pathological processes, and are important markers for certain
pathological states (Hunter, 2007). The ability to identify PTM in PBMC will provide key insights necessary
to accomplish the aims of this project. Our proteomic facility has developed analytical procedures to
monitor routinely and comprehensively 20 biologically important PTM in biological samples, such as
phosphorylation, methylation, dimethylation,
ethylation,
Figure 13
biotinylation,
ubiquitinylation,
nitrosylation, etc. Due to the
usually low stoichiometry of
PTM,
combined
with
the
complexity of the digested
sample, identifying multiple PTM
Figure 13: Scheme for a novel dual-enzyme and dual-activation method to enhance PTM
identification significantly.
in these samples is challenging.
To increase the number of PTM identified, we shall employ the high resolution nano-LC/NSI system
described above, and use a shallow gradient to improve resolution of the complex mixture. In addition, we
shall employ a dual-enzyme/dual-activation technique that we recently developed. The flow chart of this
new technique is illustrated in Figure 13. In this sample processing scheme, two enzymes are employed
7
individually in parallel: trypsin (cuts at K and R) and V8 (cuts at D and E). These produce complementary
proteolytic peptide profiles. Parallel samples are analyzed by both collisionally activated dissociation (CID)
and electron transferring dissociation (ETD), which are alternative approaches for fragmenting peptides to
obtain sequence-informative product ions. CID can fragment singly to triply charged peptides with good
efficiency, but not peptides with higher charge-states, and often does not preserve fragile PTM such as
phosphorylation. In contrast, ETD is not optimal for doubly-charged peptides, and does not work well on
singly-charged peptides, but can efficiently fragment peptides having 3 or more charges. It also can
preserve phosphorylations. Therefore CID and ETD provide complementary information. Using this dualenzymes and dual-activation method, we are able to observe significantly more PTM than when using
trypsin and CID alone, as well as identify more peptide/proteins (data not shown). For example, with
conditions otherwise the same, using trypsin and CID alone to analyze a human liver sample enables
identification of ~5400 PTM, when using a set of stringent filters, while the dual-enzyme/dual-activation
approach resulted in ~9000 PTM identified. For each sample, the database search for CID samples is
filtered using Xcorr (2.0 z=1, 2.5 z=2 and 3 z=3) and probability score (P<0.01); for samples analyzed by
ETD, the standard filter will be Xcorr > 2.5 z=2, 3 z=3 and 3.5 if z=4 and Sf>0.7. All identified proteins,
peptides and PTM are cataloged in a database.
Potential results:
As seen in our preliminary data (Figures 8 & 9) proteomic analysis using 2D-DIGE showed significant
changes in protein expression between the NP and the LTNP cohorts. Identification of these unique protein
spots is currently underway using several methodologies described above. These investigations will
provide a list of known or unknown proteins that play a role in HIV-1 disease progression. A large number
of hypothetical genes and proteins may be obtained on data analyses. Some of them may not have been
described in the literature or exist in a database. These unanticipated proteins may be candidates for new
biomarkers associated with progression of HIV disease or even emerge as potential targets for immovative
therapies. Although, we expect a large number of proteins to be expressed in each group, only a few may
be expressed at levels detected by the 2D-DIGE approach. Even with low concentration polyacrylamide
gels, some high MW proteins may not be adequately separated by 2D electrophoresis. Limitations of this
methodology are peptide clustering, protein modifications, protein-protein interaction, low mass accuracy,
and enzymatic cleavage of specific proteins, some or all of which can result in inaccurate or incomplete
identification of the unique protein. Once a unique protein has been identified we shall quantitate its
expression with western blots. Attempts will be made to account for any confounding variables using
technological advances in proteomics as described above. Candidate genes and proteins, as selected by
genomic and proteomic analysis, will be confirmed by appropriate assays such as QPCR, and western and
antibody based microarrays. Changes in protein expression will be correlated with changes in gene
expression. To limit variations from the analysis of genomic and proteomic data, we shall repeatedly
measure gene/protein expression in different aliquots of the same sample using different reciprocal labeling
of the same sample. We anticipate that using a combination of these state of the art proteomic
methodologies we shall be able to identify new biomarkers of HIV infection.
Research Design and Methods for Specific Aim II:
SNP analysis:
SNP analysis will be undertaken by Sequenom MassARRAY spectrometry, a powerful genotyping
technology that efficiently and precisely measures the amount of genetic target material and variations
therein using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (Maldi TOF-MS).
It is able to deliver reliable and specific data from trace amounts of DNA from patients. This methodology is
one of the cheapest and most error free technologies for high throughput SNP typing. It uses samples
arrayed in 384 well plates and allows custom genotyping of SNPs within candidate genes or genomic
intervals. Sequenom MassARRAY can be used to identify alleles on the basis of mass and therefore does
not required expensive labeled primers and is also far more reliable than other genotyping approaches with
<0.5% error rate. Briefly, the technology involves PCR amplification of the region containing the SNP of
interest, an optimized primer extension reaction to generate allele-specific DNA products, and chip-based
mass spectrometry for separation and analysis of the DNA analytes. A single post-PCR primer extension
reaction generates diagnostic products that, based on their unique mass values, allow discrimination
between two alleles. The entire process has been designed for complete automation including assay
development, PCR setup, post-PCR treatment, nano liter transfer of diagnostic products onto silicon chips,
8
serial reading of chip positions in the mass spectrometer, and final analytical interpretation. Sequenom
SPECTRODESIGNER software designs primers for genotyping in multiplex fashion. IPLEX software allows
for the design of assays for 24-28 SNPs at a time. Sequenom MassARRAY will be used to identify allelic
variants, RANTES In1.1c, CCR2b-641, CCR5-∆32, IL10-5'A, IL-4 -589T, TNF--238A and HLA-B27, in NP
and LTNP. DNA is extracted from PBMC isolated from NP and LTNP samples and used for the Sequenom
MassARRAY analysis. A total of 65 samples from each cohort will be run in triplicate on a 384 sample
plate. A total of 8 SNPS will be run per sample plate. Genotyping data are automatically analyzed using a
script that computes descriptive statistics and runs the Shapiro–Wilk test for goodness-of-fit to normality.
Tests of association are automatically performed using a script that runs the modified χ2 test. The interface
also allows classical association studies to be carried out based on genotypes of individuals.
Analysis of the genetic variants using real time, quantitative PCR:
Quantitation of gene expression of the specific SNPs listed above will be performed using QPCR in
both study groups, NP and LTNP. The RANTES In1.1c, CCR2b-641 and the CCR5-∆32 alleles that have
been implicated in HIV-1 disease progression will be genotyped using DNA extracted from PBMC from
patients in one of our cohorts (NP or LTNP). Genomic DNA will be extracted from 5–10 ml of peripheral
blood, using standard procedures. DNA (100 ng) will be amplified using QPCR. The following are
sequences that we have designed for use in QPCR for the CCR5-∆32 allele (CCR5-∆32-F:
5'CTTCATTACACCTGCAGCT3' and CCR5-∆32-R: 5'TGAAGATAAGCCTCACAGCC3'); RANTES In1.1C
allele
(RANTES
In1.1C-F,
5′-CCTGGTCTTGACCACCACA
and
RANTES
In1.1C-R,5′GCTGACAGGCATGAGTCAGA); CCR2b-641 allele (CCR2b-641-F 5' TTG TGG GCA ACA TGA TGG and
CCR2b-641-R, 5' GAG CCC ACA ATG GGA GAG TA); IL10-592A allele, (IL10- 592-A-F-5'TACTCTTACCCACTTCCCCC-3' and IL10-592-A-R-5'-TGAGAAATAATTGGGTCCCC-3'); IL-4 -589T allele
(IL-4 -589T –F- 5’-CAGTCCTCTGGCCAGAGAG-3’ and IL-4 -589T –R-5’-CACCGCATGTACAAACTCCC3’); TNF- -238G/A allele, (TNF- -238G/A-F-5'AGAAGACC-CCCCTCGGAACC3' and TNF- -238G/A-R5'ATCTGGAG-GAAGCGGTAGTG3') and the HLA-B27 allele (HLA-B27-F-5’GGG TCT CAC ACC CTC
CAG AAT-3’ and HLA-B27 R-5’-CGG CGG TCC AGG AGC T-3’) respectively. PCR fragments for CCR5
∆32, RANTES In1.1C, CCR2b-641, IL10-592A, TNF- -238G/A and the HLA-B27 alleles are 164 bp, 240
bp, 128 bp, 311 bp, 700 bp, 152 bp and 135 bp respectively when separated on 2% agarose gels. Primer
mixes always include control primers that amplify a nonpolymorphic region. Relative abundance of each
mRNA species is quantitated by QPCR. Relative expression of mRNA species is calculated using the
comparative CT method (Shivley et al, 2003). All data are controlled for quantity of RNA input by
measurements on a reference gene, -actin, and the 18S RNA standard as internal controls. Results on
RNA from LTNP samples are normalized to results obtained on RNA from NP samples. Data are expressed
as transcript accumulation index (TAI) assuming that all PCR reactions are working at 100% efficiency.
Potential results:
Accurate estimation of allele frequencies requires calculation of a correction factor for unequal allelic
amplification from the peak height ratios of a small set of heterozygotes. The Sequenom MassARRAY
method gives the best reproducibility. The use of pooled samples as controls in each run (we use pools of
384 individuals) and multiple replicates increase the accuracy of detection. Although genotyping accuracy
has not been systematically examined, no genotyping method is 100% accurate and as many as 5% of
individual genotypes could be mis-called. Such genotyping errors would decrease the power to detect
quantitative trait loci or could have serious effects on linkage disequilibrium measures (Abecasis et al,
2001; Akey et al, 2001). Most of the scoring errors are caused by ambiguities in the allele peaks, sampleto-sample contamination, or mislabeling of DNAs. The use of pools will reduce all of these sources of error.
Nevertheless, Sequenom MassARRAY provides very accurate association studies, in a large set of
samples, compared with genotyping individual samples. We expect that data obtained from this analysis
will provide information on specific allelic variants that play an important role in HIV-1 disease progression
in our patient cohorts. Having identified the most important allelic variants from among the 8 SNPs studied,
we will further quantitate the expression levels of specific genes in our patient cohorts by QPCR.
Research Design and Methods for Specific Aim III:
This aim focuses on the development of new computational tools to integrate and analyze genomic,
proteomic, and clinical data from our different HIV-1 infected patient cohorts. The wealth of information that
9
will be produced by our project will require novel methods for its storage, analysis, and dissemination.
Thus, information management and analysis is a key component of this proposal.
III.a Data warehouse, data modeling and system design:
The most essential component for the research proposed is to design an efficient and organized
database allowing integration of various streams of information and study. Implementation of the system to
organize our data is a complex and data-intensive process since the data are inherently noisy, complex,
and distributed across multiple information resources. The core of this project’s information management
environment, therefore, will be a data warehouse and associated tools. The data warehouse will be used to
integrate information on HIV-1 and host cell gene expression, protein identification, post-transcriptional and
posttranslational modifications of proteins, functional activities, and protein-macromolecule interactions with
clinical data from our HIV-1 patient cohorts. Application tools will be built to analyze and integrate these
data, and make easy access to all this information via the World Wide Web.
To support the intensive computational research activities proposed in this proposal, we must design
a robust system to integrate genomic, proteomic, and clinical datasets. A high-level logical view of system
architecture is shown in Figure 14. The central component, the data warehouse, will contain primary,
derived and external data. The primary information includes genomic, proteomic, and clinical data collected
from individual patients. Since clinical data obtained from various patients are heterogeneous datasets,
they will be transformed and imported into the data warehouse using Perl programs. Associated with the
primary data will be a set of derived data generated from our data analysis approaches and different data
analysis tools. Public information also will be integrated into the database. These include data from the
National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) such as human EST, UniGene
and RefSeq data; gene ontology data from the Gene Ontology Consortium (http://www.geneontology.org/);
protein-protein interaction data from the DIP database (http://dip.doe-mbi.ucla.edu/); protein domain
models from the Pfam database (http://pfam.wustl.edu/); and microarray gene expression datasets from
GEO (http://www.ncbi.nlm.nih.gov/projects/geo/).
Data Sources
Data Integration
Data Warehouse
Unified Access
Data Mining
Clinical data
and sample
annotations
Gene
functional
annotations
Gene
Data
extraction,
trans formation,
cleaning
& loading
• Ad hoc
A standard
interface
queries
for
application
tools
• OLAP
• Cluster
analysis
expression
datasets
Metadata
capturing &
Promoter
sequences
and motifs
Protein
and domain
interactions
Object oriented
•
Mining gene
regulatory
networks
integration
Data quality
control
Refreshmet
Defining
basic
operators
for data
• Interactome
prediction
• Pathway
analysis
access
Protein
structural
information
Figure 14. System architecture of the integrated data warehouse under development
.
Schema design. Our preliminary study showed that we can use multidimensional schemas to
effectively model the microarray, proteomic and experimental data, and provide highly efficient query
processing. However, the multidimensional schemas do not appear to be sufficient for modeling the
semantics of the clinical data and sample data. For the clinical data, a multidimensional schema has
problems to model the complex many-to-many relationships between the fact and dimensions and to
provide bi-temporal support for some clinical measures. If a single fact table is used to store all the different
clinical measures, most entries (including foreign keys of the central fact table) would contain null values
due to the incompleteness of data.
We propose to develop a data model that is most suitable to specify and integrate biomedical data
including genomic, proteomic, clinical, and other related data such as ontology information in the data
10
warehouse so that efficient querying and analysis of these data can be performed. We propose to design a
new schema called Hybrid BioSchema (HBS), which combines multidimensional model with object-oriented
model. The HBS model can support both of multidimensional on-line analytical processing (MOLAP) server
and object-oriented (OO) data processing server. Then, we can take the benefit from greater scalability of
the OO model and faster computation of MOLAP. In HBS, data spaces are specified as either main data
spaces or sub-data spaces. The main data space is modeled by the OO model and each sub-data space is
modeled by the multidimensional model and is connected to central objects in the main data space. Thus,
each sub-space can be individually managed. Clinical data and sample data are in the main data space, for
example, a patient object with demographic and other related attributes, while microarray and proteomic
data are in the sub-data spaces. The object-oriented model can express all complex relationships between
every entity that exist in the clinical data. This model can be applied to many biomedical applications.
The HBS schema has many advantages. First, MOLAP operations can be applied easily to analyze
genomic and proteomic data and the data in the main data space can be accessed through OO server,
with explicit navigation capabilities. Thus, the analysis can be applied to any biomedical data with high
query performance. Second, HBS is very expressive, showing clear data semantics because the main data
space has clear object relationships. Third, HBS is very scalable and extensible. Updating or adding a
large amount of data can be easily done in any sub-data space. If we need to update or add data to the
main data space, then OO data server can be used during updating or adding. Finally, HBS has very
simple and concrete structure. Also, the data warehouse schema is easy to understand.
Querying and online analytical processing. The most important feature of MOLAP is its ability to
present multidimensional data at different levels of detail through roll-up and drill-down operations along a
concept hierarchy associated with a dimension. Such operations are especially suited for gene and protein
expression data analysis, where at least three major concept hierarchies can be identified. For example, for
the Array Probe dimension in the gene data sub-space, gene expression data may be summarized using
Gene Ontology (GO) or other ontology hierarchies (Gene Ontology Consortium 2000; Rosse and Mejino.,
2003). These vocabularies encode significant background knowledge of biology, and thus are important for
meaningful MOLAP analyses of genomic and proteomic data. The classification hierarchies of the other
dimensions may also be defined in a similar way based on domain-specific knowledge. The abovementioned ontology hierarchies are critical for meaningful MOLAP analyses of clinical and gene expression
data. Examples include summarization of clinical test results using disease hierarchies for pattern
discoveries, and gene expression summarization using terms from GO’s biological process hierarchy,
which may reveal changes in pathways due to diseases or in response to drug treatments, and thus
provide useful information for clinical research and drug discovery.
System construction: Data warehousing is a complex and data-intensive process that transforms
heterogeneous data into an integrated objects and multidimensional representation. This is especially true
for biomedical data warehousing, in which data are inherently noisy, complex and distributed at multiple
information resources. We will develop wrapper programs for data extraction from various sources, design
transformation algorithms for integration of the heterogeneous datasets, and define constraints for data
quality control. To monitor data lineage and quality, it is important to capture all the metadata generated in
the data staging process. The metadata include the semantics and structure of the data warehousing
processes. Thus, metadata repository is a key resource for the operational design, evolution and
refreshment of the data warehouse. We will design a framework to model the clinical and genomic data
warehousing processes, and store the information in the metadata repository.
The underlying database for managing the data warehouse will be powered by the ORACLE
database management system, which has proven to be stable, reliable, and easy to administer. It will serve
as the data repository. The object database architecture in ORACLE will enable us to specify complex
clinical data and relationships. The relational database architecture in ORACLE will enable users to ask
complex question regarding all aspects of the HIV disease. Constituting the application layer, a set of Perl
and Java modules and programs provide centralized program logic to handle transactions between the
user and the database. Finally, users will be able to interact with and navigate the database via a userfriendly Web interface.
III.b Advanced analysis methods and tools
We now describe the analysis methods and tools to aid researchers in mining the data. We will
support the use of routine statistical and computational approaches to identify differentially expressed
genes and proteins. For genomic analysis, the gene expression image analysis of the cDNA array and
11
quality control will be performed with the Microarray Suite software (MAS 5.0), which utilizes global (linear)
normalization procedures. The latest version of the software uses the paired version of non- parametric
Wilcoxon’s test. For the proteomic analysis, Ciphergen Protein Chip software version 3.0 will be used to
analyze the spectra and relative abundance of the individual proteins. Further, a classification tree will be
utilized to define cutpoints in variable peaks in different patient cohorts, NP and LTNP. A best
discrimination between 2 cohorts will be achieved by considering all possible confounding variables by the
algorithms used in the analysis. A classification tree will be constructed by separating the subgroups based
on the selected cutpoints. These splitting processes will be continued until a minimum number of patients
required in each patient cohort is achieved. The final tree will be pruned to eliminate overfitting of the
datasets. The data will be cross-validated and the tree error rate will be eliminated and a tree with the
smallest error rate will be selected. Further, in parallel analysis, the protein peaks that are over-expressed
or under-expressed in different patient cohorts will be compared to control groups. The results of this
analysis will be compared to that of the classification tree to evaluate the sensitivity of the analysis. In
parallel analysis, De Cyder software will be used to distinguish clear statistical differences in protein
expression between treated versus untreated, and between different patient cohorts.
However, since the genomic, proteomic, and clinical data are very complex, fast evolving and are
often incomplete, we shall investigate more advanced data analysis approaches to analyze gene and
protein profiles based on biological significance.
Target selection, common profile extraction and association.
Target selection discovers a subset of genes/proteins which explain the phenotypic variations. While
previous studies have proved useful for identifying informative genes/proteins, several fundamental
challenges still remain. (1) Correlations between genes are often ignored. Most previous target selection
approaches are single-gene-based. These methods simply assume genes are independent while ignoring
their correlations. However, genes are well known to interact with each other through gene regulative
networks. The assumption of independence between genes/proteins oversimplifies the complex
relationship between them. (2) Domain knowledge can be incorporated to improve the performance of
target selection. Gene/protein array datasets are typically noisy due to technical constraints. Incorporation
of domain knowledge will help to reduce the effect of noise and improve the quality of result. (3)
Association between gene and protein expression may provide considerable amount of information for
delineating the roles of genes/proteins in disease state. However, few, if any, studies have been devoted to
analyzing inherent correlation between these two types of expressions.
Correlation-based feature extraction. We propose to use the correlation between genes (or proteins)
as features for identifying phenotypes. Such correlation between gene expression levels could result from
underlying biological process and warrant further investigation. Instead of trying to get rid of correlation in
the selected gene set, we examine whether such correlation itself is a good predictor of sample class
labels. As we have pointed out in the preliminary study, the correlated genes bear biological meaning: the
weighted summation or difference of expression levels of several genes.
Incorporation of domain knowledge. We shall integrate the domain knowledge such as those
imbedded in the Gene Ontology (GO) (Gene Ontology Consortium 2000) annotations into our target
selection process. The rationale is that while it is likely that even random gene expression can achieve
relatively high discriminative scores when the number of samples is limited, it is less likely that several
random genes annotated with the same GO term will all have similarly high scores. Our algorithm first
examines for each GO term whether genes annotated with it have statistically higher discriminative scores.
Where this is so, this is an indication of a correlation between the corresponding GO term and sample
class labels. We then choose from genes that are annotated with GO terms that are highly correlated with
sample class labels. The discriminative power DP of a GO term of sample class labels is then defined as
the percentage of genes that are annotated with this GO term with discriminative scores larger than a
threshold. The discriminative power of a GO term measures the collective discriminative power of individual
genes annotated with such GO term. The higher DP score is, the stronger a GO term is correlated with
sample class labels. The best GO adjusted single gene based scores are then sorted. Top ranked genes
are then selected as informative genes. The utility of best GO adjusted scores are two folds. On one hand
irrelevant noises are further filtered. On the other hand, implicit sample classes may become explicit.
Association analysis between mRNA and protein expression. Association between mRNA and protein
expression may provide considerable amount of information for delineating the roles of genes and proteins
in disease states. So far few, if any, studies have been devoted to analyzing inherent correlation between
12
these two types of expressions. Instead of using simple spearman rank correlation, association analysis
between mRNA and protein expression profiles can be performed using KWII (Jakulin, 2005) that will
enable detection of two- or three-way interactions between mRNA and protein expression samples that can
be used for effectively detecting differentially expressed genes/proteins for HIV. We propose to use k-way
interaction information (Chanda et al., 2007; 2008), an information theoretic metric (Moore et al., 2006;
Bhasi et al 2006a, 2006b; Liu et al., 2005) (that can be considered as a multivariate generalization of the
Kullback-Leibler divergence (KLD) (Liu et al., 2005; Rosenberg et al ., 2003; Smith et al., 2001; Anderson
and Thompson., 2002) for association analysis of genetic data (SNPs) with gene and protein expression
data. For the n-variable case on the set  {X1,X2 ,,Xn } , the KWII can be written succinctly as an
alternating sum over all possible subsets T of ν using difference operator notation. The following definition
 T
of KWII follows that of Jakulin (Jakulin.,2005) : KWII( )   (1)
H(T) where H denotes entropy. The

T 

KWII measures the gain or loss of information due to the inclusion of additional variables in the model. It
quantitates interactions by representing the information that cannot be obtained without observing all k
variables at the same time (Jakulin.,2005). In the bi-variate case, the KWII is always positive but in the
 or negative. The interpretation of KWII values is intuitive because
multivariate case, KWII can be positive
positive values indicate synergy between variables, negative values indicate redundancy between
variables and a zero value indicates the absence of k-way interactions. The significance of the KWII values
can be ascertained using permutation or bootstrap based methods. Entropy calculations of the expression
profile combinations needed for KWII can be done using kernel based density estimations of the empirical
probability distributions in a nonparametric fashion that further highlights the versatility and flexibility of the
approach. Furthermore, to identify genes that may serve as ideal targets for novel treatment strategies, we
shall examine the selected subsets of mRNA and protein expressions shown to be significant. If from both
genomic and proteomic profiles we independently discover the same mRNA/protein to be differentially
expressed, then the chance of the error can be greatly reduced. Thus by combining the results from
genomics and proteomics profiles, we can gain significance increases in power of detecting differentially
expressed genes.
We also plan to analyze genetic data from SNPs together with gene expression data of NP and
LTNP subjects. As an example, consider a single gene G whose expression levels are measures across
T time points and N individuals or subjects. Let G(n, t) denote the expression level for the nth individual at
the tth time point. Let Gm denote the mean expression level for this gene across all the individuals. Then
treating each G(n) as a random variable, its KLD with the distribution of Gm can be determined using
empirical estimation methods. Finally these KLD values for G can be analyzed for association with the
combinations of genetic markers using KWII to detect genes whose expression levels differ in the
presence of different alleles at a marker position. This approach can be extended to analysis of
combinations of multiple genes and multiple markers upon treating the data as multivariate data.
Gene biomarker selection.
From the above analysis, we expect to get a set of genes displaying differential expression between
the HIV-1 infected patient cohorts, LTNP and NP. We shall select 10 target genes for each comparison and
use both Q-PCR and proteomics approaches to determine optimal mRNA and protein signature for single
gene biomarkers from multiple patient cohorts and healthy individuals. QPCR and proteomics approaches
will provide for each sample an accurate measurement for the mRNA and protein concentration which will
be used for the construction of receiver operator characteristic (ROC) curves for the determination of
sensitivity and specificity. For the HIV infected patient cohorts and healthy individuals, the percentage of
patients whose specific mRNA and/or protein concentration passes selected thresholds based on the
above data analysis is defined as the “specificity”; the percentage of healthy individuals whose mRNA
and/or protein concentration does not pass the selected threshold is defined as the “sensitivity”. We shall
look for an optimal mRNA and/or protein concentration, which has the maximum sensitivity and specificity
combination in ROC curves, as single gene biomarker. To reduce false predictions, we shall integrate the
results from both QPCR and proteomics and find those genes with similar trend in ROC curves as the gene
biomarker. For the HIV infected patients before and after drug treatments, a conceptual model can be
similarly designed.
In addition to the single gene biomarker, we also shall look for multi-gene biomarkers. This will first be
done by building a classification and regression trees model using the threshold from the above single
13
gene biomarker. Out of the 10 selected genes we shall investigate the biomarkers using 3 or 4 genes,
which will result in 120 and 210 classification and regression trees, respectively. If 65 patients in each
patient cohort were used, this method will classify the 130 samples into final patient and normal groups by
3 or 4 splitting steps. Similar to those cases in single gene biomarker analyses the percentage of patients
from the 65 patients in the classified patient groups is the “specificity”; the percentage of healthy individuals
from the 65 healthy persons in the classified healthy group is the “sensitivity”. A few gene combinations
with maximum sensitivity and specificity will be used for multi-gene biomarkers. Similar to the single gene
biomarker, we shall integrate the results from both QPCR and proteomics to reduce false predictions. To
incorporate other phenotype information into our analyses, we shall also build a regression model by
integrating the patient’s age, gender, and medical history into the analysis. The best model will be
searched using the backward stepwise regression and validated by leave-one-out cross-validation
approach. These models will provide useful means for predicting disease states and HIV patients
responding to drug treatment based on expression profiles.
Genetic network reconstruction.
A full understanding of virtually any complex biological system requires the identification of the
regulatory networks that control gene expression within that system. Such networks are composed of
genes, the transcription factors that regulate them, and crucially, the cis-regulatory sequences on which the
transcription factors act. Fully comprehending all aspects of these regulatory networks is fundamental to
understanding normal development, progression to disease, and response to pharmacological agents. In
eukaryotic organisms, transcriptional regulation of a gene’s spatial, temporal, and expression level is
generally mediated by multiple transcription factors (TFs). Therefore, the identification of synergistic TFs
and the elucidation of relationships among them are of great importance for understanding gene regulatory
networks. Previous methods employed for the identification of synergistic TFs are based on either TF
enrichment from co-regulated genes or phylogenetic footprinting. Despite the success of these methods,
both have limitations. For example, methods based on phylogenetically conserved sequences, although
they can greatly reduce the false prediction rate (Wasserman and Sandelin, 2004), have limitations related
to missing potentially significant observations. Moreover, if the species are very closely related,
nonfunctional sequences may not have diverged enough to allow functional sequence motifs to be
identified; conversely, if the species are distantly related, short conserved regions may be masked by
nonfunctional background sequences.
We shall employ a new strategy to identify synergistic TFs. First, information from both genomics and
proteomics will be integrated to find co-regulated genes in both HIV infected patient cohorts vs. healthy
individuals. Human orthologous promoter sequences within 1-kb upstream of the annotated transcriptional
start sites (TSS) then will be obtained from the Database of Transcriptional Start Sites (Yamashita et al.,
2006) for these co-regulated genes. The orthologous promoter sequences will be searched for transcription
factor binding sites (TFBSs) using the Match® program (Kel et al., 2003) and ~550 (Precision Weight
Matrix (PWMs) from the professional TRANSFAC 9.1 database (Matys et al., 2003). To minimize false
predictions, we shall use an in-house developed novel approach (Hu et al., 2007) to detect synergistic TFs
in these co-regulated genes. In this novel approach, rather than aligning the regulatory sequences from
orthologous genes and then identifying conserved TFBS in the alignment, we proposed a new concept of
function conservation to identify TF combinations. The first is functional conservation of TFs between
species. The second is functional conservation of TFBSs between promoter sequences of individual
orthologous genes. The algorithm for this novel function conservation approach has been implemented at 3
levels: (1) functional TFBS enrichment based on the pattern of binding site arrangement on promoters of
orthologous genes by distance constraint, (2) enrichment of overlapping orthologous genes whose
regulatory sequences contain the enriched TFBS combinations, and (3) integration of function conservation
from both TF and TFBS levels by correlation analyses. Genome-wide TF analyses have demonstrated that
our novel algorithm is better able to predict synergistic, functional TFBSs, TF-TF interactions, and thus
genetic networks. We shall combine our novel approaches with existing tools for discovering a genetic
network involved in HIV disease and build publicly available application tools (see database section) into
the database system.
Identification of genetic contributions to HIV and drug response.
One challenge in the post genomic era is to develop robust strategies for identifying the genetic
contributions to HIV and drug response, which involve multiple gene interactions. Central to an
understanding of such complex systems and an understanding of the role of genes underlying the individual
14
response to drug treatments, are effective models and software tools that lead to the characterization of
genetic variations based on genomic data, mostly SNPs. Although SNP data are available from The
HapMap (The International HapMap Consortium) Project (Li 2005), one cannot perform the whole genomebased association studies based directly on the genotypes or allele frequencies of individual markers due to
the relative low power of each SNP and the huge number of total SNPs. To increase the power of detection,
we have chosen closely linked SNPs inherited together during the history of evolution to find specific
patterns in the non-random association between alleles and the haplotype structures that they form. We
shall employ different algorithms such as haplotype-based (Li 2005) or clustering technique (Liu et al.,
1999) for haplotype mapping to find disease susceptibility (DS) genes which embeds haplotypes,
especially mutants of recent origin. These mutations tend to be close to each other due to linkage
dysequilibrium, while other haplotypes can be regarded as random noise sampled from the haplotype
space. The association between genetics and disease state will not only result in the identification of
disease related genes but also provide personalized medicine fingerprinting.
Tools
Interactive mining. We propose to design InterM, an integrated environment for interactive exploration
of coherent expression patterns and co-expressed genes/proteins in gene/protein expression data for HIV
infections. Our system will integrate the users' domain knowledge and effectively handle the high
connectivity in the data. Based on our density-based approach, InterM models a cluster of co-expressed
genes as a dense area (Jiang et al., 2004; 2005). Through this density-based model, InterM can distinguish
co-expressed genes from intermediate genes by their relative density. The coherent expression pattern in a
dense area is represented by the expression profile of the gene that has the highest local density in the
dense area. Other genes in the same dense area can be sorted in a list according to the similarity between
their expression profiles and the coherent expression pattern. Since the intermediate genes have low
similarity to the coherent pattern, they are at the rear part of the sorted list. Users can set up a similarity
threshold and thus cut the intermediate genes from the cluster. A user should be able to explore the coexpressed genes and their coherent patterns by unfolding a hierarchy of genes and patterns. The
exploration starts from the root. Three main components of InterM are shown in Figure 15. Users can
explore the coherent patterns in the data set and save/load the coherent patterns through the pattern
manager (Figure 15(a)). InterM has a working zone (Figure 15(b)), which integrates the parallel
coordinates, the coherent pattern index graph (Jiang et al., 2005), and a tree view. Users can select a node
in the tree view, then the working zone will display the corresponding expression profiles and coherent
pattern index graph. Users can click on the coherent pattern index graph to split the node or roll back
previous split operations. The tree structure is adjusted dynamically according to the exploration
operations. We will also design a gene annotation panel. Given a specific node on the hierarchical tree, the
panel sorts the genes belonging to the node, and displays the name and the annotation (if any) for each
gene (Figure 15(c)). This InterM function helps to integrate such domain knowledge into the system.
(a) Pattern manager
(b) Working zone
(c) Gene annotation panel
Figure 15: Screen snapshots of InterM.
Visualization tool. We shall expand the scope of our VizStruct tool (Zhang et al., 2004) to meet the
need of the proposed HIV research by providing the following functions for viewing the structures of
genomic data:
Zip zooming view. We propose a zip zooming view method extending circular parallel coordinate
plots. Instead of showing all dimensional information, it combines several adjacent dimensions and
15
displays the reduced dimension information. The number of dimensions displayed, we call it a granularity
setting, can be set by the user. This allows different levels of combination. Two distant points in input space
may be mapped to nearby points in 2D space or vise versa. One solution is to use zip zooming view to
inspect these 2 points more closely. Another approach is to allow the user to interactively adjust the weight
of individual dimension parameter to change data distribution in 2D space. It can easily cause the
separation of falsely mapped points. By adjusting the coordinate weights of the dataset, data's original
static state is changed into dynamic state which may compensate the information loss from the mapping.
Dimension tour. To effectively tackle the multi-dimensional characteristic nature of gene data, we will
design an animation method, called dimension tour. It is a sequence of either scatterplots or zip zooming
views in which each frame has a specific dimension parameter settings. Dimension tour can be defined as
a function of time. Our system can be used in a variety of ways during exploratory array data analysis.
Figure 16 shows three snapshots that are taken from a dimension tour of a sample set.
Figure 16
We shall also develop a visualization tool to provide an integrated view of various genomic and
proteomic data in the warehouse. Figure 17 shows an example, in which gene expression changes are
mapped onto the known protein interactome, and gene ontology annotations are used to characterize the
main function of the highly connected graph components. The graph provides an integrated and global
view of cellular changes due to disease or in response to treatment.
Protein
interactome
Microarray
data
Gene
ontology
. . .
Tumor versus normal colon tissues
Regulation of
cell cycle
Gene
Signal transduction,
protein biosynthesis
Cell-cell
signaling
DNA replication/repair
GO annotation
P19838
GO:0007165
Signal transduction
Q14164
GO:0006468
Protein phosphorylation
Q99759
GO:0000165
MAPKKK cascade
P07900
GO:0006457
Protein folding
P52292
GO:0006886
Protein transport
P12956
GO:0006302
Double-strand break repair
P30304
GO:0000074
Regulation of cell cycle
P05231
GO:0007267
Cell-cell signaling
P25445
GO:0006916
Anti-apoptosis
Apoptosis
Figure 17. Genomic and proteomic data integration. Red nodes represent up-regulated genes, green nodes are downregulated genes, and black nodes indicate gene expression unknown.
At present, information and tools for the systematic analysis of the genes and proteins of both the
host and virus associated with HIV-1 infections are limited and scattered across a wide-range of online
resources. The application tools that we shall develop will be user-friendly, and constitute a bioinformatics
resource that integrates the genomic and proteomic analysis of host and HIV-1 proteins and correlates this
information to clinical data from unique HIV-1 patient cohorts. Our ultimate aim is to use the systems
16
biology approach to better understand HIV-1 disease and build a specific database to make this
information readily available to the public.
Software Dissemination and Timeline
Data resources and analysis tools that we develop will be made publicly available according to NIH
policies. The tools will be geared for use by bench biologists and users with a wide range of quantitative
skills. We shall establish procedures and training to insure that the software and tools developed by this
project are production quality and easy to use by biomedical researchers outside our university and shall
incorporate these software packages in a novel computing environment that facilitates their wide-spread
distribution to the biomedical research community. UB has a strong foundation in place from which this
effort will be based. Specifically, the proposed environment will be built around a system currently in place
to support biomedical research at UB, namely BioACE (Bioinformatics Application Computing
Environment). We shall develop a portal that will tie together all of the services that an existing or
prospective client will need. These include: (1) a means to request/register for services, (2) links to online
training materials, (3) application and tool downloads, and (4) access to database applications. We intend
to implement mechanisms whereby authorized users may access these and any documentation securely.
Access may be via direct download or via media such as CDs or DVDs. We shall also develop a database
to hold bibliographic and reference materials relevant to the project.
The timeline for the project follows:
TIME LINE OF RESEARCH DEVELOPMENT PLAN
GRANT PERIOD
APR
MAY
JUN
JUL
AUG
SEPT
OCT
NOV
DEC
JAN
FEB
MAR
Year 1
4/1/2009 to 3/31/2010
AIM-I: Patient Sample Collection; Genomic analysis & Proteomic Analysis-2D-DIGE/ MALDITOF/ NanoLCMS/MS; AIM III – Design of data schema and data collection
Year 2
4/1/2010 to 3/31/2011 AIM-I: Proteomic analysis Contd… ITRACK/SILAC/ PTEM
AIM III: Data integration and data loading
AIM-II: Quantitation Of Allelic Variants
AIM III: Data integration and data loading
Year 3
4/1/2011 to 3/31/2012 AIM-II: SNP Analysis; AIM III: Design of data analysis methods
Year 4
4/1/2012 to 3/31/2013 AIMs I, II, and III: Applying data Analysis methods to genomic, proteomic, and clinical data
AIMs I, II, and III: Results analysis and summarization. Manuscript preparation & Submission, tool
Year 5
4/1/2013 to 3/31/2014 dissemination and Preparation of Future Grant proposal Based on Results obtained from this Investigation.
17