Download Integrating Discovery, Development, and Commercial Data in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Integrating Discovery, Development,
and Commercial Data into
Data Mining
Jennifer Sloan
Data Mining Consultant
GlaxoSmithKline: US Pharma IT
15 September 2004
Data Mining Definition
Data Mining is a process that uses a
variety of data analysis tools to discover
patterns and relationships in data that
may be used to make valid and accurate
predictions.
Data Mining is a tool that allows
us to




Identify problematic areas
Control process variability
Make concrete decisions on business needs
Develop a model which can aid in future
business decisions
Commercial Data
Analyzing Multivariate Data
Managing Data Usage
Model Building
Multivariate Data Sets

Data are multivariate in nature

Large data sets containing multiple criteria
within each observation

Comparing multiple vectors is nearly
impossible without reducing to a single point
Here we view 5-dimensional information on one observation. Each
point represents a prescriber and the color represents a Market Share
increase or decrease. Overlapping distributions make this difficult to
interpret and further analysis is required. Over 200K observations are
represented in this graph.
The same observations are observed but now two-way interactions between
the variables help us determine which variables are affecting market shifts
and lead to constructing models which will predict prescriber behavior.
Drug Development
Drug Development Issues

Adverse Event Reporting System (AERS)
Over 2 million AE reports and approximately 2000
drugs and biologics submitted to the FDA since 1968


Creates Extremely Complicated Matrix of Data
Recently, Data Mining methods have helped
address this issue with the development of a
method used to examine large databases for
associations between drugs and AEs
Data Mining Algorithm

Multi-Item Gamma Poisson Shrinker (MGPS)
Developed by William DuMochel (AT&T)
Through statistical modeling, this Empirical Bayesian
method identifies higher-than-expected reporting
relationships of drug-event combinations

Automated, web-based system with rapid drilldown capability
MGPS runs using all event terms and drugs in the AERS
database and produces results for all drug-event
combinations
MGPS: Significance



Handles Complex Stratification
(age, gender, year of report > 945 categories)
Performs complex computations in minimal
amount of time: Much MORE EFFICIENT
Real World Example:
Membership: PhRMA-FDA
Working Group
Chair: June Almenoff (GSK)
FDA Involvement
Involved PhRMA companies: Abbott, Allergan,
AstraZeneca, Bristol-Myers Squibb,
GlaxoSmithKline, Johnson & Johnson, Lilly, Merck,
Novartis, Schering-Plough, Pfizer, Roche, Wyeth
Drug Discovery
SCAM—Statistical Classification of
Activities of Molecules

Recursive partitioning customized for
chemistry

Creates a structure activity relationship (SAR)
mode7l

Handles large numbers of descriptors (> 1
million)
SCAM : Data Structure
Biological
Activities
Y1
Y2
Y3
Y4
...
Yn
>100K
O
N
S
H
N
N
NH
O
O
1010111010000000000001
1010011110000000000001
1010111110000100010001
1010011010000010010001
...
1000111101010001000001
> 2 million
SCAM’s Recursive Partitioning
n = 1650
Ave = 0.34
SD = 0.81
Feature
n = 1614
ave = 0.29
sd = 0.73
t=
Signal
Noise
rP = 2.03E-70
aP = 1.30E-66
=
2.60 - 0.29
0.734
1
1
+
36 1614
n = 36
ave = 2.60
sd = 0.9
= 18.68
SCAM Tree
Advantages of SCAM
 Works
for complex situations, mixtures and
interactions.
 Output
 High
is easy to understand and explain
statistical power
 Produces
a valid answer
SCAM Drawbacks
Data greedy
 Only one view of the data
 Binary descriptors may be too “crude”
 Disposition of outliers is difficult
 Highly correlated variables may be obscured
 Higher order interactions may be masked

Concluding Remarks

Data Mining enables us to efficiently handle
LARGE amounts of data

Data Mining allows us to perform analyses IN
REAL TIME

Data Mining covers a wide array of topics in
drug industry and its benefits are plentiful
References
Almenoff, June S, et al. “Disproportionality Analysis Using Empirical Bayes
Data Mining: A tool for the Evaluation of Drug Interactions in the PostMarketing Setting.” Pharmacoepidemiology and Drug Safety,12, 517-521
(2003).
Donahue, Rafe. “An Overview of Data Mining in Drug Development and
Marketing.” http://home.earthlink.net/~rafedonahue. May 2003.
Hawkins, D.M. and G.V. Kass, “Automatic Interaction Detection.” Topics in
Applied Multivariate Analysis, ed. Hawkins, (1982).
Hawkins, D.M., S.S. Young and A. Rusinko. “Analysis of a Large StructureActivity Data Set Using Recursive Partitioning.” QSAR, 16, 296-302
(1997).