Download surveillance - Emerson Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Documentation for Cancer Surveillance Study
Page 1 of 3
Studies of Cancer Incidence using Surveillance Data
It has long been recognized that there is wide variation in cancer incidence rates by
geographic location, and there is considerable epidemiologic interest in the investigation
of major causes of such disparities. By using cancer surveillance systems, investigators
can perform ecologic analyses of various risk factors. The basic approach is to perform
“ecologic studies” examining the correlation between (average) cancer incidence rates
and average exposure to hypothesized risk factors. As a first step, such studies are
extremely inexpensive compared to case-control studies, cohort studies, or clinical trials..
The accompanying data set (surveillance.txt) is derived from the incident cases of various
cancer types as reported to registries that participated in the Surveillance, Epidemiology
and End Results Program (SEER) of the National Cancer Institute during the years 1973
to 1984. In that program, a number of geographic regions in the United States strive to
collect data on all incident cancers diagnosed within their respective geographical
catchment areas. Each case thus identified is characterized with respect to age, sex,
ethnicity, and state or country of birth, among many other variables.
The data from the SEER registries can thus constitute the numerators in estimating cancer
incidence rates. Denominator data is then estimated from U.S. census data: For the
accompanying data set, a sample of 19% of the 1980 United States census data has
provided estimates of the numbers of people who lived in the geographical areas covered
by the SEER registries during that time period. These populations are enumerated for
each stratum defined by age, sex, ethnicity, and country of birth. The person-years of
observation for a given stratum is then estimated as the corresponding population size
times the number of years each registry contributed incident cases to the study.
Such study designs have been used in exploratory analyses to address such hypotheses as:
 A protective effect of vitamin D on the incidence of colorectal cancer. The
colorectal cancer incidence of 9 SEER sites was analyzed as a function of the
average amount of sunlight received in the geographical area.
 The relative contribution of genetics versus environmental or cultural exposures
(e.g., diet) in cancer incidence. For instance, incidence of stomach cancer among
U.S. white males or females is considerably lower than that among Japanese
males or females. If the relatively high rate of stomach cancer among Japanese in
Japan is due to a genetic predisposition, we might reasonably expect that
Japanese-Americans would have a similarly high incidence of stomach cancer. On
the other hand, if the high incidence of stomach cancer in Japan is primarily due
to environmental and cultural factors, we might expect that Americans of
Japanese ancestry would have stomach cancer incidence rates closer to that of
white Americans. We might further expect that immigrant Japanese, i.e., those
ethnic Japanese who were born in Japan and immigrated to the United States,
would show an intermediate risk of stomach cancer.
Documentation for Cancer Surveillance Study
Page 2 of 3
Standard Disclaimer: Such analyses are of course subject to the “ecologic fallacy”:
Trends in cancer incidence rates that correlate with average exposure to some risk factor
may not translate to the individual level risk. For instance, a study of prostate cancer
incidence across countries as a function of the per capita use of oral contraceptives (OC)
found that there was a significant trend toward higher prostate cancer incidence in those
countries that had the higher percentage of women using oral contraceptives (Margel,
Fleshner: BMJ Open 1, 2011). However, we can be certain that those individuals having
the highest exposure to oral contraceptives (women) did not ever have prostate cancer.
(To be fair, the authors never supposed that their analysis would apply to women. Instead
they were supposing that higher OC use would lead to higher presence of OCs and their
metabolic derivatives in water supplies and other environmental exposures. It could
indeed be the case that men having higher exposures were more likely to have prostate
cancer, and this study was designed as an inexpensive means of obtaining preliminary
data. Arguing against that however, is the finding that OC metabolites are a very low
proportion of the estrogens and estrogen-like compounds in drinking water.)
These data also illustrate problems that commonly arise in data analysis: missing data. A
problem that has arisen in these data is that a key variable of interest, country of birth, is
often missing from the SEER data. Depending upon the mechanism of missing data, it
might or might not be reasonable to merely exclude from both the numerator and
denominator all people for whom birthplace data were missing. In this case, however, our
problem is compounded because the source of the denominator data is more complete.
Approximately 0.6% of subjects are missing birthplace information in the census data,
while the missing data rates in the SEER data are 10-35%. Thus we have no way of
knowing the size of the denominator population in each stratum for which we would have
had complete ascertainment of birthplace by the SEER registry.
Documentation for Cancer Surveillance Study
Page 3 of 3
Description of Data Set
The file surveillance.txt is a free-field (white space delimited) text file containing data
on incident colorectal cancer cases extracted from nine SEER registries. Cases are
restricted to patients whose race is white and whose cancer diagnoses corresponded to
one of 13 ICD-O morphology codes. For seven SEER sites, cases were abstracted if they
represented diagnoses made after January 1, 1973, while the start dates were January 1,
1974 for the Western Washington (Seattle) SEER site and January 1, 1975 for the Atlanta
SEER site. In all cases, data collection ceased for diagnoses after December 31, 1984.
The data consist of counts of the cases extracted from the SEER registries within strata
defined by sex, age (within 5 year age groups), birthplace (US, non-US, or unknown),
and SEER site. Also included are estimated catchment area populations within each
stratum as derived from a sample of 1980 U.S. census data. Each row of the file
corresponds to a single SEER-sex-age-birthplace stratum, with the first row of the file
containing the following variable names:
bplace
male
age
SEER
startyr
cases
census80
character string indicating birthplace (“US”, “nonUS”, or “Unknown”)
indicator of case sex (0= female, 1= male)
five year age category, indicated by the midpoint of the five year interval
(so 0-5 yr= 2.5, 5-10 yr= 7.5 yr, … 80 – 85 yr= 82.5, > 85 yr= 87.5 yr)
character string indicating SEER site (“ATL”= Atlanta ,
“CT”= Connecticut, “DET”= Detroit, “HI”= Hawaii, “IA”= Iowa,
“NM”= New Mexico, “SF”= San Francisco, “UT”= Utah,
“WWA”= Western Washington (Seattle))
starting year for case collection from the corresponding SEER (all case
collection stopped on December 31, 1984)
number of colorectal cancer diagnoses made at the corresponding SEER
site during the data collection period for subjects who were of the
corresponding age, sex, and birthplace
estimated white population of the specified sex, age, and birthplace in the
SEER geographic catchment area according to the 1980 census