Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Documentation for Cancer Surveillance Study Page 1 of 3 Studies of Cancer Incidence using Surveillance Data It has long been recognized that there is wide variation in cancer incidence rates by geographic location, and there is considerable epidemiologic interest in the investigation of major causes of such disparities. By using cancer surveillance systems, investigators can perform ecologic analyses of various risk factors. The basic approach is to perform “ecologic studies” examining the correlation between (average) cancer incidence rates and average exposure to hypothesized risk factors. As a first step, such studies are extremely inexpensive compared to case-control studies, cohort studies, or clinical trials.. The accompanying data set (surveillance.txt) is derived from the incident cases of various cancer types as reported to registries that participated in the Surveillance, Epidemiology and End Results Program (SEER) of the National Cancer Institute during the years 1973 to 1984. In that program, a number of geographic regions in the United States strive to collect data on all incident cancers diagnosed within their respective geographical catchment areas. Each case thus identified is characterized with respect to age, sex, ethnicity, and state or country of birth, among many other variables. The data from the SEER registries can thus constitute the numerators in estimating cancer incidence rates. Denominator data is then estimated from U.S. census data: For the accompanying data set, a sample of 19% of the 1980 United States census data has provided estimates of the numbers of people who lived in the geographical areas covered by the SEER registries during that time period. These populations are enumerated for each stratum defined by age, sex, ethnicity, and country of birth. The person-years of observation for a given stratum is then estimated as the corresponding population size times the number of years each registry contributed incident cases to the study. Such study designs have been used in exploratory analyses to address such hypotheses as: A protective effect of vitamin D on the incidence of colorectal cancer. The colorectal cancer incidence of 9 SEER sites was analyzed as a function of the average amount of sunlight received in the geographical area. The relative contribution of genetics versus environmental or cultural exposures (e.g., diet) in cancer incidence. For instance, incidence of stomach cancer among U.S. white males or females is considerably lower than that among Japanese males or females. If the relatively high rate of stomach cancer among Japanese in Japan is due to a genetic predisposition, we might reasonably expect that Japanese-Americans would have a similarly high incidence of stomach cancer. On the other hand, if the high incidence of stomach cancer in Japan is primarily due to environmental and cultural factors, we might expect that Americans of Japanese ancestry would have stomach cancer incidence rates closer to that of white Americans. We might further expect that immigrant Japanese, i.e., those ethnic Japanese who were born in Japan and immigrated to the United States, would show an intermediate risk of stomach cancer. Documentation for Cancer Surveillance Study Page 2 of 3 Standard Disclaimer: Such analyses are of course subject to the “ecologic fallacy”: Trends in cancer incidence rates that correlate with average exposure to some risk factor may not translate to the individual level risk. For instance, a study of prostate cancer incidence across countries as a function of the per capita use of oral contraceptives (OC) found that there was a significant trend toward higher prostate cancer incidence in those countries that had the higher percentage of women using oral contraceptives (Margel, Fleshner: BMJ Open 1, 2011). However, we can be certain that those individuals having the highest exposure to oral contraceptives (women) did not ever have prostate cancer. (To be fair, the authors never supposed that their analysis would apply to women. Instead they were supposing that higher OC use would lead to higher presence of OCs and their metabolic derivatives in water supplies and other environmental exposures. It could indeed be the case that men having higher exposures were more likely to have prostate cancer, and this study was designed as an inexpensive means of obtaining preliminary data. Arguing against that however, is the finding that OC metabolites are a very low proportion of the estrogens and estrogen-like compounds in drinking water.) These data also illustrate problems that commonly arise in data analysis: missing data. A problem that has arisen in these data is that a key variable of interest, country of birth, is often missing from the SEER data. Depending upon the mechanism of missing data, it might or might not be reasonable to merely exclude from both the numerator and denominator all people for whom birthplace data were missing. In this case, however, our problem is compounded because the source of the denominator data is more complete. Approximately 0.6% of subjects are missing birthplace information in the census data, while the missing data rates in the SEER data are 10-35%. Thus we have no way of knowing the size of the denominator population in each stratum for which we would have had complete ascertainment of birthplace by the SEER registry. Documentation for Cancer Surveillance Study Page 3 of 3 Description of Data Set The file surveillance.txt is a free-field (white space delimited) text file containing data on incident colorectal cancer cases extracted from nine SEER registries. Cases are restricted to patients whose race is white and whose cancer diagnoses corresponded to one of 13 ICD-O morphology codes. For seven SEER sites, cases were abstracted if they represented diagnoses made after January 1, 1973, while the start dates were January 1, 1974 for the Western Washington (Seattle) SEER site and January 1, 1975 for the Atlanta SEER site. In all cases, data collection ceased for diagnoses after December 31, 1984. The data consist of counts of the cases extracted from the SEER registries within strata defined by sex, age (within 5 year age groups), birthplace (US, non-US, or unknown), and SEER site. Also included are estimated catchment area populations within each stratum as derived from a sample of 1980 U.S. census data. Each row of the file corresponds to a single SEER-sex-age-birthplace stratum, with the first row of the file containing the following variable names: bplace male age SEER startyr cases census80 character string indicating birthplace (“US”, “nonUS”, or “Unknown”) indicator of case sex (0= female, 1= male) five year age category, indicated by the midpoint of the five year interval (so 0-5 yr= 2.5, 5-10 yr= 7.5 yr, … 80 – 85 yr= 82.5, > 85 yr= 87.5 yr) character string indicating SEER site (“ATL”= Atlanta , “CT”= Connecticut, “DET”= Detroit, “HI”= Hawaii, “IA”= Iowa, “NM”= New Mexico, “SF”= San Francisco, “UT”= Utah, “WWA”= Western Washington (Seattle)) starting year for case collection from the corresponding SEER (all case collection stopped on December 31, 1984) number of colorectal cancer diagnoses made at the corresponding SEER site during the data collection period for subjects who were of the corresponding age, sex, and birthplace estimated white population of the specified sex, age, and birthplace in the SEER geographic catchment area according to the 1980 census