Download Information-Statistical Approach for a Strategic Planning of a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Information-Statistical Approach for a Strategic Planning
of a Community-Based Wireless Project
Wei-Tsu Yang1 and Bon K. Sy2
1 Queens
College/CUNY, Computer Science Department, Flushing NY 11367, U.S.A.
[email protected]
2 Queens
College/CUNY, Computer Science Department, Flushing NY 11367, U.S.A.
[email protected]
Abstract. The objective of this paper is to apply an information-statistical data
mining technique to achieve two specific goals related to building a
community-based wireless infrastructure. The first goal is to discover from data
the characteristics behind a successful project on building a community-based
wireless infrastructure; e.g., NYCwireless. The second goal is to estimate the
node distribution of a wireless infrastructure if such a community-based
wireless project is to be invested and expanded to the Queens county of New
York City. The first step is to discover statistically significant event patterns
that attempt to explain the characteristics of the project --- NYCwireless. Then
an information-statistical approach is applied to discover an optimal probability
model based on Shannon entropy criterion for estimating the node distribution
of a projected wireless infrastructure in Queens County of New York City.
1 Introduction
NYCwireless [1] is a non-profit organization that provides free wireless Internet
service over radio connections to mobile users in public spaces such as coffee shops
and parks throughout Manhattan metropolitan area in New York City. Each node is
operated independently by volunteers using their own equipment.
By the end of April 2002, over ninety network nodes were listed in the
NYCwireless database for the New York City metropolitan area. More than half of
them (49) are located at Manhattan. On the contrary, Queens County accounts for less
than 10% of the nodes. Yet from the population survey conducted by the U.S. Census
Bureau [2], Queens County has a larger population size than Manhattan.
In this paper, we attempt to answer two specific questions that may provide
valuable information for strategic planning and to assist in the decision making
process behind building a community-based wireless infrastructure in the Queens
County of New York City:
1. What are the characteristics, quantified by statistical significant event patterns,
of the active participants who contributed to the community-based wireless
infrastructure in Manhattan?
2
Wei-Tsu Yang1 and Bon K. Sy2
2. Based on the information revealed by statistical significant event patterns, what
is the optimal probability model, with respect to Shannon entropy criterion, that can
be used as a basis for projecting the spatial distribution of wireless nodes in the
Queens County of New York City if identical resources as that of NYCwireless
invested in Manhattan are applied to Queens?
2 Information Statistical Approach
Since the primary goal of our project is to discover the relationship between event
patterns and the distribution of wireless nodes, information statistical approach is
chosen. Information statistical approach fits very well for uncovering unknown event
patterns that are statistically significant [3]. The choice of representation and
characteristics for capturing the behavior of wireless node owners is still an open
issue. In this research, we choose a representation framework that is based on multivalued variables, and we apply an information statistical approach to reveal the
information embedded in the event patterns.
Information statistical approach for data mining is built upon the concept of
patterns. Concept of patterns is common in the data mining community [4].
Grenander has discussed extensively a general concept of patterns from the
perspective of applied mathematics with application to understanding the relationship
between image set patterns and statistical geometry [5,6,7].
One notion of the concept of patterns that we have explored is to capture the
meaning and the quality of the information embedded in data. In comparison to the
concept of patterns discussed by Grenander, one interesting aspect found by us is the
possibility of interpreting joint events of discrete random variables surviving
statistical hypothesis test of interdependency as statistically significant association
patterns. In doing so, significant previous works already established [8,9,10,11,12,13]
may be used to provide a unified framework for linking information theory with
statistical analysis. The significance of such a linkage is that it not only provides a
basis for using statistical approaches for revealing hidden significant association
patterns, but for using information theory as a measurement instrument to determine
the quality of information obtained from statistical analysis. For further details on the
application of statistical techniques for analyzing and discovering statistical patterns,
and information theory for interpreting the meaning behind the statistical analysis,
readers are referred to a report elsewhere [3]. A specific example will be shown later
to illustrate the process of discovering significant event patterns using mutual
information measure and chi-square analysis.
In our proposed information-statistical data mining technique, the statistically
significant event patterns just discovered was then used to identify an optimal
probability model that maximizes Shannon entropy. An optimal model maximizing
Shannon entropy has the property of minimizing undesirable bias contributed by the
unknown information. As discussed elsewhere [14], identifying an optimal probability
model based on the marginal and joint frequency information of discrete random
variables is an optimization problem. Specifically, the optimization problem consists
of a set of linear probability constraints, and a non-linear objective function due to
Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless
Project
3
Shannon entropy criterion. In this next section, we will describe how the proposed
information-statistical approach can be applied to (1) identify the characteristics of
active participants who contributed to the community-based wireless infrastructure in
Manhattan, and (2) project the spatial distribution of wireless nodes in the Queens
County of New York City based on the findings in (1).
3 Application of Data Mining to Wireless Project Feasibility Study
Based on the mission statement of NYCwireless, all of the wireless access points
are managed by independent volunteers who have a broadband Internet connection.
An interesting question to ask, but difficult to answer, is that why they are willing to
share their broadband Internet connection with the public at their own cost. Since
each of them may have his/her own different reason(s), we will try to approach this
from a slightly different perspective. Instead, we will attempt to find out what
common characteristics of these volunteers share..
According to the surveys on the U.S. households with Internet access obtained
from the National Telecommunications and Information Administration (NITA), and
from the Economics and Statistics Administration (ESA) using U.S. Census Bureau
Current Population Survey Supplements [1], family income, age, and educational
attainment are three main factors affecting internet use in America. As such, these
three factors are chosen as a basis for understanding the NYCwireless access point
distribution.
3.1 Discovering the characteristics
Grounded on these three parameters: family income, age, and educational attainment,
we will proceed to attempt to understand the characteristics of the active participants
who contributed to the community-based wireless infrastructure in Manhattan.
The data set used in this study is drawn from the 1990 Census Data Lookup server
[1]. The survey was conducted by the U.S. Census Bureau in 1990. We are aware of
the issue in regard to "synchronize" data sets with different time frames. (see
Discussion for detail).
The data set used in this study will be referred to as DS1990. A total of 92 ZIP
codes are listed in 1990 Census Data for Manhattan. However, only 38 of them
actually have valid data. Thus, only these 38 zip codes in DS1990 are included in this
study. Using zip code as a basis to partition the 1990 Census Data for Manhattan, four
simple frequency counts were performed on each partition: (1) the number of
individuals aged between 25 and 49, (2) the number of individuals with an education
attainment at or more than college bachelor degree, (3) the number of individual with
a personal income level more than $25,000, and (4) the number of wireless access
points. We then computed the mean μ and the standard deviation σ for each one of
the four frequency counts using the tallies of all zip codes.
Four parameters are defined for the frequency information as relevant to zip codes.
They are AG: individuals aged between 25 and 49 from all zip codes, ED: educational
4
(1)
attainment at or more than college bachelor degree from all zip codes, IN: personal
income level more than $25,000 from all zip codes, and NO: the number of wireless
access points from all zip codes.
To convert these four parameters into discrete random variables, each parameter
will be quantized into four possible states {1, 2, 3, 4}, with a state representing an
interval defined by unit deviation σ from the mean. For example, AG = 1 refers to
the value of the frequency count of individuals aged between 25 and 49 in a zip codes
that is one standard deviation less than the mean; i.e., μ-σ.
Pr(AG=1) represents a percentage count a/b * 100% related to sample population
partition by zip code. Within a zip code (let’s say indexed by i), we can count the
number of individuals aged between 25 and 49 --- denoted by ti. We can then
calculate the mean μ = (1/N)∑i=1b ti (and similarly the standard deviation σ ). Now
we can go back to determine whether ti is less than μ-σ within a zip code. The
counts of “ti less than μ-σ from all zip codes” is a, and b is the number of zip
codes. The percentage of a/b*100% defines Pr(AG=1) (see Appendix 1 for detail)
3.2 Results
A S-PLUS function is written for detecting event association based on the following
statistical hypothesis:
(1)
where the mutual information analysis is represented by :
(2)
The function will return every found pattern in a list whenever
. An
instantiation of all four variables is an event pattern. In other words, an event pattern
is a 4-tuple value-pair. Out of 256 (4*4*4*4) possible event patterns, our search
function detects the following 26 events as significant event patterns (see Appendix 2
for detail).
We then cross-validated the results by an alternative approach discussed in [3]. 18
out of 26 found significant patterns were chosen based on the union of two result sets.
The next step is to derive the probability for each pattern for formulating
probability constraints. Since X1:1 in our data set represents no wireless node, it does
not provide useful answer to our prediction. Hence, we exclude all discovered
significant patterns with X1:1. In addition, three marginal probability terms, Pr(X1:3)
= 0.18, Pr(X2:4) = 0.16, and Pr(X3:4) = 0.26, are also left out from our probability
constrains. It is because an optimal probability model will automatically get a value
close to the actual one anyway.
The optimal model is derived by applying the probability model discovery
algorithm discussed elsewhere [14]. Out of 256 joint probability terms of the joint
Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless
Project
5
probability model, there are twenty-two non-zero probability terms (as listed in
Appendix 3).
Since we are interested in projecting the spatial distribution of wireless nodes in
the Queens County in this project, we will focus on the most probable event patterns
that are also statistically significant. Note that (X1:3, X2:3, X3:4, X4:4) is the most
probable event pattern that is also statistically significant. In combination with other
significant event patterns where (X1:3, X2:3, X3:3, X4:3), (X1:3, X2:3, X3:3, X4:4),
(X1:3, X2:3, X3:4, X4:4), and (X1:3, X2:4, X3:4, X4:4), these event patterns
altogether reveal the information that can be stated in the following sentence:
In an area characterized by its zip code, if the frequency count of individuals aged
between 25 and 49 is equal to or above the overall mean , and the frequency count
of the individuals with an educational attainment level at or more than college
bachelor degree is equal to or above the overall mean , and the frequency count of
the individuals with a personal income level more than $25,000 is equal to or above
the overall mean , then we can project there will be multiple wireless network nodes
in that zip area.
3.3 Evaluation of our prediction
Applying the above model to the adjusted (see Discussion for detail) census 1990 data
for the Queens County, 14 zip areas (11104, 11354, 11355, 11357, 11365, 11367,
11368, 11372, 11373, 11374, 11375, 11377, 11385, 11435) are projected to have one
or more wireless network nodes.
As of 01/13/2003, among 63 zip areas in the Queens County, there are more than
thirty wireless network nodes in the following 13 zip areas: 11103, 11104, 11354,
11355, 11359, 11364, 11365, 11367, 11373, 11374, 11375, 11416, 11428.
The number of unique zip areas just listed is 19. Eight of the 13 actual zip areas
(with wireless node(s) presence) are covered in the 14 projected zip areas. Therefore,
there are (13 – 8) five false negative cases, and (14 – 8) six false positive cases.
Hence, the false negative error rate is 5/19 = 26.3% and the false positive error rate is
6/19 = 31.5%.
4. Discussion
The Census 1990 data set used for data mining has the following data quality issues:
1. For both Manhattan and Queens Counties, there is an integrity issue; e.g., sum
of individuals does not equal to the reported total counts in some cases. This occurred
in each one of the three selected factors. But a normalization factor has been applied
to correct the problem.
2. The data on Census Data Lookup server was conducted in 1990; however, the
wireless network node distribution data drawn from NYCwireless is the most update
one. These two data sets are not in the same time frame, thus we have a data
inconsistency problem.
6
(1)
In order to calibrate our data sets, we estimated the 2001 population data based on
the population percent change from 1990 to 2000 and applied the adjustment
accordingly. In addition to data calibration in terms of time, we believe data
calibration in terms of population density is essential as well. The population for sub
areas in Manhattan varies a lot. As a result, if we simply use frequency counts drawn
from the Census Data (with time adjusted) without proper calibration on data mining,
it is subject to prejudice.
To calibrate the frequency counts for each zip code in Manhattan, we multiply
time-adjusted value by the ratio of total population of New York County over sub
population of each zip code respectively. The data set pertinent to the Queens County
is calibrated in the same manner.
There are over 30 % of ZIP code data for Manhattan is missing in the 1990 census
data set. Although this affects data quality, the data set is still by far the most
comprehensive one in terms of the socioeconomic and demographic variables
available. We choose to eliminate sub areas with data missing from our data analysis.
Anchored in the discussion above, we are aware of potential bias that might exist
in our data analysis due to data quality issues. In the future, when Census 2000 data or
other reliable data sets on socioeconomic and demographic variables are available for
public lookup, a follow up research based upon the new data set will be conducted.
5. Conclusion
Several useful discoveries regarding the distribution of NYCwireless network nodes
are found in the research. In particular, the number of individuals aged between 25
and 49, with an education attainment more than college bachelor degree, with a
personal income level more than $25,000 are strongly associated with wireless access
points distribution in Manhattan.
Based on optimal model found from the information statistical approach, 14 zip
areas in Queens County are projected to have more than one wireless node. Although
our projection has relative high false positive rate, its false negative rate is very low.
Given that the characteristics of the participants who are willing to contribute to the
community-based wireless infrastructure is very difficult to defined, we believe our
projection is valuable to predict the spatial distribution of wireless nodes in the
Queens County of New York City.
The performance of information statistical approach will be the focus of our follow
up research. It is very interesting to see how other data mining algorithms perform
against it in terms of false negative rate and false positive rate.
Some other parameters such as political parties, profession, and business density of
sub areas are not covered in this paper. Political party is a sociology factor that may
provide insights into a person’s behaviors and decisions. A public minded person
might adopt the idea of community-based NYCwireless project quicker than others.
Profession of a person may also have a decisive influence on the ability of running a
wireless access point. The business density of coffee shops and bars of sub areas is
vital to evaluate the maximum beneficiaries for wireless access points. It is also a
factor that may influence the existence of a wireless access point. A person might
Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless
Project
7
reluctant to joint the community if he or she does not expect anyone to use it in the
neighborhood. These factors also will be the focus of our future study.
6. Acknowledgement
This work is supported in part by a NSF DUE CCLI grant #0088778, and a PSCCUNY Research Award.
Reference
1. (WWW http://www.nycwireless.net/) NYCwireless
2. (WWW http://homer.ssd.census.gov/cdrom/lookup) 1990 Census Data.
3. Sy B.K., "Information-Statistical Pattern Based Approach for Data Mining," Journal of
Statistical Computing and Simulation, 2001, Gordon and Breach Publishing Group, NJ, 69(2),
2001.
4. Fayyad, U. M. and Piatetsky-Shapiro, G. and Smyth, P., "From Data Mining to Knowledge
Discovery: An Overview", in Advances in Knowledge Discovery and Data Mining, (editors:
Fayyad, U. M. and Piatetsky-Shapiro, G. and Smyth, P. and Uthurusamy, R.), chapter 1, p 1-34,
AAAI Press / MIT Press, 1996.
5. Grenander U., Chow Y. Keenan K.M. 1991, HANDS: A Pattern Theoretic Study of
Biological Shapes, Springer-Verlag, New York.
6. Grenander U., 1993, General Pattern Theory, Oxford University Press, Oxford.
7. Grenander U., 1996, Elements of Pattern Theory, The Johns Hopkins University Press, ISBN
0-8018-5187-4.
8. Chen J. and Gupta A.K., "Information Criterion and Change Point Problem for Regular
Models," Technical Report No. 98-05, Department of Math. and Stat., Bowling Green State U.,
Ohio.
9. Cover T.M. and Thomas J.A., Elements of Information Theory, Wiley 1991.
10. Good I.J., "Weight of Evidence, Correlation, Explanatory Power, Information, and the
Utility of Experiments," Journal of Royal Statistics Society, Ser. B, 22:319-331, 1960.
11. Haberman S.J. "The Analysis of Residuals in Cross-classified Tables," Biometrics, 29:205220, 1973.
12. Kullback S. and Leibler R., "On Information and Sufficiency," Ann. Math. Statistics, 22:7986, 1951.
13. Kullback S., Information and Statistics, Wiley and Sons, New York, 1959.
14. Sy B.K., "Probability Model Selection Using Information-Theoretic Optimization
Criterion," Journal of Statistical Computing and Simulation, 2001, Gordon and Breach
Publishing Group, NJ, 69(3), 2001.
15. (WWW http://www.ntia.doc.gov/ntiahome/dn/html/Chapter2.htm) A Nation Online:
How Americans Are Expanding Their Use Of The Internet.
16. (WWW http://www.ntia.doc.gov/ntiahome/dn/hhs/HHSchartsindex.html) U.S. households
with Internet access.
17. (WWW http://bonnet2.geol.qc.edu/jscs9901.html) Information-Statistical Pattern based
Approach for Data Mining,
8
(1)
Appendix 1. The summarization of symbols and variables
Variabl
es
Symb
State
ol
NO
X1
{1,2,3,
4}
AG
X2
{1,2,3,
4}
IN
X3
{1,2,3,
4}
ED
X4
{1,2,3,
4}
State description:
X1:
1 = ‘ The number of wireless network node is below μ-σ’
2 = ‘ The number of wireless network node is between μ-σ and μ’
3 = ‘ The number of wireless network node is between μ and μ+σ’
4 = ‘ The number of wireless network node is above μ+σ’
X2:
1 = ‘ Sub-population of those between 25 and 49 aged below μ-σ’
2 = ‘ Sub-population of those between 25 and 49 aged between μ-σ and μ’
3 = ‘ Sub-population of those between 25 and 49 aged between μ and μ+σ’
4 = ‘ Sub-population of those between 25 and 49 aged above μ+σ’’
X3:
1 = ‘ Total population of personal income level more than $25,000 is below μ-σ’
2 = ‘ Total population of personal income level more than $25,000 is between μσ and μ’
3 = ‘ Total population of personal income level more than $25,000 is between μ
and μ+σ’
4 = ‘ Total population of personal income level more than $25,000 above μ+σ’
X4:
1 = ‘ Total population of educational attainment more than college is below μ-σ’
2 = ‘ Total population of educational attainment more than college is between μσ and μ’
3 = ‘ Total population of educational attainment more than college is between μ
and μ+σ’
4 = ‘ Total population of educational attainment more than college is above μ+σ’
Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless
Project
9
Appendix 2. The result of significant event patterns discovery
Event patterns
Mutual Information
Chisquare/2N
X1:1, X2:1, X3:1, X4:1
3.583911
1.883471
2.261983
2.21362
1.592132
2.84367
2.106705
3.950586
3.172978
4.066063
4.284351
2.651025
2.37746
3.261983
2.066063
2.736755
2.444575
3.736755
2.514363
3.99979
3.222182
4.351465
4.329097
3.55149
2.736755
2.222182
0.2651224
0.02579801
0.03953932
0.03755223
0.01771876
0.06997035
0.06682213
0.3559456
0.09381503
0.1948605
0.6922545
0.05842556
0.04458763
0.3038388
0.03192413
0.1267294
0.0477283
0.1500842
0.05116419
0.7400093
0.09788331
0.24293
0.4776154
0.1290799
0.06336469
0.03789873
X1:1, X2:1, X3:2, X4:1
X1:1, X2:2, X3:1, X4:1
X1:1, X2:1, X3:2, X4:2
X1:1, X2:2, X3:4, X4:4
X1:1, X2:3, X3:3, X4:3
X1:1, X2:3, X3:4, X4:4
X1:1, X2:4, X3:2, X4:2
X1:1, X2:4, X3:2, X4:3
X1:1, X2:4, X3:3, X4:3
X1:2, X2:1, X3:1, X4:1
X1:2, X2:1, X3:2, X4:2
X1:2, X2:2, X3:1, X4:1
X1:2, X2:2, X3:2, X4:1
X1:2, X2:3, X3:2, X4:3
X1:2, X2:3, X3:3, X4:4
X1:2, X2:4, X3:4, X4:4
X1:3, X2:3, X3:3, X4:3
X1:3, X2:3, X3:3, X4:4
X1:3, X2:3, X3:4, X4:4
X1:3, X2:4, X3:4, X4:4
X1:4, X2:1, X3:1, X4:2
X1:4, X2:2, X3:2, X4:2
X1:4, X2:2, X3:2, X4:3
X1:4, X2:3, X3:3, X4:4
X1:4, X2:3, X3:4, X4:4
10
(1)
Appendix 3. Probability Model for Pattern Inference
Table 1. Probability model for pattern inference
Index
3
4
18
24
33
38
65
69
70
76
85
108
152
171
172
176
192
194
214
215
236
X1
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
4
4
4
4
X2
1
1
2
2
3
3
1
1
1
1
2
3
2
3
3
3
4
1
2
2
3
X3
1
1
1
2
1
2
1
2
2
3
2
3
2
3
3
4
4
1
2
2
3
X4
3
4
2
4
1
2
1
1
2
4
1
4
3
3
4
4
4
2
2
3
4
Pr(X1,X2,X3,X4)
0.108
0.0174
0.0046
0.0462
0.0888
0.0452
0.079
0.0386
0.026
0.0444
0.079
0.053
0.026
0.026
0.026
0.105
0.026
0.026
0.053
0.026
0.026