Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems Analysis Information Integration Information integration is ubiquitous: • Committee meetings • Research papers • Information retrieval on the web • Assessing intelligence on the battlefield •… Information Integration Outline • Introduction • Automating Information Integration – Database Integration – Model Integration – Conflict Resolution and Meta-Information • Integrating Learned Probabilistic Information • Conclusion and Current Work Multi-Disciplinary Research • Databases (e.g., Halevy’s group at U. of Washington) • Artificial Intelligence (e.g., Stanford’s Knowledge Systems Laboratory) • Business (e.g., MIT-Sloan’s Aggregators Group) • Decision Analysis (e.g., Clemen & Winkler’s work at Duke) Database Integration Mediation Layer bioinformatics query OMIM Genes GeneClinics Proteins Nucleotide Sequences LocusLink Source Databases Entrez Database Integration • Application: Querying distributed databases • Examples – Bioinformatics – Corporate data management – Question-answer systems on the web – Detecting bioterrorism Model Integration super model if cancer then operate … expert system cdi = CIRDE BML … probabilistic model mathematical model Model Integration • Applications: Diagnosis and prediction • Examples: – Medical diagnosis – NASA spacecraft design and diagnosis – Expert system integration – Combining commonsense knowledge bases Challenges • Efficient query processing and optimization – Parsing XML • Defining expressive yet tractable mediator languages • Handling heterogeneous source languages – Wrapper technology development Challenges • Resolving ontological differences – e.g., realizing that the field “Name” for one source stores the same information as “First Name” and “Last Name” for another. • Detecting conflicts • Resolving conflicts – Resolution done manually in practice – We can automate more! Uninformed Integration What’s the weather like? raining sunny raining Intelligent Integration What’s the weather like? raining meteorologist sunny practical joker raining own eyes Types of Meta-Information • Credibility, experience, political clout • Areas of expertise • How source acquired information: – Source’s sources – Processes source used to accumulate information • Structure of the data representation Outline • Introduction • Automating Information Integration • Integrating Learned Probabilistic Information – – – – – Medical Scenario Semantic Framework LinOP-Based Aggregation Aggregating Bayesian Networks Experimental Validation • Conclusion and Current Work Medical Expert System Scenario Expert system 20 years experience 10 years experience 3 years experience Source Meta-Information • Doctors learned probabilistic models from patient data using some known standard learning algorithm. • We know the relative amount of experience doctors have had (i.e., years of practice). Popular Aggregation Approaches • Intuition approach: Take simple weighted averages, etc. unexpected behavior • Axiomatic approach: Find aggregation algorithm satisfying certain “obvious” properties impossibility results • Problem: Not semantically grounded Aggregation Semantics learning algorithm p1 p2 learning algorithm … … M samples generated from the true distribution p aggregation algorithm … pL learning algorithm learning algorithm aggregate distribution p^ optimal distribution p* not available in practice Linear Opinion Pool (LinOP) • LinOP: Weighted sum of joint distributions. • Precisely, for joint distributions pi and joint variable instantiation w, LinOP(p1, p2, …, pL)(w) = i ipi(w). • i weights: relative experience. • Satisfies unanimity, non-dictatorship, and marginalization. • Doesn’t preserve shared independences. LinOP and Joint Learning If – sources learn joint distributions using maximum likelihood or MAP learning and – the same learning framework would be used on the combined data set to learn p* then p* LinOP(p1, p2, …, pL). Bayesian Network (BN) • Summary: Compact, graphical representation of a probability distribution. • Definition: Directed acyclic graph (DAG) over nodes (random variables); each node has a local conditional probability distribution (CPD) associated with it. • Exploits causal structure in the domain. Alarm BN P(B) .001 Burglary Earthquake Alarm A P(J) + .90 - .05 JohnCalls B + + - E + + - P(A) .95 .94 .29 .001 MaryCalls P(E) .002 A P(M) + .70 - .01 BN Advantages • Compact representation and graph encodes conditional independences. • Elicitation easy in practice. • Inference efficient in practice. • Can be learned from data. • Deployed successfully – medical diagnosis, Microsoft Office, NASA Mission Control, and more. BN Learning • Idea: Select BN most likely to have generated data. • Standard algorithm: – Search over structures by adding, deleting, and reversing edges. – Parameterize and score structures using statistics from the data. – Penalize complex structures. Aggregating BNs • Each source i learns BN pi. • p* is the BN we would learn from the combined data set. • We want to approximate p* as closely as possible by aggregating p1, …, pL. • Source information: estimates for the relative experience of the sources and the total amount of data seen (M). AGGR: BN Aggregation Algorithm • Idea: Use BN learning algorithm. • Problem: We don’t have data. • Key observation: We can use LinOP to approximate the statistics needed for the parameterization and scoring steps! • Also, we can use LinOP properties to make algorithm reasonably efficient. Asia BN Visit to Asia Tuberculosis Smoking Lung Cancer Abnormality in Chest X-Ray Bronchitis Dyspnea Experimental Setup • Generate data for sources from wellknown ASIA BN which relates smoking, visiting Asia, and lung cancer. • Compare our algorithm AGGR against the optimal algorithm OPT that has access to the combined data set. • Accuracy measure: KL divergence from generating distribution. Sensitivity to M Experiments • Sensitivity to M – Size of the combined data set M varies. – AGGR’s estimate of M is accurate. • Sensitivity to Estimate of M – Size of the combined data set M is fixed. – AGGR’s estimate of M varies. KL Divergence Sensitivity to M 0.100 0.090 0.080 0.070 0.060 0.050 0.040 0.030 0.020 0.010 0.000 S1 S2 OPT AGGR 200 600 1000 1400 1800 M 2200 2600 3000 Sensitivity to Estimate of M 0.0016 S1 S2 OPT AGGR KL Divergence 0.0014 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.0000 -1.00 -0.75 -0.50 -0.25 0.00 0.25 M'/M (log10) M=10k 0.50 0.75 1.00 Subpopulations • Each source’s data may come from a different subpopulation P(D|Si), where D is the data. • We want to learn P(D). • P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL)) with sources’ weights based on P(Si). • We can apply the same algorithm. Subpopulations Experiments • In the Asia network domain, one doctor practices in San Francisco, another in Cincinnati. • Subpopulations have different priors for smoking and having visited Asia, so doctors’ beliefs are biased. • The aggregate distribution comes much closer to the original distribution. Doctor Asia BN Visit to Asia Tuberculosis Smoking Lung Cancer Abnormality in Chest X-Ray Bronchitis Dyspnea KL Divergence Subpopulations 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 S1 S2 OPT AGGR 200 600 1000 1400 1800 2200 2600 3000 M Contributions • A semantic framework for aggregating learned probabilistic models. • A LinOP-based algorithm for aggregating learned BNs. • Experiments showing algorithm behaves well. Outline • Introduction • Automating Information Integration • Integrating Learned Probabilistic Information • Conclusion and Current Work Conclusion • Conflict resolution is key in automated information integration. • This is a difficult task in general. • However, information about sources is often readily available. • Principled use of this information can greatly enhance the ability to resolve conflicts intelligently. Current Work • Allow dependence between sources’ data sets in probabilistic aggregation work. • Apply semantic framework to aggregation in other learning paradigms. • Explore application of algorithms to database integration, RoboCup, stock market prediction, etc. • Making committee meetings obsolete! Multi-Agent Research Zone • Research interests: – Information integration – Multi-agent machine learning – RoboCup soccer simulation league testbed • Masters students – Jian Xu: Information integration in medical informatics – Linxin Gan: Ensemble learning in stock market prediction CSA Graduate Program • Masters in Computer Science • Research areas include: – machine learning, KRR, and MAS – information retrieval, databases, and NLP – networking and virtual environments – simulation and evolutionary computation – software engineering and formal methods http://unixgen.muohio.edu/~maynarp/ [email protected]