Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Informatics and Information Engineering CSE 300 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 371 Fairfield Road, Box U-255 Storrs, CT 06269-2155 [email protected] http://www.engr.uconn.edu/~steve (860) 486 - 4818 Copyright © 2008 by S. Demurjian, Storrs, CT. Portions of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. IIE-1 Overview  CSE 300   Informatics  What is Informatics?  What is Biomedical Informatics?  What are Key Biomedical Informatics Challenges? Information Engineering  Data vs. Information vs. Knowledge  What is Science? What is Engineering?  What is Information Consistency? Information Usage and Repositories  How do we Store and Utilize Information?  Role of Web in Informatics  Sharing, Collaboration, and Security  Databases vs. Data Mining IIE-2 Informatics  CSE 300   Informatics is:  Management and Processing of Data  From Multiple Sources/Contexts  Involves Classification (Ontologies), Collection, Storage, Analysis, Dissemination Informatics is Multi-Disciplinary  Computing (Model, Store, Process Information)  Social Science (User Interactions, HCI)  Statistics (Analysis) Informatics Can Apply to Multiple Domains:  Business, Biology, Fine Arts, Humanities  Pharmacology, Nursing, Medicine, etc. IIE-3 What is Informatics?  CSE 300 Heterogeneous Field – Interaction between People, Information and Technology  Computer Science and Engineering  Social Science (Human Computer Interface)  Information Science (Data Storage, Retrieval and Mining) Informatics People Information Technology Adapted from Shortcliff textbook IIE-4 What is Biomedical Informatics (BMI)?  CSE 300 BMI is Information and its Usage Associated with the Research and Practice of Medicine Including:  Clinical Informatics for Patient Care  Medical Record + Personal Health Record  Bioinformatics for Research/Biology to Bedside  From Genomics To Proteomics  Public Health Informatics (State and Federal)  Tracking Trends in Public Sector  Clinical Research Informatics  Deidentified Repositories and Databases  Facilitate Epidemiological Research and Ongong Clinical Studies (Drug Trails, Data Analysis, etc.) IIE-5 What are Key BMI Focal Areas?  CSE 300     T1 Research  Transition Bench Results into  Clinical Research Clinical Research  Applying Clinical Research Results via Trials with Patients on Medication, Devices, Treatment Plans T2 Research  Translating “Successful” Clinical Trials into Practice and the Community Clinical Practice  Tracking all of the Information Associated with a Patient and his/her Care Integrated and Inter-Disciplinary Information Spectrum IIE-6 What is Medical Informatics?  CSE  300    Clinical Informatics, Pharmacy Informatics Public Health Informatics Consumer Health Informatics Nursing Informatics Systems and People Issues  Intended to Improve Clinical outcomes, Satisfaction and Efficiency  Workflow Changes, Business Implications, Implementation, etc…  Patient Centered – Personal Health Record and Medical Home  Care Centered – Pay for Performance, Improving Treatment Compliance IIE-7 What is Bionformatics?  CSE 300   Focused on Research Tools for T1:  Genomic and Proteomic Tools, Evaluation Methods, Computing And Database Needs  Information Retrieval and Manipulation of Large Distributed (caBIG) Data Sets (cabig.cancer.gov/index.asp)  Often Requires Grid Computing  Includes Cancer and Immunology Research Increasing Need to Tie These Separate Types of Systems Together = Personalized Medicine Biology and the Bedside (www.i2b2.org) IIE-8 Where is Data/How is it Used?  CSE 300   Medical And Administrative Data Found in Clinical Information Systems (CIS) Such As:  Hospital Info. Systems Electronic Medical Records  Personal Health Records…  Pharmacy Nursing, Picture Archiving Systems  Complex Data Storage and Retrieval – Many Different Systems T1 Research Increasingly Reliant on CIS T2 Research is Reliant on:  End Systems for Embedding EBM (EvidenceBased Medicine) Guidelines  Measuring Outcomes, Looking at Policy IIE-9 What are Major Informatics Challenges?  CSE  300    Shortage of Trained People Nationally Slows adoption of Health Information Technology Results in Poor Planning and Coordination, Duplication of Efforts and Incomplete Evaluation What are Critical Needs?  Dually Trained Clinicians or Researchers in Leadership of some Initiatives  Connect all folks with Informatics Roles across Institutions to Improve Efficiency  Multi-Disciplinary: CSE, Statistics, Biology, Medicine, Nursing, Pharmacy, etc. Emerging Standards for Information Modeling and Exchange (www.hl7.org) based on XML IIE-10 Information Engineering  CSE 300   Data vs. Information vs. Knowledge  How do we Differentiate Between them?  Where are they used in BMI? Science vs. Engineering  What is each of their Roles in Informatics?  How can we Engineer Information?  What is their Role in BMI? What is Information Engineering?  What are the Unique Challenges and Opportunities?  What is Available Today and Tomorrow? IIE-11 From American Heritage  CSE 300   Data  Information, esp. information organized for analysis or used as the basis for a decision.  Numerical information in a form suitable for processing by computer. Information  The act of informing or the condition of being informed; communication of knowledge.  A non-accidental signal used as an input to a computer or communications system. Knowledge  The state or fact of knowing.  The sum or range of what has been perceived, discovered, or learned.  Specific information about something. IIE-12 From Webster’s 9th Collegiate  CSE 300   Data  Factual information (e.g. statistics) used as a basis for reasoning, discussion, or calculation. Information  The communication of knowledge or intelligence  Something (as a message, experimental data, or a picture) which justifies change in a construct (as a plan or theory) that represents physical or mental experience or another construct  quantitative measure of the content of information Knowledge  The fact or condition of having information or of being learned.  The sum of what is known: the body of truth, information, and principles acquired by mankind. IIE-13 Data vs. Information vs. Knowledge  CSE  300    Overlapping Definitions Conflicting Definitions Agreement on Data Knowledge and Information - Synonyms Discussion Questions:  Equivalence of Knowledge/Information?  How can we Distinguish them?  Do these Three Terms Cover Possibilities? IIE-14 Data, Information, and Knowledge in BMI  CSE 300   Data – Basic Level  BP, Pulse, Temperature  Peak Flow, Glucose Level, Biopsy Result  X-Ray, MRI, Cat Scan Information - First level of Interpretation  BPs, Peak Flow, Glucose over Time  Interpreting Scan (Radiologist) or Biopsy Result (Oncologist) Knowledge – Applying Experience towards Diagnosis  What can Low Peak Flows over Time lead to?  What Next Step after Positive Scan or Biopsy?  What if Glucose Level is Yo-yoing? IIE-15 From American Heritage  CSE 300  Science  The observation, identification, description, experimental investigation, and theoretical explanation of natural phenomena.  Methodologoical activity, discipline, or study.  An activity that appears to require study & method.  Knowledge, esp. gained through experience. Engineering  The application of scientific and mathematical principles to practical ends such as the design, construction, and operation of efficient and economical structures, equipment, and systems. IIE-16 From Webster’s 9th Collegiate  CSE 300  Science  The state of knowing: knowledge as distinguished from ignorance or misunderstanding  A department of systemized knowledge as an object of study  A system or method reconciling practical ends with scientific laws. Engineering  The application of science and mathematics by which the properties of matter and the sources of energy in nature are made useful to people in structures, machines, products, systems, and processes. IIE-17 Science and Engineering in BMI  CSE 300  Science  Data/Information Collection & Analysis to Reach Hypothesis  Patients with CHF and Lipitor have Less Heart Attacks than CHF and Baby Aspirin  Verify in Clinical Research/Epidemiological Study Engineering  Usage of Information in Practice  Apply Scientific Results to Medical Practice  Image Processing used to Identify Tumors in CT and MRI Scans  Transfer of Radiologists Knowledge into Computer Based (Assisted) Solution  An Engineering Solution to Scientific Result IIE-18 What is Information Engineering?  CSE 300   Incorporation of an Engineering Approach and Discipline to the Generation of Information and the Promotion of the Better Use of Information and Resources Information Engineering Unifies and Combines:  Software Engineering  Database Engineering  Security Engineering  Performance Engineering  Etc... Moral: Systems Cannot and Must Not be Engineered in a Vacuum! Particularly true in BMI (T1, T2, Clinical Research, and Clinical Practice) IIE-19 Information Engineering is Motivated by:  CSE 300   Realization that Management/Control of Information will be a Primary Concern as we Continue through the 1990s and into the 21st Century Currently in an Age of Information - Volume and Complexity Dependencies Critical Systems Heavily Depend on Information:  Airline/Hotel/Auto Reservations  Telecommunications  Banking/ATMs  ATM/Credit Cards at Gas Stations/Supermarkets  Credit Bureaus Electronically Collect Information from Many Diverse Sources  E-Tailing  Medical Care/All Aspects of BMI IIE-20 Info. Engrg. - Challenge for 21st Century  CSE 300   Timely and Efficient Utilization of Information  Significantly Impacts on Productivity  Supports and Promotes Collaboration for Competitive Advantage  Use Information in New and Different Ways Collection, Synthesis, Analyses of Information  Better Understanding of Processes, Sales, Productivity, etc.  Dissemination of Only Relevant/Significant Information - Reduce Overload Implications for BMI?  Sharing of Results – Benefit Mankind  Ability to Research on Rare Diseases  Are there Unknown Isolated “Cures”? IIE-21 How is Information Engineered?  CSE 300      Careful Thought to its Definition/Purpose & Thorough Understanding of its Intended Usage/Potential Impact Insure and Maintain its Consistency  Quality, Correctness, and Relevance Protect and Control its Availability (Secure Access)  Who can Access What Information in Which Location and at What Time? Long-Term Persistent Storage/Recoverability  Cost, Reusability, Longitudinal, and Cumulative Experience Integration of Past, Present and Future Information via Intranet and Internet Access What are Implications/Challenges for BMI?  Let’s Discuss Briefly… IIE-22 Towards Information Consistency  CSE  300 Consistency of Information is Key! Consistency Gauged with respect to:  Usage of Information  Persistency of Information  Integrity/Security of Information  Allowable Values and Protection from Misuse  Validity (Relevance) of Information  Means Something to Someone in a Postive Way  Discussion Questions:  Why is Consistency Important for BMI?  How is Consistency Attained for BMI?  What Else Impacts Consistency BMI? IIE-23 What's Available to Support IE?  CSE 300  What Can be Provided to Make the Advanced Application Design Process:  More Complete?  More Robust?  More Responsive?  Less Error Prone? Current Choices to Support Information Engineering:  Conventional Programming Languages and Data Models  Object-Oriented Programming Languages  Object-Oriented DBS  XML Databases  Middleware and SOA (Web)  Data Mining/Warehouses IIE-24 What are Key Questions?  CSE 300  Focus on Information and its Behavior  What are Different Kinds of Information?  How is Information Manipulated?  Is Same Information Stored in Different Ways?  What are Information Interdependencies?  Will Information Persist? Long-Term DB? Versions of Information?  What Past Info. is Needed from Legacy DBs or Applications?  Who Needs Access to What Info. When?  What Information is Available Across WWW? All of these Questions Apply to BMI! IIE-25 Information Usage and Repositories  CSE 300  How do we Store and Utilize Information?  Databases  Data Mining What are Key Issues?  Information Sharing/Data Correctness  Collaboration 1. Among Providers and Researchers 2. Among Providers and Patients 3. Among Patients (Support Groups)  Security 1. Control of Patient Information (De-identified) 2. Secure Exchange/Patient Ownership 3. Establish Custom Patient Controlled Groups  What is the Role of Web in Informatics? IIE-26 The Role of a Database  CSE 300       Database is a Norm in Today's and Tomorrow's Applications Usage Information Tightly Linked to its Storage Integration of Database - Key Component Support Many Representations of ``Same'' Information Promotes Retrieval of Information Geared Towards User Needs and Responsibilities Gap Exists Between Standalone Programming Applications and Database Systems For BMI:  Database (Data Warehouse) is a Key Feature  Need for Access to Data (De-identified)  Need to Share and Interact among Stakeholders IIE-27 DBMS Architecture  CSE 300 DBMS Languages  Data Definition Language (DDL)  Data Manipulation Language (DML)  From Embedded Queries or DB Commands Within a Program  “Stand-alone” Query Language   Host Language:  DML Specification (e.g., SQL) is Embedded in a “Host” Programming Language (e.g., Java, C++) DBMS Interfaces  Menu-Based Interface  Graphical Interface  Forms-Based Interface  Interface for DBA (DB Administrator) IIE-28 ANSI/SPARC - Three Schema Architecture  CSE  300  External Data Schema (Users’ view) Conceptual Data Schema (Logical Schema) Internal Data Schema (Physical Schema) IIE-29 How are these Used for BMI?  CSE 300   Internal Data Schema (Physical Schema)  Hidden Data Representation for Storage of BMI Data in Proprietary Format  Under the Control of DB System Conceptual Data Schema (Logical Schema)  The Data Model for the BMI Application  Access to Schema Controllable via SQL External Data Schema (Users’ view)  Subsets of the Data Model for Different Users  External View for Patients  External View for Providers  External View for Clinical Researchers  Need Ability for a Patient to Control Access to his/her Own External View IIE-30 Data Independence  CSE 300   Ability that Allows Application Programs Not Being Affected by Changes in Irrelevant Parts of the Conceptual Data Representation, Data Storage Structure and Data Access Methods Invisibility (Transparency) of the Details of Entire Database Organization, Storage Structure and Access Strategy to the Users  Both Logical and Physical Recall Software Engineering Concepts:  Abstraction the Details of an Application's Components Can Be Hidden, Providing a Broad Perspective on the Design  Representation Independence: Changes Can Be Made to the Implementation that have No Impact on the Interface and Its Users IIE-31 Physical Data Independence  CSE 300   The Ability to Modify the Physical Data Representation Without Causing Application Programs to Be Rewritten Examples:  Transparency of the Physical Storage Organization  Transparency of Physical Access Paths  Numeric Data Representation and Units  Character Data Representation  Data Coding  Physical Data Structure All of these are Vital for BMI – Particularly if we Use Standard to Achieve Application Independence IIE-32 Physical Data Independence  CSE 300  Physical Data Independence is a Measure of How Much the Internal Schema Can Change Without Affecting the Application Programs In BMI – Allows us to Plug and Play Different DBMS Platforms – Extensible and Versatile Integration Physical IIE-33 Logical Data Independence  CSE 300   Transparency of the Entire Database Conceptual Organization As a Result:  Transparency of Logical Access Strategy  Addition of New Entities  Removal of Entities  Virtual (Derived) Data Items  Union of Records Views  Common Mechanism for Logical Data Dependency  Provide Different Logical Data Contexts to Different Users Based on Their Needs  Update Views vs. Read-Only Views IIE-34 Logical Data Independence  CSE 300  Logical Data Independence is a Measure of How Much the Conceptual Schema Can Change Without Affecting the Application Programs For BMI – Allows us to Separate End User Applications (Patients, Providers, etc.) from DB Logical IIE-35 Classic Information System Design CSE 300 IIE-36 Data vs. Information CSE 300 IIE-37 Programming Language Systems vs. DBS  CSE 300  Similarities and Differences Exist At System Level:  Shared Resources vs. Shared Data  Execution Granularity - Programs vs. Transactions  Granularity Difference - Files vs. Instances Classic Problem of “Impedance Mismatch”  Thin Layer of Overlap between PLS (C++, Java, etc.) and Relational Database System  What will Future Bring?  SQL3 with Object-Oriented Extensions  XML Databases (Apached Xindice, Sendra, etc.) Today Tomorrow? PLS PLS RDBS XML DBS IIE-38 What is Today’s Impedance Mismatch?  CSE 300  Relational Data Organizes Information into Flat Files  Relational Tables with Primary Key  High Number of Tuples per Table (1000s & more)  Limited Number of Tables (10-50) for Even Large Size Application  Limited Linkages Among Tables (Foreign Keys) What Does BMI/PHR/EMR Require?  For Each Patient, Track Multiple Dependencies  Visits per Patient  Tests per Patient  Prescriptions per Patient   Data Inherently Complex and Interdependent Flattened into Relational Format IIE-39 The Health Care Application - Classes CSE 300 IIE-40 The Health Care Application - Classes CSE 300 IIE-41 The Health Care Application - Classes CSE 300 IIE-42 The Health Care Application - Relationships CSE 300 IIE-43 How Does Mismatch Occur?  CSE 300  On Left – OO Classes  Inheritance  Dependencies Programmatic View  C++ or Java Usage  Staging from DB to OO Item(Phy_Name*, Date*, Visit_Flag, Symptom, Diagnosis, Treatment, Presc_Flag, Pre_No, Pharm_Name, Medication, Test_Flag, Test_Code, Spec_No, Status, Tech)  Above – Relational Tables  Stage Data from Tables into OO (e.g. Java) format  Utilize JDBC  What are the Implications/Impacts? IIE-44 Implications and Impact  CSE 300  Three Copies of “Same” Information in Different  Database Table (Item)  OO Representation – Server Side (Classes)  GUI Display – Client Side (html/xml) What can this Lead to? Dr. D, Jan 01, 08 Fever, Flu, Bed Rest No Scripts No Tests Item(Phy_Name*, Date*, Visit_Flag, Symptom, Diagnosis, Treatment, Presc_Flag, Pre_No, Pharm_Name, Medication, Test_Flag, Test_Code, Spec_No, Status, Tech) IIE-45 What is one Possible Solution?  CSE 300  Standards and Usage of XML  Consider CDA – Clinical Document Architecture  Standard for Clinical (Provider) Medical Record Clinical Record Organized as:      <patient_encounter> - location <legal_authenticator> - MD <originating_organization> and <provider> <patient> - name, birthdate, gender <body_confidentiality-”CONF1”> - note         History Past Medical History Medications Allergies Social History Physical Exam Vitals (BP, Resp, Temp, HR) Etc... IIE-46 What is one Possible Solution?  CSE  300 Let’s Explore this in Greater Detail Starting with the CDA Header <?xml version="1.0"?> <!DOCTYPE levelone PUBLIC "-//HL7//DTD CDA Level One 1.0//EN" "levelone_1.0.dtd"> <levelone> <clinical_document_header> <id EX="a123" RT="2.16.840.1.113883.3.933"/> <set_id EX="B" RT="2.16.840.1.113883.3.933"/> <version_nbr V="2"/> <document_type_cd V="11488-4" S="2.16.840.1.113883.6.1" DN="Consultation note"/> <origination_dttm V="2000-04-07"/> <confidentiality_cd ID="CONF1" V="N" S="2.16.840.1.113883.5.1xxx"/> <confidentiality_cd ID="CONF2" V="R" S="2.16.840.1.113883.5.1xxx"/> <document_relationship> <document_relationship.type_cd V="RPLC"/> <related_document> <id EX="a234" RT="2.16.840.1.113883.3.933"/> <set_id EX="B" RT="2.16.840.1.113883.3.933"/> <version_nbr V="1"/> </related_document> </document_relationship> <fulfills_order> <fulfills_order.type_cd V="FLFS"/> <order><id EX="x23ABC" RT="2.16.840.1.113883.3.933"/></order> <order><id EX="x42CDE" RT="2.16.840.1.113883.3.933"/></order> </fulfills_order> IIE-47 CDA Example - Continued CSE 300 IIE-48 CDA Example - Continued CSE 300 IIE-49 CDA Example - Continued CSE 300 IIE-50 CDA Example - Continued CSE 300 IIE-51 CDA Example - Continued CSE 300 IIE-52 CDA Example - Continued CSE 300 IIE-53 CDA Example - Continued CSE 300 IIE-54 CDA Example - Continued CSE 300 IIE-55 Information Sharing/Access: Potential Pitfalls  CSE 300     Another Critical Issue is Information Sharing  Perception: How do I see/understand Data/Info?  Differences: What is the Reality? Dealing with Information at Different Levels  Syntax – Format of Information  Semantics – Meaning of Information  Pragmatics – Usage of Information When Unifying Databases/Information Repositories, Must Address all Three! Data Integrity and Data Security  Correct and Consistent Values  Assurance in All Secure Accesses For BMI – All of the Above are Critical for Correct Usage and Interpretation in All Contexts (T1, T2, …) IIE-56 Information Syntactic Considerations  CSE 300    Syntax is Structure and Format of the Information That is Needed to Support a Coalition Incorrect Structure or Format Could Result in Simple Error Message to Catastrophic Event For Sharing, Strict Formats Need to be Maintained Health Care Data Suffers from Lack of Standards  Standards for Diagnosis (Insurance Industry)  Emerging Standards Include:  Health Level 7 (HL7)  Based on XML  Formats Non-Standard for Different Health Organizations, Insurers, Pharmacy Networks, etc.  N*N Translations Prone to Errors! IIE-57 Information Semantics Concerns  CSE 300 Semantics (Meaning and Interpretation)  NATO and US - Different Message Formats  Distances (Miles vs. Kilometers)  Grid Coordinates (Mils, Degrees)  Maps (Grid, True, and Magnetic North)  What Can Happen in Health Care Data?  Possible to Confuse Dosages of Medications?  Weight of Patients (Pounds vs. Kilos)?  Measurement of Vital Signs?  Dana Farber Chemo Death – Checks/Balances  What Others are Possible? IIE-58 Syntactic & Semantic Considerations  CSE  300      What’s Available to Support Information Sharing? How do we Insure that Information can be Accurately and Precisely Exchanged? How do we Associate Semantics with the Information to be Exchanged? What Can we Do to Verify the Syntactic Exchange and that Semantics are Maintained? Can Information Exchange Facilitate Federation? Can this be Handled Dynamically? Or, Must we Statically Solve Information Sharing in Advance? IIE-59 Information Pragmatics Considerations  CSE 300   Pragmatics Require that we Totally Understand Information Usage and Information Meaning  What are the Critical Information Sources?  How will Information Flow Among Them?  What Systems Need Access to these Sources?  How will that Access be Delivered?  Who (People/Roles) will Need to See What When?  How will What a Person Sees Impact Other Sources? Focus on: Way that Information is Utilized and Understood in its Specific Context Can Medical Info be Misused even if Understood? IIE-60 Information Pragmatics Considerations  CSE 300  What are Pragmatics Issues re. Underinsured and Uninsured Populations in Event?  How Can we Use Info Effectively if we Don’t Know if it is Complete?  Has Info from All Sources Been Collected?  What Happens if Same Patient in Different Repositories Can’t be Reconciled?  What if Patient in Unresponsive and Can’t Supply any Info?  Is Usage of Info Complicated due to Incompleteness? Multiple Locations? Or, if the Event is Major – will all Patient Populations Suffer Same Substandard Care? IIE-61 Collaboration and Security  CSE  300 Two Concepts go Hand in Hand Strong Parallels  Collaboration  Among Providers and Researchers  Among Providers and Patients  Among Patients (Support Groups)  Security  Control of Patient Information (De-identified)  Secure Exchange/Patient Ownership  Establish Custom Patient Controlled Groups   Let’s Explore them Both via our Semester Project Also Consider Emergent and Policy Issues IIE-62 Collaboration: Providers and Researchers  CSE 300   Providers  Seeking new Treatment Plans  Looking for Clinical Research Studies for Patients  Looking to Communicate with Clinical Researchers Researchers  Publish Evidence-Based Guidelines  New Treatments  Collect Data on Provider Visits  Provide Forum to Discuss with Provider  Allow Provider to Upload Anonymous Outcomes Also – Need to Collaborate Among Researchers of All Types (Sharepoint, WIKIs, etc.) IIE-63 Collaboration: Providers and Patients  CSE 300 Patients  Open Personal Health Record to Providers  Patients have  Data Entry Facility for Chronic Conditions  Ability to Graph and Track their Disease Education Materials also Available Providers  Securely Communicate (email) with Patients (see https://www.relayhealth.com/rh/specific/patients/default.aspx)  Access to Authorized Patient Data  Tracking of Patients (to Reduce Office Visits)  Proactive Intervention to Head off Potential Hospitalizations/Problems via Treatment Algorithms to Auto-Notify Based on Data Values   IIE-64 Collaboration: Among Patients  CSE 300 Patients  Provide Each with a List of Support Groups  Allow them to Join Groups or Form New Groups  Secure Communication via:  Email  Chatting Environment  Link to Actual (Physical Meetings) Repository of Available Support Groups Overall:  Patients can Meet other Patients with Same Issues  Vital for Patients with Rare Diseases  Form On-Line Communities   IIE-65 Security: General Concepts  CSE 300   Authentication  Proving you are who you are  Signing a Message  Is the Client who S/he Says they are? Authorization  Granting/Denying Access  Revoking Access  Does the Client have Permission to do what S/he Wants? Encryption  Establishing Communications Such that No One but Receiver will Get the Content of the Message  Symmetric Encryption  Public Key Encryption IIE-66 Key Security Issues  CSE 300    Legal and Ethical Issues  Information that Must be Protected  Information that Must be Accessible Policy Issues  Who Can See What Information When?  Applications Limits w.r.t. Data vs. Users? System Level Enforcement  What is Provided by the DBMS? Programming Language? OS? Application?  How Do All of the Pieces Interact? Multiple Security Levels/Organizational Enforcement  Mapping Security to Organizational Hierarchy  Protecting Information in Organization IIE-67 What are Key Access Control Concepts?  CSE 300  Assurance  Are the Security Privileges for Each User Adequate to Support their Activities?  Do the Security Privileges for Each User Meet but Not Exceed their Capabilities? Consistency  Are the Defined Security Privileges for Each User Internally Consistent?  Least-Privilege Principle: Just Enough Access  Are the Defined Security Privileges for Related Users Globally Consistent?  Mutual-Exclusion: Read for Some-Write for Others IIE-68 Available Security Approaches  CSE 300   Mandatory Access Control (MAC)  Bell/Lapadula Security Model  Security Classification Levels for Data Items  Access Based on Security Clearance of User Role Based Access Control (RBAC)  Govern Access to Information based on Role  Users can Play Different Roles at Different Times Responsibilities of Users Guiding Factor  Facilitate User Interactions while Simultaneously Protecting Sensitive Data Discretionary Access Control (DAC)  Richer Set of Access Modes - Govern Access to Information based on User Id  Discretionary Rules on Access Privileges  Focused on Application Needs/Requirements IIE-69 Mandatory Security Mechanism  CSE 300  Typical Security Classification Levels for Subjects/programs and Objects/resources  Top Secret (TS) and Secret (S)  Confidential (C) and Unclassified (U) Rules:  TS is the Highest and U is the Lowest Level  TS > S > C > U  Security Levels:      C1 is Security Clearance Given to User U1 C2 is Security Classification Given to Object O1 U1 can Access O1 iff C1  C2 This is Referred to as the Domination of U1 Over O1 Not Prevalent in BMI – But May have Relevance IIE-70 Role Based Access Control (RBAC)  CSE 300  Focuses on Defining Roles of Typical Behavior  Nurse, Nurse-Manager, Education-RN  Physician, Attending-MD, Specialist  Student, Faculty-Advisor, Head  Focus on Duties that are Shared During Authorization of Roles to Users  Establish Boundaries of Access  User Steve with Role Faculty-Advisor  Limited to Faculty Capabilities on Peoplesoft  Only Can Manipulate His Advisees  User Steve with Role Associate Head  Possible Overlap in Responsibilities w/ Faculty-Advisor  Other Activities not given to Faculty-Advisor Role IIE-71 Why is RBAC Needed?  CSE 300   In Health Care, different professionals (e.g., Nurses vs. Physicians vs. Administrators, etc.) Require Select Access to Sensitive Patient Data Suppose we have a Patient Access Client  Lois playing the Nurse Role would be Allowed to Enter Patient History, Record Vital Signs, etc.  Steve playing M.D. Role would be Allowed to do all of a Nurse plus Write Orders, Enter Scripts, etc.  Vicky playing Admin Role would be Allowed to Enter Demographic/Insurance Info. Role Dictates Client Behavior  Physician’s Write Scripts  Nurses Enter Patient Data (Vitals + History)  All Access Shared Medical Record  Access is Limited Based on Role IIE-72 Discretionary Access Control  CSE 300    Discretionary  Grant Privileges to Users, Including Capabilities to Access Specific Data Items in a Specific Mode  Available in Most Commercial DBMSs Aspects of DAC  User’s Identity  Predefined Discretionary “Rules” Defined by the Security Administrator  Allows User to “Delegate” Capabilities to Another User  Delegate Capabilities and Ability to Delegate Role Delegation and Delegation Authority DAC Available in SQL2 IIE-73 What is Role Delegation?  CSE 300   Role Delegation, a User-to-User Relationship, Allows an Original User (OU) to Transfer Responsibility for a Particular Role to a Delegated User (DU) Two Major Types of Delegation  Administratively-directed Delegation has an Administrative Infrastructure Outside the Direct Control of a User Mediates Delegation  User-directed Delegation has an User (Playing a Role) Determining If and When to Delegate a Role to Another User In Both, Security Administrators Still Oversee Who Can Do What When w.r.t. Delegation IIE-74 Why is Role Delegation Important?  CSE 300 Many Different Scenarios Under Which Privileges May Want to be Passed to Other Individuals  Large organizations often require delegation to meet demands on individuals in specific roles for certain periods of time  True in Many Different Sectors  Health Care and Financial Services  Engineering and Academic Setting  Example:  Reda Delegates Head Role to Steve when Traveling  Key Issues:  Who Controls Delegation to Whom?  How are Delegation Requirements Enforced? IIE-75 Coalitions for Clinical/Translational Science CSE 300 Pfizer Bayer UConn Storrs UConn Health Center Saint DCF, Francis, DSS, etc. CCMC, … Info. Sharing - Joint R&D Support T1, T2, and Clinical Research Company and University Partnerships Collaborative Funding Opportunities Cohesive and Trusted Environment Existing Systems/Databases and New Applications How do you Protect Commercial Interests? Promote Research Advancement? Free Read for Some Data/Limited for Other? Commercialization vs. Intellectual Property? NIH FDA NSF Balancing Cooperation with Propriety IIE-76 Emergent Public Policy Issues  CSE 300  How do we Protect a Person’s DNA?  Who Owns a Person’s DNA?  Who Can Profit from Person’s DNA?  Can Person’s DNA be Used to Deny Insurance? Employment? Etc.  How do you Define Security Limitations/Access? What about i2b2 – Informatics for Integrating Biology and the Bedside (see https://www.i2b2.org/)  Scalable Informatics Framework to Bridge  Clinical Research Data  Vast Data Banks for Basic Science Research  Goal: Understand Genetic Bases of Diseases IIE-77 Emergent Public Policy Issues  CSE 300 Can DNA Repositories be Anonymously Available for Medical Research?  Do Societal Needs Trump Individual Rights?  Can DNA be Made Available Anonymously for Medical Research?  De-identified Data Repositories  Privacy Protecting Data Mining International Repository Might Allow Medical Researchers Access to Large Enough Data Set for Rare Conditions (e.g., Orphan Drug Act) Individual Rights vs. Medical Advances   IIE-78 Internet and the Web  CSE 300 A Major Opportunity for Business  A Global Marketplace  Business Across State and Country Boundaries  A Way of Extending Services  Online Payment vs. VISA, Mastercard  A Medium for Creation of New Services  Publishers, Travel Agents, Teller, Virtual Yellow Pages, Online Auctions …   A Boon for Academia  Research Interactions and Collaborations  Free Software for Classroom/Research Usage  Opportunities for Exploration of Technologies in Student Projects What are Implications for BMI? Where is the Adv? IIE-79 WWW: Three Market Segments Server CSE 300 Business to Business Corporate Network    Server Intranet     Decision support Mfg.. System monitoring corporate repositories Workgroups Information sharing Ordering info./status Targeted electronic commerce Internet Corporate Server Network Internet     Sales Marketing Information Services Provider Network Server Provider Network Exposure to Outside IIE-80 Information Delivery Problems on the Net  CSE 300    Everyone can Publish Information on the Web Independently at Any Time  Consequently, there is an Information Explosion  Identifying Information Content More Difficult There are too Many Search Engines but too Few Capable of Returning High Quality Data Most Search Engines are Useful for Ad-hoc Searches but Awkward for Tracking Changes What are Information Delivery Issues for BMI?  Publishing of Patient Education Materials  Publishing of Provider Education Materials  How Can Patients/Providers find what Need?  How do they Know if its Relevant? Reputable? IIE-81 Example Web Applications  CSE 300   Scenario 1: World Wide Wait  A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web  You Want to Monitor the Results for this Important Event, so you Fire up your Trusty Web Browser, Pointing at the Result Posting Site, and Wait, and Wait, and Wait … What is the Problem?  The Scalability Problems are the Result of a Mismatch Between the Data Access Characteristics of the Application and the Technology Used to Implement the Application May not be Relevant to BMI: Hard to Apply Scenario IIE-82 Example Web Applications  CSE 300   Scenario 2:  Many Applications Today have the Need for Tracking Changes in Local and Remote Data Sources and Notifying Changes If Some Condition Over the Data Source(s) is Met  To Monitor Changes on Web, You Need to Fire Your Trusty Web Browser from Time to Time, Cache the Most Recent Result, and Difference Manually Each Time You Poll the Data Source(s) Issue: Pure Pull is Not the Answer to All Problems BMI: If a Patient Enters Data that Sets off a Chain Reaction, how Can Provider be Notified and in Turn the Provider Notify the Patient (Bad Health Event) IIE-83 What is the Problem?  CSE 300  Applications are Asymmetric but the Web is Not  Computation Centric vs. Information Flow Centric Type of Asymmetry  Network Asymmetry  Satellite, CATV, Mobile Clients, Etc.  Client to Server Ratio  Too Many Clients can Swamp Servers  Data Volume  Mouse and Key Click vs. Content Delivery  Update and Information Creation  Clients Need to be Informed or Must Poll  Clearly, for BMI, Simple Web Environment/Browser is Not Sufficient – No Auto-Notification IIE-84 What are Information Delivery Styles?  CSE 300   Pull-Based System  Transfer of Data from Server to Client is Initiated by a Client Pull  Clients Determine when to Get Information  Potential for Information to be Old Unless Client Periodically Pulls Push-Based System  Transfer of Data from Server to Client is Initiated by a Server Push  Clients may get Overloaded if Push is Too Frequent Hybrid  Pull and Push Combined  Pull First and then Push Continually IIE-85 Publish/Subscribe  CSE 300   Semantics: Servers Publish/Clients Subscribe  Servers Publish Information Online  Clients Subscribe to the Information of Interest (Subscription-based Information Delivery)  Data Flow is Initiated by the Data Sources (Servers) and is Aperiodic  Danger: Subscriptions can Lead to Other Unwanted Subscriptions Applications  Unicast: Database Triggers and Active Databases  1-to-n: Online News Groups May work for Clinical Researcher to Provider Push IIE-86 Design Options for Nodes  CSE 300 Three Types of Nodes:  Data Sources  Provide Base Data which is to be Disseminated  Clients  Who are the Net Consumers of the Information  Information Brokers  Acquire Information from Other Data Sources, Add Value to that Information and then Distribute this Information to Other Consumers  By Creating a Hierarchy of Brokers, Information Delivery can be Tailored to the Need of Many Users  Brokers may be Ideal Intermediaries for BMI!  Act on Behalf of Patients, Providers  Incorporate Secure Access IIE-87 Research Challenges  CSE 300 Ubiquitous/Pervasive Many computers and information appliances everywhere, networked together  Inherent Complexity:  Coping with Latency (Sometimes Unpredictable)  Failure Detection and Recovery (Partial Failure)  Concurrency, Load Balancing, Availability, Scale  Service Partitioning  Ordering of Distributed Events “Accidental” Complexity:  Heterogeneity: Beyond the Local Case: Platform, Protocol, Plus All Local Heterogeneity in Spades.  Autonomy: Change and Evolve Autonomously  Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc. IIE-88 Infosphere Problem: too many sources,too much information CSE 300 Internet: Information Jungle Infopipes Clean, Reliable, Timely Information, Anywhere Digital Earth Personalized Filtering & Info. Delivery Sensors IIE-89 Current State-of-Art CSE 300 Web Server Mainframe Database Server Thin Client IIE-90 Infosphere Scenario – for BMI CSE 300 Infotaps & Fat Clients Sensors Variety of Servers Many sources Database Server IIE-91 Heterogeneity and Autonomy  CSE 300 Heterogeneity:  How Much can we Really Integrate?  Syntactic Integration  Different Formats and Models  Web/SQL Query Languages  Semantic Interoperability  Basic Research on Ontology, Etc  Autonomy  No Central DBA on the Net  Independent Evolution of Schema and Content  Interoperation is Voluntary  Interface Technology (Support for Isvs)  DCOM: Microsoft Standard  CORBA, Etc... IIE-92 Security and Data Quality  CSE 300 Security  System Security in the Broad Sense  Attacks: Penetrations, Denial of Service  System (and Information) Survivability  Security Fault Tolerance  Replication for Performance, Availability, and Survivability  Data Quality  Web Data Quality Problems     Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise IIE-93 Legacy Data Challenge  CSE 300  Legacy Applications and Data  Definition: Important and Difficult to Replace  Typically, Mainframe Mission Critical Code  Most are OLTP and Database Applications Evolution of Legacy Databases  Client-server Architectures  Wrappers  Expensive and Gradual in Any Case IIE-94 Potential Value Added/Jumping on Bandwagon  CSE 300     Sophisticated Query Capability  Combining SQL with Keyword Queries Consistent Updates  Atomic Transactions and Beyond But Everything has to be in a Database!  Only If we Stick with Classic DB Assumptions Relaxing DB Assumptions  Interoperable Query Processing  Extended Transaction Updates Commodities DB Software  A Little Help is Still Good If it is Cheap  Internet Facilitates Software Distribution  Databases as Middleware IIE-95 Data Warehousing and Data Mining  CSE 300  Data Warehousing  Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making  Underlying Infrastructure in Support of Mining  Provides Means to Interact with Multiple DBs  OLAP (on-Line Analytical Processing) vs. OLTP Data Mining  Discovery of Information in a Vast Data Sets  Search for Patterns and Common Features based  Discover Information not Previously Known  Medical Records Accessible Nationwide  Research/Discover Cures for Rare Diseases  Relies on Knowledge Discovery in DBs (KDD) IIE-96 Data Warehousing and OLAP  CSE 300   A Data Warehouse  Database is Maintained Separately from an Operational Database  “A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]” OLAP (on-Line Analytical Processing)  Analysis of Complex Data in the Warehouse  Attempt to Attain “Value” through Analysis  Relies on Trained and Adept Skilled Knowledge Workers who Discover Information Data Mart  Organized Data for a Subset of an Organization  Establish De-Identified Marts for BMI Research IIE-97 Building a Data Warehouse  CSE 300 Option 1  Leverage Existing Repositories  Collate and Collect  May Not Capture All Relevant Data  Option 2  Start from Scratch  Utilize Underlying Corporate Data Corporate data warehouse Option 1: Consolidate Data Marts Option 2: Build from scratch Data Mart ... Data Mart Data Mart Data Mart Corporate data IIE-98 BMI – Partition/Excerpt Data Warehouse  CSE 300  Clinical and Epidemiological Research (and for T2 and T1) Each Study Submitted to Institutional Review Board (IRB)  For Human Subjects (Assess Risks, Protect Privacy)  See: http://resadm.uchc.edu/hspo/irb/ To Satisfy IRB (and Privacy, Security, etc.), Reverse Process to Create a Data Mart for each Approved Study  Export/Excerpt Study Data from Warehouse  May be Single or Multiple Sources BMI data warehouse Data Mart ... Data Mart Data Mart Data Mart IIE-99 Data Warehouse Characteristics  CSE  300   Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of  Store of Integrated Data from Multiple Sources  Processed into Multi-Dimensional Model Warehouse Supports of  Times Series and Trend Analysis  “Super-Excel” Integrated with DB Technologies Data is Less Volatile than Regular DB  Doesn’t Dramatically Change Over Time  Updates at Regular Intervals  Specific Refresh Policy Regarding Some Data IIE-100 Three Tier Architecture CSE 300 monitor External data sources OLAP Server integrator Summarization report Operational databases Extraxt Transform Load Refresh serve Data Warehouse Query report Data mining metadata Data marts IIE-101 Data Warehouse Design  CSE 300   Most of Data Warehouses use a Start Schema to Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates A Fact Table Connects All Dimension Tables with a Multiple Join  Each Tuple in Fact Table Represents the Content of One Dimension  Each Tuple in the Fact Table Consists of a Pointer to Each of the Dimensional Tables  Links Between the Fact Table and the Dimensional Tables for a Shape Like a Star IIE-102 What is a Multi-Dimensional Data Cube?  CSE 300    Representation of Information in Two or More Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis, Three or More Dimensions are Useful For BMI – Axes for Diagnosis, Drug, Subject Age IIE-103 Multi-Dimensional Schemas  CSE 300    Supporting Multi-Dimensional Schemas Requires Two Types of Tables:  Dimension Table: Tuples of Attributes for Each Dimension  Fact Table: Measured/Observed Variables with Pointers into Dimension Table Star Schema  Characterizes Data Cubes by having a Single Fact Table for Each Dimension Snowflake Schema  Dimension Tables from Star Schema are Organized into Hierarchy via Normalization Both Represent Storage Structures for Cubes IIE-104 Example of Star Schema CSE 300 Product Date Date Month Year Sale Fact Table Date ProductNo ProdName ProdDesc Categoryu Product Store Customer Unit_Sales Store StoreID City State Country Region Dollar_Sales Customer CustID CustName CustCity CustCountry IIE-105 Example of Star Schema for BMI CSE 300 Vitals Date Date Month Year Patient Fact Table Visit Date BP Temp Resp HR (Pulse) Vitals Symptoms Patient Medications Symptoms Pulmonary Heart Mus-Skel Skin Digestive Etc. Patient PatientID PatientName PatientCity PatientCountry Reference another Star Schema for all Meds IIE-106 A Second Example of Star Schema … CSE 300 IIE-107 and Corresponding Snowflake Schema CSE 300 IIE-108 Data Warehouse Issues  CSE 300  Data Acquisition  Extraction from Heterogeneous Sources  Reformatted into Warehouse Context - Names, Meanings, Data Domains Must be Consistent  Data Cleaning for Validity and Quality is the Data as Expected w.r.t. Content? Value?  Transition of Data into Data Model of Warehouse  Loading of Data into the Warehouse Other Issues Include:  How Current is the Data? Frequency of Update?  Availability of Warehouse? Dependencies of Data?  Distribution, Replication, and Partitioning Needs?  Loading Time (Clean, Format, Copy, Transmit, Index Creation, etc.)?  For CTSA – Data Ownership (Competing Hosps). IIE-109 Knowledge Discovery  CSE 300   Data Warehousing Requires Knowledge Discovery to Organize/Extract Information Meaningfully Knowledge Discovery  Technology to Extract Interesting Knowledge (Rules, Patterns, Regularities, Constraints) from a Vast Data Set  Process of Non-trivial Extraction of Implicit, Previously Unknown, and Potentially Useful Information from Large Collection of Data Data Mining  A Critical Step in the Knowledge Discovery Process  Extracts Implicit Information from Large Data Set IIE-110 Steps in a KDD Process  CSE  300        Learning the Application Domain (goals) Gathering and Integrating Data Data Cleaning Data Integration Data Transformation/Consolidation Data Mining  Choosing the Mining Method(s) and Algorithm(s)  Mining: Search for Patterns or Rules of Interest Analysis and Evaluation of the Mining Results Use of Discovered Knowledge in Decision Making Important Caveats  This is Not an Automated Process!  Requires Significant Human Interaction! IIE-111 OLAP Strategies  CSE 300  OLAP Strategies  Roll-Up: Summarization of Data  Drill-Down: from the General to Specific (Details)  Pivot: Cross Tabulate the Data Cubes  Slide and Dice: Projection Operations Across Dimensions  Sorting: Ordering Result Sets  Selection: Access by Value or Value Range Implementation Issues  Persistent with Infrequent Updates (Loading)  Optimization for Performance on Queries is More Complex - Across Multi-Dimensional Cubes  Recovery Less Critical - Mostly Read Only  Temporal Aspects of Data (Versions) Important IIE-112 On-Line Analytical Processing  CSE 300  Data Cube  A Multidimensonal Array  Each Attribute is a Dimension In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date Product Product Store Date Sale acron Rolla,MO 7/3/99 325.24 budwiser LA,CA 5/22/99 833.92 large pants NY,NY 2/12/99 771.24 Pants Diapers Beer Nuts West East 3’ diaper Cuba,MO 7/30/99 81.99 Region Central Mountain South Jan Feb March April Date IIE-113 On-Line Analytical Processing  CSE 300 For BMI – Imagine a Data Table with Patient Data  Define Axis  Summarize Data  Create Perspective to Match Research Goal  Essentially De-identified Data Mart Medication Patient Med BirthDat Dosage Steve Lipitor 1/1/45 10mg John Zocor 2/2/55 Harry Crestor 3/3/65 5mg Lois Lipitor 4/4/66 20mg Charles Crestor 7/1/59 Lescol Crestor Zocor Lipitor 80mg 10mg 5 10 Dosage 20 40 80 1940s 1950s 1960s 1970s Decade IIE-114 Examples of Data Mining  CSE 300 The Slicing Action  A Vertical or Horizontal Slice Across Entire Cube Months Slice on city Atlanta Products Sales Products Sales Months Multi-Dimensional Data Cube IIE-115 Examples of Data Mining  CSE 300 The Dicing Action  A Slide First Identifies on Dimension  A Selection of Any Cube within the Slice which Essentially Constrains All Three Dimensions Months Products Sales Products Sales Months March 2000 Electronics Atlanta Dice on Electronics and Atlanta IIE-116 Examples of Data Mining Drill Down - Takes a Facet (e.g., Q1) and Decomposes into Finer Detail Jan Feb March Products Sales CSE 300 Drill down on Q1 Roll Up on Location (State, USA) Roll Up: Combines Multiple Dimensions From Individual Cities to State Q1 Q2 Q3 Q4 Products Sales Products Sales Q1 Q2 Q3 Q4 IIE-117 Mining Other Types of Data  CSE  300 Analysis and Access Dramatically More Complicated! Time Series Data for Glucose, BP, Peak Flow, etc. Spatial databases Multimedia databases World Wide Web Time series data Geographical and Satellite Data IIE-118 Advantages/Objectives of Data Mining  CSE 300   Descriptive Mining  Discover and Describe General Properties  60% People who buy Beer on Friday also have Bought Nuts or Chips in the Past Three Months Predictive Mining  Infer Interesting Properties based on Available Data  People who Buy Beer on Friday usually also Buy Nuts or Chips Result of Mining  Order from Chaos  Mining Large Data Sets in Multiple Dimensions Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc.  Impact on Marketing Strateg IIE-119 Data Mining Methods (1)  CSE 300 Association  Discover the Frequency of Items Occurring Together in a Transaction or an Event  Example  80% Customers who Buy Milk also Buy Bread Hence - Bread and Milk Adjacent in Supermarket  50% of Customers Forget to Buy Milk/Soda/Drinks Hence - Available at Register  Prediction  Predicts Some Unknown or Missing Information based on Available Data  Example  Forecast Sale Value of Electronic Products for Next Quarter via Available Data from Past Three Quarters IIE-120 Association Rules  CSE  300   Motivated by Market Analysis Rules of the Form  Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn Example  “Beer ^ Soft Drink  Pop Corn” Problem: Discovering All Interesting Association Rules in a Large Database is Difficult!  Issues  Interestingness  Completeness  Efficiency  Basic Measurement for Association Rules  Support of the Rule  Confidence of the Rule IIE-121 Data Mining Methods (2)  CSE 300 Classification  Determine the Class or Category of an Object based on its Properties  Example  Classify Companies based on the Final Sale Results in the Past Quarter  Clustering  Organize a Set of Multi-dimensional Data Objects in Groups to Minimize Inter-group Similarity is and Maximize Intra-group Similarity  Example  Group Crime Locations to Find Distribution Patterns IIE-122 Classification  CSE 300   Two Stages  Learning Stage: Construction of a Classification Function or Model  Classification Stage: Predication of Classes of Objects Using the Function or Model Tools for Classification  Decision Tree  Bayesian Network  Neural Network  Regression Problem  Given a Set of Objects whose Classes are Known (Training Set), Derive a Classification Model which can Correctly Classify Future Objects IIE-123 An Example  CSE 300   Attributes Attribute Possible Values outlook sunny, overcast, rain temperature continuous humidity continuous windy true, false Class Attribute - Play/Don’t Play the Game Training Set  Values that Set the Condition for the Classification  What are the Pattern Below? Outlook Temperature Humidity sunny 85 85 overcast 83 78 sunny 80 90 sunny 72 95 sunny 72 70 … … … Windy false false true false false … Play No Yes No No Yes ... IIE-124 Data Mining Methods (3)  CSE 300 Summarization  Characterization (Summarization) of General Features of Objects in the Target Class  Example  Characterize People’s Buying Patterns on the Weekend  Potential Impact on “Sale Items” & “When Sales Start”  Department Stores with Bonus Coupons  Discrimination  Comparison of General Features of Objects Between a Target Class and a Contrasting Class  Example  Comparing Students in Engineering and in Art  Attempt to Arrive at Commonalities/Differences IIE-125 Summarization Technique  CSE  300 Attribute-Oriented Induction Generalization using Concert hierarchy (Taxonomy) barcode category 14998 milk brand diaryland content size Skim 2L food 12998 mechanical MotorCraft valve 23a 12in … … … … ... Milk … Skim milk … 2% milk Category milk milk … Content Count skim 2% … 280 98 ... bread White whole bread … wheat Lucern … Dairyland Wonder … Safeway IIE-126 Why is Data Mining Popular?  CSE 300 Technology Push  Technology for Collecting Large Quantity of Data  Bar Code, Scanners, Satellites, Cameras  Technology for Storing Large Collection of Data  Databases, Data Warehouses  Variety of Data Repositories, such as Virtual Worlds, Digital Media, World Wide Web   Corporations want to Improve Direct Marketing and Promotions - Driving Technology Advances  Targeted Marketing by Age, Region, Income, etc.  Exploiting User Preferences/Customized Shopping What is Potential for BMI?  How do you see Data Mining Utilized?  What are Key Issues to Worry About? IIE-127 Requirements & Challenges in Data Mining  CSE 300    Security and Social  What Information is Available to Mine?  Preferences via Store Cards/Web Purchases  What is Your Comfort Level with Trends? User Interfaces and Visualization  What Tools Must be Provided for End Users of Data Mining Systems?  How are Results for Multi-Dimensional Data Displayed? Performance Guarantees  Range from Real-Time for Some Queries to LongTerm for Other Queries Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets IIE-128 Concluding Remarks  CSE 300   We’ve looked at:  Informatics  Information Engineering  Information Usage and Repositories Focused on Their Applicability and Relevance for BMI Likely Generated More Questions than Answers IIE-129