Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites? More information about individuals Many social media sites Partial Information Huan Liu Google+ Age N/A Location USA Education Complementary Information Facebook USC (1985-89) Connectivity is not available Consistency in Information Availability Better User Profiles Can we connect individuals across sites? Can we verify that the information provided across sites belong to the same individual? Human behavior generates - Behavioral Information redundancy Modeling Information shared across sites - Minimum Information provides a behavioral fingerprint MOBIUS MOdeling Behavior for Identifying Users across Sites Minimum information available on ALL sites: Usernames Identification Function Candidate Username (john.smith) Prior Usernames ({jsmith, john.s}) Generates Captured Via Behavior 1 Information Redundancy Feature Set 1 Behavior 2 Information Redundancy Feature Set 2 Behavior n Information Redundancy Feature Set n Identification Function Learning Framework Data Human Limitation Behaviors Exogenous Factors Endogenous Factors Time & Memory Limitation Knowledge Limitation Typing Patterns Language Patterns Personal Attributes & Traits Habits Using Same Usernames 59% of individuals use the same username 5 Username Length Likelihood 4 2 0 0 1 0 2 0 3 0 4 0 5 1 0 6 7 8 9 10 0 11 12 Limited Vocabulary Identifying individuals by their vocabulary size Limited Alphabet Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames Modifying Previous Usernames Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Creating Similar Usernames Nametag and Gateman Username Observation Likelihood Usernames come from a language model Previous Methods: 1) Zafarani and Liu, 2009 Data: 2) Perito et al., 2011 200,000 instances (50% class balance) Baselines: 414 Features 1) Exact Username Match 2) Substring Match 3) Patterns in Letters 100 90 80 70 60 50 40 30 20 10 0 77 63.12 Exact Username Match 66 77.59 91.38 49.25 Substring Matching Patterns in Letters Zafarani and Liu Perito et al. Naïve Bayes 94.5 94 93.5 93 92.5 92 91.5 91 90.5 90 89.5 89 93.59 93.7 93.71 93.77 93.8 91.38 Naïve Bayes 90.87 J48 Random Forest L2-reg L2- L1-reg L2Loss SVM Loss SVM L2-reg L1-reg Logistic Logistic Regression Regression Discover applications ofin sites Human Behavior Results Information shared across A methodology for connecting Incorporating features connecting users across sites Information Redundancy acts as a behavioral fingerprint individuals across sites indigenous to specific sites A behavioral modeling approach Uses minimum information across sites Allows for integration of additional behaviors when required