* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download How Google Works and why you should care
Survey
Document related concepts
Transcript
Detecting Web Spam through Backward Propagation of Distrust CS315-Web Search and Mining And Now For Something Completely(?) Different Propaganda: Attempt to modify human behavior, and thus influence people’s actions in ways beneficial to propagandists Theory of Propaganda Developed by the Institute for Propaganda Analysis 1938-42 Propagandistic Techniques (and ways of detecting propaganda) Word games - associate good/bad concept with social entity Glittering Generalities — Name Calling Transfer - use special privileges (e.g., office) to breach trust Testimonial - famous non-experts’ claims Plain Folk - people like us think this way Bandwagon - everybody’s doing it, jump on the wagon Card Stacking - use of bad logic Web Spammers as Propagandists Web Spammers can be seen as employing propagandistic techniques in order to modify the Web Graph There is a pattern on how to spam! Anti-Spam Lessons from Society What would you do if you realize that you should not trust a member of your trust network? Famous Actress Democracy Rev. Y Mom YOU Partner The Coffee Joint NYTimes ? X Your Boss Prof. X ? Joe (a plumber) ? ? ? US Pres. ? ? Anti-Propagandistic Lessons for Web How do you deal with propaganda in real life? Backwards propagation of distrust The recommender of an untrustworthy message becomes untrustworthy Can you transfer this technique to the web? An Anti-Propagandistic Algorithm Start from untrustworthy site s S = {s} Using BFS for depth D do: Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site) Ignore blogs, directories, edu’s S=S+U Find the bi-connected component BCC of U that includes s BCC shows multiple paths to boost the reputation of s Backwards Propagation of Distrust Start from untrustworthy site s S = {s} Using BFS for depth D do: Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site) Ignore blogs, directories, edu’s S=S+U Find the bi-connected component BCC of U that includes s BCC shows multiple paths to boost the reputation of s BCC vs Periphery Since the BCC reveals multiple paths to boost the reputation of s, we expect it to contain a higher percentage of untrustworthy sites The Periphery of the BCC, on the other hand, should have significantly lower percentage of untrustworthy sites Periphery BCC Explored neighborhoods Evaluated Experimental Results The trustworthiness of starting site is a very good predictor for the trustworthiness of BCC sites The BCC is significantly more predictive of untrustworthiness than the Periphery BCC Periphery Link Farms vs MAS