Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Portable Network Graphics wikipedia , lookup
Computer vision wikipedia , lookup
BSAVE (bitmap format) wikipedia , lookup
Spatial anti-aliasing wikipedia , lookup
Anaglyph 3D wikipedia , lookup
Indexed color wikipedia , lookup
Hold-And-Modify wikipedia , lookup
Image editing wikipedia , lookup
Stereoscopy wikipedia , lookup
CLASSIFICATION OF IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Computer Science Shruti Wakade August, 2011 CLASSIFICATION OF IMAGE SPAM Shruti Wakade Thesis Approved: Accepted: _______________________________ Advisor Dr. Kathy J. Liszka _______________________________ Department Chair Dr. Chien-Chung Chan _______________________________ Committee Member Dr. Zhong-Hui Duan _______________________________ Dean of the College Dr. Chand Midha _______________________________ Committee Member Dr. Chein–Chung Chan ______________________________ Dean of the Graduate School Dr. George R. Newkome ________________________________ Date ii ABSTRACT Image spam is one of the most prevalent forms of spam ever since its inception. Spammers have refined their spamming techniques to use smaller, more colorful and photo quality images as spam. In spite of numerous efforts to build efficient spam filters against e-mail spam by researchers and free-mailing services like yahoo mail, Gmail etc spam filters still fail to arrest image spam. This research is an attempt to understand the techniques used in spamming and identifying a set of features that can help in classification of image spam from photographs. A set of eight features were identified based on observations and existing research in this area. Among the eight features, six features have been introduced by us and two other features have been included from previous research. Data mining techniques were then applied to classify image spam from photographs. Identifying a set of efficient yet computationally inexpensive features was the objective that guided this research work. We achieved classification accuracy of 89% for the test samples. A detailed trail of image spam has been studied to identify the most prevalent types and patterns in image spam. Our results indicate that five of the six features we had introduced proved to be of high significance in identifying image spam from photographs. iii ACKNOWLEGEMENTS I extend my heartfelt gratitude and appreciation to Dr. Kathy J. Liszka, an extremely helpful teacher and a wonderful advisor who is the guiding force behind this research work. Without her guidance, inputs and encouragement this work would not have been possible. I express my sincere appreciation and gratitude to Dr. Chan for helping me with the data mining experiments and for insightful corrections. I appreciate my committee member Dr. Duan for her thoughtful inputs. I wish to thank Chuck Van Tilburg, for extending his help in the research labs and providing a workable environment in the labs. I also wish to thank Knujon for contributing spam images which helped me to build a substantial corpus for this research. Last, but not the least, I would like to convey my heartfelt gratitude to my family and friends for their constant encouragement, support and timely help. iv TABLE OF CONTENTS Page LIST OF TABLE................................................................................................................ix LIST OF FIGURE...............................................................................................................x CHAPTER I. INTRODUCTION............................................................................................................1 II. SPAM DEFINITION AND TYPES...............................................................................3 2.1 Overview............................................................................................................3 2.2 Types of spam....................................................................................................4 2.3 Image Spam.......................................................................................................5 2.4 Related Research................................................................................................7 III. SPAM IMAGES AND DATASET..............................................................................9 3.1 Types of Images................................................................................................9 3.2 Image Spam Dataset........................................................................................11 v 3.3 Corpus..............................................................................................................12 3.3.1 Statistics of Images in the Corpus.....................................................13 3.4 Preprocessing...................................................................................................14 3.4.1 Feature Selection...............................................................................15 3.5 Feature Extraction Process...............................................................................18 IV. DATA MINING TECHNIQUES................................................................................20 4.1 Data Mining Overview....................................................................................20 4.2 Classification....................................................................................................22 4.3 Decision Trees.................................................................................................23 4.3.1 J48.....................................................................................................24 4.3.2 RepTree ............................................................................................24 V. EXPERIMENTS AND RESULTS...............................................................................25 5.1 Weka Data Mining Tool..................................................................................27 5.2 Data Set Preparation........................................................................................26 5.3 Methodology....................................................................................................26 5.3.1 Run 1- Using J48 Classifier..............................................................26 vi 5.3.2 Run 2- Using RepTree Classifier......................................................27 5.3.3 Depth of the RepTree........................................................................27 5.3.4 Dataset Proportions...........................................................................28 5.3.5 Training and Testing data selection..................................................29 5.3.6 Testing on Unseen data.....................................................................29 VI. VALIDATION BY FEATURE ANALYSIS.............................................................33 VII. TRENDS IN IMAGE SPAM.....................................................................................38 7.1 Count of Image Spam......................................................................................38 7.2 Trend of the Month..........................................................................................39 7.3 New Trends in Image Spam.............................................................................42 7.3.1 Scraped Images.................................................................................42 7.3.2 Malware Embedding in Images........................................................42 VIII. CONCLUSIONS AND FUTURE WORK...............................................................46 REFERENCES..................................................................................................................49 APPENDICES...................................................................................................................52 APPENDIX A. DATA ANALYSIS............................................................53 vii APPENDIX B. GENERATING MD5SUM AND SELECTING UNIQUE FILES.................................................................55 viii LIST OF TABLES Table Page 3.1 Statistics of the images collected to form the corpus...................................................13 4.1 Example data for classification....................................................................................20 4.2 Example data for clustering.........................................................................................21 5.1 Depth value of RepTree...............................................................................................28 5.2 Accuracy of classification for different ratios of ham and spam images.....................28 5.3 Count of spam images in 2010.....................................................................................30 5.4 Accuracy of classification for unseen samples............................................................31 5.5 Computing time for extracting features.......................................................................31 7.1 Image spam count in 2008- 2011.................................................................................38 ix LIST OF FIGURES Figure Page 1.1 Example of Image Spam................................................................................................2 2.1 Adding noise to the Image.............................................................................................6 2.2 Wavy images..................................................................................................................6 2.3 Rotating Image and adding noise...................................................................................6 3.1 Text only image spam...................................................................................................9 3.2 Adding random colored pixels.....................................................................................10 3.3 Adding color streaks....................................................................................................10 3.4 Adding a wild background...........................................................................................10 3.5 Examples of Standard Images......................................................................................11 3.6 Similar spam images but with different checksums....................................................14 3.7 Low log average luminance.........................................................................................15 3.8 High log average luminance........................................................................................15 x 3.9 Ham image background...............................................................................................17 3.10 Spam image background............................................................................................17 4.1 Example of decision tree using J48 classifier for weather data...................................23 5.1 Part of Decision tree generated by the classifier..........................................................32 6.1 Plots for range of feature values for Image Spam and Ham images............................33 7.1 Images that appeared on February 14th - Valentine’s Day...........................................40 7.2 Image spam type for the time period of January 2010- Feb 2011...............................41 7.3 Example of a scraped spam image...............................................................................42 7.4 Malware Embedded .png Image..................................................................................44 7.5 Binary form of images before and after saving as .hta file..........................................45 A.1 Average of feature values for all Image Spam in the year of 2010............................52 B.1 Script to generate md5sum..........................................................................................54 B.2 Output of md5sum script.............................................................................................55 B.3 Script to delete the duplicate files...............................................................................55 B.4 Output of the Java program to identify duplicate files................................................56 xi CHAPTER I INTRODUCTION E-mail is one of the most integral parts of communications over internet today. However, each day we spent several minutes deleting spam, unsolicitated e-mails advertising for products, offering loans at low interest rates, drugs etc. Though spam filters are able to identify majority of the e-mail spam spammers are continuously developing newer techniques to send spam messages to more and more people. With the advent of technology mobile devices and other portable electronic devices are now Wi-Fi enabled and internet telephony VoIP (voice over internet protocol) has made communicating across the world easier and inexpensive. Social networks like Twitter, Facebook, MySpace are very popular means of connecting with friends across the globe. However this has opened a newer audience for spammers to exploit. Spam is not just limited to email anymore, it is on VoIP in the form of unsolicitated marketing or advertising phone calls, or marketing, advertising and pornography links on social network. Spam is everywhere! There are many ways spammers can get to know your e-mail address and send you spam even though you never open any spam mails or click any suspicious links. If you are on any social network and do not set your privacy settings your data is available to anyone which includes your location, e-mail, friend lists etc. If you subscribe to newsgroups your 1 E-mail address can be easily harvested. Dictionary attack is one such technique to harvest e-mail addresses. So it is easy to find information with little time and effort and spammer have lots and lots of it. Most of the spammers use bots to do the job for them so even if they get one user to respond to their spam it is worth the effort to send e-mail to hundreds of people. Filters today can arrest most of the e-mail spam that appears in the form of text. Black listing known spamming IPs can also prevent spam to certain extent. This research deals with the image spam, one of the spamming techniques used by spammers where the spam message is embedded in an image instead of directly being the part of the message body. Two examples of spam images are shown in Figure 1.1. We test a method to classify spam images by using decision trees using weka data mining tool. The remaining report is organized as follows. Chapter two provides an overview of spam and the current research in the area. Chapter three describes the preprocessing and data mining steps to classify images. Chapter four discusses the weka data mining tool and how to use it for classification. Chapter five presents results of classifying spam and non spam images. The final chapter six discusses conclusions and future work. Figure 1.1 Example of Image Spam 2 CHAPTER II SPAM DEFINITION AND TYPES 2.1 Overview In general spam refers to the use of electronic messaging to send unasked messages to a large group of addresses arbitrarily. Though e-mail spam is the most widely known form of spam it also appears in many other electronic media such as chats, internet telephony, social networks, web spamming etc. The cost of transmission of these messages is borne by the users who receive it and the ISP who cannot help the spam traffic and are forced to increase bandwidth to accommodate the traffic. The spammers only need to manage the mailing lists that they target. Some common examples of spam are Advertisements that are in the form of pop-ups selling products or giving free downloads when we click any link on a web page Unsolicitated e-mails with inappropriate content, offers, political views Redundant calls on IMs like Skype offering mortgages, loans with low interest rates Links on social networks that take you to free downloads, easy income, pornography Unsolicitated text messages offering loans, low priced products etc. 3 2.2 Types of spam E-mail Spam- Also known as unsolicitated bulk e-mail, it is the most common form of spam we see. Mostly the motive behind these messages is to advertise and sell products, steal information or phishing, express political views, pornography and malware injection. The first e-mail spam is said to be sent in 1978 by DEC by sending an invite to all ARPANET addresses inviting them to the reception of their new DEC-20 machine. After this incident in 1994 there was the first big USENET spam which declared religious writing and caused a lot of controversy and debate. After this the next big spam was the green card lottery where two attorneys sent bulk USENET to offer green card visas to immigrants [1]. As the time passed spam grew more and more in volume and in severity. Today most of the spam mail is sent using bots. (These are compromised systems which are controlled by a master system to send spam messages. If the user happens to fall prey for the spam the system can be compromised and be then a bot, or user’s credentials could be compromised, or malware may be injected in user’s machine etc). E-mail harvesting is the most common way to get e-mail address. The method involves spammers purchasing addresses from other spammers, or using harvesting bots which collects e-mail addresses from postings on Usenet, internet forums, etc [26]. Another method is to use dictionary attack where valid e-mail addresses are generated by guessing common usernames in that domain. Apart from these social networks today provide an easy way to reach larger audiences and is the new favorite among spammers. 4 Instant Messaging and texting- Spam in instant messengers like Skype, yahoo messengers generally comes in form of friend requests from unknown people. It is less sparse than the e-mail spam. Text spam is promotional offers, advertisements for low interest rate loans etc in the form of text messages from unknown sources. Search Engine spam (spamdexing)- Refers to spamming web pages to falsely increase their page ranking in the browser results 2.3 Image spam Image spam is a variant of e-mail spam where the spammers actually embed the spam message in an image instead of directly placing it as mail content to evade spam filters. Spam filters look for certain key words like Viagra, cash, money which are commonly related to spam e-mails. However when message is inside an image the spam filters cannot effectively filter these messages. There are many techniques which spammers used to obfuscate spam filters. Some examples are [10] Adding random words before HTML Use white text on white background Using characters like M*oney Adding bogus HTML tags with lot of text Adding spaces in words like "l o w I n t e r e s t R a t e" As stronger filter developed to track these messages spammers came up with newer techniques like image spam, using pdf documents to send spam etc. With the use of 5 Optical Character Reader (OCR) filters it was possible to extract the contents of the images and then check if the image had spam content. However, this process is expensive and spammers came up with new ways to evade the OCR filter. Some of the ways include By rotating images or making them look wavy Adding noise to the images Slice the image and rotate each component. Figure 2.1, 2.2 and 2.3 below show some of the above image spamming techniques. Figure 2.1 Adding noise to the Image Figure 2.2 Wavy images Figure 2.3 Rotating Image and adding noise 6 2.4 Related Research Image spam has not been studied as extensively as e-mail spam, however some recent research work on image spam involving detection of text in the spam message, or identifying low-level features like header properties and histogram. Artificial neural networks have been used for identifying image spam by training artificial neutral network. The images were first normalized into grayscale values between 0-1. Then an ANN was trained on these images using a supervised learning approach and the ANN was tested for classification of new samples of spam images. The classification accuracy of about 70% was reported for unseen images [11]. Low-level features like image width, height, aspect ratio, file size, compression and image area which are all extracted from the image header have been used along with another set of features like the number of colors, variance, frequently occurring different colors, primary color in image and the color saturation and color histograms were also computed. A set of binary features was used to indicate type of file like JPEG or BMP or PNG and SVM classifiers were used to classify images. An accuracy over 95% was reported [12]. Aradhaye et al., used their existing work to detect text embedded in digital photographs. After the text was extracted they analyzed the text and computed other features like color saturation, color heterogeneity feature and use SVM classifiers to classify images. They obtained an accuracy of 85% [13]. 7 Similar features as in [12] were used in another study for classification using C4.5 decision tree algorithm in weka and support vector machine to. Their results indicated that support vector algorithms performed better than C4.5 as it had a larger area under the ROC curve [14]. Another study used agglomerative hierarchical clustering algorithm to clusters the spam images based on a similarity measure of color histograms and gradient orientation histograms. The training set was selected from the clustered groups. They built a probabilistic boosting tree based on the training set to distinguish spam images from ham images. They claimed an accuracy of about 89% [15]. Peizhou et al., extracted file properties like in [12],[14] and then set a threshold for these properties. The test image properties were compared with the threshold value. If the image was captured during this step as possible spam it was then sent to the histogram testing where the histogram similarity of test image and threshold value was compared. The advantage of this two step classification was that the first step trapped many of the spam image files. They claimed accuracy of about 68% in JPEG images and 23% in GIF images for step 1 and 84% for JPEG and 80% for GIF in step 2[16]. 8 CHAPTER III SPAM IMAGES AND DATASET Not only are spam images a way to evade spam filters but they also have leverage over email spam. Colorful- Images have more colors than a regular e-mail so it makes it look attractive and professional Images come as attachment and may be named anyways so they are difficult to detect unless contents are analyzed More ways to randomize the message, rotate, skew, add blur by adding noise, animation using gif images 3.1 Types of Images Text Only Images- Some images contain only text Figure 3.1 Text only image spam 9 Randomization – In order to thwart signature based anti-spam devices spammers add random color stripes, random colored pixels, shades of colors Figure 3.2 Adding random colored pixels Figure 3.3 Adding color streaks Wild backgrounds- It is difficult for OCR to detect text in the images Figure 3.4 Adding a wild background 10 Animated gif and multipart Images- The images is split into multiple parts, some containing the message and others containing some animation. The frames in the image rotate fast enough to display only the final result to the user. Standard Images- These are neat looking images, none of the above tricks are employed and that gives it a genuine look. The entire message is contained in the image and hence scanners cannot detect it. In fact many of the images that come in as spam today have a professional look making them look just like photographs. Figure 3.5 below give two such examples. Figure 3.5 Examples of Standard Images 3.2 Image spam dataset Spam images come in various formats. We have encountered .bmp, .gif, .jpeg and .png formats while collecting images for this research. BMP stands for bitmap or bitmap image file comprising of image data in a raster format. This is a device independent format and files are not compressed so the size is larger. GIF (Graphics Interchange Format) is also a bitmap format which supports up to 8 bits per pixel. It supports animations and allows 256 colors in each frame. PNG (Portable Network Graphics) is a 11 bitmap format which was developed to improve GIF format and employs lossless data compression [6]. We have converted all the images to the JPEG (Joint Photographic Experts Group) format for the research. It supports 24-bit color map i.e. 8-bits per color and has small file size. It uses lossy compression but it is possible to adjust the degree of compression. JPEG is standard image format in many photography devices. 3.3 Corpus Corpus refers to a collection, in our case collection of images that we have used in the research. All the spam images are collected from knujon, a spam reporting service which reports illicit spam websites. We downloaded zipped files of images from spam mails collected from users all over the world by setting up a winscp server locally. The time period of the images ranges from Mar 2009- Sep 2010. Ham images were first collected from personal photographs and then from flickr [3] using the flickr downloader [4] to download images. Flickr downloader software lets you download flickr images which are under the creative commons license. In order to download images we need to install the software and in the query provide either tags like “nature”, “landscape”, “people”, etc to specify which type of image to search for or provide group names if any. Some images have been downloaded from Google images [5], Wikipedia [6] and National Geographic Channel [7] with different search tags like places, wildlife, cities etc. We have eliminated spam images with pornographic content and photographic spam 12 as these are not a part of this research. Photographic spam images are actually photographs (usually thumbnails or smaller dimensions) which are sometimes used in spam images. 3.3.1 Statistics of images in the corpus Table 3.1 Statistics of the images collected to form the corpus Spam Source- Total Number of images Unique images 19762 16186 Knujon Ham Source Total Number of images Personal 1022 Wikipedia and NGC pictures available on the wiki pages 454 Flickr 3964 Total 5440 In order to ensure that we do not use duplicate images in our training set we use a script used in Artificial neural networks as a tool for identifying spam [11] to compute checksum of each individual image. Checksum is a simple way to check data integrity during transmission of data. MD5 (Message-Digest algorithm 5) is a cryptographic hash function with a 128-bit hash value. It is very unlikely that two different files have same md5 checksum and hence md5 hash acts like a digital fingerprint of a file. In case of spam images many spam images look similar but have different checksums. For example, the following two images in Figure3.6 are identical but have different checksums. 13 (a) MD5 = 996484e6cc7340ee2067ee96074ce324 (b) MD5 = 400fea9508fb5d759d9d698ef293c937 Figure 3.6 Similar spam images but with different checksums Creating a large enough dataset is a labor intensive task as it needs removing pornographic images and photo spam manually. As compared to the SpamArchive dataset [27], we downloaded images which came from different geographical locations and were more recent (2009-2011). The SpamArchive corpus has images dated till 2007 and contains significant number of duplicate images. The Princeton corpus [28] has about 1071 images. The study conducted by Dredze et al., [29] used 2359 ham images as opposed to 7040 ham images (5440 + 1600) and 3421 spam images as opposed to 69,516 spam images (16,186 + 53,330 ) used in this study. 3.4 Pre-Processing Preprocessing refers to cleaning up of data or preparing data in a appropriate format for the experiment. We did the initial clean up of images by removing images with adult content and photo spam. Next, we converted all images to .jpg extension using the ImageMagick utility [8]. ImageMagick is an open source utility that can convert images in various formats and also perform image modifications like resizing, cropping, etc. 14 The command in ImageMagick that converts all the files in folder from .bmp to .jpg format is mogrify –format jpg *.bmp. Similarly we can convert other formats too. Most of the ham images are very large in size say 30MB as these are photographs taken from a camera. Some of the spam images are quite large in size too. In order to make the preprocessing faster we resized all the images such that they did not exceed a size of 150KB. The command in ImageMagick that resizes all the files in folder is mogrify – resize 50% *.bmp. Resizing does not affect any of the features. 3.4.1 Feature Selection When we look at an image we can immediately identify whether the image is colorful i.e. has lot of colors or the image is blurr and we cannot see the contents clearly. In this research we have looked at some of these features to distinguish spam images from nonspam. The following features have been explored during the research. a) Luminance of image- Luminance refers to the brightness of an image. Some images are brighter than others. Figure 3.7 Low log average luminance 15 Figure 3.8 High log average luminance In the above images we can see that image in Figure 3.8 is brighter or has more luminance than image in Figure 3.7. Most of the spam images are not very bright as these are not taken from camera and the main intent of sending these is to obfuscate the spam filter and send it to as many recipients as possible. Hence, they are of small file size and are not very clear and bright. We can compare the luminance of different images using the log-average luminance of an image. The log-average luminance is calculated by finding the geometric mean of the luminance values of all pixels. In a gray scale image, the luminance value is the pixel value. In a color image, the luminance value is found by a weighted sum [9] Luminance = 0.27 red + 0.67 green + 0.06 blue (3.1) We took an average of luminance for all the pixels in an image, AverageLuminance = Luminance of all pixels/ Number of pixels........(3.2) b) Numbers of colors- We computed how many colors an image has. JPEG image has a 24 bit color map i.e. each pixel is 24 bits [2]. This means that an image can have 224 or 16777216 different colors. If we divide this range into 1677 bins with 1000 consecutive colors falling into one bin. We can find numbers of colors ranging from 0 to 1677, where 1677 is the maximum number of colors an image can have for our computations. 16 c) Color saturation- Color saturation can be described as pureness of a color. For example, how red is the color red in an image. If the pixel has a value of (R, G, B) = (255, 0, 0) the pixel has high saturation of color red. As defined by Aradhye et al. [13] and Frankel et al.[17] color saturation can be defined as the ratio of the total number of pixels in an image for which the difference max(R,G,B)- min(R,G,B) is greater than some threshold T to the number of pixels. Threshold T is set to 50 by Frankel et al and in this work). For every pixel in an image we calculate the maximum and minimum among the R, G, B values and then take the difference. We use a counter to count how many pixels have the difference greater than T (T=50). Finally we divide the counter by the number of pixels in the image to calculate the saturation value. d) White pixel concentration- Spam images have white generally have a solid background which is mostly pastel or white in color. For example, Figure 3.9(ham image) has subtle shades of different colors but no solid background unlike the Figure3.10 (spam image) which has a more white background color. Figure 3.9 Ham image background Figure 3.10 Spam image background 17 We calculate how many pixels in a given image have their r, g, b component values to be above 250 as any value above this range has to be a pastel shade like gray, pale white etc. Then we take the ratio of number of white pixels to total number of pixels in an image. If an image has more white pixels than any other color then probably it is a background color for the image. Most of the photographs do not use such a background. e) Standard deviation of colors- Each image has three components of color namely red, green and blue. Pixels have different combinations of these r, g, and b values and hence display different colors. We computed the standard deviation of each component red, green and blue. The value tells us how much variation is there in each color component. f) Hue- Hue can be described as dominant wavelength in a color model which describes a given color. For example, if we are looking at an apple the hue value is red as that is what the color of apple looks to us. Java’s Color class provides a method to determine the hue value of a given pixel. We compute the hue values for each pixel and then take mean of these values to represent hue of the image. 3.5 Feature Extraction Process The image features were computed using java programs. In order to compute any of the features for an image we need to know what are the color components of each pixel value (r, g, b). We have used java’s imageIO package to retrieve pixel values of an image. Then, we extract the (r, g, b) values for each of these for further processing. 18 a) Luminance is calculated for each pixel value in an image, we take the average value of all luminance values of the pixels. b) In order to count the number of colors we read each pixel value and decide in which color bin to put it in. Hence, all the pixels are assigned to any one of the 1677 color bins. c) Color saturation is computed for each pixel and the average value is computed for an entire image. d) We read each pixel and check if it falls in the range of the white pixel threshold we set. Then we compute what percentage of total pixels is white/pastel in color. e) Standard deviation for each color component of r, g, b are computed f) Hue is computed for each pixel and then the average value taken. 19 CHAPTER IV DATA MINING TECHNIQUES 4.1 Data Mining Overview Data Mining is the process of detecting patterns in data. Usually when the data is large, data mining techniques make it easier to analyze data. The process is either completely automated or semi automated [18]. Data mining problems can be primarily categorized into classification, clustering and association. Classification classifies a given set of data into different categories using the classification model generated during learning process. The data is represented by conditional attributes and a decision label. The conditional attributes describe features of the data. These values determine what the decision label is. For example consider the data about fruits in table 4.1. In this case color, size, taste would be conditional attributes and decision attributes would be apple, orange, grape, cherry etc. It is also known as supervised learning because the decision values are known to us. Table 4.1 Example data for classification Color Red Red Green Green Black Size(diameter) 4 inches 1 inch 2 inches 15 inches 0.5 inch 20 Taste Sweet Sweet Sour Sweet Sweet Decision Apple Cherry Lemon Melon Grape The classification algorithm first builds a model which helps it know what rules to follow when classifying a new instance of data. So if we gave a new instance of data which had attributes like Color- Red, Size- 4 inches and Taste-sweet the classification algorithm knows that it is an Apple because it learns this when we train it using the data in the Table 4.1 Clustering is a technique of grouping data. We do not know the decision label in this case and it is called as unsupervised learning. Consider an example as in table 4.2, this time we only have attribute values we do not know the decision. Table 4.2 Example data for clustering Color Red Red Green Green Black Red Green Red Green Orange Size(diameter) 4 inches 3.5 inch 2 inches 15 inches 0.5 inch 4 inches 2 inches 3.5 inch 15 inches 3 inches Taste Sweet Sweet Sour Sweet Sweet Sweet Sour Sweet Sweet Sweet In this case if we had to group the fruits in the table 4.2 we will try to group fruits which have similar properties of color, size, and taste also called similarity measure. We could group all red fruits which are sweet as group 1, all green fruits which are sweet as group2, all green fruits which are sour as group3 and all orange fruits which are sweet as group4. There are many arrangements possible based on what features we take into account. 21 Association rule mining is used to find frequently occurring patterns among set of items or objects. There is not really much difference association rule mining and classification rule mining except that they can predict any attribute not just the decision or class attribute. Unlike classification rules association rules are not used together as a set. [18]. Based on the data at hand and what we wish to find from the analysis we can choose which of the above three techniques are beneficial and feasible. In this research we have used the classification technique to classify images as spam or ham based on their features. 4.2 Classification Classification predicts categorical class labels. It constructs a model based on the training set and the values in the class attribute and used it to classify new instances. Each record in the dataset is assumed to belong to a predetermined class. This is also called as a training set and used for constructing the classifier model. The model can be represented as classification rules, decision trees, or mathematical formulae. The model is then used for classifying new or unknown instances (also called a testing set).To estimate the accuracy of the model the known label of test sample is compared with the classified result from the model. Accuracy is the percentage of test set samples that are correctly classified by the model [18]. 22 4.3 Decision Trees Decision trees are a divide and conquer approach to learning from a set of independent instances. It is a flow-chart-like tree structure where internal node denotes a test on an attribute [18]. A branch represents result of the test. The leaf nodes represent class or decision labels. In the beginning, all the training examples are at the root of the tree. The examples are partitioned recursively based on selected attributes. Some of the branches are pruned to reduce branches that reflect noise or outliers. Figure 4.1 shows an example of decision tree. Figure 4.1 Example of decision tree using J48 classifier for weather data Decision trees are constructed in a top-down and recursive divide-and-conquer manner. The basic algorithm is a greedy algorithm and all the training examples are at the root in the beginning. Instances are partitioned recursively based on selected attributes. Attributes are selected on the basis of a heuristic or statistical measure like information gain. Based on the model or tree generated new instances are classified. Partitioning is 23 stopped when all samples for a given node belong to the same class or when there are no remaining attributes for further or when no samples are left. 4.3.1 J48 Of many algorithm available for decision trees J48 is one of the most popular algorithm. It is an open source Java implementation of the C4.5 algorithm in the weka data mining tool.C4.5 is an extension of ID3 developed by Quinlan [19]. The general approach of the algorithm chooses an attribute of the data that most effectively splits its sample of data into subsets enriched in one class or the other. The criteria used to choose an attribute for splitting is information gain, which is the difference in the entropy values resulting from choosing an attribute for splitting the data. The attribute which is chosen has the highest normalized information gain among all the attributes. The procedure is repeated on smaller sub lists of the attribute list. 4.3.2 RepTree RepTree builds a decision tree using information gain/variance reduction and prunes it using reduced-error pruning. Optimized for speed, it only sorts values for numeric attributes once. It deals with missing values by splitting instances into pieces, as C4.5 does [18]. Some other decision tree algorithms in weka include ID3, RandomTree, RandomForest etc. 24 CHAPTER V EXPERIMENTS AND RESULTS 5.1 Weka Data Mining Tool In order to implement the classification tool we used a data mining tool. The tool is called as Weka and it is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License [20] We have used the weka 3.6 version for our experiments. In order to upload data into weka it should be converted into weka’s native format of “arff”. ARFF stands for Attribute-Relation File Format. It is an ASCII text file that describes a list of instances sharing a set of attributes [21]. Weka is implemented in Java programming language. Hence, it is possible to import weka’s jar file and use in our java programs. The tool has both GUI and command line support. We have used weka’s class library to write a program that can generate the 25 REPTree classifier and test the classifier generated class label with the actual class label of a test sample and compute the classification accuracy. 5.2 Data Set preparation Data is the collection of images we have used for the purpose of this research. After preprocessing each image and extracting features we write the features to a text file. So each image is now represented as a vector of feature values associated with that image. 5.3 Methodology 5.3.1 Experiments with J48 classifier In the first attempt we tried to use J48 to classify the data. We had 16186 spam and 5440 ham images, out of which we randomly selected 90% for training and 10% for testing. We made 10 sets in this fashion and then used this independent testing set to check the efficiency of classification model. Since most of the values in the feature set are floating point the decision tree is very wide and has many leaf nodes. Pruning the tree and discretization did not help much as in spite of them the tree was large as we had two decimal places of floating point for each value and a wide range of values for all the features. So, though the accuracy was about 98% the tree was too large to be of use to analyze the results. 26 5.3.2 Experiments with RepTree Classifier We decided to see if any other classifier can be used which had a comparable efficiency to J48 results and generated a more comprehensible tree. RepTree is another classifier in weka which is a fast decision tree learner. It Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). Only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (i.e. as in C4.5) [20]. The reason behind choosing RepTree is that it uses information gain to choose attributes similar to j48 which uses information gain. Also, it lets you limit the depth of tree and sorts numeric attributes only once there by reducing number of branches of the tree. The accuracies were almost equal to using J48 with an advantage of a smaller tree. 5.3.3 Depth of the RepTree As mentioned earlier RepTree lets us limit the depth of the tree. However, the question is to what value to set the depth parameter. What would indicate an optimum value for depth of the tree? In order to solve the problem we tried using the concept of hill climbing. We started with a depth of 1 and then kept increasing the depth with intervals of 1. We observed the accuracy at each interval change and the point at which the accuracy stopped increasing and either became constant or started declining is the optimum value for depth of the tree. Table 5.1 below lists the depth value for each of different dataset. 27 Table 5.1 Depth value of RepTree Ratio of Spam to Ham 1:1 7:3 1:9 9:1 No of Spam 5440 16186 604 9000 No of Ham 5440 5440 5440 1000 Depth 5 7 4 6 5.3.4 Dataset Proportions We had 16187 unique spam images and 5440 ham images in the dataset. In real time there is no statistics on how many spam images are encountered for each ham images because ham images cannot be harvested. The problem here was in what ratios do we choose ham images and spam images. To start with we had 16187 spam images and 5440 ham images which are approximately in a ratio of 1:3. So we use this as an initial set for our experiments. We ran the experiments with RepTree and computed the classification accuracy. We then chose 1:1 ratio i.e. one spam image for each ham image and computed the accuracy. We then decided to check two more boundary conditions i.e. 1:9 ratio of ham to spam and spam to ham. We then chose whichever combination gave the best accuracy. Table 5.2 lists the accuracies of each of these combinations and number of images in each set. Table 5.2 Accuracy of classification for different ratios of ham and spam images Ratio of Spam to Ham 1:1 7:3 1:9 9:1 No of Spam 5440 16186 604 9000 28 No of Ham 5440 5440 5440 1000 Accuracy 97.94 98.28 99.23 98.05 It was necessary that we try these combinations to choose what proportion of ham and spam images should form the training set. From the table 5.2 we can see that we get fairly similar accuracies for every ratio. We did not chose 1:9 and 9:1 as these are extreme cases where we have only 10% of ham or spam. We chose 7:3 as it had better efficiency than 1:1 ratio and it would help us cover more spam images and train classifier better. 5.3.5 Training and testing data selection After choosing the ratio we used 16186 spam images and 5440 ham images to form a training set. For our experiment we then divided this set into ten different sets of training and testing with a split of 90(training)-10(testing). We then used the RepTree classifier in weka and supplied independent test files as testing input. We repeated the experiment 10 times once for each set of training and testing. We obtained an average classification accuracy of 98.28%. 5.3.6 Testing on unseen data After classification using RepTree we wanted to test how this classification would work on new files, something not seen by the algorithm before. We used the 53300 unique images from the year 2010 to test the classifier we generated using the training data. We also downloaded images for January 2011 to test how the classifier would work on recent spam images. Table 5.3 shows the count of images after cleanup. 29 Table 5.3 Count of spam images in 2010 Month Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Total Unique Images No. of Images 2341 2333 12991 11471 8502 15734 5263 33698 8777 1713 2303 795 717 106638 No. of JPEG 1945 1689 10626 9077 5801 14292 4620 32631 8054 1223 2248 670 259 93135 53300 No. of GIF 396 644 2365 2394 2701 1442 643 1067 723 490 55 125 458 13503 In order to test the classifier we used the weka libraries which are in java. The program generates a classifier using the training set and RepTree classifier of weka. It then compares the decision assigned to each test sample by the classifier to the actual label and computes the accuracy of the classifier. The test files were generated for each month with features of spam images for that month and equal number of ham features which were not used in the training set. We had 1688 ham images which were not a part of training so we had maximum of 1688 ham images and 1688 spam images in a test file. Some months had lesser than 1688 spam images, in such case we randomly picked equal number of ham images. Similarly if number of spam samples were more than 1688 we chose 1688 images randomly from them. We then used the decision tree to classify these test files. Table 5.4 lists the accuracies for each of the months. The average efficiency of the classifier for unseen samples is 89%. 30 Table 5.4 Accuracy of classification for unseen samples 10-Jan 10-Feb 10-Mar 10-Apr 10-May 10-Jun 10-Jul 10-Aug 10-Sep 10-Oct 10-Nov 10-Dec 11-Jan Accuracy 0.9252 0.9057 0.9390 0.8990 0.8320 0.7933 0.8451 0.9648 0.9428 0.8182 0.9476 0.8241 0.8210 Recall 0.9055 0.8618 0.9313 0.8513 0.7174 0.6439 0.7469 0.9828 0.9390 0.6913 0.9359 0.6912 0.6960 Precision 0.9427 0.9448 0.9458 0.9411 0.9308 0.9184 0.9295 0.9485 0.9463 0.9263 0.9582 0.9415 0.9281 F-Measure 0.9237 0.9014 0.9385 0.8939 0.8103 0.7570 0.8282 0.9654 0.9426 0.7918 0.9469 0.7972 0.7955 FPR 0.0551 0.0503 0.0533 0.0533 0.0533 0.0572 0.0567 0.0533 0.0533 0.0550 0.0408 0.0429 0.0540 FNR 0.0945 0.1382 0.0687 0.1487 0.2826 0.3561 0.2531 0.0172 0.0610 0.3087 0.0641 0.3088 0.3040 The table 5.5 lists the computation time for all the features for spam and ham images. Table 5.5 Computing time for extracting features Features Luminance Saturation Hue Number of colors White Pixel Concentration Standard Deviation Total time for all features Spam(ms) 938163 1146220 1416485 1555486 1022250 1283551 7362155 ham(ms) 509577 496305 573444 615987 461654 483472 3140439 ham + Spam(ms) 1447740 1642525 1989929 2171473 1483904 1767023 10502594 Avg.(ham + spam) 66.9444 75.9514 92.0156 100.4103 68.6167 81.7083 485.6467 The RepTree generated is of 63 leaf nodes. Figure 5.1 shows part of the tree which is generated by the classifier. The important features from the tree are average luminance, number of colors and white pixel concentration. The tree is used to test each of the 13 test files mentioned in Table 5.4. The reason behind using the same tree each time is to ensure that we are testing the samples across same classifier model each time. 31 Figure 5.1 Part of Decision tree generated by the classifier 32 CHAPTER VI VALIDATION BY FEATURE ANALYSIS We had 52744 unique images from the year 2010 and about 500 images for January 2011. After feature extraction we see if there is any pattern in the values of the features for image spam. We had eight different features and each of these features has a value that lied in a range. For example, average luminance can be between 0-255, average hue is between 0-1, number of colors is between 0-1677 and so on. We divided these values into ranges of equal intervals and counted how many images had feature values in that range. The graphs in figure 6.1 below show the distribution of spam and ham images in these ranges. Number of Images 27000 24000 21000 18000 15000 12000 9000 6000 3000 0 600 500 400 300 200 100 0 Average Luminance of Ham Images 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ 0+ 20+ 40+ 60+ 80+ 100+ 120+ 140+ 160+ 180+ 200+ 220+ 240+ Number of Images Average Luminance of Image Spam Avearge Luminance Average Luminance 33 Average Saturation of Ham Images 40000 35000 30000 25000 20000 15000 10000 5000 0 1500 1000 500 0 0+ 0.1+ 0.2+ 0.3+ 0.4+ 0.5+ 0.6+ 0.7+ 0.8+ 0.9+ 1 Number of Images 1 0.9+ 0.8+ 0.7+ 0.6+ 0.5+ 0.4+ 0.3+ 0+ 0.2+ 2000 0.1+ Number of Images Average Saturation of Image Spam Average Color Saturation Avearge Saturation Average Hue of Image Spam Average Hue of Ham Images 30000 20000 10000 1 0.9+ 0.8+ 0.7+ 0.6+ 0.5+ 0.4+ 0.3+ 0.2+ 0.1+ 0 1600 1400 1200 1000 800 600 400 200 0 0+ 0.1+ 0.2+ 0.3+ 0.4+ 0.5+ 0.6+ 0.7+ 0.8+ 0.9+ 1 Number of Images 40000 0+ Number of Images 50000 Average Hue Average Hue Number of Colors in Ham Images 50000 40000 30000 20000 10000 0 0+ 50+ 100+ 150+ 200+ 250+ 300+ Number of Colors Number Of Images Number of Images Number of Colors in Image Spam 5000 4000 3000 2000 1000 0 0+ 50+ 100+ 150+ 200+ 250+ 300+ Number of colors 34 Percentage of Pastel Pixels in Ham Images 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Number of Images Number of images Percentage of Pastel Pixels in Image Spam 6000 5000 4000 3000 2000 1000 0 0+ 0.1+0.2+0.3+0.4+0.5+0.6+0.7+0.8+0.9+ 1 0+ 0.1+0.2+0.3+0.4+0.5+0.6+0.7+0.8+0.9+ 1 Percenatge of Pastel Pixels Percentage of Pastel pixels Number of Images Standard Deviation of Red Component for Ham Images 25000 20000 15000 10000 5000 0 800 600 400 200 0 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ Number of Images Standard Deviation of Red Component for Image Spam Standard Deviation Standard Deviation Standard Deviation of Green Component for Ham Images Standard Deviation of Green Component for Image Spam Number of Images 20000 15000 10000 5000 0 800 600 400 200 0 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ Number of Images 25000 Standard Devaition Standard Deviation 35 Standard Deviation of Blue Component for Ham Images Number of Images 20000 15000 10000 5000 0 Standard Deviation 600 500 400 300 200 100 0 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ 25000 0+ 10+ 20+ 30+ 40+ 50+ 60+ 70+ 80+ 90+ 100+ 110+ 120+ Number of Images Standard Deviation of Blue Component for Image Spam Standard Deviation Figure 6.1 Plots for range of feature values for Image Spam and Ham images The above plots show that most of the feature values for image spam lie in specific range. For example, most of the spam images have an average luminance value in the range of 200-220. However for ham images these values are spread over a wider range and none of the values fall in the range of 200-220. Hence this is the easiest way to determine if an image could be a spam. The decision tree also chooses luminance as the root of the tree and has a cutoff value of 192.36 for average luminance. Next, if we look at average saturation of color values we see that ham images have them spread over a wider range than spam images which have saturation values mostly in the range of 0-0.2. Ham images have saturation values mostly between 0-0.8. Similarly average hue for ham images is spread out into different ranges when compared to spam images. Number of colors and white/pastel pixel concentration are not very helpful measures to identify spam from ham as for both of these spam and ham images have values in similar ranges. Standard deviation of color components is more spread out again in case of ham images than spam images. These are expected values as photographs 36 have different shades of colors than a spam image. Pictures are taken at different time of the day, so the luminance values are spread across different ranges. Also, the luminance value of ham images is lesser than spam images because generally spam images are brighter so that they attract attention of the user. The text may be skewed and there may be random noise in the image but the image is bright in appearance so that a human user can read it easily but an OCR can’t. Since the values of features are in clearly distinct ranges it is easy for the classifier to classify spam from ham images. This explains the accuracy of 89% in unseen samples of spam and ham images. The graphs also hint that spam images might rotate the content, add random noise, random pixels, or skew the text to obfuscate the filters, but there are some features that do not change in spite of these tricks. For example, the luminance value is similar for ideally many different spam images as per the graphs. Spammers cannot vary these properties often and this is probably the reason why we get a fairly high accuracy of 89%. 37 CHAPTER VII TRENDS IN IMAGE SPAM 7.1 Count of Image spam Most of the research work is dedicated towards classification or identification of spam images and very little literature exists on how the image spam has evolved and changed in past few years. We wanted to see what the trend in spam images is and track the growth and fall of spam images. In addition to the 2010 images we had we downloaded images for the years of 2008, 2009 and first two months of 2011. We did not have image spam collection for the entire year of 2008, so we took the count from August to December. Table 7.1 below lists the number of images in each month of a year. Table 7.1 Image spam count in 2008- 2011 Month Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec 2008 3660 5527 8021 3525 601 2009 118 764 2268 1008 1277 10863 7840 12883 13329 8040 5883 4119 38 2010 3171 3451 16403 18462 7337 18141 6725 36003 9105 2233 2601 943 2011 717 781 If we observe the table 7.1 we can see that the spammers follow a pattern in the sending image spam. It starts increasing slowly over each month and there is a sudden drop in the count in one of the months. Then somewhere in the middle of the year near June to September there is a sudden explosion in the number of images. Since most of the spam is sent using bots the reason for this sudden drop could be closing down of some of the spam bots. Image spam was quite low from November 2008 after closing of rogue hosting provider McColo in November 2008[22]. Then again there was an increase from June 2009 and it kept going on and off until September 2010. During September 2010 spamit.com, an affiliate program used by several spamming botnets was closed down. Hence spam count was decreased and in early 2011 we see fewer spam images. This is also shown in our spam count in table 6.1 and the statistics are in accordance with the statistics of spam images across the world though we are using only a subset of the image spam. Number of other botnets like Pushdo/Cutwail, Mega-D etc have been identified and closed down but there are always more to fill the void [22]. 7.2 Trend of the Month We also tried to manually look into each of these images and identify prevalent trends in each month of the year. This analysis provides an insight into what strategies spammers use in sending spam. For example, if it is New Year then we see lot of spam related to gifts, candies, fitness equipment etc. Gifts and candies are gifts during these times and the most common New Year resolution is to get fit! Similar trend is observed during Christmas, valentine’s day etc. 39 Figure 7.1 shows examples of images that appeared on February- 14 -2011 which is observed as Valentine’s Day. It is also interesting to note how spammers capture what people might be really curious to see. For example, weight loss, beauty products, exercise equipment. Figure 7.1 Images that appeared on February 14th - Valentine’s Day We looked at image spam from each month and then noted the most prominent trends for the month. Then we added the frequency of appearance of the trend across the year to check which of the image types are most frequently observed in image spam. Figure 7.2 below shows the graph obtained for the image type and its frequency of occurrence in the years 2010 and 2011. We can see from the graph that the most frequently appearing type of spam is pharmaceutical, pornography, hardware devices, software products for sale, photo spam, clothes, UPS or other delivery service message alerts. In fact pharmaceutical image spam appears in every single month from Jan 2010 to Feb 2011. 40 Frequency of Image Spam Types Wines Weightloss Valentines day Text spam Sinks Scenic pictures Reliogious Protest mails for europe Political Type of Image Spam Pharmaceutical Monitors Images that do not open Kid's shoes Insurance Holiday gifts Gifts Foreign language images Flowers Credit/debt help Choclates Cats Cameras Architectural designs Advertising school 0 5 Frequency 10 15 Figure 7.2 Image spam type for the time period of January 2010- Feb 2011 41 7.3 New trends in image spam 7.3.1 Scraped Images One of the newer techniques in creating image spam include scraping the image such that it cannot be read by a file reader but can be viewed in a picture editor. This makes it possible to convey the message to the user but prevents processing of images. Scraping is one of the ways to render an image unreadable, other ways of tampering include improper header information, incorrect color maps, etc. File readers which demand these to be in correct format for an image to be read cannot parse the image data. Figure 7.3 shows an example of one such an image. We have not dealt with these images in this research and hence these are removed during preprocessing. Figure 7.3 Example of a scraped spam image 7.3.2 Malware embedding in images Another observation in spam image trend is the insertion of malware into jpeg files. We found that images in 2010 during the months of May, September, November had images 42 which had malwares embedded in them. This was not seen in the earlier images of 20082009. Most of the malware was either Trojan or virus. There is little literature available on how these images are used for malware embedding and how they attack the victim. In general when a non executable file containing an executable is double clicked the non-executable file is opened. For example if a JPG file has a malware embedded in it then the JPG file opens on double clicking not the executable. In order to execute the embedded executable a loader is needed which is a part of the malware already present on the infected machine, which extracts the executable file and runs it. The executable is embedded inside a file to evade some basic security filters. This is one of the ways malwares can download malicious stuff from the web. So, if a JPG with an embedded executable file exists, probably there is another running infection on the system which actually downloaded that JPG file. This is not the only way for an embedded executable file to be executed. The JPG could also exploit some other flaw in the system and run the executable. Generally the exploit files do not contain embedded files. So, if we see a jpg with embedded executable, it's unlikely to be an exploit attack but instead there is a possibility that another infection is running on the system [23]. When we downloaded the images we did click some of them to actually look at the content but opening these files with executable in them did not infect the system. This is could be because the loader to extract the executable was not installed on the machine. 43 The images from Knujon are stripped from e-mails so the loader may not have been a part of it. Recently Microsoft's Malware Protection Center discovered a variation of a malicious image which looks like a simple .png file [24], the image has instructions which asks user to open it in MS Paint and then resave it as an .hta file of image type Bitmap. The lower part of the image looks like some random noise, but as the file is resaved according to the directions, this noisy part decompresses to a JavaScript payload which executes when the.hta file is opened. .hta extension denotes an HTML application. Hence the system will ignore the leading BMP information and execute the following HTML / Javacript information, which then run the payload. Figure 7.4 below show the snapshot of the image and subsequent hta format file [25]. The next figure 7.5 shows the binary data of the image before and after resaving as .hta file [25]. Figure 7.4 Malware Embedded in a .png Image 44 Figure 7.5 Binary form images before and after saving as .hta file This is a complicated technique and it requires user to perform lot of actions in order to succeed. Hence it is unlikely to be used in a widespread attack although it exposes the possibilities for hiding malware and data in image formats. It is also possible that cyber criminals are using these images as a medium to share malware and spread it across easily. 45 CHAPTER VIII CONCLUSIONS AND FUTURE WORK The experiment has provided us with insightful observations about how spam images have evolved in a year. Many spam images are almost photo quality images and have multiple colors. This makes the classification process trickier as it gets harder to distinguish these images from photographs. Newer techniques are used in generating image spam like scarping off header information, some images do not load when viewed as thumbnails but will open with a picture editor, malware injection etc. We also observed that images follow trends in a particular month, for example a dominant trend in a month could be advertisement of vines, exercise equipment, pharmaceuticals, chocolates etc or it could be chain letters, advertisement of dating websites, links on making fast money and many more. The pros of using the described approach for classification are that we extract features which are very easily computable and we achieve a good efficiency for unseen samples from a recent time period. Irrespective of the file size we can compute the features easily as resizing or converting formats does not affect the values of the features. Unlike the method described in [3] and [5] we compute high-level features which can be computed only after reading the file contents but spam images today have comparable quality of photo images, hence low level features will not always be an efficient measure. Even if 46 we compute features from image contents the process is not time consuming as we can resize images and reduce the complexity. We have used simple tools like java programs and ImageMagick commands to carry out the experiments. Finally, an accuracy of 98% on trained samples and 88% on untrained samples is encouraging. In Table 5.3 we can see that for November 2010 samples we have an accuracy of almost 94% where these images were never used to train the classifier. This indicates that spam images recur with modifications like change in few pixels, rotation, or noise but the properties like luminance, saturation, hue are not modified frequently. Also, in each month we get an accuracy of at least 79%. Some months had very few spam samples and this could be a reason we see a slight dip in the accuracy. However, if the classifier is updated frequently with new incoming spam images it might provide better classification accuracy over a period of time as spam images recur with slight modification each time. A challenge in the experiment was the ham images which are not available so easily. We can download images using Flickr API however processing these is a time consuming process and hence our corpus is limited. In our experiment the features we selected eight features of significance. A more detailed exploration in the area of image processing as a future work might be able to yield more features like number of objects in an images, presence of text using image segmentation, presence of random noise by adjacent pixel exploration etc. We have not dealt with images which are scraped or have malware embedded in them. This is an interesting area of future work and might provide details 47 such as presence of steganography in an image. This can also be used as a feature in the classification as normal photographs do not have such content embedded in them. 48 REFERENCES [1] Brad Templeton , http://www.templetons.com/brad/spamterm.html, last accessed date 03/24/2010 [2] Types of Bitmaps, http://msdn.microsoft.com/enus/library/ms536393%28VS.85%29.aspx, last accessed date 03/24/2010 [3] Flickr , http://www.flickr.com/ , last accessed date 03/24/2010 [4] Flickr Downloader tool, http://download.cnet.com/Flickr-Downloader/3000-12512_410790953.html, , last accessed date 03/24/2010 [5] Google Images, http://www.google.com/imghp?hl=en&tab=wi, , last accessed date 03/24/2010 [6] Wikipedia, Image formats, http://en.wikipedia.org/wiki/Image_file_formats, last accessed date 03/24/2010 [7] NationalGeographic, http://photography.nationalgeographic.com/photography/?source=NavPhoHome, last accessed date 03/24/2010 [8] ImageMagick, http://www.imagemagick.org/script/index.php, last accessed date 03/24/2010 [9] Luminance of Images, http://www.cacs.louisiana.edu/~cice/lal/index.html, last accessed date 03/24/2010 [10] Spammer’s Compendium, http://www.jgc.org/tsc.html, last accessed date 03/24/2010 [11] Hope, P., Bowling, J. R., and Liszka, K. J., Artificial Neural Networks as a Tool for Identifying Image Spam, The 2009 International Conference on Security and Management (SAM'09), July 2009, pp. 447-451. [12] Chao Wang; Fengli Zhang; Fagen Li; Qiao Liu; , "Image spam classification based on low-level image features," Communications, Circuits and Systems (ICCCAS), 2010 International Conference on , vol., no., pp.290-293, 28-30 July 2010 49 [13] Aradhye, H.B.; Myers, G.K.; Herson, J.A.; , "Image analysis for efficient categorization of image-based spam e-mail," Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on , vol., no., pp. 914- 918 Vol. 2, 29 Aug.-1 Sept. 2005 [14] Krasser, S.; Yuchun Tang; Gould, J.; Alperovitch, D.; Judge, P.; , "Identifying Image Spam based on Header and File Properties using C4.5 Decision Trees and Support Vector Machine Learning," Information Assurance and Security Workshop, 2007. IAW '07. IEEE SMC , vol., no., pp.255-261, 20-22 June 2007 [15] Yan Gao, Ming Yang, Xiaonan Zhao, Bryan Pardo, Ying Wu, Thrasyvoulos N. Pappas, Alok Choudhary, “Image Spam Hunter”, EECS Dept., Northwestern Univ., Evanston, IL, Proceedings Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. [16] Peizhou He Xiangming Wen Wei Zheng, “A Simple Method for Filtering Image Spam”, 2009 Eigth IEEE/ACIS International Conference on Computer and Information Science [17] C. Frankel, M. Swain, and V. Athitsos, “Webseer: An Image Search Engine for the World Wide Web,” Univ. ofChicago Technical Report TR96-14, 1996. [18] Data Mining practical machine learning tools and techniques, Ian H Witten , Eibe Frank, ISBN-13:978-0-12-088407-0 [19] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993 [20] Weka, http://www.cs.waikato.ac.nz/ml/weka/, last accessed date 03/24/2010 [21] Attribute-Relation File Format, http://www.cs.waikato.ac.nz/~ml/weka/arff.html, last accessed date 03/24/2010 [22] Help Net Security, http://www.net-security.org/secworld.php?id=10594, last accessed date 03/24/2010 [23] Wilders Security Forums, http://www.wilderssecurity.com/showthread.php?t=251875&highlight=malware+im ages, last accessed date 03/24/2010 [24] Sunnetbeskerming, http://www.beskerming.com/commentary/2010/08/12/527/Malware_in_Images,_a_S ocial_Engineering_Example, last accessed date 03/24/2010 50 [25] Painting by Numbers, Microsoft Malware Protection Center, Threat Response and research blog, http://blogs.technet.com/b/mmpc/archive/2010/08/09/painting-bynumbers.aspx, last accessed date 03/24/2010 [26] E-mail address harvesting, http://en.wikipedia.org/wiki/E-mail_address_harvesting, last accessed date 06/06/2011 [27] Image Spam Dataset, http://www.cs.jhu.edu/~mdredze/datasets/image_spam/, last accessed date 06/06/2011 [28] Princeton Spam Image Benchmark, http://www.cs.princeton.edu/cass/spam/, last accessed date 06/06/2011 [29] Dredze Mark, Gevaryahu Reuven, Elias-Bachrach Ari, “Learning Fast Classifiers for Image Spam,” Fourth Conference on Email and Anti-Spam (CEAS), 2007. 51 APPENDICES 52 APPENDIX A DATA ANALYSIS In order to understand the patterns and trends in image spam for the last year we performed many analyses to come up the list of most frequent trends. This section describes the process of trend analysis in detail. After cleaning up of pornographic images and photo spam we counted the number of images in each month and calculated the feature values for all the spam images for the year of 2010. We had 52,744 unique images and feature value for each of these. Initially in order to observe patterns in the feature values we took an average of the feature value for the whole set of 52,744 images and plotted it. The figure A.1 below shows the graphs generated for average value of each feature for the whole year. 53 Figure A.1 Average of feature values for all Image Spam in the year of 2010 54 APPENDIX B GENERATING MD5SUM AND SELECTING UNIQUE FILES We used unique images to train the classifier, and in order to generate an md5sum we used a small script which just generated the md5sum of a directory recursively. Then we used a Java program to list the duplicate files. The duplicate files were removed based on the time stamp and we preserved the files which had older time stamp. Then we used another script to delete the files listed by the Java program. This section shows the two scripts used. #!/bin/sh folders=`ls $1` for all in $folders do files=`ls $1/$all` for f in $files do md5sum $1/$all/$f >> totalMd5.txt done done Figure B.1 Script to generate md5sum 55 The script in figure B.1 takes a directory as a command line argument and then for each file in each subdirectory it computes the md5sum and writes it to a file called totalMd5.txt in the current directory. The output of the script is shown in the figure B.2 below. It lists the md5sum and the complete path of the location of the file. 58e33c78c71af982a7d2cd30410b2051 d15eca70160d379e7c00ca3e99e83fdf b9de7049b5cfcea19f80d4fd55f27386 0c85c7f4ed35fc533ff96a50eff932ac 887f3bb4449b81305af7f4a4f6b207f1 8bd3f7b70c5f7de4e21c611943716601 8bed5edd0494f480a1585bd16d435b23 a6e9cd6a120e15317089b8a69566d98f fde107714769faa696ecebedbad20a01 31167a234781f3244137384afa55aef3 5046c4254cd9f1b58b66d750117d29a3 f61d48470a87afb9f4251fa619dc7f03 9979242622466893910121e5f9dc8f71 2a7aa17d86385cd3d8b1edbde1042f12 d0565db3dbb9df703b6ba2f52ce1c825 16b953f44c00f243a073dcbee6b7676d 2384245e6a800a79064888dd60d306fd c0ca7f0c0eb846df8ef5d0c6f1026f46 9ad7ae4f335525e26e8e8a1e7b11c94e 4d489c5bd9a3653f5f55e544d86ed596 RENAMED2010Images/April_renamed/spam10000.jpg RENAMED2010Images/April_renamed/spam10001.jpg RENAMED2010Images/April_renamed/spam10002.jpg RENAMED2010Images/April_renamed/spam10003.jpg RENAMED2010Images/April_renamed/spam10004.jpg RENAMED2010Images/April_renamed/spam10005.jpg RENAMED2010Images/April_renamed/spam10006.jpg RENAMED2010Images/April_renamed/spam10007.jpg RENAMED2010Images/April_renamed/spam10008.jpg RENAMED2010Images/April_renamed/spam10009.jpg RENAMED2010Images/April_renamed/spam1000.jpg RENAMED2010Images/April_renamed/spam10010.jpg RENAMED2010Images/April_renamed/spam10011.jpg RENAMED2010Images/April_renamed/spam10012.jpg RENAMED2010Images/April_renamed/spam10013.jpg RENAMED2010Images/April_renamed/spam10014.jpg RENAMED2010Images/April_renamed/spam10015.jpg RENAMED2010Images/April_renamed/spam10016.jpg RENAMED2010Images/April_renamed/spam10017.jpg RENAMED2010Images/April_renamed/spam10018.jpg Figure B.2 Output of md5sum script The script in figure B.3takes a text file as input and deletes the files listed in it. #!/bin/sh cat $1 | xargs -I % rm % Figure B.3 Script to delete the duplicate files The output of the Java program looks like in the figure B.4. It simply lists the md5sum and the name of the file to be deleted using the script in figure B.3 56 3031c161229f18aecb506ea36c7c5fd2 3031c161229f18aecb506ea36c7c5fd2 3031c161229f18aecb506ea36c7c5fd2 3031c161229f18aecb506ea36c7c5fd2 3031c161229f18aecb506ea36c7c5fd2 38bccaf5a2599b6306b7d72e8e1e1b4f 38bccaf5a2599b6306b7d72e8e1e1b4f 3a59d81ecbbe5b02afd265d9921e563e 647e0d7808b2b21a4dff05a7c980a803 77c84cdb68665fe7b042f8d2d963d249 January2009/17Jan2009/37668-Gn27Swc1.gif January2009/17Jan2009/5097-Gn27Swc1-4.gif January2009/17Jan2009/73252-Gn27Swc1-1.gif January2009/17Jan2009/89006-Gn27Swc1-5.gif January2009/17Jan2009/97853-Gn27Swc1-2.gif January2009/13Jan2009/66763-image002-4.jpg January2009/13Jan2009/67048-image002-3.jpg January2009/7Jan2009/88273-EK2xUr%280A%29.gif January2009/7Jan2009/4790-Sodu%28eL%29.gif January2009/13Jan2009/88162-image001-4.jpg Figure B.4 Output of the Java program to identify duplicate files 57