Download CLASSIFICATION OF IMAGE SPAM A Thesis Presented to The

Document related concepts

3D film wikipedia , lookup

Portable Network Graphics wikipedia , lookup

Dither wikipedia , lookup

Computer vision wikipedia , lookup

BSAVE (bitmap format) wikipedia , lookup

Spatial anti-aliasing wikipedia , lookup

Anaglyph 3D wikipedia , lookup

Indexed color wikipedia , lookup

Hold-And-Modify wikipedia , lookup

Image editing wikipedia , lookup

Stereoscopy wikipedia , lookup

Medical image computing wikipedia , lookup

Stereo display wikipedia , lookup

Transcript
CLASSIFICATION OF IMAGE SPAM
A Thesis
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Master of Computer Science
Shruti Wakade
August, 2011
CLASSIFICATION OF IMAGE SPAM
Shruti Wakade
Thesis
Approved:
Accepted:
_______________________________
Advisor
Dr. Kathy J. Liszka
_______________________________
Department Chair
Dr. Chien-Chung Chan
_______________________________
Committee Member
Dr. Zhong-Hui Duan
_______________________________
Dean of the College
Dr. Chand Midha
_______________________________
Committee Member
Dr. Chein–Chung Chan
______________________________
Dean of the Graduate School
Dr. George R. Newkome
________________________________
Date
ii
ABSTRACT
Image spam is one of the most prevalent forms of spam ever since its inception.
Spammers have refined their spamming techniques to use smaller, more colorful and
photo quality images as spam. In spite of numerous efforts to build efficient spam filters
against e-mail spam by researchers and free-mailing services like yahoo mail, Gmail etc
spam filters still fail to arrest image spam. This research is an attempt to understand the
techniques used in spamming and identifying a set of features that can help in
classification of image spam from photographs.
A set of eight features were identified based on observations and existing research in this
area. Among the eight features, six features have been introduced by us and two other
features have been included from previous research. Data mining techniques were then
applied to classify image spam from photographs. Identifying a set of efficient yet
computationally inexpensive features was the objective that guided this research work.
We achieved classification accuracy of 89% for the test samples. A detailed trail of image
spam has been studied to identify the most prevalent types and patterns in image spam.
Our results indicate that five of the six features we had introduced proved to be of high
significance in identifying image spam from photographs.
iii
ACKNOWLEGEMENTS
I extend my heartfelt gratitude and appreciation to Dr. Kathy J. Liszka, an extremely
helpful teacher and a wonderful advisor who is the guiding force behind this research
work. Without her guidance, inputs and encouragement this work would not have been
possible.
I express my sincere appreciation and gratitude to Dr. Chan for helping me with the data
mining experiments and for insightful corrections. I appreciate my committee member
Dr. Duan for her thoughtful inputs. I wish to thank Chuck Van Tilburg, for extending his
help in the research labs and providing a workable environment in the labs. I also wish to
thank Knujon for contributing spam images which helped me to build a substantial
corpus for this research.
Last, but not the least, I would like to convey my heartfelt gratitude to my family and
friends for their constant encouragement, support and timely help.
iv
TABLE OF CONTENTS
Page
LIST OF TABLE................................................................................................................ix
LIST OF FIGURE...............................................................................................................x
CHAPTER
I. INTRODUCTION............................................................................................................1
II. SPAM DEFINITION AND TYPES...............................................................................3
2.1 Overview............................................................................................................3
2.2 Types of spam....................................................................................................4
2.3 Image Spam.......................................................................................................5
2.4 Related Research................................................................................................7
III. SPAM IMAGES AND DATASET..............................................................................9
3.1 Types of Images................................................................................................9
3.2 Image Spam Dataset........................................................................................11
v
3.3 Corpus..............................................................................................................12
3.3.1 Statistics of Images in the Corpus.....................................................13
3.4 Preprocessing...................................................................................................14
3.4.1 Feature Selection...............................................................................15
3.5 Feature Extraction Process...............................................................................18
IV. DATA MINING TECHNIQUES................................................................................20
4.1 Data Mining Overview....................................................................................20
4.2 Classification....................................................................................................22
4.3 Decision Trees.................................................................................................23
4.3.1 J48.....................................................................................................24
4.3.2 RepTree ............................................................................................24
V. EXPERIMENTS AND RESULTS...............................................................................25
5.1 Weka Data Mining Tool..................................................................................27
5.2 Data Set Preparation........................................................................................26
5.3 Methodology....................................................................................................26
5.3.1 Run 1- Using J48 Classifier..............................................................26
vi
5.3.2 Run 2- Using RepTree Classifier......................................................27
5.3.3 Depth of the RepTree........................................................................27
5.3.4 Dataset Proportions...........................................................................28
5.3.5 Training and Testing data selection..................................................29
5.3.6 Testing on Unseen data.....................................................................29
VI. VALIDATION BY FEATURE ANALYSIS.............................................................33
VII. TRENDS IN IMAGE SPAM.....................................................................................38
7.1 Count of Image Spam......................................................................................38
7.2 Trend of the Month..........................................................................................39
7.3 New Trends in Image Spam.............................................................................42
7.3.1 Scraped Images.................................................................................42
7.3.2 Malware Embedding in Images........................................................42
VIII. CONCLUSIONS AND FUTURE WORK...............................................................46
REFERENCES..................................................................................................................49
APPENDICES...................................................................................................................52
APPENDIX A.
DATA ANALYSIS............................................................53
vii
APPENDIX B.
GENERATING MD5SUM AND SELECTING
UNIQUE FILES.................................................................55
viii
LIST OF TABLES
Table
Page
3.1 Statistics of the images collected to form the corpus...................................................13
4.1 Example data for classification....................................................................................20
4.2 Example data for clustering.........................................................................................21
5.1 Depth value of RepTree...............................................................................................28
5.2 Accuracy of classification for different ratios of ham and spam images.....................28
5.3 Count of spam images in 2010.....................................................................................30
5.4 Accuracy of classification for unseen samples............................................................31
5.5 Computing time for extracting features.......................................................................31
7.1 Image spam count in 2008- 2011.................................................................................38
ix
LIST OF FIGURES
Figure
Page
1.1 Example of Image Spam................................................................................................2
2.1 Adding noise to the Image.............................................................................................6
2.2 Wavy images..................................................................................................................6
2.3 Rotating Image and adding noise...................................................................................6
3.1 Text only image spam...................................................................................................9
3.2 Adding random colored pixels.....................................................................................10
3.3 Adding color streaks....................................................................................................10
3.4 Adding a wild background...........................................................................................10
3.5 Examples of Standard Images......................................................................................11
3.6 Similar spam images but with different checksums....................................................14
3.7 Low log average luminance.........................................................................................15
3.8 High log average luminance........................................................................................15
x
3.9 Ham image background...............................................................................................17
3.10 Spam image background............................................................................................17
4.1 Example of decision tree using J48 classifier for weather data...................................23
5.1 Part of Decision tree generated by the classifier..........................................................32
6.1 Plots for range of feature values for Image Spam and Ham images............................33
7.1 Images that appeared on February 14th - Valentine’s Day...........................................40
7.2 Image spam type for the time period of January 2010- Feb 2011...............................41
7.3 Example of a scraped spam image...............................................................................42
7.4 Malware Embedded .png Image..................................................................................44
7.5 Binary form of images before and after saving as .hta file..........................................45
A.1 Average of feature values for all Image Spam in the year of 2010............................52
B.1 Script to generate md5sum..........................................................................................54
B.2 Output of md5sum script.............................................................................................55
B.3 Script to delete the duplicate files...............................................................................55
B.4 Output of the Java program to identify duplicate files................................................56
xi
CHAPTER I
INTRODUCTION
E-mail is one of the most integral parts of communications over internet today. However,
each day we spent several minutes deleting spam, unsolicitated e-mails advertising for
products, offering loans at low interest rates, drugs etc. Though spam filters are able to
identify majority of the e-mail spam spammers are continuously developing newer
techniques to send spam messages to more and more people. With the advent of
technology mobile devices and other portable electronic devices are now Wi-Fi enabled
and internet telephony VoIP (voice over internet protocol) has made communicating
across the world easier and inexpensive. Social networks like Twitter, Facebook,
MySpace are very popular means of connecting with friends across the globe. However
this has opened a newer audience for spammers to exploit. Spam is not just limited to email anymore, it is on VoIP in the form of unsolicitated marketing or advertising phone
calls, or marketing, advertising and pornography links on social network. Spam is
everywhere!
There are many ways spammers can get to know your e-mail address and send you spam
even though you never open any spam mails or click any suspicious links. If you are on
any social network and do not set your privacy settings your data is available to anyone
which includes your location, e-mail, friend lists etc. If you subscribe to newsgroups your
1
E-mail address can be easily harvested. Dictionary attack is one such technique to harvest
e-mail addresses. So it is easy to find information with little time and effort and spammer
have lots and lots of it.
Most of the spammers use bots to do the job for them so even if they get one user to
respond to their spam it is worth the effort to send e-mail to hundreds of people.
Filters today can arrest most of the e-mail spam that appears in the form of text. Black
listing known spamming IPs can also prevent spam to certain extent. This research deals
with the image spam, one of the spamming techniques used by spammers where the spam
message is embedded in an image instead of directly being the part of the message body.
Two examples of spam images are shown in Figure 1.1.
We test a method to classify spam images by using decision trees using weka data mining
tool. The remaining report is organized as follows. Chapter two provides an overview of
spam and the current research in the area. Chapter three describes the preprocessing and
data mining steps to classify images. Chapter four discusses the weka data mining tool
and how to use it for classification. Chapter five presents results of classifying spam and
non spam images. The final chapter six discusses conclusions and future work.
Figure 1.1 Example of Image Spam
2
CHAPTER II
SPAM DEFINITION AND TYPES
2.1 Overview
In general spam refers to the use of electronic messaging to send unasked messages to a
large group of addresses arbitrarily. Though e-mail spam is the most widely known form
of spam it also appears in many other electronic media such as chats, internet telephony,
social networks, web spamming etc. The cost of transmission of these messages is borne
by the users who receive it and the ISP who cannot help the spam traffic and are forced to
increase bandwidth to accommodate the traffic. The spammers only need to manage the
mailing lists that they target. Some common examples of spam are

Advertisements that are in the form of pop-ups selling products or giving free
downloads when we click any link on a web page

Unsolicitated e-mails with inappropriate content, offers, political views

Redundant calls on IMs like Skype offering mortgages, loans with low interest
rates

Links on social networks that take you to free downloads, easy income,
pornography

Unsolicitated text messages offering loans, low priced products etc.
3
2.2 Types of spam

E-mail Spam- Also known as unsolicitated bulk e-mail, it is the most common
form of spam we see. Mostly the motive behind these messages is to advertise and
sell products, steal information or phishing, express political views, pornography
and malware injection. The first e-mail spam is said to be sent in 1978 by DEC by
sending an invite to all ARPANET addresses inviting them to the reception of
their new DEC-20 machine. After this incident in 1994 there was the first big
USENET spam which declared religious writing and caused a lot of controversy
and debate. After this the next big spam was the green card lottery where two
attorneys sent bulk USENET to offer green card visas to immigrants [1]. As the
time passed spam grew more and more in volume and in severity. Today most of
the spam mail is sent using bots. (These are compromised systems which are
controlled by a master system to send spam messages. If the user happens to fall
prey for the spam the system can be compromised and be then a bot, or user’s
credentials could be compromised, or malware may be injected in user’s machine
etc). E-mail harvesting is the most common way to get e-mail address. The
method involves spammers purchasing addresses from other spammers, or using
harvesting bots which collects e-mail addresses from postings on Usenet, internet
forums, etc [26]. Another method is to use dictionary attack where valid e-mail
addresses are generated by guessing common usernames in that domain. Apart
from these social networks today provide an easy way to reach larger audiences
and is the new favorite among spammers.
4

Instant Messaging and texting- Spam in instant messengers like Skype, yahoo
messengers generally comes in form of friend requests from unknown people. It is
less sparse than the e-mail spam. Text spam is promotional offers, advertisements
for low interest rate loans etc in the form of text messages from unknown sources.

Search Engine spam (spamdexing)- Refers to spamming web pages to falsely
increase their page ranking in the browser results
2.3 Image spam
Image spam is a variant of e-mail spam where the spammers actually embed the spam
message in an image instead of directly placing it as mail content to evade spam
filters. Spam filters look for certain key words like Viagra, cash, money which are
commonly related to spam e-mails. However when message is inside an image the
spam filters cannot effectively filter these messages. There are many techniques
which spammers used to obfuscate spam filters. Some examples are [10]
Adding random words before HTML

Use white text on white background

Using characters like M*oney

Adding bogus HTML tags with lot of text

Adding spaces in words like "l o w I n t e r e s t R a t e"
As stronger filter developed to track these messages spammers came up with newer
techniques like image spam, using pdf documents to send spam etc. With the use of
5
Optical Character Reader (OCR) filters it was possible to extract the contents of the
images and then check if the image had spam content.
However, this process is expensive and spammers came up with new ways to evade the
OCR filter. Some of the ways include
By rotating images or making them look wavy

Adding noise to the images

Slice the image and rotate each component.
Figure 2.1, 2.2 and 2.3 below show some of the above image spamming techniques.
Figure 2.1 Adding noise to the Image
Figure 2.2 Wavy images
Figure 2.3 Rotating Image and adding noise
6
2.4 Related Research
Image spam has not been studied as extensively as e-mail spam, however some recent
research work on image spam involving detection of text in the spam message, or
identifying low-level features like header properties and histogram.
Artificial neural networks have been used for identifying image spam by training
artificial neutral network. The images were first normalized into grayscale values
between 0-1. Then an ANN was trained on these images using a supervised learning
approach and the ANN was tested for classification of new samples of spam images. The
classification accuracy of about 70% was reported for unseen images [11].
Low-level features like image width, height, aspect ratio, file size, compression and
image area which are all extracted from the image header have been used along with
another set of features like the number of colors, variance, frequently occurring different
colors, primary color in image and the color saturation and color histograms were also
computed. A set of binary features was used to indicate type of file like JPEG or BMP or
PNG and SVM classifiers were used to classify images. An accuracy over 95% was
reported [12].
Aradhaye et al., used their existing work to detect text embedded in digital photographs.
After the text was extracted they analyzed the text and computed other features like color
saturation, color heterogeneity feature and use SVM classifiers to classify images. They
obtained an accuracy of 85% [13].
7
Similar features as in [12] were used in another study for classification using C4.5
decision tree algorithm in weka and support vector machine to. Their results indicated
that support vector algorithms performed better than C4.5 as it had a larger area under the
ROC curve [14].
Another study used agglomerative hierarchical clustering algorithm to clusters the spam
images based on a similarity measure of color histograms and gradient orientation
histograms. The training set was selected from the clustered groups. They built a
probabilistic boosting tree based on the training set to distinguish spam images from ham
images. They claimed an accuracy of about 89% [15].
Peizhou et al., extracted file properties like in [12],[14] and then set a threshold for these
properties. The test image properties were compared with the threshold value. If the
image was captured during this step as possible spam it was then sent to the histogram
testing where the histogram similarity of test image and threshold value was compared.
The advantage of this two step classification was that the first step trapped many of the
spam image files. They claimed accuracy of about 68% in JPEG images and 23% in GIF
images for step 1 and 84% for JPEG and 80% for GIF in step 2[16].
8
CHAPTER III
SPAM IMAGES AND DATASET
Not only are spam images a way to evade spam filters but they also have leverage over email spam.

Colorful- Images have more colors than a regular e-mail so it makes it look
attractive and professional

Images come as attachment and may be named anyways so they are difficult to
detect unless contents are analyzed

More ways to randomize the message, rotate, skew, add blur by adding noise,
animation using gif images
3.1 Types of Images

Text Only Images- Some images contain only text
Figure 3.1 Text only image spam
9

Randomization – In order to thwart signature based anti-spam devices spammers
add random color stripes, random colored pixels, shades of colors
Figure 3.2 Adding random colored pixels
Figure 3.3 Adding color streaks

Wild backgrounds- It is difficult for OCR to detect text in the images
Figure 3.4 Adding a wild background
10

Animated gif and multipart Images- The images is split into multiple parts, some
containing the message and others containing some animation. The frames in the
image rotate fast enough to display only the final result to the user.

Standard Images- These are neat looking images, none of the above tricks are
employed and that gives it a genuine look. The entire message is contained in the
image and hence scanners cannot detect it. In fact many of the images that come
in as spam today have a professional look making them look just like
photographs. Figure 3.5 below give two such examples.
Figure 3.5 Examples of Standard Images
3.2 Image spam dataset
Spam images come in various formats. We have encountered .bmp, .gif, .jpeg and .png
formats while collecting images for this research. BMP stands for bitmap or bitmap
image file comprising of image data in a raster format. This is a device independent
format and files are not compressed so the size is larger. GIF (Graphics Interchange
Format) is also a bitmap format which supports up to 8 bits per pixel. It supports
animations and allows 256 colors in each frame. PNG (Portable Network Graphics) is a
11
bitmap format which was developed to improve GIF format and employs lossless data
compression [6].
We have converted all the images to the JPEG (Joint Photographic Experts Group)
format for the research. It supports 24-bit color map i.e. 8-bits per color and has small file
size. It uses lossy compression but it is possible to adjust the degree of compression.
JPEG is standard image format in many photography devices.
3.3 Corpus
Corpus refers to a collection, in our case collection of images that we have used in the
research. All the spam images are collected from knujon, a spam reporting service which
reports illicit spam websites. We downloaded zipped files of images from spam mails
collected from users all over the world by setting up a winscp server locally. The time
period of the images ranges from Mar 2009- Sep 2010. Ham images were first collected
from personal photographs and then from flickr [3] using the flickr downloader [4] to
download images.
Flickr downloader software lets you download flickr images which are under the creative
commons license. In order to download images we need to install the software and in the
query provide either tags like “nature”, “landscape”, “people”, etc to specify which type
of image to search for or provide group names if any.
Some images have been downloaded from Google images [5], Wikipedia [6] and
National Geographic Channel [7] with different search tags like places, wildlife, cities
etc. We have eliminated spam images with pornographic content and photographic spam
12
as these are not a part of this research. Photographic spam images are actually
photographs (usually thumbnails or smaller dimensions) which are sometimes used in
spam images.
3.3.1 Statistics of images in the corpus
Table 3.1 Statistics of the images collected to form the corpus
Spam Source-
Total Number of images
Unique images
19762
16186
Knujon
Ham Source
Total Number of images
Personal
1022
Wikipedia and NGC pictures available on the wiki pages
454
Flickr
3964
Total
5440
In order to ensure that we do not use duplicate images in our training set we use a script
used in Artificial neural networks as a tool for identifying spam [11] to compute
checksum of each individual image. Checksum is a simple way to check data integrity
during transmission of data. MD5 (Message-Digest algorithm 5) is a cryptographic hash
function with a 128-bit hash value. It is very unlikely that two different files have same
md5 checksum and hence md5 hash acts like a digital fingerprint of a file. In case of
spam images many spam images look similar but have different checksums. For example,
the following two images in Figure3.6 are identical but have different checksums.
13
(a) MD5 = 996484e6cc7340ee2067ee96074ce324
(b) MD5 = 400fea9508fb5d759d9d698ef293c937
Figure 3.6 Similar spam images but with different checksums
Creating a large enough dataset is a labor intensive task as it needs removing
pornographic images and photo spam manually. As compared to the SpamArchive
dataset [27], we downloaded images which came from different geographical locations
and were more recent (2009-2011). The SpamArchive corpus has images dated till 2007
and contains significant number of duplicate images. The Princeton corpus [28] has about
1071 images. The study conducted by Dredze et al., [29] used 2359 ham images as
opposed to 7040 ham images (5440 + 1600) and 3421 spam images as opposed to 69,516
spam images (16,186 + 53,330 ) used in this study.
3.4 Pre-Processing
Preprocessing refers to cleaning up of data or preparing data in a appropriate format for
the experiment. We did the initial clean up of images by removing images with adult
content and photo spam. Next, we converted all images to .jpg extension using the
ImageMagick utility [8]. ImageMagick is an open source utility that can convert images
in various formats and also perform image modifications like resizing, cropping, etc.
14
The command in ImageMagick that converts all the files in folder from .bmp to .jpg
format is mogrify –format jpg *.bmp. Similarly we can convert other formats too. Most
of the ham images are very large in size say 30MB as these are photographs taken from a
camera. Some of the spam images are quite large in size too. In order to make the
preprocessing faster we resized all the images such that they did not exceed a size of
150KB. The command in ImageMagick that resizes all the files in folder is mogrify –
resize 50% *.bmp. Resizing does not affect any of the features.
3.4.1 Feature Selection
When we look at an image we can immediately identify whether the image is colorful i.e.
has lot of colors or the image is blurr and we cannot see the contents clearly. In this
research we have looked at some of these features to distinguish spam images from nonspam. The following features have been explored during the research.
a) Luminance of image- Luminance refers to the brightness of an image. Some
images are brighter than others.
Figure 3.7 Low log average luminance
15
Figure 3.8 High log average luminance
In the above images we can see that image in Figure 3.8 is brighter or has more
luminance than image in Figure 3.7. Most of the spam images are not very bright
as these are not taken from camera and the main intent of sending these is to
obfuscate the spam filter and send it to as many recipients as possible. Hence,
they are of small file size and are not very clear and bright.
We can compare the luminance of different images using the log-average
luminance of an image. The log-average luminance is calculated by finding the
geometric mean of the luminance values of all pixels. In a gray scale image, the
luminance value is the pixel value. In a color image, the luminance value is found
by a weighted sum [9]
Luminance = 0.27 red + 0.67 green + 0.06 blue
(3.1)
We took an average of luminance for all the pixels in an image,
AverageLuminance = Luminance of all pixels/ Number of pixels........(3.2)
b) Numbers of colors- We computed how many colors an image has. JPEG image
has a 24 bit color map i.e. each pixel is 24 bits [2]. This means that an image can
have 224 or 16777216 different colors. If we divide this range into 1677 bins with
1000 consecutive colors falling into one bin. We can find numbers of colors
ranging from 0 to 1677, where 1677 is the maximum number of colors an image
can have for our computations.
16
c) Color saturation- Color saturation can be described as pureness of a color. For
example, how red is the color red in an image. If the pixel has a value of (R, G, B)
= (255, 0, 0) the pixel has high saturation of color red.
As defined by Aradhye et al. [13] and Frankel et al.[17] color saturation can be
defined as the ratio of the total number of pixels in an image for which the
difference max(R,G,B)- min(R,G,B) is greater than some threshold T to the
number of pixels. Threshold T is set to 50 by Frankel et al and in this work).
For every pixel in an image we calculate the maximum and minimum among the
R, G, B values and then take the difference. We use a counter to count how many
pixels have the difference greater than T (T=50). Finally we divide the counter by
the number of pixels in the image to calculate the saturation value.
d) White pixel concentration- Spam images have white generally have a solid
background which is mostly pastel or white in color. For example, Figure 3.9(ham
image) has subtle shades of different colors but no solid background unlike the
Figure3.10 (spam image) which has a more white background color.
Figure 3.9 Ham image background
Figure 3.10 Spam image background
17
We calculate how many pixels in a given image have their r, g, b component
values to be above 250 as any value above this range has to be a pastel shade like
gray, pale white etc. Then we take the ratio of number of white pixels to total
number of pixels in an image. If an image has more white pixels than any other
color then probably it is a background color for the image. Most of the
photographs do not use such a background.
e) Standard deviation of colors- Each image has three components of color namely
red, green and blue. Pixels have different combinations of these r, g, and b values
and hence display different colors. We computed the standard deviation of each
component red, green and blue. The value tells us how much variation is there in
each color component.
f) Hue- Hue can be described as dominant wavelength in a color model which
describes a given color. For example, if we are looking at an apple the hue value
is red as that is what the color of apple looks to us. Java’s Color class provides a
method to determine the hue value of a given pixel. We compute the hue values
for each pixel and then take mean of these values to represent hue of the image.
3.5 Feature Extraction Process
The image features were computed using java programs. In order to compute any of the
features for an image we need to know what are the color components of each pixel value
(r, g, b). We have used java’s imageIO package to retrieve pixel values of an image.
Then, we extract the (r, g, b) values for each of these for further processing.
18
a) Luminance is calculated for each pixel value in an image, we take the average
value of all luminance values of the pixels.
b) In order to count the number of colors we read each pixel value and decide in
which color bin to put it in. Hence, all the pixels are assigned to any one of the
1677 color bins.
c) Color saturation is computed for each pixel and the average value is computed for
an entire image.
d) We read each pixel and check if it falls in the range of the white pixel threshold
we set. Then we compute what percentage of total pixels is white/pastel in color.
e) Standard deviation for each color component of r, g, b are computed
f) Hue is computed for each pixel and then the average value taken.
19
CHAPTER IV
DATA MINING TECHNIQUES
4.1 Data Mining Overview
Data Mining is the process of detecting patterns in data. Usually when the data is large,
data mining techniques make it easier to analyze data. The process is either completely
automated or semi automated [18].
Data mining problems can be primarily categorized into classification, clustering and
association. Classification classifies a given set of data into different categories using the
classification model generated during learning process. The data is represented by
conditional attributes and a decision label. The conditional attributes describe features of
the data. These values determine what the decision label is. For example consider the data
about fruits in table 4.1. In this case color, size, taste would be conditional attributes and
decision attributes would be apple, orange, grape, cherry etc. It is also known as
supervised learning because the decision values are known to us.
Table 4.1 Example data for classification
Color
Red
Red
Green
Green
Black
Size(diameter)
4 inches
1 inch
2 inches
15 inches
0.5 inch
20
Taste
Sweet
Sweet
Sour
Sweet
Sweet
Decision
Apple
Cherry
Lemon
Melon
Grape
The classification algorithm first builds a model which helps it know what rules to follow
when classifying a new instance of data. So if we gave a new instance of data which had
attributes like Color- Red, Size- 4 inches and Taste-sweet the classification algorithm
knows that it is an Apple because it learns this when we train it using the data in the
Table 4.1
Clustering is a technique of grouping data. We do not know the decision label in this case
and it is called as unsupervised learning. Consider an example as in table 4.2, this time
we only have attribute values we do not know the decision.
Table 4.2 Example data for clustering
Color
Red
Red
Green
Green
Black
Red
Green
Red
Green
Orange
Size(diameter)
4 inches
3.5 inch
2 inches
15 inches
0.5 inch
4 inches
2 inches
3.5 inch
15 inches
3 inches
Taste
Sweet
Sweet
Sour
Sweet
Sweet
Sweet
Sour
Sweet
Sweet
Sweet
In this case if we had to group the fruits in the table 4.2 we will try to group fruits which
have similar properties of color, size, and taste also called similarity measure. We could
group all red fruits which are sweet as group 1, all green fruits which are sweet as
group2, all green fruits which are sour as group3 and all orange fruits which are sweet as
group4. There are many arrangements possible based on what features we take into
account.
21
Association rule mining is used to find frequently occurring patterns among set of items
or objects. There is not really much difference association rule mining and classification
rule mining except that they can predict any attribute not just the decision or class
attribute. Unlike classification rules association rules are not used together as a set. [18].
Based on the data at hand and what we wish to find from the analysis we can choose
which of the above three techniques are beneficial and feasible. In this research we have
used the classification technique to classify images as spam or ham based on their
features.
4.2 Classification
Classification predicts categorical class labels. It constructs a model based on the training
set and the values in the class attribute and used it to classify new instances. Each record
in the dataset is assumed to belong to a predetermined class. This is also called as a
training set and used for constructing the classifier model. The model can be represented
as classification rules, decision trees, or mathematical formulae. The model is then used
for classifying new or unknown instances (also called a testing set).To estimate the
accuracy of the model the known label of test sample is compared with the classified
result from the model. Accuracy is the percentage of test set samples that are correctly
classified by the model [18].
22
4.3 Decision Trees
Decision trees are a divide and conquer approach to learning from a set of independent
instances. It is a flow-chart-like tree structure where internal node denotes a test on an
attribute [18].
A branch represents result of the test. The leaf nodes represent class or decision labels. In
the beginning, all the training examples are at the root of the tree. The examples are
partitioned recursively based on selected attributes. Some of the branches are pruned to
reduce branches that reflect noise or outliers. Figure 4.1 shows an example of decision
tree.
Figure 4.1 Example of decision tree using J48 classifier for weather data
Decision trees are constructed in a top-down and recursive divide-and-conquer manner.
The basic algorithm is a greedy algorithm and all the training examples are at the root in
the beginning. Instances are partitioned recursively based on selected attributes.
Attributes are selected on the basis of a heuristic or statistical measure like information
gain. Based on the model or tree generated new instances are classified. Partitioning is
23
stopped when all samples for a given node belong to the same class or when there are no
remaining attributes for further or when no samples are left.
4.3.1 J48
Of many algorithm available for decision trees J48 is one of the most popular algorithm.
It is an open source Java implementation of the C4.5 algorithm in the weka data mining
tool.C4.5 is an extension of ID3 developed by Quinlan [19].
The general approach of the algorithm chooses an attribute of the data that most
effectively splits its sample of data into subsets enriched in one class or the other. The
criteria used to choose an attribute for splitting is information gain, which is the
difference in the entropy values resulting from choosing an attribute for splitting the data.
The attribute which is chosen has the highest normalized information gain among all the
attributes. The procedure is repeated on smaller sub lists of the attribute list.
4.3.2 RepTree
RepTree builds a decision tree using information gain/variance reduction and prunes it
using reduced-error pruning. Optimized for speed, it only sorts values for numeric
attributes once. It deals with missing values by splitting instances into pieces, as C4.5
does [18]. Some other decision tree algorithms in weka include ID3, RandomTree,
RandomForest etc.
24
CHAPTER V
EXPERIMENTS AND RESULTS
5.1 Weka Data Mining Tool
In order to implement the classification tool we used a data mining tool. The tool is called
as Weka and it is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing new machine
learning schemes. Weka is open source software issued under the GNU General Public
License [20]
We have used the weka 3.6 version for our experiments. In order to upload data into
weka it should be converted into weka’s native format of “arff”. ARFF stands for
Attribute-Relation File Format. It is an ASCII text file that describes a list of instances
sharing a set of attributes [21].
Weka is implemented in Java programming language. Hence, it is possible to import
weka’s jar file and use in our java programs. The tool has both GUI and command line
support. We have used weka’s class library to write a program that can generate the
25
REPTree classifier and test the classifier generated class label with the actual class label
of a test sample and compute the classification accuracy.
5.2 Data Set preparation
Data is the collection of images we have used for the purpose of this research. After
preprocessing each image and extracting features we write the features to a text file. So
each image is now represented as a vector of feature values associated with that image.
5.3 Methodology
5.3.1 Experiments with J48 classifier
In the first attempt we tried to use J48 to classify the data. We had 16186 spam and 5440
ham images, out of which we randomly selected 90% for training and 10% for testing.
We made 10 sets in this fashion and then used this independent testing set to check the
efficiency of classification model. Since most of the values in the feature set are floating
point the decision tree is very wide and has many leaf nodes. Pruning the tree and
discretization did not help much as in spite of them the tree was large as we had two
decimal places of floating point for each value and a wide range of values for all the
features. So, though the accuracy was about 98% the tree was too large to be of use to
analyze the results.
26
5.3.2 Experiments with RepTree Classifier
We decided to see if any other classifier can be used which had a comparable efficiency
to J48 results and generated a more comprehensible tree. RepTree is another classifier in
weka which is a fast decision tree learner. It Builds a decision/regression tree using
information gain/variance and prunes it using reduced-error pruning (with backfitting).
Only sorts values for numeric attributes once. Missing values are dealt with by splitting
the corresponding instances into pieces (i.e. as in C4.5) [20].
The reason behind choosing RepTree is that it uses information gain to choose attributes
similar to j48 which uses information gain. Also, it lets you limit the depth of tree and
sorts numeric attributes only once there by reducing number of branches of the tree. The
accuracies were almost equal to using J48 with an advantage of a smaller tree.
5.3.3 Depth of the RepTree
As mentioned earlier RepTree lets us limit the depth of the tree. However, the question is
to what value to set the depth parameter. What would indicate an optimum value for
depth of the tree? In order to solve the problem we tried using the concept of hill
climbing. We started with a depth of 1 and then kept increasing the depth with intervals
of 1. We observed the accuracy at each interval change and the point at which the
accuracy stopped increasing and either became constant or started declining is the
optimum value for depth of the tree. Table 5.1 below lists the depth value for each of
different dataset.
27
Table 5.1 Depth value of RepTree
Ratio of Spam to Ham
1:1
7:3
1:9
9:1
No of Spam
5440
16186
604
9000
No of Ham
5440
5440
5440
1000
Depth
5
7
4
6
5.3.4 Dataset Proportions
We had 16187 unique spam images and 5440 ham images in the dataset. In real time
there is no statistics on how many spam images are encountered for each ham images
because ham images cannot be harvested. The problem here was in what ratios do we
choose ham images and spam images. To start with we had 16187 spam images and 5440
ham images which are approximately in a ratio of 1:3. So we use this as an initial set for
our experiments. We ran the experiments with RepTree and computed the classification
accuracy. We then chose 1:1 ratio i.e. one spam image for each ham image and computed
the accuracy. We then decided to check two more boundary conditions i.e. 1:9 ratio of
ham to spam and spam to ham. We then chose whichever combination gave the best
accuracy. Table 5.2 lists the accuracies of each of these combinations and number of
images in each set.
Table 5.2 Accuracy of classification for different ratios of ham and spam images
Ratio of Spam to Ham
1:1
7:3
1:9
9:1
No of Spam
5440
16186
604
9000
28
No of Ham
5440
5440
5440
1000
Accuracy
97.94
98.28
99.23
98.05
It was necessary that we try these combinations to choose what proportion of ham and
spam images should form the training set. From the table 5.2 we can see that we get fairly
similar accuracies for every ratio. We did not chose 1:9 and 9:1 as these are extreme
cases where we have only 10% of ham or spam. We chose 7:3 as it had better efficiency
than 1:1 ratio and it would help us cover more spam images and train classifier better.
5.3.5 Training and testing data selection
After choosing the ratio we used 16186 spam images and 5440 ham images to form a
training set. For our experiment we then divided this set into ten different sets of training
and testing with a split of 90(training)-10(testing). We then used the RepTree classifier in
weka and supplied independent test files as testing input. We repeated the experiment 10
times once for each set of training and testing. We obtained an average classification
accuracy of 98.28%.
5.3.6 Testing on unseen data
After classification using RepTree we wanted to test how this classification would work
on new files, something not seen by the algorithm before. We used the 53300 unique
images from the year 2010 to test the classifier we generated using the training data. We
also downloaded images for January 2011 to test how the classifier would work on recent
spam images. Table 5.3 shows the count of images after cleanup.
29
Table 5.3 Count of spam images in 2010
Month
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Jul-10
Aug-10
Sep-10
Oct-10
Nov-10
Dec-10
Jan-11
Total
Unique Images
No. of Images
2341
2333
12991
11471
8502
15734
5263
33698
8777
1713
2303
795
717
106638
No. of JPEG
1945
1689
10626
9077
5801
14292
4620
32631
8054
1223
2248
670
259
93135
53300
No. of GIF
396
644
2365
2394
2701
1442
643
1067
723
490
55
125
458
13503
In order to test the classifier we used the weka libraries which are in java. The program
generates a classifier using the training set and RepTree classifier of weka. It then
compares the decision assigned to each test sample by the classifier to the actual label
and computes the accuracy of the classifier.
The test files were generated for each month with features of spam images for that month
and equal number of ham features which were not used in the training set. We had 1688
ham images which were not a part of training so we had maximum of 1688 ham images
and 1688 spam images in a test file. Some months had lesser than 1688 spam images, in
such case we randomly picked equal number of ham images. Similarly if number of spam
samples were more than 1688 we chose 1688 images randomly from them. We then used
the decision tree to classify these test files. Table 5.4 lists the accuracies for each of the
months. The average efficiency of the classifier for unseen samples is 89%.
30
Table 5.4 Accuracy of classification for unseen samples
10-Jan
10-Feb
10-Mar
10-Apr
10-May
10-Jun
10-Jul
10-Aug
10-Sep
10-Oct
10-Nov
10-Dec
11-Jan
Accuracy
0.9252
0.9057
0.9390
0.8990
0.8320
0.7933
0.8451
0.9648
0.9428
0.8182
0.9476
0.8241
0.8210
Recall
0.9055
0.8618
0.9313
0.8513
0.7174
0.6439
0.7469
0.9828
0.9390
0.6913
0.9359
0.6912
0.6960
Precision
0.9427
0.9448
0.9458
0.9411
0.9308
0.9184
0.9295
0.9485
0.9463
0.9263
0.9582
0.9415
0.9281
F-Measure
0.9237
0.9014
0.9385
0.8939
0.8103
0.7570
0.8282
0.9654
0.9426
0.7918
0.9469
0.7972
0.7955
FPR
0.0551
0.0503
0.0533
0.0533
0.0533
0.0572
0.0567
0.0533
0.0533
0.0550
0.0408
0.0429
0.0540
FNR
0.0945
0.1382
0.0687
0.1487
0.2826
0.3561
0.2531
0.0172
0.0610
0.3087
0.0641
0.3088
0.3040
The table 5.5 lists the computation time for all the features for spam and ham images.
Table 5.5 Computing time for extracting features
Features
Luminance
Saturation
Hue
Number of colors
White Pixel Concentration
Standard Deviation
Total time for all features
Spam(ms)
938163
1146220
1416485
1555486
1022250
1283551
7362155
ham(ms)
509577
496305
573444
615987
461654
483472
3140439
ham + Spam(ms)
1447740
1642525
1989929
2171473
1483904
1767023
10502594
Avg.(ham + spam)
66.9444
75.9514
92.0156
100.4103
68.6167
81.7083
485.6467
The RepTree generated is of 63 leaf nodes. Figure 5.1 shows part of the tree which is
generated by the classifier.
The important features from the tree are average luminance, number of colors and white
pixel concentration. The tree is used to test each of the 13 test files mentioned in Table
5.4. The reason behind using the same tree each time is to ensure that we are testing the
samples across same classifier model each time.
31
Figure 5.1 Part of Decision tree generated by the classifier
32
CHAPTER VI
VALIDATION BY FEATURE ANALYSIS
We had 52744 unique images from the year 2010 and about 500 images for January
2011. After feature extraction we see if there is any pattern in the values of the features
for image spam. We had eight different features and each of these features has a value
that lied in a range. For example, average luminance can be between 0-255, average hue
is between 0-1, number of colors is between 0-1677 and so on. We divided these values
into ranges of equal intervals and counted how many images had feature values in that
range. The graphs in figure 6.1 below show the distribution of spam and ham images in
these ranges.
Number of Images
27000
24000
21000
18000
15000
12000
9000
6000
3000
0
600
500
400
300
200
100
0
Average Luminance of Ham
Images
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
0+
20+
40+
60+
80+
100+
120+
140+
160+
180+
200+
220+
240+
Number of Images
Average Luminance of Image Spam
Avearge Luminance
Average Luminance
33
Average Saturation of Ham Images
40000
35000
30000
25000
20000
15000
10000
5000
0
1500
1000
500
0
0+
0.1+
0.2+
0.3+
0.4+
0.5+
0.6+
0.7+
0.8+
0.9+
1
Number of Images
1
0.9+
0.8+
0.7+
0.6+
0.5+
0.4+
0.3+
0+
0.2+
2000
0.1+
Number of Images
Average Saturation of Image Spam
Average Color Saturation
Avearge Saturation
Average Hue of Image Spam
Average Hue of Ham Images
30000
20000
10000
1
0.9+
0.8+
0.7+
0.6+
0.5+
0.4+
0.3+
0.2+
0.1+
0
1600
1400
1200
1000
800
600
400
200
0
0+
0.1+
0.2+
0.3+
0.4+
0.5+
0.6+
0.7+
0.8+
0.9+
1
Number of Images
40000
0+
Number of Images
50000
Average Hue
Average Hue
Number of Colors in Ham Images
50000
40000
30000
20000
10000
0
0+
50+ 100+ 150+ 200+ 250+ 300+
Number of Colors
Number Of Images
Number of Images
Number of Colors in Image Spam
5000
4000
3000
2000
1000
0
0+ 50+ 100+ 150+ 200+ 250+ 300+
Number of colors
34
Percentage of Pastel Pixels in Ham
Images
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
Number of Images
Number of images
Percentage of Pastel Pixels in Image
Spam
6000
5000
4000
3000
2000
1000
0
0+ 0.1+0.2+0.3+0.4+0.5+0.6+0.7+0.8+0.9+ 1
0+ 0.1+0.2+0.3+0.4+0.5+0.6+0.7+0.8+0.9+ 1
Percenatge of Pastel Pixels
Percentage of Pastel pixels
Number of Images
Standard Deviation of Red
Component for Ham Images
25000
20000
15000
10000
5000
0
800
600
400
200
0
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
Number of Images
Standard Deviation of Red
Component for Image Spam
Standard Deviation
Standard Deviation
Standard Deviation of Green
Component for Ham Images
Standard Deviation of Green
Component for Image Spam
Number of Images
20000
15000
10000
5000
0
800
600
400
200
0
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
Number of Images
25000
Standard Devaition
Standard Deviation
35
Standard Deviation of Blue
Component for Ham Images
Number of Images
20000
15000
10000
5000
0
Standard Deviation
600
500
400
300
200
100
0
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
25000
0+
10+
20+
30+
40+
50+
60+
70+
80+
90+
100+
110+
120+
Number of Images
Standard Deviation of Blue
Component for Image Spam
Standard Deviation
Figure 6.1 Plots for range of feature values for Image Spam and Ham images
The above plots show that most of the feature values for image spam lie in specific range.
For example, most of the spam images have an average luminance value in the range of
200-220. However for ham images these values are spread over a wider range and none
of the values fall in the range of 200-220.
Hence this is the easiest way to determine if an image could be a spam. The decision tree
also chooses luminance as the root of the tree and has a cutoff value of 192.36 for
average luminance. Next, if we look at average saturation of color values we see that ham
images have them spread over a wider range than spam images which have saturation
values mostly in the range of 0-0.2. Ham images have saturation values mostly between
0-0.8.
Similarly average hue for ham images is spread out into different ranges when compared
to spam images. Number of colors and white/pastel pixel concentration are not very
helpful measures to identify spam from ham as for both of these spam and ham images
have values in similar ranges. Standard deviation of color components is more spread out
again in case of ham images than spam images. These are expected values as photographs
36
have different shades of colors than a spam image. Pictures are taken at different time of
the day, so the luminance values are spread across different ranges. Also, the luminance
value of ham images is lesser than spam images because generally spam images are
brighter so that they attract attention of the user. The text may be skewed and there may
be random noise in the image but the image is bright in appearance so that a human user
can read it easily but an OCR can’t.
Since the values of features are in clearly distinct ranges it is easy for the classifier to
classify spam from ham images. This explains the accuracy of 89% in unseen samples of
spam and ham images.
The graphs also hint that spam images might rotate the content, add random noise,
random pixels, or skew the text to obfuscate the filters, but there are some features that do
not change in spite of these tricks. For example, the luminance value is similar for ideally
many different spam images as per the graphs. Spammers cannot vary these properties
often and this is probably the reason why we get a fairly high accuracy of 89%.
37
CHAPTER VII
TRENDS IN IMAGE SPAM
7.1 Count of Image spam
Most of the research work is dedicated towards classification or identification of spam
images and very little literature exists on how the image spam has evolved and changed
in past few years. We wanted to see what the trend in spam images is and track the
growth and fall of spam images. In addition to the 2010 images we had we downloaded
images for the years of 2008, 2009 and first two months of 2011. We did not have image
spam collection for the entire year of 2008, so we took the count from August to
December. Table 7.1 below lists the number of images in each month of a year.
Table 7.1 Image spam count in 2008- 2011
Month
Jan
Feb
Mar
Apr
May
June
July
Aug
Sep
Oct
Nov
Dec
2008
3660
5527
8021
3525
601
2009
118
764
2268
1008
1277
10863
7840
12883
13329
8040
5883
4119
38
2010
3171
3451
16403
18462
7337
18141
6725
36003
9105
2233
2601
943
2011
717
781
If we observe the table 7.1 we can see that the spammers follow a pattern in the sending
image spam. It starts increasing slowly over each month and there is a sudden drop in the
count in one of the months. Then somewhere in the middle of the year near June to
September there is a sudden explosion in the number of images. Since most of the spam
is sent using bots the reason for this sudden drop could be closing down of some of the
spam bots. Image spam was quite low from November 2008 after closing of rogue
hosting provider McColo in November 2008[22]. Then again there was an increase from
June 2009 and it kept going on and off until September 2010. During September 2010
spamit.com, an affiliate program used by several spamming botnets was closed down.
Hence spam count was decreased and in early 2011 we see fewer spam images. This is
also shown in our spam count in table 6.1 and the statistics are in accordance with the
statistics of spam images across the world though we are using only a subset of the image
spam. Number of other botnets like Pushdo/Cutwail, Mega-D etc have been identified
and closed down but there are always more to fill the void [22].
7.2 Trend of the Month
We also tried to manually look into each of these images and identify prevalent trends in
each month of the year. This analysis provides an insight into what strategies spammers
use in sending spam. For example, if it is New Year then we see lot of spam related to
gifts, candies, fitness equipment etc. Gifts and candies are gifts during these times and the
most common New Year resolution is to get fit! Similar trend is observed during
Christmas, valentine’s day etc.
39
Figure 7.1 shows examples of images that appeared on February- 14 -2011 which is
observed as Valentine’s Day. It is also interesting to note how spammers capture what
people might be really curious to see. For example, weight loss, beauty products, exercise
equipment.
Figure 7.1 Images that appeared on February 14th - Valentine’s Day
We looked at image spam from each month and then noted the most prominent trends for
the month. Then we added the frequency of appearance of the trend across the year to
check which of the image types are most frequently observed in image spam. Figure 7.2
below shows the graph obtained for the image type and its frequency of occurrence in the
years 2010 and 2011.
We can see from the graph that the most frequently appearing type of spam is
pharmaceutical, pornography, hardware devices, software products for sale, photo spam,
clothes, UPS or other delivery service message alerts. In fact pharmaceutical image spam
appears in every single month from Jan 2010 to Feb 2011.
40
Frequency of Image Spam Types
Wines
Weightloss
Valentines day
Text spam
Sinks
Scenic pictures
Reliogious
Protest mails for europe
Political
Type of Image Spam
Pharmaceutical
Monitors
Images that do not open
Kid's shoes
Insurance
Holiday gifts
Gifts
Foreign language images
Flowers
Credit/debt help
Choclates
Cats
Cameras
Architectural designs
Advertising school
0
5
Frequency
10
15
Figure 7.2 Image spam type for the time period of January 2010- Feb 2011
41
7.3 New trends in image spam
7.3.1 Scraped Images
One of the newer techniques in creating image spam include scraping the image such that
it cannot be read by a file reader but can be viewed in a picture editor. This makes it
possible to convey the message to the user but prevents processing of images. Scraping is
one of the ways to render an image unreadable, other ways of tampering include
improper header information, incorrect color maps, etc. File readers which demand these
to be in correct format for an image to be read cannot parse the image data. Figure 7.3
shows an example of one such an image. We have not dealt with these images in this
research and hence these are removed during preprocessing.
Figure 7.3 Example of a scraped spam image
7.3.2 Malware embedding in images
Another observation in spam image trend is the insertion of malware into jpeg files. We
found that images in 2010 during the months of May, September, November had images
42
which had malwares embedded in them. This was not seen in the earlier images of 20082009. Most of the malware was either Trojan or virus.
There is little literature available on how these images are used for malware embedding
and how they attack the victim. In general when a non executable file containing an
executable is double clicked the non-executable file is opened. For example if a JPG file
has a malware embedded in it then the JPG file opens on double clicking not the
executable. In order to execute the embedded executable a loader is needed which is a
part of the malware already present on the infected machine, which extracts the
executable file and runs it. The executable is embedded inside a file to evade some basic
security filters. This is one of the ways malwares can download malicious stuff from the
web.
So, if a JPG with an embedded executable file exists, probably there is another running
infection on the system which actually downloaded that JPG file. This is not the only way
for an embedded executable file to be executed. The JPG could also exploit some other
flaw in the system and run the executable. Generally the exploit files do not contain
embedded files. So, if we see a jpg with embedded executable, it's unlikely to be an
exploit attack but instead there is a possibility that another infection is running on the
system [23].
When we downloaded the images we did click some of them to actually look at the
content but opening these files with executable in them did not infect the system. This is
could be because the loader to extract the executable was not installed on the machine.
43
The images from Knujon are stripped from e-mails so the loader may not have been a
part of it.
Recently Microsoft's Malware Protection Center discovered a variation of a malicious
image which looks like a simple .png file [24], the image has instructions which asks user
to open it in MS Paint and then resave it as an .hta file of image type Bitmap. The lower
part of the image looks like some random noise, but as the file is resaved according to the
directions, this noisy part decompresses to a JavaScript payload which executes when
the.hta file is opened. .hta extension denotes an HTML application. Hence the system
will ignore the leading BMP information and execute the following HTML / Javacript
information, which then run the payload. Figure 7.4 below show the snapshot of the
image and subsequent hta format file [25]. The next figure 7.5 shows the binary data of
the image before and after resaving as .hta file [25].
Figure 7.4 Malware Embedded in a .png Image
44
Figure 7.5 Binary form images before and after saving as .hta file
This is a complicated technique and it requires user to perform lot of actions in order to
succeed. Hence it is unlikely to be used in a widespread attack although it exposes the
possibilities for hiding malware and data in image formats. It is also possible that cyber
criminals are using these images as a medium to share malware and spread it across
easily.
45
CHAPTER VIII
CONCLUSIONS AND FUTURE WORK
The experiment has provided us with insightful observations about how spam images
have evolved in a year. Many spam images are almost photo quality images and have
multiple colors. This makes the classification process trickier as it gets harder to
distinguish these images from photographs. Newer techniques are used in generating
image spam like scarping off header information, some images do not load when viewed
as thumbnails but will open with a picture editor, malware injection etc. We also
observed that images follow trends in a particular month, for example a dominant trend in
a month could be advertisement of vines, exercise equipment, pharmaceuticals,
chocolates etc or it could be chain letters, advertisement of dating websites, links on
making fast money and many more.
The pros of using the described approach for classification are that we extract features
which are very easily computable and we achieve a good efficiency for unseen samples
from a recent time period. Irrespective of the file size we can compute the features easily
as resizing or converting formats does not affect the values of the features. Unlike the
method described in [3] and [5] we compute high-level features which can be computed
only after reading the file contents but spam images today have comparable quality of
photo images, hence low level features will not always be an efficient measure. Even if
46
we compute features from image contents the process is not time consuming as we can
resize images and reduce the complexity.
We have used simple tools like java programs and ImageMagick commands to carry out
the experiments. Finally, an accuracy of 98% on trained samples and 88% on untrained
samples is encouraging. In Table 5.3 we can see that for November 2010 samples we
have an accuracy of almost 94% where these images were never used to train the
classifier. This indicates that spam images recur with modifications like change in few
pixels, rotation, or noise but the properties like luminance, saturation, hue are not
modified frequently. Also, in each month we get an accuracy of at least 79%. Some
months had very few spam samples and this could be a reason we see a slight dip in the
accuracy. However, if the classifier is updated frequently with new incoming spam
images it might provide better classification accuracy over a period of time as spam
images recur with slight modification each time.
A challenge in the experiment was the ham images which are not available so easily. We
can download images using Flickr API however processing these is a time consuming
process and hence our corpus is limited. In our experiment the features we selected eight
features of significance. A more detailed exploration in the area of image processing as a
future work might be able to yield more features like number of objects in an images,
presence of text using image segmentation, presence of random noise by adjacent pixel
exploration etc. We have not dealt with images which are scraped or have malware
embedded in them. This is an interesting area of future work and might provide details
47
such as presence of steganography in an image. This can also be used as a feature in the
classification as normal photographs do not have such content embedded in them.
48
REFERENCES
[1] Brad Templeton , http://www.templetons.com/brad/spamterm.html, last accessed
date 03/24/2010
[2] Types of Bitmaps, http://msdn.microsoft.com/enus/library/ms536393%28VS.85%29.aspx, last accessed date 03/24/2010
[3] Flickr , http://www.flickr.com/ , last accessed date 03/24/2010
[4] Flickr Downloader tool, http://download.cnet.com/Flickr-Downloader/3000-12512_410790953.html, , last accessed date 03/24/2010
[5] Google Images, http://www.google.com/imghp?hl=en&tab=wi, , last accessed date
03/24/2010
[6] Wikipedia, Image formats, http://en.wikipedia.org/wiki/Image_file_formats, last
accessed date 03/24/2010
[7] NationalGeographic,
http://photography.nationalgeographic.com/photography/?source=NavPhoHome, last
accessed date 03/24/2010
[8] ImageMagick, http://www.imagemagick.org/script/index.php, last accessed date
03/24/2010
[9] Luminance of Images, http://www.cacs.louisiana.edu/~cice/lal/index.html, last
accessed date 03/24/2010
[10] Spammer’s Compendium, http://www.jgc.org/tsc.html, last accessed date
03/24/2010
[11] Hope, P., Bowling, J. R., and Liszka, K. J., Artificial Neural Networks as a Tool for
Identifying Image Spam, The 2009 International Conference on Security and
Management (SAM'09), July 2009, pp. 447-451.
[12] Chao Wang; Fengli Zhang; Fagen Li; Qiao Liu; , "Image spam classification based
on low-level image features," Communications, Circuits and Systems (ICCCAS),
2010 International Conference on , vol., no., pp.290-293, 28-30 July 2010
49
[13] Aradhye, H.B.; Myers, G.K.; Herson, J.A.; , "Image analysis for efficient
categorization of image-based spam e-mail," Document Analysis and Recognition,
2005. Proceedings. Eighth International Conference on , vol., no., pp. 914- 918 Vol.
2, 29 Aug.-1 Sept. 2005
[14] Krasser, S.; Yuchun Tang; Gould, J.; Alperovitch, D.; Judge, P.; , "Identifying Image
Spam based on Header and File Properties using C4.5 Decision Trees and Support
Vector Machine Learning," Information Assurance and Security Workshop, 2007.
IAW '07. IEEE SMC , vol., no., pp.255-261, 20-22 June 2007
[15] Yan Gao, Ming Yang, Xiaonan Zhao, Bryan Pardo, Ying Wu, Thrasyvoulos N.
Pappas, Alok Choudhary, “Image Spam Hunter”, EECS Dept., Northwestern Univ.,
Evanston, IL, Proceedings Acoustics, Speech and Signal Processing, 2008. ICASSP
2008.
[16] Peizhou He Xiangming Wen Wei Zheng, “A Simple Method for Filtering Image
Spam”, 2009 Eigth IEEE/ACIS International Conference on Computer and
Information Science
[17] C. Frankel, M. Swain, and V. Athitsos, “Webseer: An Image Search Engine for the
World Wide Web,” Univ. ofChicago Technical Report TR96-14, 1996.
[18] Data Mining practical machine learning tools and techniques, Ian H Witten , Eibe
Frank, ISBN-13:978-0-12-088407-0
[19] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
1993
[20] Weka, http://www.cs.waikato.ac.nz/ml/weka/, last accessed date 03/24/2010
[21] Attribute-Relation File Format, http://www.cs.waikato.ac.nz/~ml/weka/arff.html, last
accessed date 03/24/2010
[22] Help Net Security, http://www.net-security.org/secworld.php?id=10594, last
accessed date 03/24/2010
[23] Wilders Security Forums,
http://www.wilderssecurity.com/showthread.php?t=251875&highlight=malware+im
ages, last accessed date 03/24/2010
[24] Sunnetbeskerming,
http://www.beskerming.com/commentary/2010/08/12/527/Malware_in_Images,_a_S
ocial_Engineering_Example, last accessed date 03/24/2010
50
[25] Painting by Numbers, Microsoft Malware Protection Center, Threat Response and
research blog, http://blogs.technet.com/b/mmpc/archive/2010/08/09/painting-bynumbers.aspx, last accessed date 03/24/2010
[26] E-mail address harvesting, http://en.wikipedia.org/wiki/E-mail_address_harvesting,
last accessed date 06/06/2011
[27] Image Spam Dataset, http://www.cs.jhu.edu/~mdredze/datasets/image_spam/, last
accessed date 06/06/2011
[28] Princeton Spam Image Benchmark, http://www.cs.princeton.edu/cass/spam/, last
accessed date 06/06/2011
[29] Dredze Mark, Gevaryahu Reuven, Elias-Bachrach Ari, “Learning Fast Classifiers for
Image Spam,” Fourth Conference on Email and Anti-Spam (CEAS), 2007.
51
APPENDICES
52
APPENDIX A
DATA ANALYSIS
In order to understand the patterns and trends in image spam for the last year we
performed many analyses to come up the list of most frequent trends. This section
describes the process of trend analysis in detail.
After cleaning up of pornographic images and photo spam we counted the number of
images in each month and calculated the feature values for all the spam images for the
year of 2010. We had 52,744 unique images and feature value for each of these. Initially
in order to observe patterns in the feature values we took an average of the feature value
for the whole set of 52,744 images and plotted it. The figure A.1 below shows the graphs
generated for average value of each feature for the whole year.
53
Figure A.1 Average of feature values for all Image Spam in the year of 2010
54
APPENDIX B
GENERATING MD5SUM AND SELECTING UNIQUE FILES
We used unique images to train the classifier, and in order to generate an md5sum we
used a small script which just generated the md5sum of a directory recursively. Then we
used a Java program to list the duplicate files. The duplicate files were removed based on
the time stamp and we preserved the files which had older time stamp. Then we used
another script to delete the files listed by the Java program. This section shows the two
scripts used.
#!/bin/sh
folders=`ls $1`
for all in $folders
do
files=`ls $1/$all`
for f in $files
do
md5sum $1/$all/$f >> totalMd5.txt
done
done
Figure B.1 Script to generate md5sum
55
The script in figure B.1 takes a directory as a command line argument and then for each
file in each subdirectory it computes the md5sum and writes it to a file called
totalMd5.txt in the current directory.
The output of the script is shown in the figure B.2 below. It lists the md5sum and the
complete path of the location of the file.
58e33c78c71af982a7d2cd30410b2051
d15eca70160d379e7c00ca3e99e83fdf
b9de7049b5cfcea19f80d4fd55f27386
0c85c7f4ed35fc533ff96a50eff932ac
887f3bb4449b81305af7f4a4f6b207f1
8bd3f7b70c5f7de4e21c611943716601
8bed5edd0494f480a1585bd16d435b23
a6e9cd6a120e15317089b8a69566d98f
fde107714769faa696ecebedbad20a01
31167a234781f3244137384afa55aef3
5046c4254cd9f1b58b66d750117d29a3
f61d48470a87afb9f4251fa619dc7f03
9979242622466893910121e5f9dc8f71
2a7aa17d86385cd3d8b1edbde1042f12
d0565db3dbb9df703b6ba2f52ce1c825
16b953f44c00f243a073dcbee6b7676d
2384245e6a800a79064888dd60d306fd
c0ca7f0c0eb846df8ef5d0c6f1026f46
9ad7ae4f335525e26e8e8a1e7b11c94e
4d489c5bd9a3653f5f55e544d86ed596
RENAMED2010Images/April_renamed/spam10000.jpg
RENAMED2010Images/April_renamed/spam10001.jpg
RENAMED2010Images/April_renamed/spam10002.jpg
RENAMED2010Images/April_renamed/spam10003.jpg
RENAMED2010Images/April_renamed/spam10004.jpg
RENAMED2010Images/April_renamed/spam10005.jpg
RENAMED2010Images/April_renamed/spam10006.jpg
RENAMED2010Images/April_renamed/spam10007.jpg
RENAMED2010Images/April_renamed/spam10008.jpg
RENAMED2010Images/April_renamed/spam10009.jpg
RENAMED2010Images/April_renamed/spam1000.jpg
RENAMED2010Images/April_renamed/spam10010.jpg
RENAMED2010Images/April_renamed/spam10011.jpg
RENAMED2010Images/April_renamed/spam10012.jpg
RENAMED2010Images/April_renamed/spam10013.jpg
RENAMED2010Images/April_renamed/spam10014.jpg
RENAMED2010Images/April_renamed/spam10015.jpg
RENAMED2010Images/April_renamed/spam10016.jpg
RENAMED2010Images/April_renamed/spam10017.jpg
RENAMED2010Images/April_renamed/spam10018.jpg
Figure B.2 Output of md5sum script
The script in figure B.3takes a text file as input and deletes the files listed in it.
#!/bin/sh
cat $1 | xargs -I % rm %
Figure B.3 Script to delete the duplicate files
The output of the Java program looks like in the figure B.4. It simply lists the md5sum
and the name of the file to be deleted using the script in figure B.3
56
3031c161229f18aecb506ea36c7c5fd2
3031c161229f18aecb506ea36c7c5fd2
3031c161229f18aecb506ea36c7c5fd2
3031c161229f18aecb506ea36c7c5fd2
3031c161229f18aecb506ea36c7c5fd2
38bccaf5a2599b6306b7d72e8e1e1b4f
38bccaf5a2599b6306b7d72e8e1e1b4f
3a59d81ecbbe5b02afd265d9921e563e
647e0d7808b2b21a4dff05a7c980a803
77c84cdb68665fe7b042f8d2d963d249
January2009/17Jan2009/37668-Gn27Swc1.gif
January2009/17Jan2009/5097-Gn27Swc1-4.gif
January2009/17Jan2009/73252-Gn27Swc1-1.gif
January2009/17Jan2009/89006-Gn27Swc1-5.gif
January2009/17Jan2009/97853-Gn27Swc1-2.gif
January2009/13Jan2009/66763-image002-4.jpg
January2009/13Jan2009/67048-image002-3.jpg
January2009/7Jan2009/88273-EK2xUr%280A%29.gif
January2009/7Jan2009/4790-Sodu%28eL%29.gif
January2009/13Jan2009/88162-image001-4.jpg
Figure B.4 Output of the Java program to identify duplicate files
57