Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
USENIX Association Proceedings of the 13th USENIX Security Symposium San Diego, CA, USA August 9–13, 2004 © 2004 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Privay-Preserving Sharing and Correlation of Seurity Alerts Patrik Linoln Phillip Porrasy linolnsl.sri.om porrassdl.sri.om SRI International Abstrat We present a pratial sheme for Internetsale ollaborative analysis of information seurity threats whih provides strong privay guarantees to ontributors of alerts. Widearea analysis enters are proving a valuable early warning servie against worms, viruses, and other maliious ativities. At the same time, proteting individual and organizational privay is no longer optional in today's business limate. We propose a set of data sanitization tehniques that enable ommunity alert aggregation and orrelation, while maintaining privay for alert ontributors. Our approah is pratial, salable, does not rely on trusted third parties or seure multiparty omputation shemes, and does not require sophistiated key management. 1 Introdution Over the past few years, omputer viruses and worms have evolved from nuisanes to some of the most serious threats to Internetonneted omputing assets. Global infetions suh as Code Red and Code Red II [21, 40℄, Nimbda [30℄, Slammer [20℄, MBlaster [18℄, and MyDoom [17℄ are among an ever-growing number of self-repliating Partially supported by ONR grants N00014-01-10837 and N00014-03-1-0961 and Maryland Prourement OÆe ontrat MDA904-02-C-0458. y Partially supported by ARDA under Air Fore Researh Laboratory ontrat F30602-03-C-0234. z Partially supported by ONR grants N00014-011-0837 and N00014-03-1-0961. Vitaly Shmatikovz shmatsl.sri.om maliious ode attaks plaguing the Internet with inreasing frequeny. These attaks have aused major disruptions, aeting hundreds of thousands of omputers worldwide. Reognition and diagnosis of these threats play an important role in defending omputer assets. Until reently, however, network defense has been viewed as the responsibility of individual sites. Firewalls, intrusion detetion, and antivirus tools, are, for the most part, deployed in the mode of independent site protetion. Although these tools suessfully defend against low or moderate levels of attak, no known tehnology an ompletely prevent large-sale onerted attaks. There is an emerging interest in the development of Internet-sale threat analysis enters. Coneptually, these enters are data repositories to whih pools of volunteer networks ontribute seurity alerts, suh as rewall logs, reports from antivirus software, and intrusion detetion alerts (we will use the terms analysis enter and alert repository interhangeably). Through olletion of ontinually updated alerts aross a wide and diverse ontributor pool, one hopes to gain a perspetive on Internet-wide trends, dominant intrusion patterns, and inetions in alert ontent that may be indiative of new wide-spreading threats. The sampling size and diversity of ontributors are thus of great importane, as they impat the speed and delity with whih threat diagnoses an be formulated. We are interested in proteting sensitive data ontained in seurity alerts against maliious users of alert repositories and orrupt repositories. The risk of leaking sensitive in- formation may negatively impat the size and diversity of the ontributor pool, add legal liabilities to enter managers, and limit aessibility of raw alert ontent. We onsider a three-way tradeo between privay, utility, and performane: privay of alert ontributors; utility of the analyses that an be performed on the sanitized data; and the performane ost that must be borne by alert ontributors and analysts. Our objetive is a solution that is reasonably eÆient, privaypreserving, and pratially useful. We investigate several types of attaks, inluding ditionary attaks whih defeat simple-minded data protetion shemes based on hashing IP addresses. In partiular, we fous on attakers who may use the analysis enter as a means to probe the seurity posture of a spei ontributor and infer sensitive data suh as internal network topology by analyzing (artiially stimulated) alerts. We present a set of tehniques for sanitization of alert data. They ensure serey of sensitive information ontained in the alerts, while enabling a large lass of legitimate analyses to be performed on the sanitized alert pool. We then explain how trust requirements between the alert ontributors and analysis enters an be further redued by deploying an overlay protool for randomized alert routing, and give a quantitative estimate of anonymity provided by this tehnique. We onlude by disussing performane issues. 2 Related Work Established Internet analysis enters, suh as DShield [34℄ and Symante's DeepSight [32℄ gather alerts from a diverse population of sensors. For example, in April 2003, DShield reported a ontributor pool of around 41,000 registered partiipants and around 2000 regular submitters, who submit a total of 5 to 10 million alerts daily [7℄. These enters proved eetive in reognizing shortterm inetions in alert ontent and volume that may indiate wide-sale maliious phenomena [39℄, as well as the ability to trak important seurity trends that may allow sites to better tune their seurity postures [31℄. Other researh has shown how to use distributed seurity information to infer Internet DoS ativity [22℄, and how to improve the speed and auray of large-sale multienterprise alert analysis enters [38℄. Alert sharing ommunities have not yet enjoyed wide-sale adoption, in part due to privay onerns of potential alert ontributors and managers of ommunity alert repositories. Raw alerts may expose site-private topologial information, proprietary ontent, lient relationships, and the site's defensive apabilities and vulnerabilities. With this in mind, established systems suppress sensitive alert ontent before it is distributed to analysis enters (e.g., eld suppression is a ongurable option in DShield's alert extration software). Even with these measures, organizations suh as DeepSight and DShield must be granted a substantial degree of trust by the alert produers, sine suppression and anonymization must be balaned against the need to maintain the utility of the alert. 2.1 Paket trae anonymization Several approahes have been proposed for anonymization of Internet paket traes [25, 36, 24℄. For example, Pang and Paxson proposed a high-level language and tool [24℄ as part of the Bro pakage, enabling anonymization of paket header and ontent. They are interested in wide-sale network traes suh as FTP sessions, while our appliation is alert management. Further, we examine strategies that mitigate ditionary attaks from adversaries who an stimulate and then observe alert prodution within the target's site. 2.2 Database obfusation The database ommunity has examined the problem of mining aggregate data while proteting privay at the level of individual reords. One approah is to randomly perturb the values in individual reords [1, 2℄ and ompensate for the randomization at the aggregate level. This approah is potentially vulnerable to privay breahes. If a data item is repeatedly submitted and perturbed (differently eah time), muh information about the original value an be inferred. In our ontext, an attaker ould intentionally probe the same IP address using the same attak strings. If the (randomly perturbed) reports of the attak are disambiguated from other alerts based on the attak's unique statistial aspets, the attaker an use them to learn important details of the original alert. 2.3 SMC shemes Consider two or more parties who want to perform a joint omputation, but neither party is willing to reveal its input. This problem is known as Seure Multiparty Computation (SMC). It deals with omputing a probabilisti funtion in a distributed system where eah partiipant independently holds one of the inputs, while ensuring orretness of the omputation and revealing no information to a partiipant other than his input and output. There exist general-purpose onstrutions that onvert any polynomial omputation to a seure multiparty omputation [37℄. Reent work has onsiderably improved the efieny of suh omputations when an approximate answer is suÆient [13℄. Appliations inlude privay-preserving data lassiation, lustering, generalization, summarization, haraterization, and assoiation rule mining. Clifton et al. [8℄ present methods for seure addition, set union, size of set intersetion, and salar produt. Lindell and Pinkas [19℄ propose a protool for seure deision tree indution, onsisting of many invoations of smaller private omputations suh as oblivious funtion evaluation. Unfortunately, the ost of even the most eÆient SMC shemes is too high for the purpose of large-sale seurity alert distribution. statistis to detailed ativity reports produed by omplex appliations suh as intrusion or anomaly detetion systems. So far, we have used the term seurity alert loosely to refer to site-loal ativity produed by a network seurity omponent (sensor ) as it reports on observed ativity or upon an ation it has taken in response to observed ativity. A seurity alert an represent a very diverse range of information, depending on the type of the seurity devie that produed it. In this setion, we onsider the typial ontent of seurity alerts from the three primary types of alert ontributors used in the ontext of Internet-sale threat analysis enters. Firewalls reside at the gateways of networks, and ontribute reports that indiate \deny" and \allow" ations for traÆ aross the gateway boundary. Most typially, rewalls ontribute alerts agging inoming pakets that were denied. Volume, port, and soure distribution patterns of suh pakets provide signifiant insight into the probe and exploit targets of maliious systems, new attak tools, and self-propagating maliious appliations. Intrusion detetion systems inlude network- and host-based systems, and may employ misuse or anomaly detetion. Unlike rewalls, intrusion detetion reports may represent a wide variety of event types, and an report on anomalous phenomena that span arbitrarily long durations of time or events. Antivirus software reports email- and le- borne virus detetion on individual hosts. Reports inlude virus type, infetion target, and the response ation, whih is typially to lean or quarantine the infetion. Table 1 summarizes the elds that onstitute a typial rewall (FW), intrusion detetion (ID), or antivirus (AV) seurity alert in its raw form, prior to data sanitization. 3 Format of Seurity Alerts 4 Threat Model Network data olleted to support threat analysis, fault diagnosis, and intrusion report orrelation may range from simple MIB To support ollaborative threat analysis, the alert repository will be published, at least partially, and thus made available to the at- Soure IP FW,ID Soure Port FW,ID Dest IP FW,ID,AV Dest Port FW,ID Protool Timestamp Sensor ID FW,ID FW,ID,AV FW,ID,AV Count FW,ID,AV Event ID Outome FW,ID,AV FW,ID,AV Captured Data ID Infeted File AV Typially refers to the soure IP address of the mahine that initiated the session or transferred the transation that aused the alert to re. In IDS alerts, this eld may represent the vitim, not the attaker, sine some systems alert upon an attak reply rather than request. Soure TCP or UDP port of the mahine that initiated the session or transferred the transation that aused the alert to re. Typially refers to the destination IP address of the mahine that initiated the session or transferred the transation that aused the alert to re. In AV systems, Dest IP an identify the mahine in whih the infetion is disovered. Destination TCP or UDP port of the mahine that initiated the session or transferred the transation that aused the alert to re. Protool type (e.g., UDP, TCP, ICMP). May inorporate inident start time, end time, inident report time. May inorporate the brand and model of the sensor and a unique identier for the individual instantiation of the sensor. Often used to represent some notion of repeated ativity, either at the alert or event (e.g., paket) level. Uniquely denes the alert type for the given sensor. Reports the status or disposition of the reported ativity. For rewalls, it may report whether the log entry was assoiated with an allow or deny rule. For AV, it may indiate infetion disposition (e.g., Symante's AV indiates whether the infeted le is leaned or quarantined). Outome elds for IDS tools are highly vendor-spei. Some IDS sensors have the ability to report part or all of the data ontent in whih the alert was applied. Antivirus logs inlude the identity of the le that was infeted. Table 1: Summary of seurity alert ontent. taker. In the worst ase, the adversary may be able to ompromise the alert repository and gain diret aess to raw alerts reported to that repository. It is thus very important to ensure that alerts are reported in a sanitized form that preserves privay of sensitive information about the produer's network. In this setion, we outline the goals of a typial attaker and the means he or she may employ to subvert our alert sharing sheme. 4.1 Sensitive elds IP addresses. Any eld that ontains an IP address suh as Soure IP or Dest IP is sensitive, sine it reveals potentially valuable information about the internal topology of the network under attak. Knowing the relationship between IP addresses and various types of alerts may allow the attaker to trak propagation of the attak through a network whih is not normally visible to him (e.g., loated behind a rewall). Even though the Soure IP eld is usually assoiated with the soure of the attak, it may (a) ontain the address of an infeted system on the internal network, or (b) identify organizations that have a legitimate relationship with the targeted network. For example, the attaker may be able to disover that attaking a partiular system in organization A leads to alerts arriving from a sensor within organization B with A's address in the Soure IP eld, and thus learn that there is a relationship between the two organizations. Popular intrusion detetion systems suh as Snort [28℄ inlude rules that are highly prone to produing false positives, while other rules simply log seurity-relevant events that are not speially assoiated with an attak. An attaker who is aware of suh behavior an losely analyze the soure IP addresses of these alerts to gain a sense of the sites with whih the produer regularly ommuniates. Captured and infeted data. Data ontained in Captured Data and Infeted File elds are extremely sensitive. File names, email addresses, doument fragments, piees of IP addresses, appliation-spei data and so on may leak private information stored on infeted systems and reveal network topology or site-spei vulnerabilities. 4.2 Sensitive assoiations The attaker may use ertain assoiations between the elds of a seurity alert to learn the seurity posture of the produer site. Congurations. Sensitive information inludes the site's set of network servies, protools, operating systems, and networkaessible ontent residing within its boundaries. While some of this information may be revealed through diret interations with external systems, the breadth of probing an be monitored and ontrolled by the target site. Assoiations between seurity alert elds that ould potentially lead to undesirable dislosures inlude [Soure IP, Soure Port, Protool℄ and [Dest IP, Dest Port, Protool℄. Site vulnerabilities. Revealing the dispo- sition of unsuessful attaks may be undesirable. Assoiations between alert produers and the Sensor ID, Event ID and Outome elds may potentially lead to suh dislosures. Defense overage. Sites may not want to reveal their detetion overage, inluding information about versions and ongurations of seurity produts that are operating within their boundaries. Attaks and probes mounted against a site with the intention of observing, potentially through indiret inferene, whih sensors are running and their alert prodution patterns, would seriously impat the site's seurity posture. Assoiations between alert produers and the Sensor ID and Event ID elds are thus sensitive. In urrent pratie, these sensitivities are handled in a variety of ways. Sensitive elds are often suppressed at the alert produer's site before the alert is forwarded to a remote alert repository. For example, the DShield alert extrator provides various onguration options to suppress elds and an IP blaklist that allows a site to suppress sensitive addresses. The seond approah is to apply ryptographi hashing to elds, allowing equality heks while maintaining a degree of ontent privay (this approah may be vulnerable to ditionary attaks, as explained below). The third approah is simply to trust the alert repository with ensuring that neither ontent nor indiret assoiations be openly revealed. 4.3 Potential attaks We desribe several threats faed by any alert sharing sheme, in the order of inreasing severity. The attaker may launh attaks of several types simultaneously. Casual browsing. Alerts published by a repository may be opied, stored and shared by any Internet user, and are thus forever out of ontrol. The mildest attak is asual browsing, where a urious user looks for familiar IP prexes and sensor IDs in the published alerts. This attak is easy to defend against, e.g., by hashing all sensitive data. Probe-response. A determined attaker may attempt to use the alert repository as a veriation orale. For example, he may target a partiular system and then observe the alerts published by the repository to determine whether the attak has been deteted, and, if so, how it was reported. By omparing IP addresses ontained in the reported alert with that of the targeted system, the attaker may learn network topology, sensor loations, and other valuable information. Ditionary attaks. The attaker an preompute possible values of alerts that may be generated by the targeted network, and then searh through the data published by the repository to nd whether any of the atual alerts math his guesses. This attak is espeially powerful sine standard hashing of IP addresses does not protet against it. For example, the attaker an simply ompute hashes for all 256 IP addresses on the targeted subnet and hek the published alerts to see if any of the hash values math. Using semantially seure enryption on sensitive elds is suÆient to foil ditionary attaks, but suh enryption also makes ollaborative analysis infeasible beause two enryptions of the same plaintext produe different iphertexts with overwhelming probability. A polynomially-bounded analyst annot feasibly perform equality omparisons unless he knows the key or engages in further interation with the alert produer. Alert ooding. If the repository publishes only the highest-volume alerts (or those satisfying any other group ondition), the attaker may target a partiular system and then \ush out" the stimulated alert by ooding the repository with fake alerts that math the expeted value of the alert produed by the targeted system. This involves either spoong soure addresses of legitimate sensors, setting up a bogus sensor, or taking over an existing sensor. Flooding will ause the repository to publish the real alert along with the fakes. The attaker an disard the fakes and analyze the real alert. Repository orruption. Finally, the attaker may deliberately set up his own repository or take ontrol of an existing repository, perhaps in a manner invisible to the repository administrator. This attak is partiularly serious. It eliminates the need for alert ooding and aggravates the onsequenes of probe-response, sine it gives the attaker immediate aess to raw reported alerts, as well as the ability to determine exatly (e.g., by inspeting inoming IP pakets) where the alert has arrived from. We desribe several partial solutions in setion 6. Solutions based on sophistiated ryptographi tehniques suh as oblivious transfer [26℄ urrently appear impratial. They provide better theoretial privay at the ost of an unaeptable derease in utility and performane, but the balane may shift in favor of ryptography-based solutions with the development of more pratial tehniques. 5 Alert Sharing Infrastruture To enable open ollaborative analysis of seurity alerts and real-time attak detetion, we propose to establish alert repositories whih will reeive alerts from many sensors, some of them publi and loated at visible network nodes and other hidden on orporate networks deep behind rewalls. Ahieving this requires a robust arhiteture for information dissemination, ideally with no single point of failure (to provide higher reliabil- ity in the fae of random faults and outages), no single point of trust (to provide stronger privay guarantees against insider misuse in any one organization), and few if any leverage points for attakers. The ore of the proposed system is a set of repositories where alerts are stored and aessed during analysis. Eah repository is very simple: it aepts alerts from anywhere, strips out soure information, and publishes them immediately or after some delay. There is no ryptographi proessing and no key management (unless the repository performs re-keying | see setion 6.2). As desribed in setion 6.3, multiple repositories make it more diÆult for the attaker to infer the soure of sanitized alerts. The repositories may share alerts, but they are not required to be synhronized, thus not every alert will be visible to every analysis engine. For performane reasons, analysis engines normally interat with a single repository or mirror site. Figure 1 shows the major data ows among a small set of sensors, produers, repositories, and analysis engines. The sensor trapezoids onsist of rewalls, intrusion detetion systems, antivirus software, and possibly other seurity alert generators. The produer boxes represent loal olletion points for an enterprise or part of an enterprise. These boxes perform the sanitization steps suh as hashing IP addresses, and are ontrolled by the reporting organization. The repository ylinders represent publi or semipubli databases ontaining reported data. A repository may be ontrolled by a produer or by an analysis organization. The analysis diamonds represent analysis servies whih proess the published alerts for historial trends, event frequeny hanges, and other aggregation or orrelation funtions. An enterprise (suh as a major researh lab famed for omputer seurity researh) may be sensitive to publi dislosure of possible attaks, and wish to keep private even the volume of alerts it generates. As desribed in setion 6.3, the repositories an optionally form a randomized alert routing network. Although we have not implemented this feature, randomized routing an provide strong anonymity guarantees for alert soures. A Figure 1: Data ows in alert proessing. niques add impratial levels of overhead to alert analysis. With over a thousand reporting sensors, naive SMC approahes would require tremendous network bandwidth and unsupportable CPU or ryptographi oproessor performane for even moderate levels of analysis query traÆ. It is possible that speial-purpose SMC shemes developed speially for this problem would prove more pratial. In this paper, we propose simple solutions whih enable a broad set of analyses on sanitized alerts that would normally require raw alert data. 6 Alert Sanitization Figure 2: Alert volume per sensor (semi log sale). Data ourtesy DShield. repository may also be ongured so that only events whose volume exeeds a ertain threshold are published. This will have relatively little impat on historial and inetion analysis (see setion 7), but may disable identiation of stealth attaks assoiated with low alert volumes. As shown in gure 2, sensors vary greatly in the volume of alerts they produe in a given day, but the total alert volume is substantial. This graph depits the number of alerts produed on a single day by 1,416 sensors reporting to DShield. At the high end, over 7 million alerts were produed by one rewall, apparently experiening a ertain DoSlike attak. Several other sensors were near or above a million alerts. The median sensor produed only 177 alerts. The total alert volume of 19,147,322 alerts reported on that day, aross a total of 1,416 dierent sensors from many organizations spread over a wide geographi area, onstrains pratial implementation hoies. In partiular, seure multiparty omputation (SMC) approahes (see setion 2.3), and many privay-preserving data mining teh- We propose several tehniques that are used in ombination to protet the alert sharing infrastruture from threats desribed in setion 4. Some of the mehanisms are \heavier" than others and impose higher ommuniation and omputational requirements on alert ontributors. On the other hand, they provide better protetion against serious threats suh as omplete orruption of the alert repository. The exat set of tehniques may be seleted by eah organization or ontributor pool individually, depending on the level of trust they are willing to plae in a partiular repository or set of repositories. 6.1 Design requirements We do not onsider solutions that require alert soures to trust the repository with proteting privay of the reported data. In the ontext of ompletely open publi repositories, as opposed to trusted servies suh as DeepSight [32℄ and DShield [34℄, suh solutions are both impratial (a ommerial enterprise is unlikely to trust an open repository to be areful with business serets) and dangerous for the repository operator, as she may be exposed to legal liability if the repository is attaked and private alert data ompromised. We also rule out solutions that require sharing of seret keys between sensors. An obvious solution might involve enrypting sensitive data with a ommon key to enable alert omparison by infrastruture partiipants, while hiding the data from a asual observer. This approah may solve the orrupt repository problem, but it is vulnerable if the attaker signs up as a partiipant, gains aess to the ommon key, and breaks privay of alerts generated by all other partiipants. Finally, solutions that require multiple produers to ollaborate and/or interat to protet a single alert are impratial in our ontext. Given the volume of alerts, espeially when the network is under attak, the ommuniation overhead is likely to prove prohibitive. This eliminates mehanisms based on threshold ryptography [11, 14℄ suh as proative seurity [15, 6℄, and seure multiparty omputation (see setion 2.3) even though they are seure if a subset of partiipants has been orrupted by the adversary. 6.2 Basi privay protetion Srubbing sensitive elds. Before an alert is sent to the repository, the produer must remove all sensitive information not needed for ollaborative analyses desribed in setion 7, inluding all ontent in Captured Data, Infeted File and Outome elds. A more advaned version of our system may enable privay-preserving analysis based on ommonalities in the Captured Data eld, e.g., presene of \bad words" assoiated with a partiular virus. Possible tehniques inlude enryption with keyword-spei trapdoors in the manner of [29, 5℄. The Sensor Id eld may be either remapped to a unique persistent pseudonym (e.g., a randomly generated string) that leaks no information about the organization that owns it, or replaed with just the make and model information. The Timestamp eld is rounded up to the nearest minute. Although this disables ne-grained propagation analyses, it adds additional unertainty against attakers staging probe-response attaks. Hiding IP addresses. Suppose the attaker ontrols the repository. He may launh an attak and then attempt to use the alert gen- erated by the vitim's sensor to analyze the attak's propagation through the vitim's internal network. Therefore, the produer must hide both Soure IP and Dest IP addresses before releasing the alert to the repository. Enrypting IP addresses under a key known only to the produer is unaeptable, as it hides too muh information. With a semantially seure enryption sheme, enrypting the same IP address twie will produe dierent iphertexts, disabling ollaborative analysis. Hashing the address using a standard, universally omputable hash funtion suh as SHA-1 or MD5 enables ditionary attaks. If the attaker ontrols the repository, he an target a system on a partiular subnet and pre-ompute hash values of all possible IP addresses at whih sensors may be loated or to whih he expets the attak to propagate. This is feasible sine the address spae in question is relatively small | either 256, or 65536 addresses (potentially even smaller if the attaker an make an eduated guess). The attaker veries his guesses by heking whether the reeived alert ontains any of the pre-omputed values. Our solution strikes a balane between privay and utility. The produer hashes all IP addresses that belong to his own network using a keyed hash funtion suh as HMAC [3, 4℄ with his seret key. All IP addresses that belong to external networks are hashed using a standard hash funtion suh as SHA-1 [23℄. This guarantees privay for IP addresses on the produer's own network sine the attaker annot verify his guesses without knowing the produer's key. In partiular, probe-response fails to yield any useful information. Of ourse, if these addresses appear in alerts generated by other organizations, then no privay an be guaranteed. We pay a prie in dereased funtionality sine alerts about events on the network of organization A that have been generated by A's sensors annot be ompared with the alerts about the same events generated by organization B's sensors. Reall, however, that we are interested in deteting large-sale events. If A is under heavy attak, hanes are that it will be deteted not only by A's and B's sensors, but also by sensors of C, D, and so on. Be- ause A's network is external to B, C, and D, their alerts will have A's IP addresses hashed using the same standard hash funtion. This will produe the same value for every ourrene of the same IP address, enabling mathing and ounting of hash values orresponding to frequently ourring addresses. Intuitively, any subset of partiipants an math and ompare their observations of events happening in someone else's network. The ost of inreased privay is dereased utility beause hashing destroys topologial information, as disussed in setion 7.2. Naturally, an organization an always analyze alerts referring to its own network, sine they are all hashed under the organization's own key. An additional benet of using keyed hashes for alerts about the organization's own events and plain hashes for other organizations' events is that the attaker annot feasibly determine whih of the two funtions was used. Even if the attaker ontrols the repository and diretly reeives A's alerts, he annot tell whether an alert refers to an event in A's or someone else's network. The attaker may still attempt to verify his guesses by preomputing hashes of expeted IP addresses and heking alerts submitted by other organizations, but with hundreds of thousands of alerts per hour and thousands of possible addresses this task is exeedingly hard. Staging a targeted probe-response attak is also more diÆult: the probe may never be deteted by another organization's sensors, whih means that the response is never omputed using plain hash, and the attaker annot stage a ditionary attak at all. Finally, note that keyed hashes do not require PKI or ompliated key management sine keys are never exhanged between sites. Re-keying by the repository. To provide additional protetion against a asual observer or an outside attaker when an alert is published, the repository may replae all (hashed) IP addresses with their keyed hashes, using the repository's own private key. This is done on top of hashing by the alert produer, and preserves the ability to ompare and math IP addresses for equality, sine all seond-level hashes use the same key. This additional keyed hashing by the repository defeats all probe-response and ditionary attaks exept when the attaker ontrols the repository itself and all of its keys, in whih ase we fall bak on protetion provided by the produer's keyed hashing. Randomized hot list thresholds. For ollaborative detetion of high-volume events, it is suÆient for the repository to publish only the hot list of reported alerts that have something in ommon (e.g., soure IP address, port/protool ombination, event id) and whose number exeeds a ertain threshold. As desribed in setion 4, this may be vulnerable to a ooding attak, in whih the attaker launhes a probe, and then attempts to fore the diretory to publish the targeted system's response, if any, by ooding it with \mathing" fake alerts based on his guesses of what the real alert looks like. Our solution is to introdue a slight random variation in the threshold value. For example, if the threshold is 20, the repository hooses a random value T between 18 and 22, and, if T is exeeded, publishes only T alerts. If the attaker submits 20 fake alerts and a hot list of 20 alerts is published, the attaker doesn't know if the repository reeived 20 or 21 alerts, inluding a mathing alert from the vitim. There is a small risk that some alerts will be lost if their number is too small to trigger publiation, but suh alerts are not useful for deteting high-volume events. Delayed alert publiation. If the alert data is used only for researh on historial trends (see setion 7.1), delayed alert publiation provides a feasible defense against proberesponse attaks. The repository simply publishes the data several weeks or months later, without Timestamp elds. The attaker would not be able to use this data to orrelate his probes with the vitim's responses. Examples of basi sanitization for dierent alert types an be found in tables 2 through 4. 6.3 Multiple repositories We now desribe a \heavy-duty" solution for the orrupt repository problem. Instead of using a single alert repository, envision multi- Field ID Soure IP Soure Port Dest IP Dest Port Protool Timestamp Sensor Count Event ID Outome Capture Data Infeted File Raw firewall alert 172.16.30.2 1147 173.19.33.1 135 6 09032003:01:03:10 PIX-4-10060231 1 Deny none none none Sanitized firewall alert 0x16e9368f 1147 0x78a65237 135 6 09032003:01:03:00 PIX 1 Deny none none none Table 2: Example rewall seurity alert sanitization. Field ID Soure IP Soure Port Dest IP Dest Port Protool Timestamp Sensor Count Event ID Outome Capture Data Infeted File Raw IDS alert 172.16.30.49 1299 176.20.22.43 80 6 10132003:11:41:09 EM-HTTP-90209321 1 CGI ATTACK NO REPLY /sripts/.%255%255./winnt/system 32/md.exe?/+dir none Sanitized IDS alert 0xb099562 1299 0xd6e79b79 80 6 10132003:11:41:00 EM-HTTP 1 CGI ATTACK none none none Table 3: Example IDS seurity alert sanitization. Field ID Soure IP Soure Port Dest IP Dest Port Protool Timestamp Sensor Count Event ID Outome Capture Data Infeted File Raw AV Alert none none 176.30.22.11 none none 11172003:09:39:00 NORTON-AV-02209302 1 W32.Sobig.F.Dam Left alone none A0014566.pdf Sanitized AV alert none none 0xb4dd807 none none 11172003:09:39:00 NORTON-AV 1 W32.Sobig.F.Dam none none none Table 4: Example antivirus seurity alert sanitization. ple repositories, operated by dierent owners and distributed throughout the Internet (e.g., open-soure ode for setting up a repository may be made available to anyone who wishes to operate one). We do not require the repositories to synhronize their alert datasets, so the additional omplexity is low. Information about available repositories is ompiled into a periodially published list. An organization that wants to take advantage of the alert sharing infrastruture hooses one or more repositories in any way it sees t | randomly, on the basis of previously established trust, or using a reputation mehanism suh as [9, 12℄. In this setting, it is insuÆient for the attaker to gain ontrol of just one repository to launh a probe-response attak beause the vitim may report his alert to a dierent repository. The osts for the attaker inrease linearly with the number of repositories. The osts for alert produers do not inrease at all, sine the amount of proessing per alert does not depend on the number of repositories. While spreading alerts over several repositories dereases opportunities for ollaborative analysis, real-time detetion of highvolume events is still feasible. If multiple systems are under simultaneous attak, hanes are their alerts will be reported to dierent repositories in suÆient numbers to pass the \hot list" threshold and trigger publiation. By monitoring a suÆiently large subset of the repositories for simultaneous spikes of similar alerts, it will be possible to detet an attak in progress and adopt an appropriate defensive posture. Repositories may also engage in periodi or on-demand exhanges of signiant perturbations in inoming alert patterns. This ould further help build an aggregate detetion apability, espeially as the number of would-be repositories grows large. average path length of m. Randomized alert routing. For better pri- Suppose the network ontains n routers, of whih are ontrolled by the attaker. The probability that a random path ontains a router ontrolled by the attaker is (n np+p+p) n2 np(n ) [27℄. For large n, this value is lose to n , whih means that almost 1 n alerts will not be observed by the attaker and thus remain ompletely anonymous. vay, we propose to deploy an overlay protool for randomized peer-to-peer routing of alerts in the spirit of Crowds [27℄ or Onion routing [33℄. Eah alert produer and repository sets up a simple alert router outside its rewall. The routers form a network. When a bath of alerts is ready for release, the produer hooses one of the other routers at random and sends the bath to it. After reeiving the alerts, a router ips a biased oin and, with probability p (a parameter of the system), forwards the alert to the next randomly seleted router, or, with probability 1 p, deposits it into a randomly seleted repository. The alert produer may also speify the desired repository as part of the alert bath. Suh a network is very simple to set up sine, in ontrast to full-blown anonymous ommuniation systems suh as Onion routing, there is no need to establish return paths or permanent hannels. The routers don't need to maintain per-alert state or use any ryptography. All they need to do is randomly forward all reeived alerts and periodially update the table with the addresses of other routers in the network. When an alert enters the network, all origin data is lost after the rst hop. Even if the attaker ontrols some of the routers and repositories, he annot be sure whether an alert has been generated by its apparent soure or routed on behalf of another produer. This provides probabilisti anonymity for alert soures whih is quantied below. The disadvantage is the ommuniation overhead and inreased lateny for alerts before they arrive to the repository (note that there is no ryptographi overhead). Anonymity estimates. To quantify the anonymity that alert ontributors will enjoy if the repositories and produers form a randomized alert routing network, we ompute the inrease in attaker workload as a funtion of the average routing path length. If p is the probability of forwarding at eah hop, then the average path length m = 2 + 1 p p . Reversing the equation, the forwarding prob2 ability p must be equal to m m 1 to ahieve the For eah of the n alerts that are observed by the attaker, the probability that its apparent soure (the site from whih an attaker-ontrolled router has reeived it) is the atual soure an be alulated as n p(n 1) [27℄. We interpret the inverse of n this probability as the attaker workload. For example, if there is only a 25% hane that the observed alert was produed by its apparent soure, the attaker needs to perform 4 times the testing to determine whether the apparent soure is the true origin. As expeted, higher values of forwarding probability p provide better anonymity at the ost of inreased lateny (modeled as inrease in the average number of hops an alert has to travel before arriving to the repository). This relationship is plotted (assuming n = 100 routers) in gure 3. 7 Supported Analyses Alert sanitization tehniques desribed in setion 6 protet sensitive information ontained in raw alerts, but still allow a wide variety of large-sale, ross-organization analyses to be performed on the sanitized data. 7.1 Historial trend analyses This lass of analyses seeks to understand the statistial harateristis and trends in alert prodution that have been observed over various durations of time. For example, [31℄ oers a ompendium of the trends observed in rewall and intrusion detetion alert prodution from a sample set of over 400 organizations in 30 ountries. Figure 3: Estimated anonymity provided by randomized alert routing. Soure- and target-based. Given a large alert orpus, alert soures and targets may be ategorized from various perspetives, suh as event prodution patterns. Beause of privay-preserving data sanitization, geographial information and domain types annot be inferred from the published alerts. One possible solution is to rely on selflassiation and allow ontributors to assoiate onise high-level proles with eah alert, inluding suh attributes as ountry, business type, and so on (e.g., \an aademi institution in California"). This will enable some forms of trend/ategorial analysis, but will also potentially make alert ontributors more vulnerable to ditionary attaks. We do enable identiation of (anonymous) soures produing the greatest volume of alerts and alerts with the greatest aggregate severity. The ativity of egregious soures is likely to be reported by multiple organizations, thus the orresponding address will be hashed using a universally omputable hash funtion suh as SHA-1. These soures an be blaklisted by distributing lters with the orresponding hash value. When installed, they would lter out all traÆ for whih the hash of the soure IP address mathes the provided value. There is a ost to this ltering, sine it requires the rewall to hash the IP addresses of all inoming traÆ to determine whih ones need to be ltered out, although this may be aeptable when the network is under a heavy attak (this hashing is benign as opposed to ditionary attaks desribed in setion 4.3). Repositories should beware of maliious blaklisting aused by the attaker submitting a large number of fake alerts impliating an innoent system. Port/protool- and event produtionbased. These analyses may oer help in understanding whih kinds of reonnaissane are performed as a preursor to a larger sale exploit, or help haraterize the extent to whih an attak has spread. 7.2 Event-driven analyses Real-time alert data published by alert repositories oers ompelling value as a soure of early warning signs that a new outbreak of maliious ativity is emerging aross the ontributor pool. The fous of this analysis is to identify signiant hanges or sudden inetions in alert prodution that may be indiative of a urrently ourring attak. Intensity analysis identies extremely aggressive soures ausing a large number of alerts from multiple ontributors. Although the soures remain anonymous, hash values of their IP addresses an be published and/or distributed to ontributors to help them adjust their ltering poliies, as desribed above. Sudden and widespread inetions in the volume and ratios of event IDs and Dest Ports in the inoming alert streams may indiate the emergene of a new intrusion threat that is aeting a growing subset of the ontributor pool. Aggregation of the volume and severity of alerts observed in the inoming alert streams may provide a basis from whih to apture an overall assessment or \Defon level" of the threats that the ontributor pool is urrently faing. A more hallenging task is to identify propagation patterns in the ourrene of event IDs and volumes, whih is neessary to analyze spreading behavior of Internet-sale intrusion ativity. Both hashing and keyed hashing destroy all topologial information in IP addresses, making it infeasible to determine whether two sanitized alerts belong to the same region of address spae. A possible solution may be oered by prex-preserving anonymization [36℄, but we leave these tehniques for future investigation. 8 Performane As illustrated in gure 2, large volumes of alert data are being generated, and alert prodution among members of the ontributor pool an vary greatly. Seurity servies an produe inundations of seurity alerts when they are the target of a denial of servie attak, and when there is a widespread outbreak of virulent worm or virus. During suh periods of signiant stress, alert prodution and proessing an pose signiant burden on sensors, repositories, and analysts, and thus limit utility of the alerts. This is a signifiant motivator for work on alert redution methods [35, 10℄, and plaes onstraints on the aeptable osts of alert sanitization. As we show below, the ost of providing privay to alert produers in our sheme is very low: there is a small impat on the performane of alert produers, and virtually no impat on the performane of supported analyses (of ourse, some analyses are disabled due to data sanitization). We argue that our sheme provides a sensible three-way tradeo between utility of alert analysis, performane of the alert sharing infrastruture, and privay of alert produers. Performane of alert produers. To un- derstand the CPU impat of alert sanitization, we benhmarked IP hashing on large alert orpuses under the sheme proposed in setion 6.2, using SHA-1 on external IP addresses (primarily Soure IP), and HMAC on internal IP addresses (primarily Dest IP). The experiment was onduted on a FreeBSD 1.4Ghz Intel Pentium III workstation using Mark Shellor's free software implementation of SHA and HMAC. 1 We employed two large alert repositories. One repository, produed from our laboratory rewall, onsisted of 4,224,122 reords olleted over a three hour period during an intense exposure to the Kuang 2 virus [16℄. The seond repository onsisted of 19,146,346 reords olleted over a 24 hour period by DShield. Table 5 presents the results of the IP address hiding sheme on the DShield and laboratory alert orpuses, reported in CPU seonds per million reords. The baseline represents the amount of seonds, in CPU time, required to read the alerts from seondary storage per 1 million reords. The hashed and ahed-8 times indiate the amount of CPU seonds required to apply SHA and HMAC hashing to the Soure IP and Dest IP elds per 1 million reords. The delta olumn represents the dierene between the baseline alert reporting performane and the sanitized alert reporting performane. Cahed-8 represents a moderately optimized implementation with a very small ahe holding the last 8 enountered IP addresses. Beause our sanitization sheme is deterministi, we an use the previously 1 Soure ode is available at http://searh.pan. org/sr/MSHELOR/Digest-SHA-4.1.0/sr/ DShield.org Laboratory baseline 29.81 75.80 hashed 64.16 110.34 delta 34.35 34.54 ahed-8 56.84 106.20 delta 27.02 30.40 Table 5: CPU Impat of IP Hashing (seonds per 1 million alerts). hashed IP addresses from the ahe. Cahing makes sense in two ases: The site is hit by a san aross its full IP address spae by a few infeted or maliious external hosts. In this ase, a few Soure IP addresses will our with regularity, resulting in a high ahe hit ratio. The site is hit by distributed-denial-ofservie-type traÆ against a subset of its valid servers. In this ase, a few Dest IP addresses will our with regularity, resulting in a high ahe hit ratio. For the IP addresses not in the trusted domain (to whih SHA is applied), ahing ahieved savings of about 65%. The results reveal that the performane impat is modest, less than the ost of I/O in our implementation. For a sensor produing 1 million alerts per hour, the additional hashing expense is roughly 30 seonds of CPU time per hour. This overhead should be onsidered in the ontext of the muh larger task of alert ahing and periodi bathed transmission to a remote alert repository. Key management is relatively heap in our ase: there is no need for PKI and keys are never distributed outside the produer's site. The expeted ost of randomized routing to anonymize alert soures depends on the parameters of the routing network suh as the forwarding probability and is roughly linear in the number of hops. There is no ryptographi proessing and alert routers are stateless (see setion 6.3). Performane of analysis. To ahieve the balane between privay and utility, our sanitization methods have been designed to have minimal or no eet on the performane of primary analyses. In partiular, sanitized IP addresses are mapped into the same size reord as the original IP addresses, and rossalert omparisons an be arried out at the repository without any network interation. Comparing hashes for equality takes the same time as omparing IP addresses, so there is zero impat on performane. When a troublesome soure IP address is identied, this information may need to be propagated bak to the produer (this is infeasible in the randomized-routing setting due to the high overhead of maintaining a return path for eah alert). The produer may opt to reveal the atual IP address of the offender. In the ase of a widespread attak, many sensors may omplain about a single IP address, and any of the vitims may hoose to reveal the soure of the threat, to enable defensive lters to be tuned appropriately. Measuring the osts of suh seletive revelation is beyond the sope of this paper. 9 Conlusions We have desribed a broad set of privay onerns that limit the ability of sites to share seurity alert information, and enumerated a number of data sanitization tehniques that strike a balane between the privay of alert produers and the funtional needs of multisite orrelation servies, without imposing heavy performane osts. Our tehniques are pratial even for large alert loads, and, most importantly, do not require that alert ontributors trust alert repositories to protet their sensitive data. This enables reation of open ommunity-aess repositories that will oer a better perspetive on Internet-wide trends, real-time detetion of emerging threats and a soure of data for maliious ode researh. As a rst prototype to demonstrate basi alert sanitization with live sensors, we are developing a Snort alert delivery plugin that im- plements SHA/HMAC and eld sanitization disussed in setion 6.2. We also plan to analyze defenses against probe-response attaks in whih the attaker artiially stimulates an alert with a rare Event Id and then uses this Event Id as a marker to reognize the response in the general alert traÆ. Aknowledgements. We thank Keith Skinner for his support in the initial performane analysis, and Johannes Ullrih from the SANS Internet Storm Center for providing aess to data set samples from DShield.org. We are grateful to the anonymous reviewers for useful omments. Referenes [1℄ R. Agrawal, A. Evmievski, and R. Srikant. Information sharing aross private databases. In Pro. ACM SIGMOD '03, pages 86{97, 2003. [2℄ R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hipporati databases. In Pro. VLDB '02, pages 143{154, 2002. [3℄ M. Bellare, R. Canetti, and H. Krawzyk. Keying hash funtions for message authentiation. In Pro. CRYPTO '96, volume 1109 of pages 1{15. Springer-Verlag, LNCS, 1996. [4℄ M. Bellare, R. Canetti, and H. Krawzyk. HMAC: Keyed-hashing for message authentiation. Internet RFC 2104, February 1997. [5℄ D. Boneh, G. D. Cresenzo, R. Ostrovsky, and G. Persiano. Publi key enryption with keyword searh. In Pro. EUROCRYPT '04, volume 3027 of LNCS, pages 506{522. Springer-Verlag, 2004. [6℄ R. Canetti, R. Gennaro, A. Herzberg, and D. Naor. Proative seurity: longterm protetion against break-ins. Cryptobytes, 3(1):1{8, 1997. [7℄ K. Carr and D. Duy. Taking the Internet by storm. CSOonline. om , April 2003. [8℄ C. Clifton, M. Kantarioglou, J. Vaidya, X. Lin, and M. Zhu. Tools for privay preserving distributed data mining. ACM SIGKDD Explorations, 4(2):28{ 34, 2002. [9℄ E. Damiani, S. D. C. di Vimerati, S. Paraboshi, P. Samarati, and F. Violante. A reputation-based approah to hoosing reliable resoures in peer-topeer networks. In Pro. ACM CCS '02, pages 207{216, 2002. [10℄ H. Debar and A. Wespi. Aggregation and orrelation of intrusion-detetion alerts. In Pro. RAID '01, pages 85{103, 2001. [11℄ Y. Desmedt and Y. Frankel. Threshold ryptosystems. In Pro. CRYPTO '89, volume 435 of LNCS, pages 307{315. Springer-Verlag, 1989. [12℄ R. Dingledine, N. Mathewson, and P. Syverson. Reputation in P2P anonymity systems. In Pro. Workshop on Eonomis of Peer-to-Peer Systems, 2003. [13℄ J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. Strauss, and R. Wright. Seure multiparty omputation of approximations. In Pro. ICALP '01, volume 2076 of LNCS, pages 927{938. Springer-Verlag, 2001. [14℄ P. Gemmell. An introdution to threshold ryptography. Cryptobytes, 2(3):7{ 12, 1997. [15℄ A. Herzberg, S. Jareki, H. Krawzyk, and M. Yung. Proative seret sharing, or how to ope with perpetual leakage. In Pro. CRYPTO '95, volume 963 of LNCS, pages 339{352. Springer-Verlag, 1995. [16℄ Internet Seurity Systems. Xbakdoorkuang2v (4074). ISS X-Fore Advisory, April 2003. [17℄ J. Jegon. Seurity rm: MyDoom worm fastest yet. CNN.om, January 2004. [18℄ J. Leyden. Blaster rewrites Windows worm rules. The Register, August 2003. http://www.seurityfous. om/news/6725. [19℄ Y. Lindell and B. Pinkas. Privay preserving data mining. In Pro. CRYPTO '00, volume 1880 of LNCS, pages 36{54. Springer-Verlag, 2000. [31℄ Symante. Symante Internet seurity threat report. Tehnial report, Symante Managed Seurity Servies, February 2003. [20℄ D. Moore, V. Paxson, S. Savage, S. Staniford, and N. Weaver. Inside the Slammer worm. IEEE Seurity and Privay, 1(4), 2003. [32℄ Symante. DeepSight threat management system home page. http://tms. symante.om, 2004. [21℄ D. Moore, C. Shannon, and K. Clay. Code-Red: a ase study on the spread and vitims of an Internet worm. In Pro. ACM Internet Measurement Workshop '03, pages 273{284, 2003. [22℄ D. Moore, G. Voelker, and S. Savage. Inferring Internet denial-of-servie ativity. In Pro. USENIX Seurity Symposium, pages 9{22, 2001. [23℄ NIST. Seure hash standard. FIPS PUB 180-1, April 1995. [24℄ R. Pang and V. Paxson. A high-level programming environment for paket trae anonymization and transformation. In Pro. ACM SIGCOMM '03, pages 339{351, 2003. [25℄ M. Peuhkuri. A method to ompress and anonymize paket traes. In Pro. ACM Internet Measurement Workshop , pages 257{261, 2001. '01 [26℄ M. Rabin. How to exhange serets by oblivious transfer. Aiken Computation Laboratory Tehnial Memo TR-81, 1981. [27℄ M. Reiter and A. Rubin. Crowds: anonymity for web transations. ACM Transations on Information and Sys- , 1(1):66{92, 1998. tem Seurity [28℄ Snort. http://www.snort.org, 2004. [29℄ D. Song, D. Wagner, and A. Perrig. Pratial tehniques for searhes on enrypted data. In Pro. IEEE Symposium on Seurity and Privay, pages 44{55, 2000. [30℄ S. Staniford, V. Paxson, and N. Weaver. How to own the Internet in your spare time. In Pro. USENIX Seurity Symposium, pages 149{167, 2002. [33℄ P. Syverson, D. Goldshlag, and M. Reed. Anonymous onnetions and onion routing. In Pro. IEEE Symposium on Seurity and Privay, pages 44{54, 1997. [34℄ J. Ullrih. DShield home page. http: //www.dshield.org, 2004. [35℄ A. Valdes and K. Skinner. Probabilisti alert orrelation. In Pro. RAID '01, pages 54{68, 2001. [36℄ J. Xu, J. Fan, M. Ammar, and S. Moon. On the design and performane of prexpreserving IP traÆ trae anonymization. In Pro. ACM Internet Measurement Workshop '01, pages 263{266, 2001. [37℄ A. Yao. Protools for seure omputation. In Pro. IEEE FOCS '82, pages 160{164, 1982. [38℄ V. Yegneswaran, P. Barford, and S. Jha. Global intrusion detetion in the DOMINO overlay system. In Pro. NDSS '04, 2004. [39℄ V. Yegneswaran, P. Barford, and J. Ullrih. Internet intrusions: global harateristis and prevalene. In Pro. ACM SIGMETRICS '03, pages 138{147, 2003. [40℄ C. Zou, W. Gong, and D. Towsley. Code Red worm propagation modeling and analysis. In Pro. ACM CCS '02, pages 138{147, 2002.