ceas2009.cc

Ceas2009.cc

Humboldt-Universit¨at zu Berlin, School of Computer Science Unter den Linden 6, 10099 Berlin, Germany Abstract. The Discovery Challenge 2006 deals with personalized spamfiltering and generalization across related learning tasks. In this overviewof the challenge we motivate and describe the problem setting and theevaluation measure. We give details on the construction of the data setsand discuss the results.
The Discovery Challenge 2006 is about personalized spam filtering and general-ization across related learning tasks. People spend an increasing amount of timefor reading messages and deciding whether they are spam or non-spam. Someusers spend additional time to label their received spam messages for traininglocal spam filters running on their desktop machines. Email service providerswant to relieve users from this burden by installing server-based spam filters.
Training such filters cannot rely on labeled messages from the individual users,but on publicly available sources, such as newsgroup messages or emails receivedthrough “spam traps” (spam traps are email addresses published visually invis-ible for humans but get collected by the web crawlers of spammers).
This combined source of training data is different from the distributions of the emails received by individual users. When learning spam filters for individualusers from this type of data one needs to cope with a discrepancy between thedistributions governing training and test data and one needs a balance betweengeneralization and adaptation. The generalization/adaptation can rely on largeamounts of unlabeled emails in the user’s inboxes that are accessible for server-based spam filters. Utilizing this unlabeled data a spam filter can be adapted tothe properties of specific user’s inboxes but when little unlabeled data for a userare available a generalization over multiple users is advised.
The Discovery Challenge 2006 covers this setting, labeled training data col- lected from publicly available sources are provided. The unlabeled inboxes ofseveral users serve as test data. The inboxes differ in the distribution of emails.
The goal is to construct a spam filter for each single user that correctly classifiesits emails as spam or non-spam. A clever way of utilizing the available sets ofunlabeled emails from different users is required.
This overview is organized as follows. In Section 2, we discuss the problem setting and define the evaluation measure. We describe the data sets in Section 3. Section 4 gives an overview of the participants and summarizes the results. InSection 5 we discuss the different approaches and Section 6 concludes.
In the problem setting of the challenge the inboxes of several users are givenand the goal is to correctly classify the messages in each inbox as spam ornon-spam. No labeled training examples from the inboxes are available, instead,one common set of labeled data is given. The labeled data and the inboxes aregoverned by different distributions. A learning algorithm cannot rely only onthe labeled data because the bias between training data and inboxes hinderslearning of a correct classification model for the inboxes. The unlabeled data inthe inboxes need to be used to adapt to their distributions.
The individual distributions of the inboxes are neither independent (identical spam messages are sent to many users), nor are they likely to be identical: dis-tributions of inbound messages vary greatly between (professional, recreational,American, Chinese, . . . ) email users. A learning algorithm can exploit thesimilarity of the inboxes.
There are two different tasks that differ in the number of inboxes and the proportion of labeled to unlabeled data (see Section 3).
Usually, cross-validation is used for tuning parameters of a classification model. In our case, cross-validation cannot be used because the emails in theinboxes are unlabeled. We provide a second set of labeled training data andinboxes for parameter tuning. The difference between the tuning set and theevaluation set is that the emails in the inboxes of the tuning set are labeled.
The feature representation of the tuning data differs from the evaluation data(different dictionary). This means, the tuning data can not be used to augmentthe training data.
The problem setting differs from the standard setting of semi-supervised – there is a bias between training and evaluation data, the training and test data are governed by different distributions, – several distinct but similar unlabeled inboxes are given, a multi-task learning or a transfer learning approach can be used for modeling and exploiting thesimilarity between inboxes, – the number of labeled emails is larger than the number of unlabeled examples The evaluation criterion for the challenge is the AUC value. The AUC value is the area under the ROC curve (Receiver Operating Characteristic curve). AROC curve is a plot of true positive rate vs. false positive rate as the predictionthreshold sweeps through all the possible values. The area under this curve hasthe nice property that it specifies the probability that, when we draw one positiveand one negative example at random, the decision function assigns a higher valueto the positive than to the negative example.
We compute AUC values for each inbox separately and average over all in- boxes of the task. The winner for each task is the participant with the highestaverage AUC value. There is an additional creativity award for each task for themost interesting solutions in terms of non-straightforward approaches, innovativeideas, and assumed high impact.
The composition of the labeled training set is the same for both tasks, they differin number of emails. 50% of the labeled training data contain spam emails sent byblacklisted servers of the Spamhaus project (www.spamhaus.org). 40% are non-spam emails from the SpamAssassin corpus and 10% are non-spam emails sentfrom about 100 different subscribed English and German newsletters. Table 1summarizes the composition of the labeled training data for both tasks. Thelabeled data of the tuning set has the same size and composition as the actualtraining data but with different emails.
Table 1. Composition of labeled training data.
Evaluating the filters with respect to the personal distributions of messages requires labeled emails from distinct users. We construct different inboxes usingreal but disclosed messages. As non-spam part of the inboxes we use messagesreceived by distinct Enron employees from the Enron corpus [9] cleaned fromspam. Each inbox is augmented with spam messages from distinct spam sources.
Some spam sources are used for multiple inboxes, in those cases all availableemails from this source were sorted by date and split into different consecutivesubsets. Because of the topic drift the distribution of the emails in the differentparts differs.
The two tasks differ in the number and size of inboxes, task A has 3 and task B 15 evaluation inboxes. The size of the inboxes in task A is 2500 and in task B400. Tables 2 and 3 summarize the composition of the evaluation and the tuninginboxes for task A and B. Each inbox consists of 50% spam and 50% non-spamemails.
The messages are preprocessed and transformed into a bag-of-words represen- tation. We provide feature vectors with term frequencies. Our preprocessing usescharset-, MIME-, base64-, URL- (RFC 1738), and subject line-decoding (RFC2047). Our tokenization takes care of HTML tags, following the X-tokenizerproposed by Siefkes et al. [3].
Dornbos spam trap, part 1(www.dornbos.com) Dornbos spam trap, part 2(www.dornbos.com) spam trap of Bruce Guenter, part 1(www.em.ca/∼bruceg/spam) personal spam of Richard Jones, part 1(www.annexia.org/spam) spam collection of SpamArchive.org, part 1 spam collection of SpamArchive.org, part 2 personal spam of Paul Wouters, part 1(www.xtdnet.nl/paul/spam) Dornbos spam trap, part 3(www.dornbos.com) Dornbos spam trap, part 4(www.dornbos.com) spam trap of Bruce Guenter, part 2(www.em.ca/∼bruceg/spam) spam trap of Bruce Guenter, part 3(www.em.ca/∼bruceg/spam) personal spam of Richard Jones, part 2(www.annexia.org/spam) spam collection of SpamArchive.org, part 3 personal spam of Paul Wouters, part 2(www.xtdnet.nl/paul/spam) spam trap of Bruce Guenter, part 4(www.em.ca/∼bruceg/spam) Table 2. Composition of the evaluation and tuning inboxes for task A.
57 teams from 19 different countries participated in the challenge. 26 participantssubmitted their results for evaluation, 20 teams have an academic and 6 teamsa commercial background. Not all teams submitted results for both tasks. Weaveraged the AUC values for all inboxes as described above and determined theranking. We conducted significance tests using a significance level of 5% to testthe null hypothesis that the second rank has a higher AUC value than the first.
The test statistic is computed as described in Hanley and McNeil [7]. For task A spam trap of Bruce Guenter(www.em.ca/∼bruceg/spam) personal spam of Tobias Scheffer(www.em.ca/∼bruceg/spam) Dornbos spam trap, part 3(www.dornbos.com) Table 3. Composition of the evaluation and tuning inboxes for task B.
we could not reject the null hypothesis for rank two and three, this means thereis no statistically significant difference between them and they are all rankedfirst. For task B we could reject the null hypothesis for the second rank, thismeans there is one winner.
Table 4 and 5 show the first five ranks for task A and task B, respectively.
Figure 1 displays the distribution of AUC over all ranks. Some participants reporthigher results in their workshop paper because they improved their algorithmsafter the submission deadline.
Fig. 1. Distribution of AUC performance dependent on rank over all participants fortask A (left) and task B (right).
We selected the solution of Bernhard Pfahringer (University of Waikato, New Zealand) for the Spam Filtering Creativity Award - task A/B, we decided toaward one team for both tasks instead of one for each task because most teamsused the same algorithm for both tasks. Details on his algorithm are given inthe next section.
The teams approached the problem in very different ways but most of theparticipants used variants of semi-supervised learning techniques. Among thesemi-supervised algorithms were graph-based algorithms [2] , large-margin-based Khurram Junejo, Mirza Yousaf, Asim KarimLahore University of Management Sciences, Pakistan Bernhard PfahringerUniversity of Waikato, New Zealand Kushagra Gupta, Vikrant Chaudhary,Nikhil Marwah, Chirag TanejaInductis India Pvt Ltd Nikolaos TrogkanisNational Technical University of Athens, GreeceGeorgios PaliourasNational Center of Scientific Research “Demokritos” Greece Chao Xu, Yiming ZhouSchool of Computer Science and Engineering,Beijing University, China Lalit Wangikar, Mansi Khanna, Ankush TalwarNikhil Marwah, Chirag TanejaInductis India Pvt Ltd Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos,Dimosthenis Christopoulos, Michalis VazirgiannisDB-NET Lab, Informatics Dept., Athens University EB, Greece Table 4. First five ranks for task A.
Gordon CormackUniversity of Waterloo, Canada Nikolaos TrogkanisNational Technical University of Athens, GreeceGeorgios PaliourasNational Center of Scientific Research “Demokritos” Greece Kushagra Gupta, Vikrant Chaudhary,Nikhil Marwah, Chirag TanejaInductis India Pvt Ltd Dyakonov AlexanderMoscow State University, Russia Wenyuan DaiApex Data & Knowledge Management Lab,Shanghai Jiao Tong University Table 5. First five ranks for task B.
methods [1, 4, 10], self-training approaches [6, 8], positive-only learning [10], andmulti-view learning methods [4]. The assumption in most of those algorithmsis that the unlabeled data is drawn from the same distribution as the labeleddata. This assumption is violated in our case, but nevertheless semi-supervisedlearning reduces the error compared to methods that do not utilize the unlabeleddata.
Bernhard Pfahringer the winner of the creativity award accounts for the bias between training and evaluation data in two ways [2]. Firstly, whenever a pre- diction for some evaluation email is needed, his algorithm transforms the wholetraining set by only selecting those features which are actually present in theevaluation email (i.e. have a non-zero value). A classification model is trainedusing this transformed training set and that model’s prediction is used for theevaluation example in question. This procedure forces the learner to concen-trate on the features that are actually present in the evaluation example. Thisidea of filtering non-existent features is similar to the approach of Steinberg andGolovnya [11]. Secondly, Pfahringer uses a learning algorithm by Zhou et al. [5]that is one of the best known graph-based semi-supervised learning algorithms.
The algorithm of Zhou et al. originally suffers from a cubic runtime complexityin the number of examples. Pfahringer develops a variant of this algorithm withlinear complexity. The tremendous reduction in runtime and memory require-ments make the algorithm applicable for large data sets.
Trogkanis and Paliouras [10], ranked second in both tasks, are very cautious when transferring knowledge from labeled to biased unlabeled data. Their ap-proach is almost unsupervised. A classifier trained an the labeled data is allowedto label only a very few unlabeled emails with high confidence. In the subsequentstep the labeled data is ignored and a semi-supervised algorithm is applied onlyto the inbox emails.
Two teams developed models that account for the similarity of inboxes with transfer learning. Participant Mohammad Al-Hasan (Rensselaer Polytechnic In-stitute) first measures the pairwise cosine similarity of all emails between allinboxes. In a second step a self-training-like learning algorithm learns separateclassifiers for all inboxes in parallel. In each self-training iteration the most confi-dent previously unlabeled email for each inbox is labeled together with the mostsimilar email from one other inbox. With this approach confident decisions fromone inbox are transfered to other inboxes. Trogkanis and Paliouras [10] use semi-supervised learning and augment the unlabeled data of one inbox by a weightedset of the unlabeled emails of all other inboxes. Gordon Cormack, ranked first intask B, even ignores the separation of emails into inboxes and pools all inboxesinto one unlabeled set for semi-supervised training.
Most of the participants obtained lower classification errors by utilizing the datafrom the unlabeled inboxes in addition to the labeled data. Those results indi-cate that server-sided spam-filtering can be improved by personalization usingunlabeled inboxes.
The results of the participants show that a wide range of semi-supervised learning algorithms can improve the classification performance for the problemsetting of the challenge. Most semi-supervised learning algorithms make theimplicit assumption that the training and test data are drawn from the samedistribution. This assumption is violated in our case. It is an open problemto develop semi-supervised learning methods that account for a bias between training and test data. We assume that such methods could further improve thebenefit of spam filter personalization.
Some participants used transfer learning to account for the similarity be- tween the inboxes. In their experiments the knowledge transfer between the in-boxes improved the classification performance. Algorithms that did not exploitthe similarity between the inboxes but used more sophisticated semi-supervisedmethods received higher scores in the overall ranking. This raises the questionwhether the best semi-supervised approaches can be integrated into a transferor multi-task learning framework and whether this further improves the classifi-cation performance.
To our knowledge there is no spam filtering software for practical settings that utilizes unlabeled examples within a learning framework, also, personalizationin server-sided spam filtering algorithms is widely disregarded. In this respectthe results of the Discovery Challenge are encouraging. A real application in aserver-sided setting poses additional challenges regarding the scalability of themethods.
The Discovery Challenge 2006 has been supported by Strato Rechenzentrum AGand by the German Science Foundation DFG under grant SCHE540/10-2.
1. Kyriakopoulou A. and Kalamboukis T. Text classification using clustering. In Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2006.
2. Pfahringer B. A semi-supervised spam mail detector. In Proceedings of the ECML- PKDD Discovery Challenge Workshop, 2006.
3. Siefkes C., Assis F., Chhabra S., and Yerazunis W. Combining winnow and orthog- onal sparse bigrams for incremental spam filtering. In Proceedings of the EuropeanConference on Principle and Practice of Knowledge Discovery in Databases, 2004.
4. Mavroeidis D., Chaidos K., Pirillos S., Christopoulos D., and Vazirgiannis M. Using tri-training and support vector machines for addressing the ecml-pkdd 2006 discov-ery challenge. In Proceedings of the ECML-PKDD Discovery Challenge Workshop,2006.
5. Zhou D., O. Bousquet, Lal T., Weston J., and Sch¨olkopf B. Learning with local and global consistency. In Advances in Neural Information Processing Systems,2003.
6. Cormack G. Harnessing unlabeled examples through iterative application of dy- namic markov modeling. In Proceedings of the ECML-PKDD Discovery ChallengeWorkshop, 2006.
7. Hanley J. and McNeil B. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148:839–843, 1983.
8. Junejo K., Yousaf M., and Karim A. A two-pass statistical approach for auto- matic personalized spam filtering. In Proceedings of the ECML-PKDD DiscoveryChallenge Workshop, 2006.
9. B. Klimt and Y. Yang. The Enron corpus: A new dataset for email classification research. In Proceedings of the European Conference on Machine Learning, 2004.
10. Trogkanis N. and Paliouras G. TPN2: Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data. In Proceedings of the ECML-PKDDDiscovery Challenge Workshop, 2006.
11. Dan Steinberg and Mikhaylo Golovnya. Identifying spam with predictive models.
In Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2006.

Source: http://www.ceas2009.cc/discovery_challenge2006_overview.pdf

Project presentation

Basic information about IntelliDrug Project presentation IntelliDrug is an IST Programme project developes an intra-oral device for controlled drug delivery. It will provide new therapeutic opportunities to people suffering from chronic diseases and drug addiction. The project addresses the need of new, less invasive and better controlled drug delivery. The occurrence of drug

Untitled

Your PreventiveRxSM Drug List: Expanded Plan PreventiveRx covers drugs that help keep you healthy because they prevent illness and other health conditions. You can get the products on this list at low or no cost to you. This list includes only prescription products. Brand-name drugs are listed with a first capital letter. Non-brand drugs (generics) are in lowercase letters. Birth control