The paper entitled “Web spam classification: a few features worth more”, co-written by M. Erdélyi, A. Garzó, and A. A. Benczúr has been accepted for presentation in Joint Web Quality 2011 in conjunction with the WWW2011, Hyderabad, India, ACM Press 2011.
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters.
While Web spam is targeted for the high commercial value of topranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007. See the full paper.
Presented at IWAW 08 by Radu Pop, Wolf Siberski, Mark Williamson
See presentation here
Presented at IWAW 08 by Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo
While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam ﬁltering technologies are rarely used by Web archivists. In this paper we make the ﬁrst attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam ﬁltering illustrated by the recent Web spam challenge data sets and techniques and describe the ﬁltering solution for archives envisioned in the LiWA project.
See paper here