Spam Cleansing
Updated September 24, 2008  Area : Spam Cleansing.

Web spam (automatically generated web content to cheat search engines), has become a real industry estimated to represent up to 20% of all the web.

The goal here is to provide methods and tools for distinguishing useful, high-quality Web content from spam content.

- Develop methods to maintain a clean view of the documents and their change history as well as save resources by excluding spam and useless content. 

- Save resources by excluding spam and useless content when crawling for the Web archive.

- Adjust crawling priority (depth, frequency of visits, cleansing efforts) based on content quality and not by the amount of search engine optimization efforts of the site owner.

- Develop methods to distinguish real changes in dynamic content from surface modifications (e.g.,creation timestamp), to provide a more useful record of wayback history.

In this domain LiWA intends to provide improvements measurable in terms of percentage of spam (1) crawled (2) presented to the user; quality of depth and freshness over sampled set of Web sites; finally in terms of the amount of detected changes that are manually classified as minor or irrelevant.

