Updated September 24, 2008 — Area : Spam Cleansing.
Web spam (automatically generated web content to cheat search engines), has become a real industry estimated to represent up to 20% of all the web.
The goal here is to provide methods and tools for distinguishing useful, high-quality Web content from spam content.
- Develop methods to maintain a clean view of the documents and their change history as well as save resources by excluding spam and useless content.
- Save resources by excluding spam and useless content when crawling for the Web archive.
- Adjust crawling priority (depth, frequency of visits, cleansing efforts) based on content quality and not by the amount of search engine optimization efforts of the site owner.
- Develop methods to distinguish real changes in dynamic content from surface modifications (e.g.,creation timestamp), to provide a more useful record of wayback history.
In this domain LiWA intends to provide improvements measurable in terms of percentage of spam (1) crawled (2) presented to the user; quality of depth and freshness over sampled set of Web sites; finally in terms of the amount of detected changes that are manually classified as minor or irrelevant.