Data Quality in Web Archiving

Posted March 02, 2009
Areas : Temporal Coherence, General.

A paper on “Data Quality in Web Archiving” has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009) in conjunction with WWW 2009.

The paper on “Data Quality in Web Archiving” by Marc Spaniol, Dimitar Denev, Arturas Mazeika, Pierre Senellart and Gerhard Weikum has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009). The Workshop and paper presentation takes places on April 20 at Madrid, Spain and is organized in conjunction with the 18th International World Wide Web Conference (WWW 2009). The paper addresses the problems of capturing a large Web site that may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. The paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, a crawling strategy is introduced that aims to ensure archive coherence by minimizing the diffusion of Web site captures.