The SHARC framework for data quality in Web archiving

Posted March 10, 2011
Areas : Archive Fidelity, General.

The publication of “The SHARC framework for data quality in Web archiving”, co-written by D. Denev, A. Mazeika, M. Spaniol and G. Weikum, to the VLDB Journal 2011 (Impact factor: 4.517 (2009) has been accepted.

The download is available to download via online first in the VLDB Journal.

 

Archiving Data Objects using Web Feeds

Posted October 22, 2010
Areas : Archive Fidelity, General.

The paper entitled “Archiving Data Objects using Web Feeds” by M. Oita and P. Senellart has been accepted for presentation at IWAW 2010

Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. This paper shows how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel.

 

IWAW 2009

Posted November 13, 2009
Areas : Archive Fidelity, Temporal Coherence, Semantic Evolution, Social Web, General.

IWAW09 took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The proceedings are now available online.

Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects.  In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.

 

Liwa Architecture

Posted October 07, 2008
Areas : Archive Fidelity, Spam Cleansing, Temporal Coherence, Semantic Evolution, Social Web, Rich Media, General.

Presented at IWAW 08 by Radu Pop, Wolf Siberski, Mark Williamson

imageOverview on the current state of LiWA architecture and proposal for the testbed infrastructure. Focus was on the modularity of the architecture and the communication between different modules based on web service invocation.

See presentation here
 

The Challenge of Dynamic Links

Posted October 07, 2008
Areas : Archive Fidelity, Rich Media, General.

Presented at IWAW 08 by Mark Williamson

imagePresented at IWAW 08 by Mark Williamson
See presentation here

 

Web Spam: a Survey with Vision for the Archivist

Posted October 06, 2008
Areas : Archive Fidelity, Spam Cleansing, General.

Presented at IWAW 08 by Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo

imageWhile Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA project.
See paper here