Temporal Analysis for Web Spam Detection: An Overview

Posted March 10, 2011
Areas : Temporal Coherence, General.

The paper “Temporal Analysis for Web Spam Detection: An Overview” co-written by M. Erdélyi, and A. A. Benczúr has been accepted for presentation in TWAW 2011 in conjunction with the WWW2011, Hyderabad, India, CEUR Workshop Proceedings 2011.

In this paper we give a comprehensive overview of temporal features devised for Web spam detection providing measurements for different feature sets.

 

The SOLAR System for Sharp Web Archiving

Posted October 22, 2010
Areas : Temporal Coherence, General.

The paper entitled “The SOLAR System for Sharp Web Archiving” by A. Mazeika, D. Denev, M. Spaniol and G. Weikum has been accepted for presentation at IWAW 2010

This paper presents the SOLAR (Scheduling of Downloads for Archiving of Web Sites) system for sharp Web archiving. SOLAR crawls all pages of a Web site and then re-crawls the visited pages forming visit-revisit intervals. If all visit-revisit intervals overlap and no page changed between its visit and revisit then all pages are “sharp” and captured as if the entire site were downloaded instantaneously. SOLAR judiciously schedules visits and revisits to maximize the number of sharp pages based on the predictions of page-specific change rates. Experiments with synthetic date show SOLAR outperforms existing techniques and captures the sites as sharp as possible.

 

A Language Modeling Approach for Temporal Information Needs

Posted December 21, 2009
Areas : Temporal Coherence, General.

A paper titled “A Language Modeling Approach for Temporal Information Needs” by Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum has been accepted for presentation in the 32nd European Conference on Information Retrieval (ECIR 2010).

The paper proposes a language modeling approach that leverages temporal expressions to improve retrieval effectiveness for temporal information needs. Experiments on the New York Times Annotated Corpus with relevance assessments obtained from Amazon Mechanical Turk show that the method yields substantial improvements. The paper will be available as part of the ECIR 2010 proceedings.

 

“Catch me if you can”: Visual Analysis of Coherence Defects in Web Archiving

Posted November 13, 2009
Areas : Temporal Coherence.

A paper on “Visual Analysis of Coherence Defects in Web Archiving” has been published as part of the IWAW09 that took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The paper is available online as part of the IWAW 2009 proceedings.

The paper “‘Catch me if you can’: Visual Analysis of Coherence Defects in Web Archiving” by Marc Spaniol, Arturas Mazeika, Dimitar Denev and Gerhard Weikum deals with the problems in Web archiving arising from the World Wide Web is a continuously evolving network of contents (e.g. Web pages, images, sound les, etc.) and an interconnecting link structure. The papers discusses questions that arise about detecting, measuring them and - finally - understanding coherence defects. To this end, visualization strategies are being presented that might be applied on diff erent level of granularities: working with (in the ideal case) properly set last-modi ed timestamps, based on metadata extracted from the crawler in accelerated crawl-revisit pairs, or from the Internet Archive’s WARC les. In order to help
the archivist in understanding the nature of these defects, this paper investigates means for visualizing change behavior and archive coherence.

 

IWAW 2009

Posted November 13, 2009
Areas : Archive Fidelity, Temporal Coherence, Semantic Evolution, Social Web, General.

IWAW09 took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The proceedings are now available online.

Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects.  In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.

 

SHARC: Framework for Quality-Conscious Web Archiving

Posted May 29, 2009
Areas : Temporal Coherence, General.

A paper on quality-conscious web archiving has been accepted in the 35th International Conference on Very Large Data Bases (VLDB 2009)

The paper on quality-conscious web archiving by Dimitar Denev, Arturas Mazeika, Marc Spaniol, and Gerhard Weikum has been accepted for presentation to the 35th International Conference on Very Large Data Bases (VLDB 2009). The conference takes place on 24-28 August in Lyon, France. The paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. The paper defines quality measures, characterise their properties, and derives a suite of quality-conscious scheduling strategies for archive crawling.

 

Data Quality in Web Archiving

Posted March 02, 2009
Areas : Temporal Coherence, General.

A paper on “Data Quality in Web Archiving” has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009) in conjunction with WWW 2009.

The paper on “Data Quality in Web Archiving” by Marc Spaniol, Dimitar Denev, Arturas Mazeika, Pierre Senellart and Gerhard Weikum has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009). The Workshop and paper presentation takes places on April 20 at Madrid, Spain and is organized in conjunction with the 18th International World Wide Web Conference (WWW 2009). The paper addresses the problems of capturing a large Web site that may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. The paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, a crawling strategy is introduced that aims to ensure archive coherence by minimizing the diffusion of Web site captures.

 

Time Will Tell: Leveraging Temporal Expressions in IR

Posted January 26, 2009
Areas : Temporal Coherence, General.

Presented at WSDM 09 by Irem Arikan and Klaus Berberich (MPG)

Presented by Irem Arikan and Klaus Berberich in the late-breaking results at WSDM 09, Barcelona
See presentation here

 

Liwa Architecture

Posted October 07, 2008
Areas : Archive Fidelity, Spam Cleansing, Temporal Coherence, Semantic Evolution, Social Web, Rich Media, General.

Presented at IWAW 08 by Radu Pop, Wolf Siberski, Mark Williamson

imageOverview on the current state of LiWA architecture and proposal for the testbed infrastructure. Focus was on the modularity of the architecture and the communication between different modules based on web service invocation.

See presentation here
 

“Catch me if you can”. Temporal Coherence of Web Archives

Posted October 05, 2008
Areas : Temporal Coherence, General.

Presented at IWAW 08 by Marc Spaniol

image
Presented at IWAW 08 by Marc Spaniol
See presentation here