The publication of “The SHARC framework for data quality in Web archiving”, co-written by D. Denev, A. Mazeika, M. Spaniol and G. Weikum, to the VLDB Journal 2011 (Impact factor: 4.517 (2009) has been accepted.
The download is available to download via online first in the VLDB Journal.
The paper “Temporal Analysis for Web Spam Detection: An Overview” co-written by M. Erdélyi, and A. A. Benczúr has been accepted for presentation in TWAW 2011 in conjunction with the WWW2011, Hyderabad, India, CEUR Workshop Proceedings 2011.
In this paper we give a comprehensive overview of temporal features devised for Web spam detection providing measurements for different feature sets.
The paper entitled “Web spam classification: a few features worth more”, co-written by M. Erdélyi, A. Garzó, and A. A. Benczúr has been accepted for presentation in Joint Web Quality 2011 in conjunction with the WWW2011, Hyderabad, India, ACM Press 2011.
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters.
Zenz. G., Tahmasebi, N., and T. Risse have been invited for submission of their paper “Language Evolution On The Go” (Extended Version) to the Journal on Multimedia Tools and Applications.
The paper “On the Applicability of Word Sense Discrimination on 201 Years of Modern English”, co-written by Tahmasebi, N., K. Niklas, G. Zenz, and T. Risse has been submitted to the Journal of Computational Linguistics.
Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785 - 1985.
The paper entitled “Language Evolution On The Go” by G. Zenz, N. Tahmasebi and T. Risse will be presented at SAME 2010
Knowing about the evolution of a term can significantly decrease time needed for searching for information. It can also aid in quickly getting a broader overview, which is essential when one is on the move. In this paper we present a solution for providing language evolution knowledge “on the go”. On the 3rd International Workshop on Semantic Ambient Media Experience 2010, November 10th in conjunction with AmI-10 in Malaga, Spain, the LiWA project will present a mobile interface for easy access and visualization as well as an overview of how this evolution was found.
The paper entitled “Archiving Web Video” by R. Pop, G. Vasile and J. Masanès has been accepted for presentation at IWAW 2010
Web archivists have a difficult time gathering web video that are, more often than not served with non-standard tools and protocols. This paper offers a survey of the state of the art in this domain.
Based on an experience of several years gathering web video content, detailed examples are presented to help understand the issues and solution to capture web video content.
This paper also presents an architectural framework for scaling web video content capture developed as part of the EU research project LiWA.
The paper entitled “The SOLAR System for Sharp Web Archiving” by A. Mazeika, D. Denev, M. Spaniol and G. Weikum has been accepted for presentation at IWAW 2010
This paper presents the SOLAR (Scheduling of Downloads for Archiving of Web Sites) system for sharp Web archiving. SOLAR crawls all pages of a Web site and then re-crawls the visited pages forming visit-revisit intervals. If all visit-revisit intervals overlap and no page changed between its visit and revisit then all pages are “sharp” and captured as if the entire site were downloaded instantaneously. SOLAR judiciously schedules visits and revisits to maximize the number of sharp pages based on the predictions of page-specific change rates. Experiments with synthetic date show SOLAR outperforms existing techniques and captures the sites as sharp as possible.
The paper entitled “Archiving Data Objects using Web Feeds” by M. Oita and P. Senellart has been accepted for presentation at IWAW 2010
Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. This paper shows how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel.
The paper entitled “Terminology Evolution Module for Web Archives in the LiWA Context” by N. Tahmasebi, G. Zenz, T. Risse and T. Iofciu has been accepted for presentation at IWAW 2010
This paper presents the LiWA Terminology evolution module, TeVo which takes us one step closer to fully automatic detection of terminology evolution. TeVo consists of a pipeline for finding evolution from web archives based on the UIMA framework. The LiWA TeVo module consists of two main processing chains, the first for Warc file extraction and text processing and the second for finding terminology evolution. The terminology evolution browser is also presented, the TeVo browser, which aids in exploring evolution of terms present in archives.
A paper entitled “Using Word Sense Discrimination on Historic Document Collections” has been accepted for presentation at the 10th ACM/IEEE JCDL
The paper entitled “Using Word Sense Discrimination on Historic Document Collections” by Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse has been accepted in the 10th ACM/IEEE Joint Conference on Digital Libraries. The paper evaluates word sense discrimination on historic document collections to investigate if word senses can be found automatically using modern technology applied on historic data. The paper also investigates which impact OCR errors, present in scanned historic documents, have on finding word senses in an automatic way. Finding word senses in an automatic way is the first step towards detecting terminology evolution and hence an important step in our research. Nina Tahmasebi will present the paper on June 22nd, 2010 at JCDL which is held in conjunction with ICADL in Surfers Paradise (Gold Coast, Australia).
A paper titled “A Language Modeling Approach for Temporal Information Needs” by Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum has been accepted for presentation in the 32nd European Conference on Information Retrieval (ECIR 2010).
The paper proposes a language modeling approach that leverages temporal expressions to improve retrieval effectiveness for temporal information needs. Experiments on the New York Times Annotated Corpus with relevance assessments obtained from Amazon Mechanical Turk show that the method yields substantial improvements. The paper will be available as part of the ECIR 2010 proceedings.
A paper on First Results on Detecting Term Evolutions has been accepted at IWAW 2009
The paper “First Results on Detecting Term Evolutions” by Nina Tahmasebi, Sukriti Ramesh and Thomas Risse has been accepted and presented at IWAW09 which took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu Greece. The paper presents first results on Detecting Term evolutions.
A paper on Automatic Detection on Terminology Evolution has been accepted at On The Move Academy in conjunction with On The Move Federated Conferences 2009
The paper entitled “Automatic Detection on Terminology Evolution” by Nina Tahmasebi has been accepted for presentation at the On The Move Academy 2009 in conjunction with the On The Move Federated Conferences, Vilamoura, Portugal 2009. The paper won Best Paper Award which was handed out on during the social event on November 4. The paper presents a Ph.D. proposal on the topic of detecting term evolutions for use in information retrieval in long term archives.
A paper on “Visual Analysis of Coherence Defects in Web Archiving” has been published as part of the IWAW09 that took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The paper is available online as part of the IWAW 2009 proceedings.
The paper “‘Catch me if you can’: Visual Analysis of Coherence Defects in Web Archiving” by Marc Spaniol, Arturas Mazeika, Dimitar Denev and Gerhard Weikum deals with the problems in Web archiving arising from the World Wide Web is a continuously evolving network of contents (e.g. Web pages, images, sound les, etc.) and an interconnecting link structure. The papers discusses questions that arise about detecting, measuring them and - finally - understanding coherence defects. To this end, visualization strategies are being presented that might be applied on different level of granularities: working with (in the ideal case) properly set last-modied timestamps, based on metadata extracted from the crawler in accelerated crawl-revisit pairs, or from the Internet Archive’s WARC les. In order to help
the archivist in understanding the nature of these defects, this paper investigates means for visualizing change behavior and archive coherence.
Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects. In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.
A paper on dealing with terminology evolution in web archives has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009)
The paper entitled ‘Bridging the Terminology Gap in Web Archive Search’ by Klaus Berberich, Srikanta Bedathur, Mauro Sozio, and Gerhard Weikum has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009). The paper proposes a method to find query reformulations that paraphrase users’ information needs using past terminology. Such query reformulations are key to retrieving old but highly relevant documents in web archives that were written using now outdated terminology. Klaus Berberich will present the paper on June 28th, 2009 at WebDB 2009, which is held in conjunction with SIGMOD 2009 in Providence (Rhode Island, USA).
A paper on quality-conscious web archiving has been accepted in the 35th International Conference on Very Large Data Bases (VLDB 2009)
The paper on quality-conscious web archiving by Dimitar Denev, Arturas Mazeika, Marc Spaniol, and Gerhard Weikum has been accepted for presentation to the 35th International Conference on Very Large Data Bases (VLDB 2009). The conference takes place on 24-28 August in Lyon, France. The paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. The paper defines quality measures, characterise their properties, and derives a suite of quality-conscious scheduling strategies for archive crawling.
While Web spam is targeted for the high commercial value of topranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007. See the full paper.
A paper on “Data Quality in Web Archiving” has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009) in conjunction with WWW 2009.
The paper on “Data Quality in Web Archiving” by Marc Spaniol, Dimitar Denev, Arturas Mazeika, Pierre Senellart and Gerhard Weikum has been accepted for presentation at the 3rd Workshop on Information Credibility on the Web (WICOW 2009). The Workshop and paper presentation takes places on April 20 at Madrid, Spain and is organized in conjunction with the 18th International World Wide Web Conference (WWW 2009). The paper addresses the problems of capturing a large Web site that may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. The paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, a crawling strategy is introduced that aims to ensure archive coherence by minimizing the diffusion of Web site captures.
Presented at WSDM 09 by Irem Arikan and Klaus Berberich (MPG)
Presented by Irem Arikan and Klaus Berberich in the late-breaking results at WSDM 09, Barcelona
See presentation here
Presented at IWAW 08 by Radu Pop, Wolf Siberski, Mark Williamson
See presentation here
Presented at IWAW 08 by Mark Williamson
Presented at IWAW 08 by Mark Williamson
See presentation here
Presented at IWAW 08 by Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo
While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam ﬁltering technologies are rarely used by Web archivists. In this paper we make the ﬁrst attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam ﬁltering illustrated by the recent Web spam challenge data sets and techniques and describe the ﬁltering solution for archives envisioned in the LiWA project.
See paper here
Presented at IWAW 08 by Marc Spaniol
Presented at IWAW 08 by Marc Spaniol
See presentation here
Presented at IWAW 08 by Nina Tahmasebi, Tereza Iofciu, Thomas Risse, Claudia Niederée, Wolf Siberski
The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure “semantic” accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution this paper formalizes the problem and discusses issues, first ideas and relevant technologies.