Press release originally published in german by L3S.
January 13, 2011
Internet pages are like stars in the sky: They are uncountable many, and every day new appear. They bring new texts, information, and pictures into existence, some of which only exist in the Internet. But who is to decide which pages are worth to preserve? Libraries and archives are currently rather helpless to deal with such gigantic amounts of data. With the current state of technology, it is not feasible to review and select all material for archiving. But this is about to change.
Two European projects will help to preserve the digital cultural heritage of the World Wide Web. The “Archive Community Memories” project focuses on automatic selection of web-content that is socially relevant. A new archiving method will not specifically search for topics or events in the Web and rate their importance. To achieve this not only Web pages of organizations or companies are evaluated, but also private content like publicly accessible blogs or social networks like Facebook. Social networks can be very helpful in discovering important Web pages, as users will suggest such pages to their friends. By harnessing such and other information, the project will help to optimize and to speed-up the reviewing process of national libraries or archives.
The total size of the EU project is eight million Euro. L3S Research Center at Leibniz University Hannover receives one million euro, and leads the scientific management. The overall management is led by researchers from University of Sheffield. There are also several other partners involved in the project, like Yahoo!, Südwestrundfunk, and Deutsche Welle.
The project is a follow-up of “Living Web Archives”, in which researchers from L3S Research Center and other European partners have been working in the past years. The goal was to improve the quality of Web archives, especially regarding multi-media content, spam detection, as well as enabling the use of the archive for future generations.
LiWA partners are pleased to announce the release in open-source of the complete list of components and tools issued from the LiWA project.
They are all grouped under the “liwa-technologies” project on Google code:
1° The Rich Media Capture Module - a plug-in dedicated to the capture of streaming video content:
2° The Temporal Coherence Analyser - a plug-in dedicated to the analysis of the temporal coherence of the archived Web content:
3° The Spam Assessment Interface - a Web service that enables the quality assessment of the archived Web content:
4° The Semantic Analizer - a component dedicated to the detection of terminology evolution:
5° The Web Archive UI Framework - a client-side framework that helps creating User Interface helpers for Web archive browsing:
To learn more about each component, the Google project provides also a wiki space, giving a brief description of each module and the necessary steps for its deployment: http://code.google.com/p/liwa-technologies/w/list
You are all welcome to download and try out the LiWA components. Your feedback and comments will be greatly appreciated, helping us to improve the documentation and the usability of the technologies.
LiWA partners are pleased to announce the release in open-source of the LiWA Evolution Tracking Module.
The LiWA Terminology Evolution Tracking Module is a java module for Word sense evolution tracking, released under the “liwa-technologies” project on Google code:
The LiWA Newsletter No3 is now available, summarizing the findings and results of the 36 months project. Enjoy reading it!
The publication of “The SHARC framework for data quality in Web archiving”, co-written by D. Denev, A. Mazeika, M. Spaniol and G. Weikum, to the VLDB Journal 2011 (Impact factor: 4.517 (2009) has been accepted.
The download is available to download via online first in the VLDB Journal.
The paper entitled “Web spam classification: a few features worth more”, co-written by M. Erdélyi, A. Garzó, and A. A. Benczúr has been accepted for presentation in Joint Web Quality 2011 in conjunction with the WWW2011, Hyderabad, India, ACM Press 2011.
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters.
The paper “Temporal Analysis for Web Spam Detection: An Overview” co-written by M. Erdélyi, and A. A. Benczúr has been accepted for presentation in TWAW 2011 in conjunction with the WWW2011, Hyderabad, India, CEUR Workshop Proceedings 2011.
In this paper we give a comprehensive overview of temporal features devised for Web spam detection providing measurements for different feature sets.
Zenz. G., Tahmasebi, N., and T. Risse have been invited for submission of their paper “Language Evolution On The Go” (Extended Version) to the Journal on Multimedia Tools and Applications.
The paper “On the Applicability of Word Sense Discrimination on 201 Years of Modern English”, co-written by Tahmasebi, N., K. Niklas, G. Zenz, and T. Risse has been submitted to the Journal of Computational Linguistics.
Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785 - 1985.
Jaap Blom gave a talk about Internet Archiving in an Audiovisual Institute at IWAW 2010 in Vienna, Austria on September 22, 2010
In this presentation three use cases were presented:
* preserve Dutch public broadcasting websites (preservation of Dutch cultural heritage)
* collect Internet AV materials (mainly AV content that is broadcasted on the internet but not on traditional media)
* preserve web context (to be used by archivists for looking up relevant context information for annotating Radio & Television items)
On December 15, 2010 the 2nd LiWA Terminology Evolution Evaluation Workshop will be held in Hanover, Germany at L3S Research Center. The workshop aims at evaluating terminology evolution found inside long term archives. The workshop attendees will also evaluate the performance of the Terminology Evolution Browser, a tool developed within LiWA to better visualize evolution.
The 1st LiWA Terminology Evolution Evaluation Workshop was held on March 16, 2010 at L3S Research Center, Hannover, Germany. The workshop spanned half a day and aimed at evaluating the outcome in LiWA WP5 technology.
The poster entitled “What if web archiving was as reliable as a simple button?” has been presented at FIAT 2010, on 16th to 18th of October in Dublin
The 2010 IIPC Working Group meetings were held in Vienna (Austria) in conjunction with iPres2010 and IWAW2010, on the 23rd and 24th of September
Radu Pop presented the LiWA tools released in open-source, during the Harvesting Working Group session of the IIPC meetings.
The paper entitled “Language Evolution On The Go” by G. Zenz, N. Tahmasebi and T. Risse will be presented at SAME 2010
Knowing about the evolution of a term can significantly decrease time needed for searching for information. It can also aid in quickly getting a broader overview, which is essential when one is on the move. In this paper we present a solution for providing language evolution knowledge “on the go”. On the 3rd International Workshop on Semantic Ambient Media Experience 2010, November 10th in conjunction with AmI-10 in Malaga, Spain, the LiWA project will present a mobile interface for easy access and visualization as well as an overview of how this evolution was found.
IWAW10 takes place on 22nd and 23rd of September in Vienna at the Austrian National Library
The following papers have been accepted for presentation at this International Web Archiving Workshop:
- “Archiving Web Video”, Radu Pop, Gabrile Vasile and Julien Masanes
- “The SOLAR System for Sharp Web Archiving”, Arturas Mazeika, Dimitar Denev, Marc Spaniol and Gerhard Weikum
- “Terminology Evolution Module for Web Archives in the LiWA Context”, Nina Tahmasebi, Gideon Zenz, Tereza Iofciu and Thomas Risse
- “Archiving Data Objects using Web Feeds”, Marilena Oita and Pierre Senellart.
The LiWA consortium released the Terminology Extraction Pipeline Version 1.0 as part of the Terminology Evolution Detection module. The released processing pipeline consists of four major steps: pre-processing, natural language processing and creation of co-occurrence graph.
In more detail the WARC Collection Reader (WARC Extraction) extracts the text and time metadata for each site archived in the input crawl. The POS (Part Of Speech) Tagger is an aggregate analysis engine from Dextract . It consists of a tokenizer, a language independent part of speech tagger and lemmatizer (TreeTagger). In the Term Extraction sub-module, we read the annotated sites, extract the lemmas and the different occurring parts of speech that were identified for the archived sites. After that, we index the terms in an database (MySQL) index (see below). In the Co-occurrence Analysis we extract lemma or noun co-occurrence matrices for the indexed crawl from the database index.
The Terminology Extraction Pipeline can be downloaded from Google Code:
ePractice.eu is a new service that merges the eGovernment Observatory with the Good Practice Framework and allows to meet people, share experience and learn in public e-domains. The LiWA Project is now listed in the cases of the eGovernment domain .
LiWA partners have been busy the last 2 years developing new Web Archiving technology within the LiWA project.
They are now pleased to announce the release in open-source of several components, as part of the “liwa-technologies” project on Google code: http://code.google.com/p/liwa-technologies/.
You are all welcome to download and try out the plug-in dedicated to the streaming video content, the Rich Media Capture module:
This is still an experimental version of the software, therefore your feedback and comments will be greatly appreciated, helping us to improve the documentation and the usability of the module.
More LiWA tools will be available soon, please stay tuned for the next releases.
The paper “Using Word Sense Discrimination on Historic Document Collections” has been accepted for the 10th ACM/IEEE JCDL
The paper entitled “Using Word Sense Discrimination on Historic Document Collections” by Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse has been accepted in the 10th ACM/IEEE Joint Conference on Digital Libraries. The paper evaluates word sense discrimination on historic document collections to investigate if word senses can be found automatically using modern technology applied on historic data. The paper also investigates which impact OCR errors, present in scanned historic documents, have on finding word senses in an automatic way. Finding word senses in an automatic way is the first step towards detecting terminology evolution and hence an important step in our research. Nina Tahmasebi will present the paper on June 22nd, 2010 at JCDL which is held in conjunction with ICADL in Surfers Paradise (Gold Coast, Australia).
Thomas Risse presented Liwa at the meeting the GI Working Group on Digital Libraries
Thomas Risse presented the members of the GI Working Group on Digital Libraries during their bi-annual meeting on Tuesday, 14.4.2010 an update on the activities and results of the LiWA project.
Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects. In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.
The LiWA applications and its R&D challenges will be presented at the Conference “Cultural Heritage on line Empowering users: an active role for user communities” at Florence, Italy on the 15th and 16th of December, 2009
Web content plays an increasingly important role in the knowledge-based society, and the preservation and long-term accessibility of Web history has high value (e.g., for scholarly studies, market analyses, intellectual property disputes, etc.). There is strongly growing interest in its preservation by libraries and archival organizations as well as emerging industrial services. Web content characteristics (high dynamics, volatility, contributor and format variety) make adequate Web archiving a challenge.
LiWA will look beyond the pure “freezing” of Web content snapshots for a long time, transforming pure snapshot storage into a “Living” Web Archive. In order to create Living Web Archives, the LiWA project will address R&D challenges in the three areas: Archive Fidelity, Archive coherence and Archive interpretability. The results of the project will be demonstrated within two application scenarios namely “Streaming Archive” and “Social Web Archive”. The Streaming Archive application will showcase the building of an audio-visual Web archive and how audio and video broadcast related web information can be preserved. The Social Web application will demonstrate how web archives can capture the dynamics and the different types of user interaction of the social web.
The LiWA Newsletter No2 is now available.
For this second Newsletter, focus has been given on Streaming and Social Web applications.
Learn more by reading the presentation of the LiWA partners involved in the development of each application.
Dr. Thomas Risse (L3S) will give a talk at the “JISC, the DPC and the UK Web Archiving Consortium Workshop”, at the The British Library Conference Centre in London, on July 21st.
The paper on “From Web page storages to Living Web Archive” will be presented by Dr. Thomas Risse, at the JISC, the DPC and the UK Web Archiving Consortium Workshop which will take place at The British Library Conference Centre in London, on July 21st.
A paper on dealing with terminology evolution in web archives has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009)
The paper entitled ‘Bridging the Terminology Gap in Web Archive Search’ by Klaus Berberich, Srikanta Bedathur, Mauro Sozio, and Gerhard Weikum has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009). The paper proposes a method to find query reformulations that paraphrase users’ information needs using past terminology. Such query reformulations are key to retrieving old but highly relevant documents in web archives that were written using now outdated terminology. Klaus Berberich will present the paper on June 28th, 2009 at WebDB 2009, which is held in conjunction with SIGMOD 2009 in Providence (Rhode Island, USA).
A paper on quality-conscious web archiving has been accepted in the 35th International Conference on Very Large Data Bases (VLDB 2009)
The paper on quality-conscious web archiving by Dimitar Denev, Arturas Mazeika, Marc Spaniol, and Gerhard Weikum has been accepted for presentation to the 35th International Conference on Very Large Data Bases (VLDB 2009). The conference takes place on 24-28 August in Lyon, France. The paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. The paper defines quality measures, characterise their properties, and derives a suite of quality-conscious scheduling strategies for archive crawling.
Nina Tahmasebi (L3S) will present a poster on the LiWA project in the CHORUS conference in Brussels (Belgium) on May 26-27, 2009. The poster will present the overall goals of LiWA and in particular semantic evolution.
Dr. Thomas Risse (L3S) will give a talk on “Terminology Evolution in Web Archiving” at Seminarraum, Institut für Knowledge und Business Engineering, in Rathausstrasse 19/9, A-1010 Wien; on April 30, 2009.
Due to the central role that the World Wide Web plays in nearly all areas
of today’s life, adequate Web archiving has become a cultural necessity in
preserving knowledge. The next generation web archiving technologies will
overcome limitations in content capture, preservation, analysis and
enrichment. One important aspect is the archive interpretability. The
correspondence between the terminology used for querying and the one used
in content objects to be retrieved is a crucial prerequisite for effective
content access based on retrieval technology. However, as terminology is
evolving over time, a growing gap opens up between older documents in
(long-term) archives and the active language used for querying such
archives. Thus, technologies for detecting and systematically handling
terminology evolution are required to ensure ``semantic’’ accessibility of
(Web) archive content on the long run. As a starting point for dealing
with terminology evolution present the problem and discusses issues,
approaches and relevant technologies.
A talk on “Data Quality in Web Archiving” will be given at the 3rd Workshop on Information Credibility on the Web (WICOW 2009) in Madrid (Spain) on Monday, April 20.
The paper on “Data Quality in Web Archiving” by Marc Spaniol, Dimitar Denev, Arturas Mazeika, Pierre Senellart and Gerhard Weikum will be presented at the 3rd Workshop on Information Credibility on the Web (WICOW 2009). The Workshop and paper presentation takes places on April 20 at Madrid, Spain and is organized in conjunction with the 18th International World Wide Web Conference (WWW 2009). The paper addresses the problems of capturing a large Web site that may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. The paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, a crawling strategy is introduced that aims to ensure archive coherence by minimizing the diffusion of Web site captures.
Julien Masanès (EA) gave a talk on “Building on Internet Memory” during a XRCE Seminar in Grenoble on March 12, 2009.
LiWA has been presented, as well as the necessity and challenges of the digital preservation, in the current context of mass information, diversity and evolution of the web.
Marc Spaniol (Max-Planck-Institute for Computer Science) gave a lecture on “Web Archiving” within the scope of the course “Creation of an E-learning Module for Web Archiving” on March 17, 2009 at Stuttgart Media University (HdM), Germany.
On March 17, 2009 Dr. Marc Spaniol from Max-Planck-Institute for Computer Science has presented ongoing research taking place within the LiWA project to students and scientists of Stuttgart Media University (HdM). During the 90 minutes lecture, Dr. Spaniol introduced the main issues in Web archiving and present examples of Web spam as well as temporal coherence ensuring crawling strategies to the audience. The lecture was part of an elective course on “Creation of an E-learning Module for Web Archiving” in the scope of the Library and Information Management program. This course is organized by Prof. Markus Hennies and Prof. Heidrun Wiesenmüller M.A.. The lecture is open to public and takes place at 14.15 on March 17 at HdM’s site in Wolframstraße. Guests interested in Web archiving and/or LiWA (in particular) are welcome.
The first edition of LiWA NEWS, our newsletter, is published.
For the first edition of LiWA NEWS, our newsletter, we have asked all the LiWA research partners to present their goals and summarize their achievements for the first year. We hope you will enjoy reading it.
Marc Spaniol (Max-Planck-Institute for Computer Science) gave a lecture on “Web Archiving” within the scope of the “Web Science” course on November 28, 2008 at RWTH Aachen University, Germany.
The course on Web Science at RWTH Aachen University organized by Prof. Dr. Matthias Jarke and Dr. Ralf Klamma of Lehrstuhl Informatik 5 addresses Web Science as a new and challenging study field in computer science. This course covers the wide range of current and emerging Web concepts, technologies and Web-based software systems. In order to present recent approaches in the scope of Web archiving, Dr. Marc Spaniol from Max-Planck-Institute for Computer Science was invited to present ongoing research taking place within the LiWA project. During the 90 minutes lecture, Dr. Spaniol introduced the main issues in Web archiving and presented examples of Web spam as well as temporal coherence ensuring crawling strategies to the audience.
A dedicated session took place during the 8th International Web Archiving Workshop
Over 70 web archivists and researchers in this domain attended the 8th edition of IWAW during which a full session was dedicated to present research objectives and early results from LiWA.
Lots of questions and interest from the audience, which is good sign for us. See below links to presentations from this session:
Web Spam: a Survey with Vision for the Archivist
Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo (paper, presentation)
Radu Pop, Wolf Siberski, Mark Williamson (presentation)
“Catch me if you can”. Temporal Coherence of Web Archives
Marc Spaniol (presentation)
The Challenge of Dynamic Links
Mark Williamson (presentation)
Thomas Risse (L3S) presented LiWA and the problem of terminology evolution during the meeting of the IFIP 2.6 Working Group on Databases
Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution present the problem and discusses issues, first ideas and relevant technologies.
Thomas Risse (L3S) presented LiWA and the problem terminology evolution during the foundation meeting of the German Digital Library Working Group of the Gesellschaft für Informatik e.V.
Thomas Risse (L3S) presented LiWA and the problem terminology evolution at the Institute for Natural Language Processing of the University Stuttgart
Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. A first generation of Web archiving technology has been built by pioneers in the domain based on existing search technology. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. It is the goal of the LiWA project (Living Web Archives, IST FP7 216267) to turn Web archives from pure Web page storages into “living Web archives”. Such living archives, will be capable of: handling a variety of content types; dealing with evolution as well as long-term archive interpretability.
One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run.
Within this talk we give an overview about the LiWA project and present in more detail the problem of terminology evolution by giving a more formal problem statement and discuss issues, first ideas and relevant technologies.
February 18, 2008
The Living Web Archives project will carry Web archiving beyond the current approach, characterized by static snapshots, to one that fully accounts for the dynamics and interrelations of Web content. Through an ambitious plan of research, the project aims to improve archive fidelity, coherence, and interpretability.
The three-year project is funded by the European Union through the Seventh Research Framework Programme.
The project kick-off meeting was held earlier this month in Hannover, Germany.
The result of LiWA’s work will be a set of next generation Web archiving methods and tools making possible the creation and long-term usability of high-quality Web archives. Aspects of the project’s research will focus on improving Web capture fidelity by providing for the capture of the hidden Web and all types of content, including content that is currently difficult to gather, and by filtering out unwanted content through spam and trap detection.
Other aspects of the project will address the temporal incoherence inherent in current Web capture methods and tools. The research will also address the rapid semantic and technological evolution of the Web in order to promote the long-term viability of Web archives.
LiWA research will lead to scalable technologies, applicable to the Web archiving goals in institutions with varying collection policies.
The project lead is the L3S Research Center in Hannover, Germany. The other partners are:
- European Archive Foundation, The Netherlands
- Max Planck Institut for Computer Science, Germany
- Computer and Automation Research Institute of the Hungarian Academy of Sciences, Hungary
- Netherlands Institute for Sound & Vision, The Netherlands
- Hanzo Archives Limited, England
- National Library of the Czech Republic, Czech Republic
- Moravian Library, Czech Republic
For more information on Living Web Archives, please visit: http://www.liwa-project.eu/index.html
December 17, 2007 Luxemburg
Download the slides : [pdf 1.30Mb]