LiWA technologies released in Open Source

Posted August 30, 2010
Areas : Archive Fidelity, Spam Cleansing, Temporal Coherence, Semantic Evolution, Social Web, Rich Media, General.

LiWA partners are pleased to announce the release in open-source of the complete list of components and tools issued from the LiWA project.

They are all grouped under the “liwa-technologies” project on Google code:
http://code.google.com/p/liwa-technologies/.


1° The Rich Media Capture Module - a plug-in dedicated to the capture of streaming video content:
http://code.google.com/p/liwa-technologies/source/browse/rich-media-capture
http://code.google.com/p/liwa-technologies/downloads/detail?name=rich-media-capture-plugin-1.0.jar

2° The Temporal Coherence Analyser - a plug-in dedicated to the analysis of the temporal coherence of the archived Web content:
http://code.google.com/p/liwa-technologies/source/browse/temporal-coherence

3° The Spam Assessment Interface - a Web service that enables the quality assessment of the archived Web content:
http://code.google.com/p/liwa-technologies/source/browse/assessment-interface

4° The Semantic Analizer - a component dedicated to the detection of terminology evolution:
http://code.google.com/p/liwa-technologies/source/browse/SemanticAnalyser
http://code.google.com/p/liwa-technologies/downloads/detail?name=SemanticAnalyser-1.0.zip

5° The Web Archive UI Framework - a client-side framework that helps creating User Interface helpers for Web archive browsing:
http://code.google.com/p/liwa-technologies/source/browse/web-archive-ui-framework



To learn more about each component, the Google project provides also a wiki space, giving a brief description of each module and the necessary steps for its deployment: http://code.google.com/p/liwa-technologies/w/list



You are all welcome to download and try out the LiWA components. Your feedback and comments will be greatly appreciated, helping us to improve the documentation and the usability of the technologies.

 

LiWA Evolution Tracking Module released

Posted April 22, 2011
Areas : Semantic Evolution.

LiWA partners are pleased to announce the release in open-source of the LiWA Evolution Tracking Module.

The LiWA Terminology Evolution Tracking Module is a java module for Word sense evolution tracking, released under the “liwa-technologies” project on Google code:
http://code.google.com/p/liwa-technologies/downloads/detail?name=LiWAEvoTracking.zip&can=2&q=

 

LiWA Third Newsletter published

Posted April 07, 2011
Areas : Archive Fidelity, Spam Cleansing, Temporal Coherence, Semantic Evolution, Social Web, Rich Media, General, Events.

The LiWA Newsletter No3 is now available, summarizing the findings and results of the 36 months project. Enjoy reading it!

 

Language Evolution On The Go

Posted March 10, 2011
Areas : Semantic Evolution, General, Events.

Zenz. G., Tahmasebi, N., and T. Risse have been invited for submission of their paper “Language Evolution On The Go” (Extended Version) to the Journal on Multimedia Tools and Applications.

 

On the Applicability of Word Sense Discrimination on 201 Years of Modern English

Posted March 10, 2011
Areas : Semantic Evolution, General, Events.

The paper “On the Applicability of Word Sense Discrimination on 201 Years of Modern English”, co-written by Tahmasebi, N., K. Niklas, G. Zenz, and T. Risse has been submitted to the Journal of Computational Linguistics.

Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785 - 1985.

 

2nd LiWA Terminology Evolution Evaluation Workshop

Posted December 29, 2010
Areas : Semantic Evolution, General, Events.

On December 15, 2010 the 2nd LiWA Terminology Evolution Evaluation Workshop will be held in Hanover, Germany at L3S Research Center. The workshop aims at evaluating terminology evolution found inside long term archives. The workshop attendees will also evaluate the performance of the Terminology Evolution Browser, a tool developed within LiWA to better visualize evolution.

 

1st LiWA Terminology Evolution Evaluation Workshop

Posted December 29, 2010
Areas : Semantic Evolution, General, Events.

The 1st LiWA Terminology Evolution Evaluation Workshop was held on March 16, 2010 at L3S Research Center, Hannover, Germany. The workshop spanned half a day and aimed at evaluating the outcome in LiWA WP5 technology.

 

Language Evolution On The Go

Posted November 04, 2010
Areas : Semantic Evolution, General, Events.

The paper entitled “Language Evolution On The Go” by G. Zenz, N. Tahmasebi and T. Risse will be presented at SAME 2010

Knowing about the evolution of a term can significantly decrease time needed for searching for information. It can also aid in quickly getting a broader overview, which is essential when one is on the move. In this paper we present a solution for providing language evolution knowledge “on the go”. On the 3rd International Workshop on Semantic Ambient Media Experience 2010, November 10th in conjunction with AmI-10 in Malaga, Spain, the LiWA project will present a mobile interface for easy access and visualization as well as an overview of how this evolution was found.

 

LiWA papers at IWAW10

Posted September 21, 2010
Areas : Archive Fidelity, Temporal Coherence, Semantic Evolution, General, Events.

IWAW10 takes place on 22nd and 23rd of September in Vienna at the Austrian National Library

The following papers have been accepted for presentation at this International Web Archiving Workshop:
- “Archiving Web Video”, Radu Pop, Gabrile Vasile and Julien Masanes
- “The SOLAR System for Sharp Web Archiving”, Arturas Mazeika, Dimitar Denev, Marc Spaniol and Gerhard Weikum
- “Terminology Evolution Module for Web Archives in the LiWA Context”, Nina Tahmasebi, Gideon Zenz, Tereza Iofciu and Thomas Risse
- “Archiving Data Objects using Web Feeds”, Marilena Oita and Pierre Senellart.

 

Terminology Extraction Pipeline Version 1.0 released

Posted August 09, 2010
Areas : Semantic Evolution, General, Events.

The LiWA consortium released the Terminology Extraction Pipeline Version 1.0 as part of the Terminology Evolution Detection module. The released processing pipeline consists of four major steps: pre-processing, natural language processing and creation of co-occurrence graph.

In more detail the WARC Collection Reader (WARC Extraction) extracts the text and time metadata for each site archived in the input crawl. The POS (Part Of Speech) Tagger is an aggregate analysis engine from Dextract . It consists of a tokenizer, a language independent part of speech tagger and lemmatizer (TreeTagger). In the Term Extraction sub-module, we read the annotated sites, extract the lemmas and the different occurring parts of speech that were identified for the archived sites. After that, we index the terms in an database (MySQL) index (see below). In the Co-occurrence Analysis we extract lemma or noun co-occurrence matrices for the indexed crawl from the database index.

The Terminology Extraction Pipeline can be downloaded from Google Code:
http://code.google.com/p/liwa-technologies/downloads/detail?name=SemanticAnalyser-1.0.zip

 

Using Word Sense Discrimination on Historic Document Collections

Posted April 20, 2010
Areas : Semantic Evolution, General, Events.

The paper “Using Word Sense Discrimination on Historic Document Collections” has been accepted for the 10th ACM/IEEE JCDL

The paper entitled “Using Word Sense Discrimination on Historic Document Collections” by Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse has been accepted in the 10th ACM/IEEE Joint Conference on Digital Libraries. The paper evaluates word sense discrimination on historic document collections to investigate if word senses can be found automatically using modern technology applied on historic data. The paper also investigates which impact OCR errors, present in scanned historic documents, have on finding word senses in an automatic way. Finding word senses in an automatic way is the first step towards detecting terminology evolution and hence an important step in our research. Nina Tahmasebi will present the paper on June 22nd, 2010 at JCDL which is held in conjunction with ICADL in Surfers Paradise (Gold Coast, Australia).

 

IWAW Proceedings online

Posted November 13, 2009
Areas : Archive Fidelity, Temporal Coherence, Semantic Evolution, Social Web, General, Events.

IWAW09 took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The proceedings are now available online.

Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects.  In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.

 

Bridging the Terminology Gap in Web Archive Search

Posted June 08, 2009
Areas : Semantic Evolution, General, Events.

A paper on dealing with terminology evolution in web archives has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009)

The paper entitled ‘Bridging the Terminology Gap in Web Archive Search’ by Klaus Berberich, Srikanta Bedathur, Mauro Sozio, and Gerhard Weikum has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009). The paper proposes a method to find query reformulations that paraphrase users’ information needs using past terminology. Such query reformulations are key to retrieving old but highly relevant documents in web archives that were written using now outdated terminology. Klaus Berberich will present the paper on June 28th, 2009 at WebDB 2009, which is held in conjunction with SIGMOD 2009 in Providence (Rhode Island, USA).

 

LiWA Poster at the CHORUS Conference in Brussels

Posted May 18, 2009
Areas : Semantic Evolution, General, Events.

Nina Tahmasebi (L3S) will present a poster on the LiWA project in the CHORUS conference in Brussels (Belgium) on May 26-27, 2009. The poster will present the overall goals of LiWA and in particular semantic evolution.

 

Talk “Terminology Evolution in Web Archiving” at Vienna

Posted April 20, 2009
Areas : Semantic Evolution, General, Events.

Dr. Thomas Risse (L3S) will give a talk on “Terminology Evolution in Web Archiving” at Seminarraum, Institut für Knowledge und Business Engineering, in Rathausstrasse 19/9, A-1010 Wien; on April 30, 2009.

Due to the central role that the World Wide Web plays in nearly all areas
of today’s life, adequate Web archiving has become a cultural necessity in

preserving knowledge. The next generation web archiving technologies will
overcome limitations in content capture, preservation, analysis and
enrichment. One important aspect is the archive interpretability. The
correspondence between the terminology used for querying and the one used
in content objects to be retrieved is a crucial prerequisite for effective

content access based on retrieval technology. However, as terminology is
evolving over time, a growing gap opens up between older documents in
(long-term) archives and the active language used for querying such
archives. Thus, technologies for detecting and systematically handling
terminology evolution are required to ensure ``semantic’’ accessibility of

(Web) archive content on the long run. As a starting point for dealing
with terminology evolution present the problem and discusses issues,
approaches and relevant technologies.

 

Half day session on LiWA during IWAW

Posted September 18, 2008
Areas : Archive Fidelity, Spam Cleansing, Temporal Coherence, Semantic Evolution, General, Events.

A dedicated session took place during the 8th International Web Archiving Workshop

image
Over 70 web archivists and researchers in this domain attended the 8th edition of IWAW during which a full session was dedicated to present research objectives and early results from LiWA.
image Lots of questions and interest from the audience, which is good sign for us. See below links to presentations from this session:

Web Spam: a Survey with Vision for the Archivist
Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo (paper, presentation)

imageTerminology Evolution in Web Archiving: Open Issues
Nina Tahmasebi, Tereza Iofciu, Thomas Risse, Claudia Niederée, Wolf Siberski (paper,presentation)

Liwa Architecture
Radu Pop, Wolf Siberski, Mark Williamson (presentation)

“Catch me if you can”. Temporal Coherence of Web Archives
Marc Spaniol (presentation)

The Challenge of Dynamic Links
Mark Williamson (presentation)

 

 

Presentation at the IFIP WG 2.6. Meeting

Posted September 10, 2008
Areas : Semantic Evolution, Events.

Thomas Risse (L3S) presented LiWA and the problem of terminology evolution during the meeting of the IFIP 2.6 Working Group on Databases

Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution present the problem and discusses issues, first ideas and relevant technologies.

 

Presentation at the GI-DL Meeting in Karlsruhe

Posted July 08, 2008
Areas : Semantic Evolution, Events.

Thomas Risse (L3S) presented LiWA and the problem terminology evolution during the foundation meeting of the German Digital Library Working Group of the Gesellschaft für Informatik e.V.

 

Presentation at the University Stuttgart

Posted June 27, 2008
Areas : Semantic Evolution, Events.

Thomas Risse (L3S) presented LiWA and the problem terminology evolution at the Institute for Natural Language Processing of the University Stuttgart

Abstract

Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. A first generation of Web archiving technology has been built by pioneers in the domain based on existing search technology. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. It is the goal of the LiWA project (Living Web Archives, IST FP7 216267) to turn Web archives from pure Web page storages into “living Web archives”. Such living archives, will be capable of: handling a variety of content types; dealing with evolution as well as long-term archive interpretability.

One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run.

Within this talk we give an overview about the LiWA project and present in more detail the problem of terminology evolution by giving a more formal problem statement and discuss issues, first ideas and relevant technologies.