Using Word Sense Discrimination on Historic Document Collections
The paper “Using Word Sense Discrimination on Historic Document Collections” has been accepted for the 10th ACM/IEEE JCDL
The paper entitled “Using Word Sense Discrimination on Historic Document Collections” by Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse has been accepted in the 10th ACM/IEEE Joint Conference on Digital Libraries. The paper evaluates word sense discrimination on historic document collections to investigate if word senses can be found automatically using modern technology applied on historic data. The paper also investigates which impact OCR errors, present in scanned historic documents, have on finding word senses in an automatic way. Finding word senses in an automatic way is the first step towards detecting terminology evolution and hence an important step in our research. Nina Tahmasebi will present the paper on June 22nd, 2010 at JCDL which is held in conjunction with ICADL in Surfers Paradise (Gold Coast, Australia).
IWAW Proceedings online
IWAW09 took take place the 30th of September and 1st of October 2009, in conjunction with ECDL in Corfu (Greece). The proceedings are now available online.
Around 40 participants attended IWAW2009, which took place on Sep. 30 / Oct. 1 2009, in conjunction with ECDL in Corfu (Greece). The workshop provided a comprehensive overview on active research and practice on the preservation of the Web. This year’s workshop also addressed several new approaches and research (from virtual worlds preservation to temporal dimension of Web Archives) as well as practical issues addressed by Archiving institutions, specifically with respect to managing the storage of large volumes of digital material. In this context, a special Session was devoted to the WARC storage format, which has been accepted as a new ISO standard (ISO 28500:2009), as well as emerging tool support to handle these container objects. In general, scalability issues and managing large-volume crawls were topics of intensive discussions, based on the increasing body of experience available in numerous institutions by now, running a series of Web archiving activities in a range of different configurations.
Bridging the Terminology Gap in Web Archive Search
A paper on dealing with terminology evolution in web archives has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009)
The paper entitled ’Bridging the Terminology Gap in Web Archive Search’ by Klaus Berberich, Srikanta Bedathur, Mauro Sozio, and Gerhard Weikum has been accepted in the 12th International Workshop on the Web and Databases (WebDB 2009). The paper proposes a method to find query reformulations that paraphrase users’ information needs using past terminology. Such query reformulations are key to retrieving old but highly relevant documents in web archives that were written using now outdated terminology. Klaus Berberich will present the paper on June 28th, 2009 at WebDB 2009, which is held in conjunction with SIGMOD 2009 in Providence (Rhode Island, USA).
LiWA Poster at the CHORUS Conference in Brussels
Nina Tahmasebi (L3S) will present a poster on the LiWA project in the CHORUS conference in Brussels (Belgium) on May 26-27, 2009. The poster will present the overall goals of LiWA and in particular semantic evolution.
Talk “Terminology Evolution in Web Archiving” at Vienna
Dr. Thomas Risse (L3S) will give a talk on “Terminology Evolution in Web Archiving” at Seminarraum, Institut für Knowledge und Business Engineering, in Rathausstrasse 19/9, A-1010 Wien; on April 30, 2009.
Due to the central role that the World Wide Web plays in nearly all areas
of today’s life, adequate Web archiving has become a cultural necessity in
preserving knowledge. The next generation web archiving technologies will
overcome limitations in content capture, preservation, analysis and
enrichment. One important aspect is the archive interpretability. The
correspondence between the terminology used for querying and the one used
in content objects to be retrieved is a crucial prerequisite for effective
content access based on retrieval technology. However, as terminology is
evolving over time, a growing gap opens up between older documents in
(long-term) archives and the active language used for querying such
archives. Thus, technologies for detecting and systematically handling
terminology evolution are required to ensure ``semantic’’ accessibility of
(Web) archive content on the long run. As a starting point for dealing
with terminology evolution present the problem and discusses issues,
approaches and relevant technologies.
Half day session on LiWA during IWAW
A dedicated session took place during the 8th International Web Archiving Workshop
Over 70 web archivists and researchers in this domain attended the 8th edition of IWAW during which a full session was dedicated to present research objectives and early results from LiWA.
Lots of questions and interest from the audience, which is good sign for us. See below links to presentations from this session:
Web Spam: a Survey with Vision for the Archivist
Andras Benczur, David Siklosi, Jacint Szabo, Istvan Biro, Zsolt Fekete, Miklos Kurucz, Attila Pereszlenyi, Simon Racz, Adrienn Szabo (paper, presentation)
Terminology Evolution in Web Archiving: Open Issues
Nina Tahmasebi, Tereza Iofciu, Thomas Risse, Claudia Niederée, Wolf Siberski (paper,presentation)
Liwa Architecture
Radu Pop, Wolf Siberski, Mark Williamson (presentation)
“Catch me if you can”. Temporal Coherence of Web Archives
Marc Spaniol (presentation)
The Challenge of Dynamic Links
Mark Williamson (presentation)
Presentation at the IFIP WG 2.6. Meeting
Thomas Risse (L3S) presented LiWA and the problem of terminology evolution during the meeting of the IFIP 2.6 Working Group on Databases
Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution present the problem and discusses issues, first ideas and relevant technologies.
Presentation at the GI-DL Meeting in Karlsruhe
Thomas Risse (L3S) presented LiWA and the problem terminology evolution during the foundation meeting of the German Digital Library Working Group of the Gesellschaft für Informatik e.V.
Presentation at the University Stuttgart
Thomas Risse (L3S) presented LiWA and the problem terminology evolution at the Institute for Natural Language Processing of the University Stuttgart
Abstract
Due to the central role that the World Wide Web plays in nearly all areas of today’s life, adequate Web archiving has become a cultural necessity in preserving knowledge. A first generation of Web archiving technology has been built by pioneers in the domain based on existing search technology. The next generation web archiving technologies will overcome limitations in content capture, preservation, analysis and enrichment. It is the goal of the LiWA project (Living Web Archives, IST FP7 216267) to turn Web archives from pure Web page storages into “living Web archives”. Such living archives, will be capable of: handling a variety of content types; dealing with evolution as well as long-term archive interpretability.
One important aspect is the archive interpretability. The correspondence between the terminology used for querying and the one used in content objects to be retrieved is a crucial prerequisite for effective content access based on retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure ``semantic’’ accessibility of (Web) archive content on the long run.
Within this talk we give an overview about the LiWA project and present in more detail the problem of terminology evolution by giving a more formal problem statement and discuss issues, first ideas and relevant technologies.
