The LiWA consortium released the Terminology Extraction Pipeline Version 1.0 as part of the Terminology Evolution Detection module. The released processing pipeline consists of four major steps: pre-processing, natural language processing and creation of co-occurrence graph.
In more detail the WARC Collection Reader (WARC Extraction) extracts the text and time metadata for each site archived in the input crawl. The POS (Part Of Speech) Tagger is an aggregate analysis engine from Dextract . It consists of a tokenizer, a language independent part of speech tagger and lemmatizer (TreeTagger). In the Term Extraction sub-module, we read the annotated sites, extract the lemmas and the different occurring parts of speech that were identified for the archived sites. After that, we index the terms in an database (MySQL) index (see below). In the Co-occurrence Analysis we extract lemma or noun co-occurrence matrices for the indexed crawl from the database index.
The Terminology Extraction Pipeline can be downloaded from Google Code: