VIMM Working Group 3.3 collected and discussed examples from different fields of automatic information extraction relevant for virtual cultural heritage, such as text-mining, image pattern recognition (interpretation) etc. After listing mechanisms for automatic information extraction and collecting examples, these findings were examined and discussed with regard to the relevance for the field of CH.
The outcomes include a list of examples of automatic information extractions. Everyone is invited to enrich, discuss and comment this list.
Example/ Title | BBC World Service radio archive |
Method of AIE | automatic extraction of topics |
Area of application | Audio |
Aim | They first used speech-to-text technology to create transcripts, albeit “noisy” ones. They then built a “semantic tagger” called KiWi, specially designed to work on the “noisy” transcripts, that automatically assigns topics, drawn from DBpedia, Wikipedia’s store of structured data, to the radio programmes. From this data they built a prototype website that lets people explore this archive. And while doing so they can approve, correct, or add to this machine-generated metadata to make the whole thing better for all. |
Technology | speech-to-text; entity extraction from text |
Affiliated project | |
Link | http://www.bbc.co.uk/rd/projects/worldservice-archive-proto |
Further reading | |
Why best-practice? | BBC materials are a good example of a varied cultural heritage resource. They have worked extensively to implement Linked Data within their systems. This experiment links automatic extraction to human review/correction, which I think is a good approach. |
Comment (Pros, Cons) | Project is finished. |
Example/ Title | London Smells |
Method of AIE | Text data mining |
Area of application | Text |
Aim | The aim of the project is to use the dataset as a tool to explore and interpret the lives and health of the 19th and 20th century Londoners. |
Technology | Data mining over 5500 Medical Officer of Health (MOH) reports from the Greater London area spanning from 1848 to 1972 and creating a dataset of smell-related words. “We are currently working on extracting implied smells using NLP.” |
Affiliated project | |
Link | http://londonsmells.co.uk/ |
Further reading | [see web site] |
Why best-practice? | Demonstrates the practicability of extracting meaningful specialised information from a generic source. |
Comment (Pros, Cons) |
Example/ Title | STAR/ARIADNE |
Method of AIE | Text data mining |
Area of application | Text (grey literature) |
Aim | Generation of semantically useful structured information from full text sources. Results expressed as Linked Data (based on CIDOC CRM and Getty AAT). A working web application prototype is available via http://ariadne-lod.isti.cnr.it/description.html – queries concern wooden objects (e.g. samples of beech wood keels), optionally from a given date range, with automatic expansion over AAT hierarchies of wood types and some associative relationships. |
Technology | |
Affiliated project | STAR; ARIADNE |
Link | http://ariadne-lod.isti.cnr.it/description.html |
Further reading | Vlachidis A, Tudhope D. 2015. A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain. Journal of the Association for Information Science and Technology, 67 (5), 1138–1152, Wiley. https://doi.org/10.1002/asi.23485 Evaluation of outcomes
Vlachidis A, Tudhope D. 2015. Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation. Program: electronic library and information systems, 49(2), pp. 118 – 134, Emerald. https://doi.org/10.1108/PROG-10-2014-0076 currently freely available. Negation detection and see Andreas Phd work portal http://andronikos.co.uk/ |
Why best-practice? | |
Comment (Pros, Cons) | The main focus was to explore the technical feasibility of the semantic integration (the NLP are experimental prototype pipelines). |
Techniques | same method as bbc (above) – extracting keywords and linking them to knowledge bases |
Example/ Title | Compact Descriptors for Visual Search (CDVS) |
Method of AIE | Visual search enabling descriptors designed to be compact in size to be embedded in the image metadata. In addition to visual search these descriptors can be used for classification, matching, indexing and automated textual metadata generation. |
Area of application | Visual search |
Aim | Enable efficient and interoperable design of visual search applications. In particular:
– ensure interoperability of visual search applications and databases, – enable high level of performance of implementations conformant to the standard, – simplify design of visual search applications, – enable hardware support for descriptor extraction and matching functionality in mobile devices, – reduce load on wireless networks transmitting visual search-related information. |
Technology | Compact binary image descriptors |
Affiliated project | |
Link | http://mpeg.chiariglione.org/standards/mpeg-7/compact-descriptors-visual-search |
Further reading | http://ieeexplore.ieee.org/document/7149289/ |
Why best-practice? | – Standardized technology (enabling interoperability across systems)
– Metadata enrichment with limited overhead |
Comment (Pros, Cons) | Pro:
– Standardized technology (enabling interoperability across systems) – Metadata enrichment with limited overhead Con: – Limited adoption so far Comment:
|
Example/ Title | Caffe |
Method of AIE | Deep learning framework |
Area of application | Neural network library that makes creating state-of-the-art computer vision systems more easy to implement. |
Aim | Making development of deep learning applications more accessible |
Technology | Deep learning / neural networks |
Affiliated project | |
Link | http://caffe.berkeleyvision.org |
Further reading | https://goo.gl/3UNkCZ |
Why best-practice? | Deep learning is the driver of many modern AI applications, including advanced vision applications
such as object recognition and classification. Training deep learning networks requires huge amounts of (annotated) data, which is often hard to acquire. Frameworks such as Caffe provide base layers that have been trained on millions of images. Therefore, the network is trained to recognize basic generic image features and can then be extended to be trained in more specific domains with a more moderate reference data set. |
Comment (Pros, Cons) | Caffe enables development of state-of-the-art high performance vision applications. Nevertheless,
it is only a base framework and programming experience is required to build applications on top of it. A framework that learns to classify and extract information. |
Example/ Title | Traces through Time |
Method of AIE | metadata matching |
Area of application | biographical records |
Aim | This project aims to identify where multiple biographical records refer to the same individual, by matching on key facts in the metadata. Matches are assigned a confidence score. |
Technology | Custom metadata matching algorithm with support for fuzzy matching. |
Affiliated project | TNA Discovery resource |
Link | http://blog.nationalarchives.gov.uk/blog/making-connections-tracing-people-collection/#more-27523 |
Further reading | |
Why best-practice? | Matching people is an important use case, so any examples of this are helpful. |
Comment (Pros, Cons) | This is a prototype service, so implementation details may change over time. |
Example/ Title | FREME Framework |
Method of AIE | Adaptive and multillingual content enrichment, including named entity recognition, machine translation and terminology annotation |
Area of application | Various, the framework is not specific to an application area but use cases include metadata enrichment and application of data sources like ORCID with high relevance of the researchers community at large |
Aim | Ease access to language and data technologies |
Technology | software-as-a-service framework |
Affiliated project | freme-project, see http://freme-project.eu/ |
Link | https://freme-project.github.io/ |
Further reading | Overview paper: https://svn.aksw.org/papers/2016/LREC_FREME_Overview/public.pdf |
Why best-practice? | Usage of various formats in the linked data realm (e.g. NIF, OntoLex, ITS 2.0) and general APIs for lowering the barrier to access data and language technologies. |
Comment (Pros, Cons) | easy to use and adopt, promotion of standards for enrichment – that was the aim |
Example/ Title | Europeana semantic enrichment |
Method of AIE | semantic enrichment, linking |
Area of application | any metadata record (text) |
Aim | The aim is to enrich data providers’ metadata (more than 50 million records) aggregated in Europeana, by automatically linking text strings found in the metadata to controlled terms from Linked Open dataset or vocabularies, like Geonames for places and DBpedia for person names and concepts. |
Technology | metadata linking |
Affiliated project | |
Link | http://europeana.eu |
Further reading | http://pro.europeana.eu/share-your-data/data-guidelines/europeana-semantic-enrichment
https://docs.google.com/document/d/1JvjrWMTpMIH7WnuieNqcT0zpJAXUPo6x4uMBj1pEx0Y |
Why best-practice? | Simple technology and large scale implementation |
Comment (Pros, Cons) | The process is automatic and lacks a further validation (so enrichments can be inaccurate)
Extracts information from metadata, not content. Works with string-matching only. Large scale, no validation. |
Example/ Title | Europeana Fashion semantic enrichment |
Method of AIE | NER through NLP and regular expression string matching, linking |
Area of application | metadata records (text) |
Aim | Enrich fashion metadata specific properties, like dc:type, dcterms:medium, edmfp:technique, gr:color, etc, by extracting named entities from textual descriptions using NLP techniques in a multilingual context and linking the extracted entities to LoD sources like Getty AAT and DBpedia. |
Technology | SaaS |
Affiliated project | Europeana Fashion |
Link | http://www.europeanafashion.eu |
Further reading | http://bit.ly/2jiRAGS
https://link.springer.com/chapter/10.1007%2F978-3-319-49607-8_7 |
Why best-practice? | broad application in the area of metadata enrichment and very accurate results also in a multilingual context |
Comment (Pros, Cons) | Multilingual NLP. Validation tool still missing. |