VIMM Working Group 3.3 collected and discussed examples from different fields of automatic information extraction relevant for virtual cultural heritage, such as text-mining, image pattern recognition (interpretation) etc. After listing mechanisms for automatic information extraction and collecting examples, these findings were examined and discussed with regard to the relevance for the field of CH.

The outcomes include a list of examples of automatic information extractions. Everyone is invited to enrich, discuss and comment this list.

 

Example/ Title BBC World Service radio archive
Method of AIE automatic extraction of topics
Area of application Audio
Aim They first used speech-to-text technology to create transcripts, albeit “noisy” ones. They then built a “semantic tagger” called KiWi, specially designed to work on the “noisy” transcripts, that automatically assigns topics, drawn from DBpedia, Wikipedia’s store of structured data, to the radio programmes. From this data they built a prototype website that lets people explore this archive. And while doing so they can approve, correct, or add to this machine-generated metadata to make the whole thing better for all.
Technology speech-to-text; entity extraction from text
Affiliated project
Link http://www.bbc.co.uk/rd/projects/worldservice-archive-proto
Further reading
Why best-practice? BBC materials are a good example of a varied cultural heritage resource. They have worked extensively to implement Linked Data within their systems.  This experiment links automatic extraction to human review/correction, which I think is a good approach.
Comment (Pros, Cons) Project is finished.
Example/ Title London Smells
Method of AIE Text data mining
Area of application Text
Aim The aim of the project is to use the dataset as a tool to explore and interpret the lives and health of the 19th and 20th century Londoners.
Technology Data mining over 5500 Medical Officer of Health (MOH) reports from the Greater London area spanning from 1848 to 1972 and creating a dataset of smell-related words. “We are currently working on extracting implied smells using NLP.”
Affiliated project
Link http://londonsmells.co.uk/
Further reading [see web site]
Why best-practice? Demonstrates the practicability of extracting meaningful specialised information from a generic source.
Comment (Pros, Cons)
Example/ Title STAR/ARIADNE
Method of AIE Text data mining
Area of application Text (grey literature)
Aim Generation of semantically useful structured information from full text sources. Results expressed as Linked Data (based on CIDOC CRM and Getty AAT). A working web application prototype is available via  http://ariadne-lod.isti.cnr.it/description.html  – queries concern wooden objects (e.g. samples of beech wood keels), optionally from a given date range, with automatic expansion over AAT hierarchies of wood types and some associative relationships.
Technology
Affiliated project STAR; ARIADNE
Link http://ariadne-lod.isti.cnr.it/description.html
Further reading Vlachidis A, Tudhope D. 2015. A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain. Journal of the Association for Information Science and Technology, 67 (5), 1138–1152, Wiley. https://doi.org/10.1002/asi.23485  Evaluation of outcomes

Vlachidis A, Tudhope D. 2015. Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation. Program: electronic library and information systems, 49(2), pp. 118 – 134, Emerald.

https://doi.org/10.1108/PROG-10-2014-0076 currently freely available. Negation detection

and see Andreas Phd work portal http://andronikos.co.uk/

Why best-practice?
Comment (Pros, Cons) The main focus was to explore the technical feasibility of the semantic integration (the NLP are experimental prototype pipelines).
Techniques same method as bbc (above) – extracting keywords and linking them to knowledge bases
Example/ Title Compact Descriptors for Visual Search (CDVS)
Method of AIE Visual search enabling descriptors designed to be compact in size to be embedded in the image metadata. In addition to visual search these descriptors can be used for classification, matching, indexing and automated textual metadata generation.
Area of application Visual search
Aim Enable efficient and interoperable design of visual search applications. In particular:

– ensure interoperability of visual search applications and databases,

– enable high level of performance of implementations conformant to the standard,

– simplify design of visual search applications,

– enable hardware support for descriptor extraction and matching functionality in mobile devices,

– reduce load on wireless networks transmitting visual search-related information.

Technology Compact binary image descriptors
Affiliated project
Link http://mpeg.chiariglione.org/standards/mpeg-7/compact-descriptors-visual-search
Further reading http://ieeexplore.ieee.org/document/7149289/
Why best-practice? – Standardized technology (enabling interoperability across systems)

– Metadata enrichment with limited overhead

Comment (Pros, Cons) Pro:

– Standardized technology (enabling interoperability across systems)

– Metadata enrichment with limited overhead

Con:

– Limited adoption so far

Comment:

  • A standard to improve AIE
Example/ Title Caffe
Method of AIE Deep learning framework
Area of application Neural network library that makes creating state-of-the-art computer vision systems more easy to implement.
Aim Making development of deep learning applications more accessible
Technology Deep learning / neural networks
Affiliated project
Link http://caffe.berkeleyvision.org
Further reading https://goo.gl/3UNkCZ
Why best-practice? Deep learning is the driver of many modern AI applications, including advanced vision applications

such as object recognition and classification. Training deep learning networks requires huge amounts of (annotated) data, which is often hard to acquire. Frameworks such as Caffe provide base layers that have been trained on millions of images. Therefore, the network is trained to recognize basic generic image features and can then be extended to be trained in more specific domains with a more moderate reference data set.

Comment (Pros, Cons) Caffe enables development of state-of-the-art high performance vision applications. Nevertheless,

it is only a base framework and programming experience is required to build applications on top of it.

A framework that learns to classify and extract information.

Example/ Title Traces through Time
Method of AIE metadata matching
Area of application biographical records
Aim This project aims to identify where multiple biographical records refer to the same individual, by matching on key facts in the metadata.  Matches are assigned a confidence score.
Technology Custom metadata matching algorithm with support for fuzzy matching.
Affiliated project TNA Discovery resource
Link http://blog.nationalarchives.gov.uk/blog/making-connections-tracing-people-collection/#more-27523
Further reading
Why best-practice? Matching people is an important use case, so any examples of this are helpful.
Comment (Pros, Cons) This is a prototype service, so implementation details may change over time.
Example/ Title FREME Framework
Method of AIE Adaptive and multillingual content enrichment, including named entity recognition, machine translation and terminology annotation
Area of application Various, the framework is not specific to an application area but use cases include metadata enrichment and application of data sources like ORCID with high relevance of the researchers community at large
Aim Ease access to language and data technologies
Technology software-as-a-service framework
Affiliated project freme-project, see http://freme-project.eu/
Link https://freme-project.github.io/
Further reading Overview paper: https://svn.aksw.org/papers/2016/LREC_FREME_Overview/public.pdf
Why best-practice? Usage of various formats in the linked data realm (e.g. NIF, OntoLex, ITS 2.0) and general APIs for lowering the barrier to access data and language technologies.
Comment (Pros, Cons) easy to use and adopt, promotion of standards for enrichment – that was the aim
Example/ Title Europeana semantic enrichment
Method of AIE semantic enrichment, linking
Area of application any metadata record (text)
Aim The aim is to enrich data providers’ metadata (more than 50 million records) aggregated in Europeana, by automatically linking text strings found in the metadata to controlled terms from Linked Open dataset or vocabularies, like Geonames for places and DBpedia for person names and concepts.
Technology metadata linking
Affiliated project
Link http://europeana.eu
Further reading http://pro.europeana.eu/share-your-data/data-guidelines/europeana-semantic-enrichment

https://docs.google.com/document/d/1JvjrWMTpMIH7WnuieNqcT0zpJAXUPo6x4uMBj1pEx0Y

Why best-practice? Simple technology and large scale implementation
Comment (Pros, Cons) The process is automatic and lacks a further validation (so enrichments can be inaccurate)

Extracts information from metadata, not content. Works with string-matching only. Large scale, no validation.

Example/ Title Europeana Fashion semantic enrichment
Method of AIE NER through NLP and regular expression string matching, linking
Area of application metadata records (text)
Aim Enrich fashion metadata specific properties, like dc:type, dcterms:medium, edmfp:technique, gr:color, etc, by extracting named entities from textual descriptions using NLP techniques in a multilingual context and linking the extracted entities to LoD sources like Getty AAT and DBpedia.
Technology SaaS
Affiliated project Europeana Fashion
Link http://www.europeanafashion.eu
Further reading http://bit.ly/2jiRAGS

https://link.springer.com/chapter/10.1007%2F978-3-319-49607-8_7

Why best-practice? broad application in the area of metadata enrichment and very accurate results also in a multilingual context
Comment (Pros, Cons) Multilingual NLP. Validation tool still missing.