My blog
Harnessing the power of the Oxford English Dictionary for linguistic research and NLP applications

Harnessing the power of the Oxford English Dictionary for linguistic research and NLP applications

How the OED Text Annotator may help bring text mining and natural language processing technologies to the next level.

Published on by Anton Vasetenkov

Linguistic research and automated text analysis often involve breaking sentences apart into tokens and identifying the base form and meaning of each individual element, with the resulting "interpretative" data being either used directly or fed into various kinds of downstream processing tools and algorithms. The tokenisation algorithms used to perform this task draw on carefully crafted linguistic datasets, the quality of which often predicts the usefulness of the produced annotations.

The Oxford English Dictionary, or OED, is one of the largest dictionaries ever compiled, which makes it one of the most valuable and comprehensive linguistic datasets for the English language. The definitive record of the language published by Oxford University Press, it features about 600000 words, 3 million quotations, and over 1000 years of English.

The OED Text Annotator is the first and only publicly available tool for linguistic annotation based on the OED. Built by Oxford Languages, the system performs performs tokenisation, part-of-speech tagging, lemmatisation on a given text and links each word to its corresponding OED lexeme through sense disambiguation. In the publicly available version of the application, the word origin and usage data from the resulting annotations is picked up by another tool called the OED Text Visualizer which displays it in an interactive visual format.

So what does it all mean?

The OED Text Annotator mobilises the value and richness of the OED dataset for linguistic analysis and NLP applications. It may therefore both lead to important discoveries in all areas of research as well as help power more advanced and robust NLP systems supporting various industrial applications.

See also

Document understanding: Modern techniques and real-world applications
How document understanding helps bring order to unstructured data.
A technical introduction to OpenAI's GPT-3 language model
An overview of the groundbreaking GPT-3 language model created by OpenAI.
A beginner's guide to graph embeddings
Understanding what graph embeddings are and why they are important for graph analytics.
Towards more linked lexicographical data: Lexemes on Wikidata
A glimpse into the meaning and other properties of words described with structured and linked data.
Navigating unstructured data: The rise of question answering
Question answering technologies are key to efficiently dealing with overwhelming amounts of unstructured data.

Thanks for stopping by my digital playground! If you want to say hi, you can reach out to me on LinkedIn or via email. I'm always keen to chat and connect.

If you really-really like my work, you can support me by buying me a coffee.