HomeProjectsBlogContact
Made with ❤️ by Anton Vasetenkov
Towards more linked lexicographical data: Lexemes on Wikidata
Sep 12, 2020

Towards more linked lexicographical data: Lexemes on Wikidata

A glimpse into the meaning and other properties of words described with structured and linked data.

Wikidata is arguably the world's best-known open structured knowledge base. Generally speaking, all things stored in Wikidata are entities, which are either real-world objects such as Sky Tower or more abstract topics such as JavaScript. Such "common" entities are known as concepts or Q-items, and account for the majority of data stored in Wikidata.

In May 2018, Wikidata added support for a new kind of entity called lexemes or L-items. An important concept in lexicography and linguistic analysis, lexemes are the units of language that are used to group together words that are related through inflection. For example, the English verb run is a lexeme that refers to the set of words which includes run, runs, ran, and running—all of which share the same meaning. Capturing such lexicographical information is important, and doing so using linked data greatly increases the utility of the resulting dataset.

The structure of a lexeme

Each L-item in Wikidata has a headword and is linked to the language and lexical category to which it belongs and its lexical forms and senses. For example, the English noun table has multiple senses including "item of furniture" and "arrangement of data in rows and columns" and lists the following forms: table, tables, table's, and tables'.

Accessing the lexeme data programmatically

As with all Wikidata entities, the statements about lexemes can be queried using the Wikidata Query Service.

For example, this query returns the canonical forms (lemmas) of all English noun lexemes:

SELECT ?lexeme ?lemma
WHERE {
    ?lexeme dct:language wd:Q1860 ;
            wikibase:lexicalCategory wd:Q1084 ;
            wikibase:lemma ?lemma .
}

Try this query using the Wikidata Query Service

Query results (truncated):

lexemelemma
wd:L11845gut
wd:L12158pollution
wd:L12190trap
wd:L12204transit
wd:L12212freight
......

To get all the words in all languages that mean "(liquid) water", the following query can be used:

SELECT ?languageLabel ?lemma
WHERE {
    ?lexeme dct:language ?language ;
            wikibase:lemma ?lemma ;
            ontolex:sense [
                wdt:P5137 wd:Q29053744
            ] .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?languageLabel

Try this query using the Wikidata Query Service

Query results (truncated):

languageLabellemma
'Are'arewai
Abazaдзы
Abkhazаӡы
Acehneseie
Achiyaʼ
......

Applications of Wikidata lexemes

High-quality structured linguistic datasets are the sine qua non of computational linguistics and its applications. As the need for more comprehensive linguistic datasets is constantly growing, Wikidata lexemes can offer a better alternative for systems that tackle problems such as word-sense disambiguation, machine translation, and text classification and summarisation.

The bottom line

Since its first release in 2018, the Lexicographical Data on Wikidata has grown to include over 300,000 lexemes in over 700 languages.

There is little doubt that Wikidata lexemes will become the go-to source of structured lexicographical data for language researchers and technologists.

See also
My projects: Semantic Web Browser
Navigating the Semantic Web and retrieving the structured data about entities made easy with Semantic Web Browser.
Running Neo4j in Docker with the Graph Data Science library
How to run the official Neo4j Docker image and enable the Graph Data Science library?
A technical introduction to OpenAI's GPT-3 language model
An overview of the groundbreaking GPT-3 language model created by OpenAI.
Harnessing the power of the Oxford English Dictionary for linguistic research and NLP applications
How the OED Text Annotator may help bring text mining and natural language processing technologies to the next level.
Linked data for the enterprise: Focus on Bayer's corporate asset register
An overview of COLID, the data asset management platform built using semantic technologies.
A beginner's guide to graph embeddings
Understanding what graph embeddings are and why they are important for graph analytics.
Document understanding: Modern techniques and real-world applications
How document understanding helps bring order to unstructured data.
Navigating unstructured data: The rise of question answering
Question answering technologies are key to efficiently dealing with overwhelming amounts of unstructured data.
Why federation is a game-changing feature of SPARQL
SPARQL federation is an incredibly useful feature for querying distributed RDF graphs.
What does a knowledge engineer do?
An overview of knowledge engineering and the core competencies and responsibilities of a knowledge engineer.
My projects: Dictionary
Look up word pronunciations online.