22 - 23 June 2020

Towards an ontology based on Hallig-Wartburg’s Begriffssystem for Historical Linguistic Linked Data

Authors:  Sabine Tittel, Frances Gillis-Webber and Alessandro A. Nannini

Conference:  7th Workshop on Linked Data in Linguistics: Building Tools and Infrastructures (LDL 2020), co-located with LREC 2020. Presented online.


To empower end users in searching for historical linguistic content with a performance that far exceeds the research functions offered by websites of, e.g., historical dictionaries, is undoubtedly a major advantage of (Linguistic) Linked Open Data ([L]LOD). An important aim of lexicography is to enable a language-independent, onomasiological approach, and the modelling of linguistic resources following the LOD paradigm facilitates the semantic mapping to ontologies making this approach possible. Hallig-Wartburg's Begriffssystem (HW) is a well-known extra-linguistic conceptual system used as an onomasiological framework by many historical lexicographical and lexicological works. Published in 1952, HW has meanwhile been digitised. With proprietary XML data as the starting point, our goal is the transformation of HW into Linked Open Data in order to facilitate its use by linguistic resources modelled as LOD. In this paper, we describe the particularities of the HW conceptual model and the method of converting HW: We discuss two approaches, (i) the representation of HW in RDF using SKOS, the SKOS thesaurus extension, and XKOS, and (ii) the creation of a lightweight ontology expressed in OWL, based on the RDF/SKOS model. The outcome is illustrated with use cases of medieval Gascon, and Italian.

11 - 16 May 2020

A Framework for Shared Agreement of Language Tags beyond ISO 639

Authors:  Frances Gillis-Webber and Sabine Tittel

Conference:  12th edition of the Language Resources and Evaluation Conference (LREC 2020)


The identification and annotation of languages in an unambiguous and standardized way is essential for the description of linguistic data. It is the prerequisite for machine-based interpretation, aggregation, and re-use of the data with respect to different languages. This makes it a key aspect especially for Linked Data and the multilingual Semantic Web. The standard for language tags is defined by IETF’s BCP 47 and ISO 639 provides the language codes that are the tags' main constituents. However, for the identification of lesser-known languages, endangered languages, regional varieties or historical stages of a language, the ISO 639 codes are insufficient. Also, the optional language sub-tags compliant with BCP 47 do not offer a possibility fine-grained enough to represent linguistic variation. We propose a versatile pattern that extends the BCP 47 sub-tag privateuse and is, thus, able to overcome the limits of BCP 47 and ISO 639. Sufficient coverage of the pattern is demonstrated with the use case of linguistic Linked Data of the endangered Gascon language. We show how to use a URI shortcode for the extended sub-tag, making the length compliant with BCP 47. We achieve this with a web application and API developed to encode and decode the language tag.

7 February 2020

Ontology-Based Data Access of Animals with Ontop - A Tutorial

Authors:  Frances Gillis-Webber and C. Maria Keet


View:  An Introduction to Ontology Engineering by C. Maria Keet (Textbook version 1.5, Appendix A.2, Page 256)


The aim of this tutorial is to demonstrate the concept of Ontology-Based Data Access (OBDA), where one queries the data residing in a database through the ontology. We use the Ontop framework for this, which is compatible with the Protégé ontology development environment (ODE), and MySQL is used as the relational database.

13 December 2019

MPhil Graduation

Location:  Sarah Baartman Hall, University of Cape Town

Dissertation:  The Construction of a Linguistic Linked Data Framework for Bilingual Lexicographic Resources

Supervisors:  Richard Higgs and Connie Bitso

Department:  Department of Knowledge and Information Stewardship (previously Library and Information Studies Centre), University of Cape Town

View:  |  Download:  DISSERTATION


Little-known lexicographic resources can be of tremendous value to users once digitised. By extending the digitisation efforts for a lexicographic resource, converting the human readable digital object to a state that is also machine-readable, structured data can be created that is semantically interoperable, thereby enabling the lexicographic resource to access, and be accessed by, other semantically interoperable resources. The purpose of this study is to formulate a process when converting a lexicographic resource in print form to a machine-readable bilingual lexicographic resource applying linguistic linked data principles, using the English-Xhosa Dictionary for Nurses as a case study. This is accomplished by creating a linked data framework, in which data are expressed in the form of RDF triples and URIs, in a manner which allows for extensibility to a multilingual resource. Click languages with characters not typically represented by the Roman alphabet are also considered. The purpose of this linked data framework is to define each lexical entry as “historically dynamic”, instead of “ontologically static” (Rafferty, 2016:5). For a framework which has instances in constant evolution, focus is thus given to the management of provenance and linked data generation thereof. The output is an implementation framework which provides methodological guidelines for similar language resources in the interdisciplinary field of Library and Information Science.

1 - 3 October 2019

Identification of Languages in Linked Open Data: a Case Study of Linguistic Data of French Combining a Diatopic with a Diachronic Perspective

Authors:  Sabine Tittel and Frances Gillis-Webber

Conference:  eLex 2019: Smart Lexicography

View:  Conference Proceedings  |  Download:  PAPER


When modelling linguistic resources as Linked Data, the identification of languages using language tags and language codes is a mandatory task. IETF’s BCP 47 defines the standard for tags, and ISO 639 provides the codes. However, these codes are insufficient for the identification of diatopic variation within a language and, also, for different historical language stages. This weakness hampers the accurate identification of data, which in turn leads to ambiguity when extending, aggregating and re-using this data—a key notion of Linked Open Data and the Semantic Web. We show the limitations of language identification with a case study of French linguistic data from both a diachronic and a diatopic perspective. Our exemplary data derives from dictionaries of Old French, Middle French, and of Modern French dialects, and from a Modern French linguistic atlas. For each exemplar, we propose a solution using the privateuse sub-tag of BCP 47’s language tag, staying within the boundaries of existing standards. Using a predefined pattern for the privateuse sub-tag, the solutions enable a dialect, a patois, in combination with a time period, to be defined and identified. This can lead to shared agreement of language tags that will increase interoperability within the context of Linked Data.

30 June - 6 July 2019

Summer School:  ISWS 2019, International Semantic Web Research Summer School

Location:  Bertinoro, Italy  |  Bertinoro International Center for Informatics

Funding:  €970 grant kindly provided by ISWS

23 June 2019

A Model for Language Annotations on the Web

Authors:  Frances Gillis-Webber, Sabine Tittel and C. Maria Keet

Conference:  1st Iberoamerican Knowledge Graphs and Semantic Web Conference (KGSWC 2019)

View: (Communications in Computer and Information Science 2019, 1029)  |  SUPPLEMENTARY MATERIAL


Several annotation models have been proposed to enable a multilingual Semantic Web. Such models hone in on the word and its morphology and assume the language tag and URI comes from external resources. These resources, such as ISO 639 and Glottolog, have limited coverage of the world’s languages and have a very limited thesaurus-like structure at best, which hampers language annotation, hence constraining research in Digital Humanities and other fields. To resolve this ‘outsourced’ task of the current models, we developed a model for representing information about languages, the Model for Language Annotation (MoLA), such that basic language information can be recorded consistently and therewith queried and analyzed as well. This includes the various types of languages, families, and the relations among them. MoLA is formalized in OWL so that it can integrate with Linguistic Linked Data resources. Sufficient coverage of MoLA is demonstrated with the use case of French.

20 - 23 May 2019

The Shortcomings of Language Tags for Linked Data when Modeling Lesser-Known Languages

Authors:  Frances Gillis-Webber and Sabine Tittel

Conference:  2nd Conference on Language, Data and Knowledge (LDK 2019)

Location:  Leipzig, Germany  |  University of Leipzig in the Assembly Hall and University Church of St. Paul

View: (OASIcs 2019, 70)  |  Download:  PRESENTATION


In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified.

6 November 2018

Conversion of the English-Xhosa Dictionary for Nurses to a Linguistic Linked Data Framework

Author:  Frances Gillis-Webber

Journal:   Special Issue of Information: Towards the Multilingual Web of Data

View: (Information 2018, 9(11), 274)


The English-Xhosa Dictionary for Nurses (EXDN) is a bilingual, unidirectional printed dictionary in the public domain, with English and isiXhosa as the language pair. By extending the digitisation efforts of EXDN from a human-readable digital object to a machine-readable state, using Resource Description Framework (RDF) as the data model, semantically interoperable structured data can be created, thus enabling EXDN’s data to be reused, aggregated and integrated with other language resources, where it can serve as a potential aid in the development of future language resources for isiXhosa, an under-resourced language in South Africa. The methodological guidelines for the construction of a Linguistic Linked Data framework (LLDF) for a lexicographic resource, as applied to EXDN, are described, where an LLDF can be defined as a framework: (1) which describes data in RDF, (2) using a model designed for the representation of linguistic information, (3) which adheres to Linked Data principles, and (4) which supports versioning, allowing for change. The result is a bidirectional lexicographic resource, previously bounded and static, now unbounded and evolving, with the ability to extend to multilingualism.

10 - 15 September 2018

Summer School:  ISAO 2018, 4th Interdisciplinary School on Applied Ontology

Location:  Cape Town, South Africa  |  University of Cape Town

2 - 6 July 2018

Converting the English-Xhosa Dictionary for Nurses to Linguistic Linked Data

Author:  Frances Gillis-Webber

Conference: International Congress of Linguists (ICL20)

Location:  Cape Town, South Africa  |  Cape Town International Convention Centre (CTICC)

7 - 12 May 2018

Managing Provenance and Versioning for an (Evolving) Dictionary in Linked Data Format

Author:  Frances Gillis-Webber

Conference:  6th Workshop on Linked Data in Linguistics: Towards Linguistic Data Science (LDL-2018), 11th edition of the Language Resources and Evaluation Conference (LREC 2018)

Location:  Miyazaki, Japan  |  Phoenix Seagaia Conference Center



The English-Xhosa Dictionary for Nurses is a unidirectional dictionary with English and isiXhosa as the language pair, published in 1935 and recently converted to Linguistic Linked Data. Using the Ontolex-Lemon model, an ontological framework was created, where the purpose was to present each lexical entry as “historically dynamic” instead of “ontologically static” (Veltman, 2006:6, cited in Rafferty, 2016:5), therefore the provenance information and generation of linked data for an ontological framework with instances constantly evolving was given particular attention. The output is a framework which provides guidelines for similar applications regarding URI patterns, provenance, versioning, and the generation of RDF data.

26 - 30 June 2017

Summer School:  2nd Summer Datathon on Linguistic Linked Open Data (SD-LLOD-17)

Location:  Cercedilla, Spain  |  Residencia Lucas Olazábal of Universidad Politécnica de Madrid

17 - 20 January 2017

Ways to Improve the User Experience in Digital Humanities

Presenter:  Frances Gillis-Webber

Conference:  Inaugural Conference of the Digital Humanities Association of Southern Africa (DHASA 2017)

Location:  Stellenbosch, South Africa  |  Stellenbosch Institute for Advanced Studies (STIAS)

3 - 4 March 2016

Workshop:  OWL & Protégé Tutorial

Location:  Manchester, United Kingdom  |  University of Manchester

Funding:  £500 travel grant kindly provided by University of Manchester

