Zaza Text Corpus Project

This digital corpus was developed as part of a postdoctoral research project funded by TÜBİTAK (The Scientific and Technological Research Council of Turkey) through its 2219 – International Postdoctoral Research Fellowship Program. The project was conducted at the Institute for the Interdisciplinary Study of Language Evolution and Zurich Center for Linguistics at the University of Zurich, under the academic supervision of Prof. Dr. Paul Widmer and Dr. Dagmar Jung.

Purpose and Structure

The corpus is based on a curated selection of 20 texts written in a Northern dialect of Zazaki (Kırmancki) and serves as the foundation for a structured, linguistically annotated Zaza language resource.

Each text is enriched with detailed linguistic metadata, including:

  • Part-of-speech (POS) tags
  • Named entities
  • Structural levels (sentence, phrase, word)

The annotations span several linguistic domains and subcategories:

  • Syntax
  • Morphology
  • Semantics
  • Morphosyntax
  • Lexicology
  • Phrase structure

Technology

Annotation was performed using the digital tool CATMA (Computer Assisted Text Markup and Analysis). While some texts are already fully annotated, others are still in progress. Annotations will continue to be added incrementally.

Annotations are organized into six categories: Morphology, Syntax, Morphosyntax, Semantics, Phrase structure, and Lexicology. While annotation coverage is still incomplete, it is being continuously expanded. By hovering the mouse over a word, users can view its corresponding annotations. Annotations can also be downloaded for each individual text or as a combined dataset. Both words and annotation layers can be searched via the interface, with results highlighted in color for easy identification.

Users can explore the corpus through:

  • Keyword search
  • Filtering options
  • Statistical analysis tools

Project Objectives

This corpus was developed to serve multiple purposes:

  • To support linguistic and philological research
  • To contribute to the documentation and revitalization of the Zaza language (Zazaki)
  • To provide high-quality language resources for students and educators
  • To preserve and promote Zaza cultural heritage in its original linguistic form

This corpus was created by Assoc. Prof. Dr. İlyas Arslan.
All included texts were previously published by their respective authors.
The use of these texts in the corpus was authorized by the authors.
For academic or other using, please cite as follows:

Arslan, İ. (2025). Zaza Text Corpus. University of Zurich & Munzur University. https://ilyasarslan62.github.io/zazatextcorpus/