BACK TO GLG Site
logo ILC logo CSIC

Madrid Ancient Greek Wordlist (MAGWL)


A list compiled by Daniel Riaño Rufilanchas
Last updated: 18.04.2020 20:08 CTM

I have compilled here a list of Ancient Greek words lemmatized, morphologically analysed, and glossed, to be used in all kind of projects related with the Ancient Greek world, scholarly or educational, as well as in linguistic and all kind of Digital Humanities projects. Today, the list contains more than 1,300,000 different forms of ancient Greek words (and variant spellings for words found in papyri) and about 3,800,000 different possible morphological analysis for them. The words come from every kind of literary texts, in prose or verse, from Homer (c. 8th century BC) to the 6th century AD, plus many non literary papyri coming from the Papyrological Navigator.

The list contains the proper names (PN) identified as such (through capitalisation) in the sources. Some effort has been put in automatically identifying of the nature of the referent (a person, a building, an event…). A much more complete list of personal list, though, is to be found in our list of Ancient Greek Personal names).

This page was set to collect reactions and ideas before the final release of the full MAGWL. The list will be uploaded in a way that allow any student or researcher to contribute their corrections and to download the entire list.

You can see the first 10K entries of the list here.

A MAJOR REVISION OF THE WORDLIST IS CURRENTLY DEVELOPING. THOUSANDS OF GHOST FORMS AND FALSE FORM-LEMMA PAIRINGS ARE BEING REMOVED, AND SEVERAL THOUSAND NEW FORMS ARE BEING ENTERED. PLEASE BE PATIENT!

Many thanks to Silvia, Antonio and José Antonio for their help!

This page is a work on progress by the Grupo de Lingüística Griega del ILC (GLG).


What does the MAGWL looks like?

The MAGWL consists of a series of lines, each of them containing lexical and morphological information of a Greek word; Or rather, any morphological form of a word. Or even spelling variants of a given morphological form of a word, in the case of the papyrological documentation.

Here is an example:

Βουκολίδα Βουκολίδης ø Βουκολιδα n-d---ma-;n-d---mn-;n-d---mv- ø Bucolides, PN of a man Morph

μετεστρατοπεδεύσατο μεταστρατοπεδεύω μετά-στρατοπεδεύω μεταστρατοπεδευω v3saim--- ø shift oneʼs ground; camp Diorisis;Morph

Every line consists of eight items:
  1. A morphological form (or variant spelling) of a Greek word.

  2. The lemma to which that form belongs.

  3. If the word has a prefix, the parts of the word (prefixes + lexeme, no further Wortbildung description is given).

  4. The form of the word, without diacritics.

  5. The possible morphological parsings of the form, using the compact Perseus schema of annotation. (You can se the meaning of the letters at each position here).

  6. Occasional, and always incomplete info on the dialects where the form is attested.

  7. A gloss of the word (in English, or in Spanish if no English translation was found).

  8. The corpora where any of the above items were found.


Where does all this data come from?

Data in this list comes from four kind of sources of open data:

All the repositories of data used for the list are identified in the list with the abbreviation placed at the beginning of every paragraph of the following section. This list would not be possible without the creators of the digital resources we are mentioning. To them our deepest gratitude. The following comments are intended as a guide to to user of the Madrid_AGWL, in order to allow him or her to assess the fiability of each piece of data. In no way they can be interpreted as a criticism toward any project.

Previous Word Lists openly accesible

Morph.

Numerically speaking, the source of the greater number of data comes from a word list generated by the Morpheus parser. It uses betacode and includes lemmatization and morphological tagging without part-of-speech identification, and provides a gloss for most forms. This list used to be available at several places over the web, specially in the sites related to the Perseus Project, the ultimate origin of this parser and many other data. We call this the MorpheusList. Today, a version of this list is accesible together with the linguistic resources of the excellent Diogenes program for quaering the TLG_D.

MorphU

Giuseppe G. A. Celano improved the above list and contributed a new Morpheus Unicode list, that corrects errors of the previous list.

Trism, AGPN (GLG_NPGA)

The Nombres de persona en griego antiguo (Ancient Greek Personal names) list is a c. 80K word list built by the same team that buit MAGWL using The Lexicon of Greek Personal Names and the Trismegistos repository of papyrological & epigraphic resources (all names come from papyri).

Treebanks with POS tagging and lemmatization

In some way or another, all the Treebanks here mentioned follow the guidelines of the Perseus Dependency Treebanks

Gorman

Vanesa Gorman has built a Ancient Greek Treebank of +500K words coming from literary texts.

Diorisis

«The Diorisis Ancient Greek Corpus is a digital collection of ancient Greek texts (from Homer to the early fifth century AD) compiled for linguistic analyses, and specifically with the purpose of developing a computational model of semantic change in Ancient Greek» built by Barbara McGillivray and Alessandro Vatri.

Keersm

Alek Keersmaekers has built a huge Treebank of documentary papyri texts containing several million words (plus a good number of parts of words and gaps).

Riaño

The undersigner built his own corpus of grammatically analyzed texts, then ported into a immediate constituents TreeBank, lated converted into a stratified dependency treebank compliant with the AGLDTB schema. Data from the initial 100K words of this corpus (Ancient Greek prose writers) have been included in this list.

Ancient Greek lexica previoulsy digitised

The relevant data of the LSJ and the DGE was extracted and processed thanks to the wonderful Logeion project. No words are enough to thank its creators.

LSJ

Liddell-Scott-Jones is the standard Ancient Greek dictionary in English Speaking countries. In the digitised version of Logeion, it contains 116,184 lemmata. Few of them are proper names.

DGE

Volumes 1 to 7 of the Diccionario Griego Español cover the Ancient Greek vocabulary up to ἔξαυος. The digitised version of the DGE in Logeion contains 59,330 lemmata, including some proper nouns.

Inference work from crossing pieces of all the above data

This is made by me.


How the sausage is made

The resulting list (with its many errors) is the output of processing all the above material and consolidating it automatically. All the lemmatization is the work of the responsibles of the wordlist and the treebanks, but basically from the people in charge of the Morph lists.

Some of the features of the list are dependant of the original sources. For instance, the original Morph list lemmatised many verbal compounds differentiating the first preverb form all the rest, but the results were sometimes problematic. MorpheusU fixed some of the impossible analyses.

The assignment of the part of speech of the forms is guess work based in grammatical information provided by Morph, or the cues that may be obtained from the lexica (Ancient Greek lexica usually does not say explicitly if a word is, say, a verb, and adjective or a noun, but the information provided should provide a human user with enough cues to infer that.)

All the Treebanks (except Riaño's) depend somehow of the Morph list for lemmatization and POS tagging, or a corrected version thereof. However, they have somehow corrected many errors, most notably Gorman.

Before processing, Keersmaekers corpus was cleaned from several thousand numerals, and a fair number of artifacts (sometimes the digital result of different convention for the edition of papyri for the last two centuries). In the process I may have deleted some legit words.

This list contains many non Greek names (names without a Greek etymology). I did no attempt to delete or separate them from the rest of the names. Often, such names appear without inflection. Quite often, the editors did not try to accentuate the forms, or the lemma.

I have used LSJ and DGE to correct some apparent anomalies in other treebanks. I used LPGN and the Trismegistos repertory of names to do the same with the proper names. This changes are not noted in the list except for the fact that the name of the corpus (LGPN or Trism) will appear in the line corresponding to the nominative singular of that PN.

Some of the treebanks may present some lacunae in proper names, specially (for reasons unknown to me) with names starting with rho and omega.

The glosses to the list words come from the first two definitions of LSJ and DGE. Because of that procedure, the gloss is not always the most common meaning of the word in all times (some times lexica start by the older assumed meaning; or the firstly attested; or the intransitive vs. the transitive; or just the way around, etc.). When I couldn't extract a definition from LSJ (or when a word is not present in that dictionary) I used the DGE (preceded by "Sp." for Spanish).

For the personal names, I provide first an automatic transliteration of the name (most of the times it should correspond to at least one possible transliteration in English). Then, if I was able to teach the program to recognise what kind of entity is it (based on the clues provided by the dictionary's definition) I give a tag that classifies the name among persons, events and places. The sort of tags you find in such cases are "PN of a festivity", "PN of a woman", "PN of a man", etc. (PN is of course "Proper Name"). This kind of tagging is intended for students of Greek and for automatic linguistic analysis.

Because of all the above, you may think that: a) Whenever a disagreement exists between data from LSJ and DGE on one side, and the rest of the sources on the other, it is usually the information coming from the first two sources the safer bet. b) Whenever a disagreement exists between data from Morph and the Treebanks depending on it, it is generally the data of the human curated source the best (this is specially true with Gorman).

For some reason, the English rationale to capitalise verbs and adjectives was extended to the modern printing Ancient Greek. Inconsistencies in the capitalisation of nominals and personal noun-derived adjectives and verbs have been dealt with, in accord to this convention. For that purpose I have used data taken, or inferred, from LGPN, Trismegistos or the lexica. Likewise, inconsistencies in accentuation have been carefully (and incompletely) dealt with. In he same vein, lexica and the list of personal names have been used to identify the nature of the referent of PNs from other sources, and to identify the right POS tagging for some entries. Most of the times, I have not used the information from lexica and LGPN on the sex of the bearer of a proper name to rectify the gender of nouns in other corpora, since Greek names are many times used for both sexes.

When a name has a common meaning (as many Greek names do have) I added that meaning between parentheses, after the transliteration and the identification tag, for instance:

  • Ζηνάριον Ζηνάριον ø Ζηναριον n-s---fn- ø Zenarion, PN of a woman Keersm;LGPN;Trism
  • θησεῖα θησεῖον ø θησειον n-p---na-;n-p---nn-;n-p---nv- ø Thesion, PN of a festivity Morph
  • Θησείδαις Θησεῖδαι ø Θησειδαις a-p---md- ø Thesidae, PN of a man Morph

During the process of building this list I have occasionally deleted several impossible forms or non-words.


The morphological tags

I have used Perseus nine-positions schema for morphological description of Greek and Latin words. Although it is not complete (e.g., no way to indicate verbal adjectives) and somewhat problematic (infinitive and participle a mood?) it is a compact way to describe the greater part of basic Greek morphology.

The meaning of the characters used for the morphological description (5th item of each line on the list) are these:

1 part of speech

  • n noun
  • v verb
  • a adjective
  • d adverb
  • l article
  • g particle
  • c conjunction
  • r preposition
  • p pronoun
  • m numeral
  • i interjection
  • u punctuation

2 person

  • 1 first person
  • 2 second person
  • 3 third person

3 number

  • s singular
  • p plural
  • d dual

4 tense

  • p present
  • i imperfect
  • r perfect
  • l pluperfect
  • t future perfect
  • f future
  • a aorist

5 mood

  • i indicative
  • s subjunctive
  • o optative
  • n infinitive
  • m imperative
  • p participle

6 voice

  • a active
  • p passive
  • m middle
  • e medio-passive

7 gender

  • m masculine
  • f feminine
  • n neuter

8 case

  • n nominative
  • g genitive
  • d dative
  • a accusative
  • v vocative
  • l locative

9 degree

  • c comparative
  • s superlative


This page is work on progress by the

Please, send your comments and ideas to Daniel

The contents of this site are CC by Daniel Riaño Rufilanchas logo Creative Commons

logo ILC logo CSIC