Portuguese Language Dictionary

I don’t write here for some time. It’s been a busy year. But a recent event makes me feel like it is an excellent time to write a little, about the architecture I created behind the Dicionário da Língua Portuguesa [Portuguese Language Dictionary] from Academia das Ciências de Lisboa [Lisbon’s Science Academy].

The story of this project is long, but keeping it simple, a dictionary published in 2001 was re-engineered, from its’ PDF, into an XML format. This process was based on the typography of the document, together with some rules. There were challenges, and there were errors, but the dictionary entries were converted to XML documents, one for each word, and stored accordingly with (a slightly modified version of) the Text Encoding Initiative standard, namely Chapter 9, which describes how dictionaries should be encoded.

These 69K documents were then added to an XML-aware database, eXist-DB. To allow the development of the dictionary, a full-fledged web application was developed on top of eXist-DB, using XQuery, XPath, XML, JavaScript, CSS and HTML. You can see a few about LeXmart on its website.

One of the first interesting developments around this project was how backups are stored. eXist allows scheduled jobs to run, so a scheduled backup was created. But this just exports the whole database as a ZIP file or as a tree of the files inside the database. Per se, this is not a backup, and especially, not an incremental backup. To handle that, another standard Linux cronjob runs a few hours later, gets the backup log file and if it is not corrupt, adds those files as a new version in a GIT repository, taking care of adding new files and removing old ones. This is a clean way to store changes without handling that myself. After this process, the backup is removed. Only the copy in the GIT repository is kept.

Another issue was how to make the dictionary available to the end user. One of the first requirements was to be fast, and another one was not to interfere with the dictionary development. To help with that I decided to create a database using ArangoDB. This is a NoSQL database, written in C++, with support for object storage as well as graphs. Unfortunately, not using graphs yet, but I will get there someday. This database is quite similar to Mongo but seems more stable. Also, the interface with this database is performed in Perl, using my own driver implementation, cleverly named Arango::Tango.

The synchronization process, between eXist-DB and ArangoDB, is, thus, a Perl script that runs every two hours, based on another cronjob. This is an incremental synchronization tool, that only copies (and deletes) entries that were changed in the last 8 hours (of course, only two hours would be enough, but as the job might fail, a larger window ensures no entry is lost). This synchronization process also looks into the XML and extracts some information to create some indexed fields in the database, making the search faster.

The front end is a WordPress site. Not that I like PHP, but as you can notice, this same site is running in WordPress too. WordPress allows non-tech people to edit the dictionary site freely, and only the search engine is managed by me. For that, a specific PHP code was written to perform a REST call to a small Perl API, written in Dancer2. The API performs the search, chews some of the XML when needed, and returns it. This XML is directly incorporated into the website, and CSS makes it pretty.

The API, itself, uses jSpell, which includes a Perl module to perform morphological analysis. This allows the user to write inflected forms and get as the result an infinitive form (for verbs) or the masculine singular form (for names and adjectives).

While this architecture does not have anything new, I think the tools used are quite interesting and unusual, and therefore, why I am sharing them with you. Hope you enjoy it.

/dev/null

Portuguese Language Dictionary

Leave a Reply