## Experimenting eXist-DB on Docker

I’ve been using eXist-DB for some time for the project of the Dicionário da Academia de Ciências de Lisboa, that is being revived from the PDF into TEI, so that a new digital version can be soon released.

Recently I needed to update the server where eXist-DB was running, and decided to use a dockerized version of it. Although that can make things a little slower (not really sure), it makes things easier to replicate, and now I can have, easily, the dictionary database running on my laptop or in the server, using the same code.

I am using the default latest version of eXist-DB docker image. The only difference is that, because my XQuery code uses FunctX functions, I needed to import that module. Thus, my Dockerfile is composed by:

FROM existdb/existdb:latest

ADD http://exist-db.org/exist/apps/public-repo/public/functx-1.0.1.xar /exist/autodeploy

I have the data and application on a GIT repository, as exported by the eXist-DB backup tool. Thus, I decided to create a simple script to import the data, instead o creating the docker image already with that data. Therefore, my docker-compose.yml file is composed by:

version: '3.3'
services:
exist:
build: ./dacl/docker
container_name: exist
ports:
- 8080:8080
- 8443:8443
volumes:
- ./data:/exist/data
- ./config:/exist/config
- ./dacl:/import
- ./outdir:/export


The relevant parts:

• The path to the folder including the Dockerfile (dacl/docker)
• Ports 8080 and 8443 are used by eXist-DB, and I am just forwarding them to the host
• Created four volumes: data stores the binary database data, config stores the configuration files, and import and export volumes are used to import data, and export data for backup.

For importing all data into the database I am using a shell script. The first five lines import some collections. The last two execute two auxiliary scripts, the first to re-index application data, and the second to create the proper groups, users and assign passwords.

First, the import.sh script is:

#!/usr/bin/env bash

docker-compose exec -T exist java org.exist.start.Main backup -u admin -r /import/db/schemas/__contents__.xml

docker-compose exec -T exist java org.exist.start.Main client -u admin -F /import/xq/repair.xq
docker-compose exec -T exist java org.exist.start.Main client -u admin -F /import/xq/users.xq

Note that the XQuery scripts are being held in the same folder that is mounted in the import volume. Otherwise, you will not be able to access it from inside the container.

The repair XQuery script holds this code:

import module namespace repair="http://exist-db.org/xquery/repo/repair"
at "resource:org/exist/xquery/modules/expathrepo/repair.xql";
repair:clean-all(),
repair:repair()


And finally, the users Xquery script has the following code:

sm:passwd('admin','admin-password),
sm:create-group('dacl'),
sm:chmod(xs:anyURI('/db/apps/academia'), 'rwxrwx---')

Also, in case it gets useful, this is my backup.sh script

docker-compose exec -T exist java org.exist.start.Main backup -u admin -p admin.entrada -b /db/academia -d /export

rsync -aASPvz --delete-after outdir/db/ dacl/db/

DATE=date +%Y%m%d
cd dacl && git commit -a -m "Backup \$DATE" && git push origin v5

Of course this is not rocket science, and this approach might have a lot of problems, but in the other hand, it might get handy to someone.

## Scopus, and other messy services

For a long time that I am against the creation of Journal and Conference indexes, trying to stamp the contributions published or presented on those venues as good or bad. While I agree that there are some conferences that accept mostly every document that they receive, that does not mean that, a distracted researcher, could try and publish there some great work. This is the main problem of statistics altogether, as they consider the whole for the part.

Some other examples of the problems on this kind of journal ranking can be given. Depending on the area you do your research on, there are very different number of venues to publish your work. Thus, a couple researchers will have a simpler task while others will have an impossible task.

While there are topics that can be published in a wide range of journals, there are some other subjects that are too specialized, that there isn’t so much reference places where to publish it. Of course the researcher can publish the work in a less specific journal. There are some wide subject journals that have a good ranking. But then, would it be better to publish your work on some place where the standard researcher will not search for work on a specific topic, or publish in a less reputable conference where all the community publishes?

If this seems all too abstract, I can give an example. There aren’t much top ranking journals to publish on Natural Language Processing (NLP). Even for conferences, there are just a few with a good ranking. But then, if you look to any paper on NLP, I can be mostly sure that you can find, at least, one reference to a paper published in the Languages, Resources and Evaluation Conference (LREC). This one, is not top ranked. Also, they have a quite small rejection percentage. But all main researchers of the area publish there. It is a huge conference, with more than  700 papers accepted each year.

But, even if you accept that journals and conferences should be rated, and that a researcher work should be evaluated by such rankings, then you must look to what these rating organizations do. Let me give you a small example: I found that a journal I co-edit is present on SciMago (here). The problem is that, although the name and ISSN is correct, the subject classification is a little at the side and, worse, all the information on the publisher is wrong. SciMago claim that this information comes from Scopus, and that they just process it. But after contacting a couple of times Scopus, they aren’t able to fix the data. They redirect me to SciMago. When contacting SciMago, they route the conversation to Scopus again. In order to show Scopus that this is their problem, I tried to ask SciMago to reply to my mail including Scopus in the recipients. They just replied me, with the same copy&paste text from before.

If these services aren’t able to fix their data, even when asked for, can we trust on all other data they claim to have?

## Edx + Microsoft Course on R

I participated in some courses in Coursera, with very interesting subjects. For the first time I decided to enroll in a course in edx, and I ended up in a Microsoft Data Camp course on R. The course is very basic, mostly on syntax than in statistics. Good enough. The course is based in 5 to 6 minutes videos, and then some exercises done online. Although too basic, the course was interesting enough for me, as I did not know anything on R.

But what really annoyed me was the final evaluation. First, the instructions claim we will have 4 minutes per question… but the timer starts at 3 minutes. Well.. Microsoft… Then, instead of having questions on using R for some interesting tasks (take this data, change it, then plot) the exam is a set of puzzle questions, with placeholder we need to fill to obtain some result. Yes, when using R we will be faced with puzzles. And worse, there is no pause button. I understand you might want to limit the time to answer to each question, but why not to pause in between questions? Specially when you are at home and someone rings your door bell. Well, after 20 minutes or so at the door, I returned to the exam, and still got 75%.

And, OK, I did not pay for a verified certificate. But at least you could show a digital certificate we might print and show. OK, I’ll get back to Coursera. See you, edX!

## Digital Object Identifier

If you are an academic, you know what DOI are, and you know that a lot of websites are now requiring your publications to have one, so you can refer to them (for example, Publons). DOI are managed by IDF, the International DOI Foundation,  a not-for-profit organization.

In the other hand, I am a co-editor for Linguamática, a free and open access journal. It exists for eight years, and never got any funding. Editors and reviewers are not paid. Publication if free and contents access too.

For some time that I want to add DOI to Linguamática. But all DOI registration agencies that I consulted have only paid membership, and fees for each DOI creation. This is not possible for Linguamática, unless we require fees to the authors or to the readers. And for us is better to keep without a DOI, than to change our policies.

Yesterday I tried to contact IDF directly, asking if there was any way to get a DOI:

Dear Sirs,

I am one of the editors of Linguamática (http://linguamatica.com). This journal is in its eighth year of existence, without any fee for authors or readers. We do everything for the evolution of the natural language processing area for free.

As you know, a lot of services require DOI identifiers. Unfortunately we do now have any means to pay for the DOI services from one of the registration agencies. Does DOI/IDF has any service for this kind of initiative?

Thank you

Unfortunately this was the answer I got:

Hello,

Unfortunately, no. In order to acquire a DOI you must work with an existing RA, and it is up the each RA to establish a business plan and set pricing. Crossref offers low pricing for your type of journal, but I doubt they assign DOIs at no cost at all.

And as you might guess, RA (registration agencies) are not not-for-profit organizations. As an example, Crossref has an annual fee of 275 dollars, with an extra dollar for each deposited article (Linguamática publishes from 10 to 20 articles, top, each year). So, we would require around 300 dollars each year to have a DOI. Unfortunately that is not possible.

I wonder if other open access journals have similar problems, and how they solved the problem. Or if someone else thinks that an Open-DOI is needed, supported by volunteers and minimum monetary contributions for server and domain expenses…

## Open-Access Journals

It seems that Havard is fearing the Open-Access journals, and decides to do a false research project to shown that Open-Access Journals accepts bad/wrong papers. They submitted a paper for some Open-Access journals and found out that a high percentage accepted this paper.

What is people is claiming now? That Open-Access Journals does not do peer review, or do that poorly, and they are more interested in publishing anything than to access the quality of the document being published.

The fact is that this study is a mistake from the beginning. What does a percentage says? If I say that 80% of woman cheat their husbands, would this say they are worst than man? Probably man percentage is higher, but if it wasn’t computed, there is no way to compare.

So, what does it mean that a lot of Open-Access journals accept innacceptable papers? Just that. It doesn’t mean regular journals are better, or that conferences are better. You might know that there is a complete industry behind conferences (I organized a bunch of them, higher cost so far was 120 euro, and I offered lunches and dinner.. other I attend cost more than 500 euro and offer nothing at all!). The same happens with regular journals, and with Open-Access journals, of course.

But please, do a valid research. Take the article and submit to the same number of Open-Access journals, standard journals, and conferences. Then compare the results. That is research! Computing a meaningfull means nothing.

Please do not blacken Open-Access journals. They are the way to go for public research!

## Map-Reduce, or why I hate software patents.

In the recent times you should be hearing a lot on map-reduce. I first heard of the term in last year Codebits. Although I wasn’t there, there was a talk with that title. I confess that knowing that map and reduce are common functional operators on different programming languages, I did not look to the talk abstract. During this year Yet Another Perl Workshop Europe, in Pisa, I saw a book on Hadoop, asked what it was about to a friend that wanted to buy it, and he said: a framework to implement Map-Reduce.

That made me think.. wait.. this should be the name of something different from what I though it was. Looking deeper I understood the concept. Googling, I found Google filled the patent request in 2004, and patented it in 2010. Found also that I used that construct in 2007, and documented it on my PhD thesis in 2008. Of course I did not call it Map-Reduce. In fact I did not call it anything fancy. It was just a way to get to results. Named it as my “divide and conquer approach”. And I did not heard of Google approach as well. I just got to it because I needed some results.

So, this is yet another reason why I hate software patents.

## TEI – Well Done!

I will not detail anything about TEI. Sorry. I would just like to let you know that every time I need to work with any TEI subset, I find myself amazed with the quality of their documentation and the details they thought on before writing the standard.

Sometimes I just get to me thinking… do I really need all this stuff? The common answer is, no, I do not need so much detail on my annotations.

But that doesn’t mean I should not use TEI. Probably I should look to the section about the items I am trying to annotate and meditate. Probably I will not need the amount of different tags and details that are defined by TEI. But I am almost sure I will find one or two that I did not thought about. Then, I can use the portion of TEI I really want and forget about the rest. Probably my document will not validate against TEI, but probably it will not be too far away. And, probably, if someone else looks to the document, she will probably understand. And, if she don’t, I can always point to the TEI documentation and say: I am not using it all, but the subset I thought to be relevant.

Where am I using TEI? You can see it being used in the Dicionário-Aberto project, where the dictionary is encoded in a TEI subset. Also, I am looking to the TEI header and filtering it, making it an option to annotate documents on a parallel corpora project.

## DBLP Bibliography Database and Scientific Publications in Portugal

In Portugal, Universities are rating researchers accordingly with their publications being or not cited on Internet articles databases like DBLP or ISI Web of Knowledge. Basically, if your article is not cited anywhere, then your article is class C. If it is cited in DBLP, it is class B. Finally, if it is present in ISI Web of Knowledge, it is classified as class A.

That is, if you can persuade DBLP author to publish the information about a conference or a journal, you can get your article to be rated B. Then, if a commercial company includes your article (that is, ISI Web of Knowledge), then you can get a class A article.

I wonder how a single guy (Michael Ley is doing a great job, that is not the problem) can find out if a journal is good or not for all areas. I do not know what Michael researches about, but I do not agree he can discern what conferences or journals are good for Parallel Computation, Natural Language Processing, Bio-Informatics, Artificial Intelligence, etc, etc.

Also, I wonder why there is a journal with a single issued published in DBLP, and without all articles listed. Yes, there is a journal that has more than thirty issues. Only one is in DBLP. And that one is not complete. Just half the articles are listed.

Yes, I tried a couple of times (in fact, more than four times) to send the full information about that journal and offered myself to add the BibTeX entry for all journal issues. Never ever got an answer.

The same happened when I sent (twice) the index for a journal on Natural Language Processing for the Iberian Languages. No answer at all. Is it because it is  bad journal? Probably. But I do not think my mails where read at all.

I can do similar comments about ISI Web of Knowledge. Why is a company maintaining this index? Why is this index paid? If a journal or conference pay for its inclusion, do you think the company will reply that it does not have enough quality to be listed?

More questions can be made. Check the number of conferences or journals on computer architecture. Then, check the number of conferences or journals in Natural Language Processing. Then, check the number of indexed conferences or journals in these areas. Yes, it is easier to be a GOOD researcher in computer architecture than in Natural Language Processing. Go figure why…