Yesterday I was trying to automate a process using a Makefile. At first, I was mostly using Bash inside the Makefile to iterate over a lot of files and perform operations. Unfortunately, that approach does not allow us to use multiple parallel jobs, taking advantage of the multi-processor available in the machine.

I decided to try to convert my Bash loops into Makefile rules, using patterns, and decided not to give up my original project structure. I needed to find a way to have the Makefile do what I wanted without changing anything. I found a way. I learned a lot. And I decided to share my findings. Stay with me.

Update 0: A few minutes after posting I understood I could simplify the template. Search below for this update.

Update 1: Thanks to my friend, José João, there are some simplifications below. Search for this update to see them.

The problem at hand

I am doing OCR (optical character recognition) to a dictionary, using Tesseract. I have a folder for each letter in the dictionary. Inside, I have an image file for each column of the dictionary. Just like the left side of the image below.

On the right, you have the resulting files of my process. First, each image will be sent through Tesseract, creating a file with the same name (different extension) with the OCRed text. Then, each group of text files inside a letter folder will be concatenated together, and processed through a filter, that does some cleanup on the files. The result of this process is the text files in the root folder whose name is just a letter. Finally, all these text files are concatenated together in the all.txt text file.

The main challenge (or at least, my challenge, as I might miss some completely obvious solution) was to make rules that process files inside the folders, without repeating the rules, one for each letter. Also, the way the Makefile pattern works (the % character) does not allow easily to match against a folder name.

The next sections describe step by step, each rule I ended with.

Running Tesseract on each page

Makefiles support patterns. I was expecting to have something like:

Makefile
%.txt: %.png
  tesseract -l por $< $<
Makefile

Here there are two main issues:

  • The variable $< will hold the input filename, like A/001.png. Thus, if we use it as the second argument for Tesseract, that same name will be used as the prefix for the output filename. The produced file will be named A/001.png.txt
  • Unfortunately the “percent” % pattern can not be used for files inside folders, just for files in the same folder as the makefile is (or, at least, by default).

For the first issue, we can use a text function from Makefile, named subst. It receives three arguments: the text to be replaced, the replacement text, and the variable where the replacement will take place. Note that you can’t use spaces, or they will be interpreted as part of the arguments. The first iteration for the code above was:

Makefile
%.txt: %.png
  tesseract -l por $< $(subst .png,,$<)
Makefile

Update 1: Makefiles have a lot of special variables, like $<, that is the list of dependencies, or $@ that is the list of targets. Just like that, $* is the pattern that matched the percent sign, which is exactly what I want in the second line. Thus, we can write simply:

Makefile
%.txt: %.png
  tesseract -l por $< $*
Makefile

For the second issue, we need a little more work. While by default the percent does not match folder names, we can supply a list of the files where the match will occur. And, in this case, the percent sign will match the filename string, even if it contains path separators.

To make this work we need a variable with the list of text files that will be generated. Fortunately, we can compute that list from the list of png files. This is done using two variables, using the wildcard and patsubst functions:

Makefile
columns=$(wildcard ?/*.png)
text_columns=$(patsubst %.png,%.txt,$(columns))

$(text_columns) : %.txt : %.png
	tesseract -l por $< $(subst .png,,$<)
Makefile

The first line creates the list of png files inside each dictionary letter folder. The wildcard function is straightforward: it expands the wildcard in a list of files. The contents of this list will be something like A/001.png A/002.png.

The second line computes the list of text files based on the list of png files. We replace the file extensions using the patsubst function. Again, three arguments: the pattern to search for, the replacement pattern, and the list of where to apply that replacement. Note that we use a pattern, which guarantees we are replacing only the file extensions. The resulting list will contain something like A/001.txt A/002.txt.

With this list, we can define the rule presented in line 4: the same as we had before, but with an extra first argument. This is, basically, the list of targets for which this rule can be applied. As this list is precomputed and supplied, the percent sign can match the whole string, and just understand it should apply the rule for any file, even if it is inside the folder.

Concatenating and processing each letter

For this step, we want a rule like the following one (that doesn’t work):

Makefile
A.txt: A/*.txt
  cat A/*.txt > A.txt
  tools/pre-processor.pl A.txt
	mv A.txt.out A.txt
Makefile

The rule content is simple: concatenate all files, call a custom preprocessor tool, that uses the input filename to generate an output file with the .out extension, and finally, a renaming of the file to the desired output.

And why doesn’t this work? Because I do not want to repeat the rule for each letter, and because we can’t use wildcards as the dependency of the rule (that A/*.txt).

The first attempt was to use a similar approach as described above. But it wasn’t possible. The problem of expanding the wildcard was making me mad. The result is a High-Order Rule: a rule that is generated on the fly for each letter.

One step at a time. We need the list of the letter text file outputs. For that, we will use the name of the folders, like this:

Makefile
letters=$(wildcard ?)
letter_texts=$(addsuffix .txt,$(letters))
Makefile

Note that using that line 1 we need to make sure there are no other files with a name of length 1. For the second line, it just adds the .txt suffix to each element of the list of folders using the addsuffix function. So, the content of this list is something like A.txt B.txt C.txt.

Now, we create a template for our dynamic rules. Hold your breadth.

Makefile
define build_letter_text_rule
$(1).txt : $(filter $(1)/%,$(text_columns))
	cat $(filter $(1)/%,$(text_columns)) > $(1).txt
	tools/pre-processor.pl $(1).txt
	mv $(1).txt.out $(1).txt
endef	
Makefile

Line 1 defines the name of the rule template. The define command is just a different way to define a standard variable. But it is easier to use when you need multiline content.

The variable $(1) is the first parameter passed to the template whenever we instantiate it. In our case, is the letter of the folder. So, the target of the rule is $(1).txt, that is, something like A.txt or B.txt.

The dependencies for the rule uses the filter function. It selects all elements from a list that matches a pattern. The list is the one we computed earlier, with the names of all the text files, one per page: A/001.txt A/002.txt B/003.txt. The pattern is the name of the folder: $(1)/%. So, for the letter A, the pattern is A/%, and filters from the list only the files in that folder: A/001.txt A/002.txt. Cool huh?

The remaining lines are now simple to understand. We use $(1) whenever we want to refer to the letter, and we use the filter line to get the list of text files to be concatenated. I know I can define variables there… being lazy.

With the template ready, we can not create a dynamic rule for each letter:

Makefile
$(foreach letter,$(letters),$(eval $(call build_letter_text_rule,$(letter))))
Makefile

If you know Lisp, you are probably relating. Not easy to read. The foreach statement iterates through a list (the second argument, in this case, the list of letters). The first argument is the name of the variable that will hold the value for each item in the list. The third argument is what to perform for each loop iteration. In this case, the evaluation of a Makefile rule (the string stored in that template). For that, we use the eval function. It just receives the code to be evaluated, that is the string returned by a call to our template, passing as argument the relevant letter. Magically, a set of 26 rules is created on-the-fly. Lovely.

Update 0: We can use the special makefile variables in the template as well, but we need to protect them, as they should be expanded only when the rule is evaluated:

Makefile
define build_letter_text_rule
$(1).txt : $(filter $(1)/%,$(text_columns))
	cat $$<  > $$@
	tools/pre-processor.pl $$@
	mv $$@.out $$@
endef	
Makefile

Concatenating all letters

The final step is to concatenate all the text files. This is extremely simple as we already have the list of files stored in a variable:

Makefile
all.txt: $(letter_texts)
	cat $(letter_texts) > all.txt
Makefile

Conclusions

The code is not easy to read. But the makefile is quite small (about 20 lines). Less than the number of letters in the dictionary. Perfect.

Regarding time, the process that sequentially took 21 minutes is now taking 12 minutes using 4 jobs (make -j 4). Wonderful.

Leave a Reply