Yesterday I was trying to automate a process using a Makefile. At first, I was mostly using Bash inside the Makefile to iterate over a lot of files and perform operations. Unfortunately, that approach does not allow us to use multiple parallel jobs, taking advantage of the multi-processor available in the machine.
I decided to try to convert my Bash loops into Makefile rules, using patterns, and decided not to give up my original project structure. I needed to find a way to have the Makefile do what I wanted without changing anything. I found a way. I learned a lot. And I decided to share my findings. Stay with me.
Update 0: A few minutes after posting I understood I could simplify the template. Search below for this update.
Update 1: Thanks to my friend, José João, there are some simplifications below. Search for this update to see them.
The problem at hand
I am doing OCR (optical character recognition) to a dictionary, using Tesseract. I have a folder for each letter in the dictionary. Inside, I have an image file for each column of the dictionary. Just like the left side of the image below.
On the right, you have the resulting files of my process. First, each image will be sent through Tesseract, creating a file with the same name (different extension) with the OCRed text. Then, each group of text files inside a letter folder will be concatenated together, and processed through a filter, that does some cleanup on the files. The result of this process is the text files in the root folder whose name is just a letter. Finally, all these text files are concatenated together in the all.txt
text file.
The main challenge (or at least, my challenge, as I might miss some completely obvious solution) was to make rules that process files inside the folders, without repeating the rules, one for each letter. Also, the way the Makefile pattern works (the %
character) does not allow easily to match against a folder name.
The next sections describe step by step, each rule I ended with.
Running Tesseract on each page
Makefiles support patterns. I was expecting to have something like:
%.txt: %.png
tesseract -l por $< $<
MakefileHere there are two main issues:
- The variable
$<
will hold the input filename, likeA/001.png
. Thus, if we use it as the second argument for Tesseract, that same name will be used as the prefix for the output filename. The produced file will be namedA/001.png.txt
- Unfortunately the “percent”
%
pattern can not be used for files inside folders, just for files in the same folder as the makefile is (or, at least, by default).
For the first issue, we can use a text function from Makefile, named subst
. It receives three arguments: the text to be replaced, the replacement text, and the variable where the replacement will take place. Note that you can’t use spaces, or they will be interpreted as part of the arguments. The first iteration for the code above was:
%.txt: %.png
tesseract -l por $< $(subst .png,,$<)
MakefileUpdate 1: Makefiles have a lot of special variables, like $<
, that is the list of dependencies, or $@
that is the list of targets. Just like that, $*
is the pattern that matched the percent sign, which is exactly what I want in the second line. Thus, we can write simply:
%.txt: %.png
tesseract -l por $< $*
MakefileFor the second issue, we need a little more work. While by default the percent does not match folder names, we can supply a list of the files where the match will occur. And, in this case, the percent sign will match the filename string, even if it contains path separators.
To make this work we need a variable with the list of text files that will be generated. Fortunately, we can compute that list from the list of png
files. This is done using two variables, using the wildcard
and patsubst
functions:
columns=$(wildcard ?/*.png)
text_columns=$(patsubst %.png,%.txt,$(columns))
$(text_columns) : %.txt : %.png
tesseract -l por $< $(subst .png,,$<)
MakefileThe first line creates the list of png
files inside each dictionary letter folder. The wildcard
function is straightforward: it expands the wildcard in a list of files. The contents of this list will be something like A/001.png A/002.png
.
The second line computes the list of text files based on the list of png
files. We replace the file extensions using the patsubst
function. Again, three arguments: the pattern to search for, the replacement pattern, and the list of where to apply that replacement. Note that we use a pattern, which guarantees we are replacing only the file extensions. The resulting list will contain something like A/001.txt A/002.txt
.
With this list, we can define the rule presented in line 4: the same as we had before, but with an extra first argument. This is, basically, the list of targets for which this rule can be applied. As this list is precomputed and supplied, the percent sign can match the whole string, and just understand it should apply the rule for any file, even if it is inside the folder.
Concatenating and processing each letter
For this step, we want a rule like the following one (that doesn’t work):
A.txt: A/*.txt
cat A/*.txt > A.txt
tools/pre-processor.pl A.txt
mv A.txt.out A.txt
MakefileThe rule content is simple: concatenate all files, call a custom preprocessor tool, that uses the input filename to generate an output file with the .out
extension, and finally, a renaming of the file to the desired output.
And why doesn’t this work? Because I do not want to repeat the rule for each letter, and because we can’t use wildcards as the dependency of the rule (that A/*.txt
).
The first attempt was to use a similar approach as described above. But it wasn’t possible. The problem of expanding the wildcard was making me mad. The result is a High-Order Rule: a rule that is generated on the fly for each letter.
One step at a time. We need the list of the letter text file outputs. For that, we will use the name of the folders, like this:
letters=$(wildcard ?)
letter_texts=$(addsuffix .txt,$(letters))
MakefileNote that using that line 1 we need to make sure there are no other files with a name of length 1. For the second line, it just adds the .txt
suffix to each element of the list of folders using the addsuffix
function. So, the content of this list is something like A.txt B.txt C.txt
.
Now, we create a template for our dynamic rules. Hold your breadth.
define build_letter_text_rule
$(1).txt : $(filter $(1)/%,$(text_columns))
cat $(filter $(1)/%,$(text_columns)) > $(1).txt
tools/pre-processor.pl $(1).txt
mv $(1).txt.out $(1).txt
endef
MakefileLine 1 defines the name of the rule template. The define
command is just a different way to define a standard variable. But it is easier to use when you need multiline content.
The variable $(1)
is the first parameter passed to the template whenever we instantiate it. In our case, is the letter of the folder. So, the target of the rule is $(1).txt
, that is, something like A.txt
or B.txt
.
The dependencies for the rule uses the filter
function. It selects all elements from a list that matches a pattern. The list is the one we computed earlier, with the names of all the text files, one per page: A/001.txt A/002.txt B/003.txt
. The pattern is the name of the folder: $(1)/%
. So, for the letter A
, the pattern is A/%
, and filters from the list only the files in that folder: A/001.txt A/002.txt
. Cool huh?
The remaining lines are now simple to understand. We use $(1)
whenever we want to refer to the letter, and we use the filter
line to get the list of text files to be concatenated. I know I can define variables there… being lazy.
With the template ready, we can not create a dynamic rule for each letter:
$(foreach letter,$(letters),$(eval $(call build_letter_text_rule,$(letter))))
MakefileIf you know Lisp, you are probably relating. Not easy to read. The foreach
statement iterates through a list (the second argument, in this case, the list of letters). The first argument is the name of the variable that will hold the value for each item in the list. The third argument is what to perform for each loop iteration. In this case, the evaluation of a Makefile rule (the string stored in that template). For that, we use the eval
function. It just receives the code to be evaluated, that is the string returned by a call
to our template, passing as argument the relevant letter. Magically, a set of 26 rules is created on-the-fly. Lovely.
Update 0: We can use the special makefile variables in the template as well, but we need to protect them, as they should be expanded only when the rule is evaluated:
define build_letter_text_rule
$(1).txt : $(filter $(1)/%,$(text_columns))
cat $$< > $$@
tools/pre-processor.pl $$@
mv $$@.out $$@
endef
MakefileConcatenating all letters
The final step is to concatenate all the text files. This is extremely simple as we already have the list of files stored in a variable:
all.txt: $(letter_texts)
cat $(letter_texts) > all.txt
MakefileConclusions
The code is not easy to read. But the makefile is quite small (about 20 lines). Less than the number of letters in the dictionary. Perfect.
Regarding time, the process that sequentially took 21 minutes is now taking 12 minutes using 4 jobs (make -j 4
). Wonderful.