Using make for automatic processing
Recently, I needed to extract text from a PDF file. This PDF file was produced from scanned pages. To extract the pages back, I used command
pdfimages file.pdf IMG
This stored the original scans in PBM-formatted image files. The files were named in the IMG-XXX.pbm
pattern, with XXX
starting at 000
and incrementing.
In the next step I converted raw .pbm
files to the TIFF format, as Tesseract OCR software prefers this format. Doing so using ImageMagick command convert
, the images were additionally processed to remove spots and other mishaps. The format conversion and image processing command as whole was
convert -verbose -colorspace gray -median 3x3 -resize 300% -blur 10 input.pbm output.tiff
As I repeatedly tried different image processing chains for multiple images at once, the execution time exceeded 1 minute. It occured to me, looking at the CPU load, that only one core has been used at a time. Surely the multi-process execution on a multi-core CPU would speed up every OCR run. But I was unsure at first, whether there is a simple process manager, which limits amount of concurrently running processes. Then I remembered about an old acquaintance, make
, or, precisely, GNU make
. The tool also resolves dependencies automatically, which helps scheduling of processes, additionally.
The only problem yet to solve was the lack of rules for translation, or, in make
parlance, pattern rules, from the source image to the OCRed text. Luckily, years ago I used make
to produce both printable and audible music scores from descriptions, written in ABC notation, automatically; therefore, I only had to read up the make
documentation again. This is the result:
CONVERT = convert
IMFLAGS = -verbose -colorspace gray -median 3x3 -resize 300% -blur 10
TESSERACT = tesseract
RM = rm
SRCS = $(wildcard *.pbm)
INTERMEDIATES = ${SRCS:.pbm=.tiff}
RESULTS = ${SRCS:.pbm=.txt}
.PHONY: all clean
all: $(RESULTS)
.PRECIOUS: $(INTERMEDIATES)
clean:
$(RM) -f $(INTERMEDIATES) $(RESULTS)
%.tiff: %.pbm
$(CONVERT) $(IMFLAGS) $< $@
%.txt: %.tiff
$(TESSERACT) $< $(basename $@) -l deu
The commands convert
, tesseract
and rm
are stored in variables CONVERT
, TESSERACT
and RM
for being able to specify unusual locations of them. The variables SRCS
, INTERMEDIATES
and RESULTS
contain file names. The declaration .PHONY
instructs make
to execute targets all
and clean
even when there are actual files names all
or clean
. The .PRECIOUS
declaration instructs make
to preserve intermediate files. Line groups, starting in the form of %to: %from
, are pattern rules (rules for translation). $<
and $@
mean source and target file respectively.
Run with make -j number
, where number
amounts either to the number of cores or the number of threads in the CPU. On AMD CPU with 4 cores, I used make -j 4
.
Using make
, which is available for any off-the-shelf OS, you are able to speed up lengthy processing on multi-core systems. Also, make
ensures seamless translations between different stages of processing chain.
Which leads to the question: where do you use make
, aside from everyday source code compilation?
comments powered by Disqus