In this part of series, I will describe, how to fetch German law texts from https://www.gesetze-im-internet.de.
The (federal) laws in Germany are published by the Federal Ministry of Justice and Consumer Protection on https://www.gesetze-im-internet.de. There are also land (i.e. state) laws, published here, administrative regulations, published here, and many more laws, but for the sake of simplicity we will use the texts of federal laws only.
As stated in the notes page, there are four formats available:
- HTML (which you can view in browser)
- PDF (most suitable for archive or for printed documents)
- EPUB (for e-book readers)
- XML (original format, which can be converted easily to other formats)
The format of the XML representation is defined by this DTD, which will become very helpful in the next part of this series.
As also stated on the mentioned above notes page, the index XML documents is available at http://www.gesetze-im-internet.de/gii-toc.xml. This index links to XML documents, packed into ZIP archives, all of them having the same name
The choice of the format
From the four available formats, we need the one, which represents the resulting text with the least markup. The requirement comes from the need to generate a future law text with as little markup as possible.
This requirement, of course, eliminates the PDF format, because it is adapted to the printed media. While the HTML format could be converted to text, for example with the veritable html2text, the contents of law texts are split between small sections, hence complicating the conversion. The conversion of the EPUB format to text is difficult to customise, at least in comparison to XML. Finally, for XML format, there is already a converter to plain text, described in another post.
So we need the documents in XML format.
How to parse HTML with batteries included
Should you ask yourself at this point, why do I overlook two very nice and tried Python packages, please read the list under First things first in this article.
To parse HTML with the
HTMLParser class, you simply create a subclass from it. Then, depending on what you need to get from HTML data, you implement the
handle_* methods. For example, to parse links from the https://www.gesetze-im-internet.de front page, you need the following code:
Collecting all XML documents
While, as mentioned above, there is a list of XML documents here, we will try to collect URLs of all XML documents from the list of current documents at http://www.gesetze-im-internet.de/aktuell.html.
The parser implemented for this page is similar to the previous example. As the current documents are grouped by the first character into separate lists, this parser collects the links to these lists:
As all links to document lists are stored in the variable
partial_list_urls, we must add another parser to fetch the links to XML documents. This parser also stores law names.
Complete fetch code
If we combine the two examples, and add some error handling and some
urlretrieve action as well, we get this:
After executing this code, we get 6518 ZIP files into the cache directory.
In the next step, we will build the text corpus from all the law texts fetched.