Developer tools
Aleph as a toolkit contains a number of Python libraries that can be used independently of the core tool for data parsing and normalisation.
All of the tools below are packaged as releases regularly and can be installed via the Python package registry using pip:
fingerprints is a Python library that heavily normalises names of companies and people before comparison. This includes transliteration, word order, and the normalisation of company type suffixes like Limited (Ltd) or Aktiengesellschaft (AG). fingerprints depends on normality and works best when pyicu is installed.
pdflib is a Python to C binding for the poppler PDF parser. It's used to extract text and images from PDF files with a high level of error tolerance.
msglite is a fork of msg-extractor, a parser for Microsoft Outlook MSG files. These binary email files are OLE containers (like old-style Word or Excel documents) and require some tickling before they will confess details about the contained email message.
countrynames helps to turn country names into two-letter ISO codes representing that country. For example, United States or Delaware become us, England becomes gb. Due to the work area of the OCCRP, this includes some exotic country designations, such as Yugoslavia, Transnistria and the Soviet Union (now deceased).
pantomime is a simple tool for dealing with MIME type names (such as text/plain). It contains both a parser and normaliser for MIME declarations, and many common types defined as constants.
languagecodes is a Python library that handles the normalisation of language identifiers into ISO 3-letter codes. For example, en becomes eng, de becomes deu, etc.
Last modified 4d ago
Export as PDF
Copy link