All of the tools below are packaged as releases regularly and can be installed via the Python package registry using
fingerprints is a Python library that heavily normalises names of companies and people before comparison. This includes transliteration, word order, and the normalisation of company type suffixes like Limited (Ltd) or Aktiengesellschaft (AG).
fingerprints depends on
normality and works best when
pyicu is installed.
pdflib is a Python to C binding for the
poppler PDF parser. It's used to extract text and images from PDF files with a high level of error tolerance.
msglite is a fork of
msg-extractor, a parser for Microsoft Outlook MSG files. These binary email files are OLE containers (like old-style Word or Excel documents) and require some tickling before they will confess details about the contained email message.
countrynames helps to turn country names into two-letter ISO codes representing that country. For example,
United States or
gb. Due to the work area of the OCCRP, this includes some exotic country designations, such as Yugoslavia, Transnistria and the Soviet Union (now deceased).
pantomime is a simple tool for dealing with MIME type names (such as
text/plain). It contains both a parser and normaliser for MIME declarations, and many common types defined as constants.
languagecodes is a Python library that handles the normalisation of language identifiers into ISO 3-letter codes. For example,