fingerprintsis a Python library that heavily normalises names of companies and people before comparison. This includes transliteration, word order, and the normalisation of company type suffixes like Limited (Ltd) or Aktiengesellschaft (AG).
normalityand works best when
pdflibis a Python to C binding for the
popplerPDF parser. It's used to extract text and images from PDF files with a high level of error tolerance.
msgliteis a fork of
msg-extractor, a parser for Microsoft Outlook MSG files. These binary email files are OLE containers (like old-style Word or Excel documents) and require some tickling before they will confess details about the contained email message.
countrynameshelps to turn country names into two-letter ISO codes representing that country. For example,
gb. Due to the work area of the OCCRP, this includes some exotic country designations, such as Yugoslavia, Transnistria and the Soviet Union (now deceased).
pantomimeis a simple tool for dealing with MIME type names (such as
text/plain). It contains both a parser and normaliser for MIME declarations, and many common types defined as constants.