With Aleph, we try to support all file formats commonly found in leaked evidence used in investigative reporting. Unlike other systems, Aleph does not use the Apache Tika format for content extraction. This allows us to more specifically extract structured informations and generate detailed online previews for a variety of formats.
Basic data formats like plain text, HTML and XML.
Office formats including Word, Powerpoint, LibreOffice Text, LibreOffice Impress, WordPerfect, RTF, PDF, ClarisWorks, EPub, DejaVu, Lotus WordPro, StarOffice, Abiword, PageMaker, MacWrite, etc.
Tabular formats like Excel, Excel 2007, OpenDocument Spreadsheet, DBF, Comma-Separated Values, SQLite, Access.
E-Mail formats including plain MIME email (RFC822), Outlook MSG, Outlook PST, Outlook Mac Backups (OLM), MBOX, VCard.
Archive/package formats like ZIP, RAR, Tar, 7Zip, Gzip, BZip2.
Media formats including JPEG, PNG, GIF, TIFF, SVG, and metadata from common video and audio files.
Aleph attempts to extract written text from any image submitted to the engine. This includes images included in PDF or other office format files, such as in scanned documents. When performing OCR, Aleph supports two backends: Tesseract 4, and the Google Vision API.
The output generated by Google Vision API is much higher quality than that generated by Tesseract, but requires submitting the source images to a remote service, while also incurring potentially significant costs.
Tesseract, on the other hand, benefits heavily from knowing the language of the documents from which it is attempting to extract content. If you are seeing extremely weak recognition results, make sure that the collection containing the documents has a collection language set.
Aleph performs named entity recognition (NER) immediately before indexing data to ElasticSearch. The terminology here can be confusing: although called "entity extraction", the process actually extracts names from entities (e.g. a PDF or an E-Mail).
Currently text processing begins with language classification using fasttext, and then feeds into spaCy for NER. While names of people and companies are tagged directly, locations are checked against to the GeoNames database. This is used to tag countries to individual documents. Additionally, a number of regular expressions are used to perform rule-based extraction of phone numbers, email addresses, IBANs and IPs.
Once extracted, these tags are added as properties to the Follow the Money entity of the
Document that they have been extracted from. They can be found in the following fields:
We're extremely happy to consider pull requests that add further types of linguistic and pattern-based extraction.
Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.
Aleph's document ingest services requires a large number of command-line utilities and libraries to be installed within a certain version range in order to operate correctly. While we'd love to be able to ship e.g. a Debian package in the long term, the work required for this is significant.