Technical FAQ

This section outlines technical questions about the working of the Aleph system.

What file formats can Aleph extract content from?

With Aleph, we try to support all file formats commonly found in leaked evidence used in investigative reporting. Unlike other systems, Aleph does not use the Apache Tika format for content extraction. This allows us to more specifically extract structured informations and generate detailed online previews for a variety of formats.

  • Basic data formats like plain text, HTML and XML.

  • Office formats including Word, Powerpoint, LibreOffice Text, LibreOffice Impress, WordPerfect, RTF, PDF, ClarisWorks, EPub, DejaVu, Lotus WordPro, StarOffice, Abiword, PageMaker, MacWrite, etc.

  • Tabular formats like Excel, Excel 2007, OpenDocument Spreadsheet, DBF, Comma-Separated Values, SQLite, Access.

  • E-Mail formats including plain MIME email (RFC822), Outlook MSG, Outlook PST, Outlook Mac Backups (OLM), MBOX, VCard.

  • Archive/package formats like ZIP, RAR, Tar, 7Zip, Gzip, BZip2.

  • Media formats including JPEG, PNG, GIF, TIFF, SVG, and metadata from common video and audio files.

Does Aleph perform optical character recognition (OCR)?

Aleph attempts to extract written text from any image submitted to the engine. This includes images included in PDF or other office format files, such as in scanned documents. When performing OCR, Aleph supports two backends: Tesseract 4, and the Google Vision API.

The output generated by Google Vision API is much higher quality than that generated by Tesseract, but requires submitting the source images to a remote service, while also incurring potentially significant costs.

Tesseract, on the other hand, benefits heavily from knowing the language of the documents from which it is attempting to extract content. If you are seeing extremely weak recognition results, make sure that the collection containing the documents has a collection language set.

How does Aleph extract named entities from text?

Aleph performs named entity recognition (NER) immediately before indexing data to ElasticSearch. The terminology here can be confusing: although called "entity extraction", the process actually extracts names from entities (e.g. a PDF or an E-Mail).

Currently text processing begins with language classification using fasttext, and then feeds into spaCy for NER. While names of people and companies are tagged directly, locations are checked against to the GeoNames database. This is used to tag countries to individual documents. Additionally, a number of regular expressions are used to perform rule-based extraction of phone numbers, email addresses, IBANs and IPs.

Once extracted, these tags are added as properties to the Follow the Money entity of the Document that they have been extracted from. They can be found in the following fields: detectedLanguage, namesMentioned , country, ipMentioned, emailMentioned, phoneMentioned and ibanMentioned.

We're extremely happy to consider pull requests that add further types of linguistic and pattern-based extraction.

Can I run Aleph without using Docker?

Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.

Aleph's document ingest services requires a large number of command-line utilities and libraries to be installed within a certain version range in order to operate correctly. While we'd love to be able to ship e.g. a Debian package in the long term, the work required for this is significant.

How can I upgrade to a new version of Aleph?

Aleph does not perform updates and database migrations automatically. (Except for the Kubernetes setup, which does it as a job) Once you have the latest version, you can run the command bellow to upgrade the existing installation (i.e. apply changes to the database model or the search index format). In production mode, you may want to perform a backup before running an upgrade.

host$ docker-compose run --rm shell aleph upgrade

You will have to pull new docker images or check out the latest version using Git in order to fetch the latest product version.

I get an error about missing tables. How do I fix it?

If running aleph commands gives you warnings about missing tables, you probably need to migrate your database to the lastest schema. Try:

make upgrade

Why do entities have two-part IDs?

When looking at an Aleph URL, you may notice that every entity ID has two parts, separated by a dot (.), for example:deadbeef.3cd336a9859bdf2be917f561430f2a83e5da292b. The first part in this is the actual entity ID, while the second part is a signature (HMAC) assigned by the server when indexing the data.

The background for this is a security mitigation. There are various places in Aleph where a user can actually assign arbitrary IDs to new entities, including the collection _bulk API. In these cases, an attacher could attempt to inject an ID already used by another collection and thus overwrite its data.

To avoid this, each entity ID is assigned a namespace ID suffix for the collection it is submitted to. This way, multiple collections can have entities with the same ID without overwriting each other's data.

When using the Aleph API, you can submit either form: a version of the entity with its signature, or without, via the _bulk API. The signature will be fixed up automatically.

ElasticSearch will not start. What's wrong?

Most problems arise when the ElasticSearch container doesn't startup properly, or in time. If upgrade fails with errors like NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb11b6ab0d0>: Failed to establish a new connection: [Errno 111] Connection refused this is what happened.

You can find out specifically what went wrong with ES by consulting the logs for that container:

docker-compose -f logs elasticsearch

You will almost certainly need to run the following before you build:

sysctl -w vm.max_map_count=262144

Or to set this permanently, in /etc/sysctl.conf add:


Max file descriptors

If the error in your ES container contains:

elasticsearch_1 | [1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

Please see the relevant ElasticSearch documentation for this issue.

ElasticSearch has gone into read-only mode, why?

When the host machine disk is over 90% full, ElasticSearch can decide to stop writes to the index as an emergency measure. You would see errors like this:

AuthorizationException(403, 'cluster_block_exception', 'index [aleph-collection-v1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

To fix this, try the following:

  • Run docker system prune on the host machine

  • Inside a make shell, run this CURL command: curl -XPUT -H "Content-Type: application/json" http://elasticsearch:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

Something else is wrong, what do I do?

Try turning it off and on again

If all else fails, you may just need to wait a little longer for the ES service to initialize before you run upgrade. Doing the following (after make build) should be sufficient:

  1. make shell

  2. Inside the aleph shell run aleph upgrade.

  3. If that succeeds, in a new terminal run make web to launch the UI and API.

If that does not help, come visit the Aleph slack and talk to the community to get support.

How can I clear parts of the redis cache?

redis-cli --scan --pattern ocr:* | xargs redis-cli del
redis-cli --scan --pattern aleph:authz:* | xargs redis-cli del
redis-cli --scan --pattern aleph:metadata:* | xargs redis-cli del