Technical FAQ

This section outlines technical questions about the working of the Aleph system.

What file formats can Aleph extract content from?

With Aleph, we try to support all file formats commonly found in leaked evidence used in investigative reporting. Unlike other systems, Aleph does not use the Apache Tika format for content extraction. This allows us to more specifically extract structured informations and generate detailed online previews for a variety of formats.

  • Basic data formats like plain text, HTML and XML.

  • Office formats including Word, Powerpoint, LibreOffice Text, LibreOffice Impress, WordPerfect, RTF, PDF, ClarisWorks, EPub, DejaVu, Lotus WordPro, StarOffice, Abiword, PageMaker, MacWrite, etc.

  • Tabular formats like Excel, Excel 2007, OpenDocument Spreadsheet, DBF, Comma-Separated Values, SQLite, Access.

  • E-Mail formats including plain MIME email (RFC822), Outlook MSG, Outlook PST, Outlook Mac Backups (OLM), MBOX, VCard.

  • Archive/package formats like ZIP, RAR, Tar, 7Zip, Gzip, BZip2.

  • Media formats including JPEG, PNG, GIF, TIFF, SVG, and metadata from common video and audio files.

Does Aleph perform optical character recognition (OCR)?

Aleph attempts to extract written text from any image submitted to the engine. This includes images included in PDF or other office format files, such as in scanned documents. When performing OCR, Aleph supports two backends: Tesseract 4, and the Google Vision API.

The output generated by Google Vision API is much higher quality than that generated by Tesseract, but requires submitting the source images to a remote service, while also incurring potentially significant costs.

Tesseract, on the other hand, benefits heavily from knowing the language of the documents from which it is attempting to extract content. If you are seeing extremely weak recognition results, make sure that the collection containing the documents has a collection language set.

How does Aleph extract named entities from text?

Aleph performs named entity recognition (NER) immediately before indexing data to ElasticSearch. The terminology here can be confusing: although called "entity extraction", the process actually extracts names from entities (e.g. a PDF or an E-Mail).

Currently text processing begins with language classification using fasttext, and then feeds into spaCy for NER. While names of people and companies are tagged directly, locations are checked against to the GeoNames database. This is used to tag countries to individual documents. Additionally, a number of regular expressions are used to perform rule-based extraction of phone numbers, email addresses, IBANs and IPs.

Once extracted, these tags are added as properties to the Follow the Money entity of the Document that they have been extracted from. They can be found in the following fields: detectedLanguage, namesMentioned , country, ipMentioned, emailMentioned, phoneMentioned and ibanMentioned.

We're extremely happy to consider pull requests that add further types of linguistic and pattern-based extraction.

How do I add support for a new language to Aleph?

There are two aspects to adding support for a new language to Aleph: translating the user interface, and adapting the processing pipeline.

To add a new language for the Aleph user interface ("localisation"), register a user account on transifex.comand apply to become a member of the Aleph organisation. Start a new translation and translate all strings in the various Aleph components (followthemoney, aleph-ui, aleph-api and react-ftm).

If you wish to try out the translation on a local developer install of Aleph, please make sure you have the Transifex command-line client installed and configured. Then run the following sequence of commands:

make translate
cd ui/
npm run translate
cd ..
make translate

In terms of adapting the processing pipeline, go through the following items:

  • Check that the hard-coded language list in FollowTheMoney includes the three-letter code for your language (module followthemoney.types.language). If the language you are adding has multiple language codes, you may want to add a synonym mapping to the languagecodes Python library.

  • Check that the ingest-file service in its Dockerfile installs a Tesseract model for your language, if one is available in Ubuntu.

  • Check if a spaCy model is available for named entity extraction, and add it to the Dockerfile in ingest-file. Also make sure to adapt the INGESTORS_NER_MODELS environment variable in that file.

Can I run Aleph without using Docker?

Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.

Aleph's document ingest services requires a large number of command-line utilities and libraries to be installed within a certain version range in order to operate correctly. While we'd love to be able to ship e.g. a Debian package in the long term, the work required for this is significant.

Here's a guide for running Aleph sans docker on Debian w/ systemd.

How do I upgrade to a new version of Aleph?

Aleph does not perform updates and database migrations automatically. Once you have the latest version, you can run the command bellow to upgrade the existing installation (i.e. apply changes to the database model or the search index format).

Before you upgrade, check the release notes to make sure you understand the latest release and know about new options and features that have been added.

The procedures for upgrading are different between production and development mode:

In development mode, make sure you've pulled the latest version from GitHub. We recommend you check out develop if you want to contribute code. Then, run:

make build
make upgrade

In production mode, make sure you perform a backup of the main database and the ElasticSearch index before running an upgrade.

Then, make sure you are using the latest docker-compose.yml file. You can do this by checking out the source repo, but really you just need that one file (and your config in aleph.env). Then, run:

docker-compose pull --parallel
# Terminate the existing install (enter downtime!):
docker-compose down
docker-compose up -d redis postgres elasticsearch
# Wait a minute or so while services boot up...
# Run upgrade:
docker-compose run --rm shell aleph upgrade
# Restart prod system:
docker-compose up -d

I get an error about missing tables. How do I fix it?

If running aleph commands gives you warnings about missing tables, you probably need to migrate your database to the lastest schema. Try:

make upgrade

My import is stuck at 67%, what's wrong?

This often means you're not running an Aleph worker process - the component responsible for indexing documents, generating caches, cross-reference and email alerts. When you operate in development mode (using the make commands), this is the case by default.

To fix this issue in development mode, just run a worker:

make worker

If you're encountering this issue in production mode, try to check the worker log files to understand the issue.

How can I make imports run faster?

The included docker-compose configuration for production mode has no understanding of how powerful your server is. It will run just a single instance of the services involved in data imports, worker , ingest-file and convert-document.

The easiest way to speed up processing is to scale up those services. Make a shell script to start docker-compose with a set of arguments like this:

docker-compose up --scale ingest-file=8 --scale convert-document=4 --scale worker=2

The number of ingest-file processes could be the number of CPUs in your machine, and convert-document needs to be scaled up for imports with many office documents, but never higher than ingest-file.

ElasticSearch will not start. What's wrong?

Most problems arise when the ElasticSearch container doesn't startup properly, or in time. If upgrade fails with errors like NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb11b6ab0d0>: Failed to establish a new connection: [Errno 111] Connection refused this is what happened.

You can find out specifically what went wrong with ES by consulting the logs for that container:

docker-compose -f docker-compose.dev.yml logs elasticsearch

You will almost certainly need to run the following before you build:

sysctl -w vm.max_map_count=262144

Or to set this permanently, in /etc/sysctl.conf add:

vm.max_map_count=262144

Max file descriptors

If the error in your ES container contains:

elasticsearch_1 | [1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

Please see the relevant ElasticSearch documentation for this issue.

ElasticSearch has gone into read-only mode, why?

When the host machine disk is over 90% full, ElasticSearch can decide to stop writes to the index as an emergency measure. You would see errors like this:

AuthorizationException(403, 'cluster_block_exception', 'index [aleph-collection-v1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

To fix this, try the following:

  • Run docker system prune on the host machine

  • Inside a make shell, run this CURL command: curl -XPUT -H "Content-Type: application/json" http://elasticsearch:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

How do I shut down Aleph?

When you're running in development mode, run:

make stop

In production mode, the equivalent command is:

docker-compose down --remove-orphans

Something else is wrong, what do I do?

Try turning it off and on again

If all else fails, you may just need to wait a little longer for the ES service to initialize before you run upgrade.

  1. Shut down all Aleph components: make stop

  2. Re-build the development docker containers: make build

  3. Apply the latest data migrations: make upgrade

  4. If that succeeds, in a new terminal run make web to launch the UI and API, and make worker to start a worker service.

Talk to the community

If that does not help, come visit the Aleph slack and talk to the community to get support.

How can I clear parts of the redis cache?

redis-cli --scan --pattern aleph:authz:* | xargs redis-cli del
redis-cli --scan --pattern aleph:metadata:* | xargs redis-cli del

Can I customise the text of the about page?

The "About" section in the Navbar is based on a micro-CMS that is a bit like Jekyll. You can see the templates in the Aleph GitHub repository at aleph/pages. Pages can be customised by setting an environment variable, ALEPH_PAGES_PATH to point to a directory with content pages.

All pages with the menu: true header set will be added to the Navbar, others will just be shown in the sidebar menu inside the "About" section.

How to add these pages to the running Aleph container is more of a Docker problem, so you might want to look into how to build a derived image for the api service, or just mount a path from the server as a volume inside the api.

How do I manage users and groups?

The options for managing users and groups in Aleph are very limited. This is because many installations delegate those tasks to a separate OAuth single sign-on service, such as Keycloak (an example configuration exists in contrib/keycloak).

That's why adding features like password resets, a admin UI for user creation or groups management is not on the roadmap of the OCCRP developer team. However, other developers are encouraged to implement them and contribute the code.

Can I run Aleph on Kubernetes?

That's where it's most at home. We don't yet provide an official helm chart (help wanted!), but if you hit one of the OCCRP staff up on Slack, we might be able to share (parts of) our manifests.

We aggressively use auto-scaling both on the cluster and pod level, which helps to combine fast imports with limited operational cost.

How can I connect to the database directly?

If you want to manipulate the SQL database directly (e.g. to edit a user, create or delete a group), you can connect to the PostgreSQL database.

In development mode, the database is exposed on the host at 127.0.0.1:15432. (User, password and database name are all aleph). You can also connect from the shell container:

make shell
psql $ALEPH_DATABASE_URI

The same can be done if you run an instance of the shell container in production mode.

Why do entities have two-part IDs?

When looking at an Aleph URL, you may notice that every entity ID has two parts, separated by a dot (.), for example:deadbeef.3cd336a9859bdf2be917f561430f2a83e5da292b. The first part in this is the actual entity ID, while the second part is a signature (HMAC) assigned by the server when indexing the data.

The background for this is a security mitigation. There are various places in Aleph where a user can actually assign arbitrary IDs to new entities, including the collection _bulk API. In these cases, an attacher could attempt to inject an ID already used by another collection and thus overwrite its data.

To avoid this, each entity ID is assigned a namespace ID suffix for the collection it is submitted to. This way, multiple collections can have entities with the same ID without overwriting each other's data.

When using the Aleph API, you can submit either form: a version of the entity with its signature, or without, via the _bulk API. The signature will be fixed up automatically.

Why don't you use a graph database?

The benefit of storing Aleph as a graph would be running path queries and quick pattern matching ("Show me all the companies owned by people who have the same name as a politician").

The downsides are:

  • User-controlled access to Aleph must always go through security checks, and we haven't really found a graph DB that would handle the security model of Aleph without generating incredibly complex (and slow) queries.

  • While some graph databases have Lucene built in, that doesn't replace ElasticSearch. Simple search is a killer use case and needs to be really good and offer advanced features like facets, text normalisation and index sharding.

  • There's a lot of data in some Aleph instances. It's not clear how many graph databases can respond to queries against billions of entities within an HTTP response cycle.

  • Our data is usually not well integrated, so the graph is less dense than you might think, unless we fully pre-compute all possible entity duplicates as graph links.

All of this said, we'd really love to hear about any experiments regarding this. In OCCRP we sometimes materialise partial Aleph graphs into Neo4J and let analysts browse them via Linkurious. We're hoping to look into dgraph as a possible backend at some point.

We have well-defined graph semantics for FollowTheMoney data and you can export any data (including documents like emails) in Aleph into various graph formats (RDF, Neo4J, and GEXF for Gephi).

‚Äč