Fixed a bug in the tokenisation of the search index that dropped numbers from being made searchable. This has been fixed, but it only applies to collections (re-)indexed after this release.
Improved scoring in cross-references based on a regression model derived from user judgements. Also tuned the way Aleph compares properties in the "Mentions" tab of documents etc.
For Outlook email files (.msg), the RTF variant of the body will now be indexed in the form of an attachment to the message, titled
Inline the helm chart into the Aleph repository, it's now shipped with the main application. This requires updating your helm configuration if you've been using the previous charts.
Loads of bug fixes for small UI issues.
Re-design the Investigation UI for a UX that involves guiding the user through some common actions.
Refactor much of the state handling in the React app.
Bug fixes on ingestors.
Allow entities in one collection to reference those in another.
Re-name personal datasets to "Investigations" in the UI
Introduce user interfaces for profiles, an interactive way to de-duplicate data. Fix various bugs in profile logic in the backend.
Get rid of the global scoped search, show separate search bars closer to the subject of the search in the user interface.
Introduce structured logging of JSON objects in Stackdriver.
Polish data loading in the user interface and de-bug various features.
Work on Arabic/RTL i18n, nested directionality.
Debug OIDC logout
Pairwise judgement API to replace xref decisions API.
Aleph 3.9.5 uses OpenID Connect to largely automate the configuration of delegated login. Previous versions of Aleph configured an OAuth2 client explicitly, which also required coding custom handlers for each OAuth provider. The new system also addresses a number of potential security issues.
Unfortunately, the transition requires some incompatible changes:
You now need to configure a
ALEPH_OAUTH_METADATA_URL to set an endpoint used by OIDC to self-configure.
Examples of valid metadata URLs for services like Google, Azure, Amazing Cognito and Keycloak can be found in the file
The existing options
ALEPH_OAUTH_AUTHORIZE_URL are no longer needed.
ALEPH_OAUTH_SCOPE are now optional.
The database IDs generated for users and groups will be different. For users, the ID should be re-written the first time a user logs in after the upgrade. Groups, on the other hand, may require a SQL intervention to adapt their IDs. For example, with a Keycloak provider, the change would be:
UPDATE role SET foreign_id = REPLACE(foreign_id, 'kc:', 'group:') WHERE type = 'group';
Beyond these breaking changes, some other differences are notable:
Logging out of Aleph will now also log a user out of the OAuth provider, where supported (e.g. Keycloak, Azure).
If a user is blocked or deleted while using the site, their session will be disabled by the worker backend within an hour. (This can be forced by running
Changes unrelated to OAuth:
EntitySets no longer contain an
entities array of all their members. Use the sub-resource
Multiple bug fixes in UI related to i18n.
Move file ingestor service
ingest-file to its own repository to decouple versioning and CI/CD.
Show transliterated names of non-latin entities in the user interface.
Refactor query serialisation, remove in-database query log.
Fix out of memory errors in cross-reference
Extensive bug fixes in mapping UI
Data exports feature to let users make offline data exports for searches and cross-reference
New home page, based on the stupid CMS we introduced for the about section.
Ability to map entities into lists via the UI and alephclient.
Tons of bug fixes in UI and backend.
UI for managing lists of entities within a dataset. This lets you make sub-sets of a dataset, e.g. "The Family", "Lawyers" or "Core companies".
Ability to cross-reference a collection of documents against structured data collections using
Mention schema stubs. Requires dataset reingest before it takes effect.
New internationalisation mechanism for the React bits, using JSON-formatted translation files.
Move the linkages API ("god entities" / record linkage) to use entity sets instead of its own database model.
Remove soft-deletion for some model types (permissions, entities, alerts, mappings).
Date histogram facet and filtering tool on search results.
Added example code for how to add text processors to Aleph.
Re-worked collection stats caching to avoid super slow requests when no cache is present.
Tons of bug fixes.
Introduce EntitySets, as user-curated sets of ... entities! All diagrams are now entitysets, as will be timelines and bookmarks.
Refactor queue and processing code for the Aleph worker.
"Expand node" support in network diagrams pulls relevant connections from the backend and shows them to the user while browsing a network diagram.
Correctly handle the use of multi-threading when using Google Cloud Storage Python client libraries.
We've re-worked the way entities are aggregated before they are being loaded into the search index. This was required because Aleph is become more interactive and needs to handle non-bulk operations better. It also improves metadata handling, like which user uploaded a document, or when an entity was last updated. Aleph will now always keep a full record of the entities in the SQL database, whichever way they are submitted. To this end, we've migrated from
followthemoney-store (i.e. balkhash 2.0). This will start to apply to existing collections when they are re-ingested or re-indexed.
Aleph has two new APIs for doing a collection
reindex. The existing
process collection API is gone.
alephclient now supports running
delete on a collection.
Operators can expedite the rollout of the new backend by running
aleph reingest-casefiles and
aleph reindex-casefiles to re-process all existing personal datasets.
Numerous UI fixes make the table editor and network diagrams much more smooth.
We've introduced a table editor in the user interface for manually editing entities in personal datasets.
A graph expand API for entities returns all entities adjacent to an entity for network-based exploration of the data.
Linkages, a new data model. A linkage is essentially an annotation on an entity saying it is the same as some other entities (in other datasets). This would, for example, let you group together all mentions of a politician into a single profile. Linkages are currently created via the Xref UI, which now has a ‘review mode’.
In the future, profiles (ie. the composite of many linkages) will start showing up in the UI in different places, to introduce an increasingly stronger notion of data integration. Because linkages are based on a reporter’s judgement, they belong to either a) them, or b) a group of users — so they are always a bit contextualised, not fully public.
Our hope is also that the data collected via linkages will provide training material for a machine learning-based approach to cross-referencing.
Users who employ OAuth may need to change their settings to define a
ALEPH_OAUTH_HANDLER in their
aleph.env . By default, the following handlers are supported:
Run VIS2 / Network diagrams on Aleph as a testing feature.
Two SECURITY ISSUES in the software: one that would let an attacker enumerate registered users, and the other could be exploited for XSS with a forged document. They were discovered by two friendly hackers from blbec.online who kindly reported them to us.
synonames. This is an extension to our install of ElasticSearch that allows us to expand names into cultural transliterations. So for example doing a search for
Christoph will now also search
Кристоф, even though they aren’t literally the same names. This should increase recall for cross-cultural queries. The whole thing was a project from @Aparna, generating these aliases from Wikidata entries.
These changes come alongside a lot of UI and backend polishing, so things should be much more smooth all around.
The goal of
aleph 3.0.0 is to harmonise the handling of data inside the index. Instead of having different formats and mappings for documents, entities, table rows and document pages, there is now just one type of index object: an entity.
This means that document-based data is now completely 'translated' to the
followthemoney ontology used by
aleph (meaning that in theory, each page of a document and each row of a table is now a node in the object graph of the
In order to accomplish this, a complete re-index is required in all cases. The recommended path of migrating from a 2.x.x installation is this set of commands in an aleph container shell (
# Re-create the indexes:aleph resetindex# Apply a database schema change:aleph upgrade# Re-index collections and documents:aleph repair --entities
Be advised that any data loaded via the entity mapping mechanism will need to be re-loaded after this. It is also worth noting that at OCCRP, we have now started generating mapped data via the
followthemoney command-line tool, and are using
alephclient to bulk-load the resulting stream of entities into the system. This has proven to be significantly quicker than the built-in mapping process.
ALEPH_REDIS_EXPIRE are now
ALEPH_OCR_VISION_API is now
OCR_VISION_API, it will enable use of
the Google Vision API for optical character recognition.
/api/2/collections/<id>/ingest API now only accepts a single file, or
no file (which will create a folder). The response body contains only the ID
of the generated document. The status code on success is now 201, not 200.