Importing structured data
Instructions for projecting structured data into Aleph's knowledge graph using a data mapping.
Aleph is commonly used to import data tables (like a companies registry, list of persons of interest, or a set of government contracts) that a user wants to search and browse.
It does this by mapping tabular source data to the Follow the Money (FtM) data model. Aleph defines entities (such as People, Companies, Assets or Court Cases), and the relationships between them (such as Familyrelations, or business interests – Ownership or Directorship, or other links like Sanctions, Payments ).
To load structured data into Aleph, an entity mapping file needs to be created. A mapping file uses YAML syntax to describe how information from a table can be projected to FtM entities.
Mappings can only be applied to tabular data. Other file types, such as PDFs, E-Mails or PowerPoints are converted into Follow the Money entities using an automated process (ingest-file) that extracts only limited semantics.

Getting started

In order to map data to a Follow the Money model, you will need the following: a source data table, a tool to process the mapping, and a mapping file (to direct the tool).
Source data can be either a CSV (comma-separated values) file using the UTF-8 character encoding, or a valid connection string to connect to a SQL database. Using SQL as a source also lets you perform JOINs within the database while mapping data.
In order to execute a mapping, you need either a running instance of the Aleph server, or you need to install the ftm command-line utility. Using the command-line tool is recommended for playing around with a mapping while you are refining it. When using the server to map data, you must make sure that the source data is located at a place that is accessible to the Aleph worker, such as an HTTP URL for a CSV file or a database that Aleph can connect with.
To write a mapping file, you will first need to identify:
  • the types of entities included in the dataset (e.g. People, Companies, Directorships)
  • the properties that describe each entity (e.g. the name of a Company, or the birthDate of a Person)
  • and the the field or combination of fields that can be used to generate a key (this used to uniquely identify each entity in the dataset). Find more details on these requirements below.

A simple mapping example

Writing a mapping file is often an iterative process, which we can gradually expand upon to refine the data model.
Below is a simple mapping file. It downloads a list of British members of parliament and transforms them into FtM Person entities.
brexitonians.yml
1
gb_parliament_57:
2
queries:
3
- csv_url: http://bit.ly/uk-mps-csv
4
entities:
5
member:
6
schema: Person
7
keys:
8
- id
9
properties:
10
name:
11
column: name
Copied!
The mapping file specifies a dataset name (gb_parliament_57) and uses a single query to pull data from a CSV file (the dataset is from the excellent EveryPolitician project). The query generates a Person entity, which maps the CSV's id column to a key, and the CSV's namecolumn to the FtM property name
Try saving this file to your computer and executing it with the ftm command-line tool: ftm map brexitonians.yml.
Aleph will now have a dataset called gb_parliament_57 , which is a list of MP names. To use FtM language – the data has been mapped to a Person entity with a single property ( name).

Assigning additional properties

However, the source CSV file has far more detail on each MP, from e-mail addresses to political party affiliation. To include this data in gb_parliament_57 , we need to map each CSV column to the respective property as defined in the FtM schema. ****The properties vary based on the type of entity (a Person will have different properties from a Company).
To find out what properties exist for a particular schema, you can check out the YAML-based schema definitions on GitHub or the diagrams here. (As seen in the diagram, FtM schemata have a hierarchical structure. Entities can be assigned all the properties of their parent entities. For instance, Person is a child of LegalEntity, which is a child of Thing . As such, a Person can be assigned all of the properties of a LegalEntity and Thing.)
Here's an updated mapping file, which maps additional columns from the CSV file to properties in the Person schemata (email, nationality, and alias).
brexitoids.yml
1
gb_parliament_57:
2
queries:
3
- csv_url: http://bit.ly/uk-mps-csv
4
entities:
5
member:
6
schema: Person
7
keys:
8
- id
9
properties:
10
name:
11
column: name
12
alias:
13
column: sort_name
14
email:
15
column: email
16
nationality:
17
literal: GB
Copied!

Generating multiple entities

Now that we've generated a detailed record for each MP, we might want to add their party membership. First, let's map a party entity (Line 12 onwards):
brexicles.yml
1
gb_parliament_57:
2
queries:
3
- csv_url: http://bit.ly/uk-mps-csv
4
entities:
5
member:
6
schema: Person
7
keys:
8
- id
9
properties:
10
name:
11
column: name
12
party:
13
schema: Organization
14
keys:
15
- group_id
16
properties:
17
name:
18
column: group
Copied!
When run this will create twice as many entities as before: the MPs, and parties. Note how each party is generated multiple times (once for each of its members). When you're using the command-line, you will need to perform entity aggregation to merge these duplicates. Running the mapping inside of the Aleph server will do this automatically.

Creating relationships between entities

What this does not yet do, however, is explicitly create a link between each MP and their party. In Follow the Money parlance, links (or relationships) are just another entity type. Note how, on lines 5 and 12 in the above mapping, we are assigning a temporary name for the member and the party. We can use these references when generating a third entity, the Membership:
brexosaurs.yml
1
gb_parliament_57:
2
queries:
3
- csv_url: http://bit.ly/uk-mps-csv
4
entities:
5
member:
6
schema: Person
7
keys:
8
- id
9
properties:
10
name:
11
column: name
12
party:
13
schema: Organization
14
keys:
15
- group_id
16
properties:
17
name:
18
column: group
19
membership:
20
schema: Membership
21
keys:
22
- id
23
- group_id
24
properties:
25
organization:
26
entity: party
27
member:
28
entity: member
Copied!
When loaded to Aleph, this mapping would now show browsable entities for the member and each party, and list the memberships on each of their profile pages. You can also export this data to a more conventional node-graph data model for use in Neo4J or Gephi.

A more realistic complex mapping

The companies registry of the Republic of Moldova is an open dataset that consists of three separate source files that, taken together, produce a graph of company information, ownership and management:
  • companies.csv with companies' details like name, id, address, incorporation date;
  • directors.csv with names of directors and their details;
  • founders.csv also with names and details of the founding entities (i.e. major shareholders).
The mapping example given below describes the relationship between the companies stored in companies.csv and directors and founders, stored in directors.csv and founders.csv respectively.
moldova.yml
1
# The name of the mapping that would become the foreign_id of the collection
2
# on Aleph.
3
md_companies:
4
queries:
5
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/companies.csv
6
# Entity definition section.
7
entities:
8
# This is an arbitrary entity name that will be used throughout this query
9
# section of the mapping.
10
company:
11
# Entity schema type from Follow the Money model.
12
schema: Company
13
# List of columns that are used as unique identifiers for each record.
14
# Could also be viewed as record aggregation when there are several
15
# records for the same company that differ only in, for example, address
16
# field. In this case the resulting entity will contain address values
17
# merged from different source data records.
18
keys:
19
- IDNO
20
- Denumirea_completă
21
# A set of properties that describe the chosen schema type.
22
# For each property one or several columns can be used to get value from.
23
# A literal string value could be given instead of a column value,
24
# e.g. for a country code.
25
properties:
26
name:
27
column: Denumirea_completă
28
registrationNumber:
29
column: IDNO
30
incorporationDate:
31
column: Data_înregistrării
32
address:
33
column: Adresa
34
jurisdiction:
35
literal: MD
36
legalForm:
37
column: Forma_org
38
status:
39
column: Statutul
40
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/directors.csv
41
# With this query Director records are loaded and the Directorship
42
# relation is defined between Directors and Companies.
43
entities:
44
# Again a Company entity is constructed using the same set of keys
45
# as in the query above in order to be referred to in the Directorship
46
# event definition.
47
company:
48
schema: Company
49
keys:
50
- Company_IDNO
51
- Company_Name
52
director:
53
schema: LegalEntity
54
keys:
55
- Company_Name
56
- Company_IDNO
57
- Director
58
properties:
59
name:
60
column: Director
61
# To only include records that have a non-empty `Director` column.
62
required: true
63
directorship:
64
schema: Directorship
65
# To avoid key collision between directors and directorships an additional
66
# literal string value is given with `key_literal`.
67
key_literal: Directorship
68
keys:
69
- Company_Name
70
- Company_IDNO
71
- Director
72
properties:
73
# Linking together directors and companies, where the director and
74
# organization properties of the Directorship interval contain references
75
# to the director and company entities that were constructed previously.
76
director:
77
entity: director
78
required: true
79
organization:
80
entity: company
81
required: true
82
role:
83
literal: director
84
# Similar to directors, in order to link founders to companies through
85
# an ownership event th company and founder entities have to be declared
86
# again in each query sectio.
87
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/founders.csv
88
entities:
89
company:
90
schema: Company
91
keys:
92
- Company_IDNO
93
- Company_Name
94
founder:
95
schema: LegalEntity
96
keys:
97
- Company_Name
98
- Company_IDNO
99
- Founder
100
properties:
101
name:
102
column: Founder
103
required: true
104
ownership:
105
schema: Ownership
106
key_literal: Ownership
107
keys:
108
- Company_Name
109
- Company_IDNO
110
- Founder
111
properties:
112
owner:
113
entity: founder
114
required: true
115
asset:
116
entity: company
117
required: true
118
role:
119
literal: founder
120
# In case there're extra tables with data that has to be linked companies,
121
# the the same set of keys is repeated and the relevant properties
122
# are declared.
123
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/licensed.csv
124
entities:
125
company:
126
schema: Company
127
keys:
128
- Company_IDNO
129
- Company_Name
130
properties:
131
sector:
132
column: Denumire
133
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/unlicensed.csv
134
entities:
135
company:
136
schema: Company
137
keys:
138
- Company_IDNO
139
- Company_Name
140
properties:
141
sector:
142
column: Denumire
143
caemCode:
144
column: Cod_CAEM
Copied!

Generating unique keys

When creating entities from a dataset, each generated entity must be assigned a unique ID. This ID is computed from the keys defined in the mapping file. When writing the file, it is therefore necessary to understand what combination of source columns from the original table can be used to uniquely identify an entity in the context of a dataset. Failing to do so will result in "key collisions", a problem that results in variety or errors which are sometimes hard to diagnose:
  • Entities' properties contain values from different unrelated records (e.g. addresses, dates of birth);
  • Wrong entity types (Persons are generated as LegalEntities instead)
  • Related entities are merged together in various ways;
  • Error messages are shown when trying to load the mapping (e.g. Cannot index abstract schema or No common ancestor: ... )
For example, given a table of people with their personal details the mapping below might not always be valid, because different people can have the same first and last name (and thus a key collision will happen).
bad.yml
1
entities:
2
person:
3
schema: Person
4
keys:
5
- FirstName
6
- LastName
7
properties:
8
firstName:
9
column: FirstName
10
lastName:
11
column: LastName
12
birthDate:
13
column: DoB
Copied!
The solution is to include in the list of keys as many properties as is necessary and sufficient to eliminate any intersection between unrelated entities of the same type.
good.yml
1
entities:
2
person:
3
schema: Person
4
keys:
5
- FirstName
6
- LastName
7
- DoB
8
properties:
9
firstName:
10
column: FirstName
11
lastName:
12
column: LastName
13
birthDate:
14
column: DoB
Copied!
Keys for events (Ownership, Sanction, Family) will usually be a product of keys of the entities that such an event links together.
combined.yml
1
entities:
2
company:
3
schema: Company
4
keys:
5
- company_name
6
owner:
7
schema: Person
8
keys:
9
- owner_name
10
ownership:
11
schema: Ownership
12
keys:
13
- company_name
14
- owner_name
Copied!

Loading a mapping from a SQL database

In the examples shown above, data has been loaded from CSV files. The mapping system can also connect to a SQL database using SQLAlchemy. Depending on the database system used, further Python drivers (such as psycopg2 or mysqlclient) might be required for specific backends.
When loading from a SQL database, you can begin your query with a specification of the tables you wish to access, and how they should be joined:
database.yml
1
za_cipc:
2
queries:
3
- database: postgresql://localhost/cipc
4
tables:
5
- table: za_cipc_companies
6
alias: companies
7
- table: za_cipc_directors
8
alias: directors
9
joins:
10
- left: companies.regno
11
right: directors.company_regno
Copied!
Please note that when you query more than one table at the same time, all the column names used in the mapping need to be qualified with the table name, ie. companies.name or directors.name instead of just name.
Mappings support substitution of environment variables. Instead of storing your database credentials to a mapping file, you might want to reference an environment variable like ${DATABASE_URI} in the mapping file, and define the username and password externally.

Filtering source data

When loading data from a mapping, you may sometimes want to filter the data so that only part of a table is imported. FtM mappings will only let you do equality filters; anything more complex than that should be considered data cleaning and be done prior to executing the mapping.
filters.yml
1
gb_parliament_57:
2
queries:
3
- csv_url: http://bit.ly/uk-mps-csv
4
filters:
5
group: "Conservative"
6
filters_not:
7
gender: "male"
8
entities:
9
member:
10
schema: Person
11
keys:
12
- id
13
properties:
14
name:
15
column: name
Copied!

Extra functions for property values

Mapping a column value to a property is normally a straight copy operation:
1
[...]
2
properties:
3
name:
4
column: person_name
Copied!
There are some tricks available, however:
hacks.yml
1
# Setting multiple values for a property:
2
properties:
3
name:
4
columns:
5
- person_name
6
- maiden_name
7
8
# Merging values ad-hoc:
9
properties:
10
name:
11
columns:
12
- first_name
13
- last_name
14
join: " "
15
16
# Setting a constant value:
17
properties:
18
country:
19
literal: "SS"
20
21
# Defining a date format:
22
properties:
23
birthDate:
24
column: dob
25
format: "%d.%m.%Y"
Copied!
In general, we are not seeking to incorporate further data cleaning functionality into the mapping process. It's generally a good idea to design your data pipeline such that loading entities via a mapping is preceded by data cleanup step in which necessary normalisations are applied.

Loading a mapping

Mapping files can be loaded in two different ways — either from an aleph shell:
1
aleph bulkload mapping.yml
Copied!
or by using a combination of followthemoney-util and alephclient command-line tools:
1
ftm map mapping.yml | ftm aggregate | alephclient write-entities -f dataset_name
Copied!
Last modified 1mo ago