Crawling data with Memorious
Memorious is the web crawling framework that is part of Aleph toolkit. It can be used to periodically retrieve structured and unstructured data from the web and load it into Aleph.
Memorious (named after Funes the Memorious) is a light-weight distributed web scraping toolkit. It can:
  • Maintain an overview of a fleet of crawlers
  • Scrape and store both structured and unstructured data from the web
  • Load the scraped data to Aleph in a variety of ways
  • Schedule crawler execution in regular intervals
  • Store execution information and error messages
  • Distribute scraping tasks across multiple machines
  • Make crawlers modular and simple tasks re-usable
  • Get out of your way as much as possible

Memorious management interface

Memorious has a neat user interface to monitor the status of your crawler fleet at a glance. The interface also lets you start, stop and inspect crawlers with ease.

How it works

Memorious crawlers consist of a YAML configuration file and some (optional) Python functions to define crawler operations. Some built-in utility operations already come packaged with Memorious. For example, frequent operations like making HTTP requests, writing data into a database - can be done using built-in operations provided by Memorious. Memorious can also provide handy utilities to load the scraped data to Aleph for further processing.
A really simple crawler configuration in Memorious might look like this:
1
# Scraper for the OCCRP web site.
2
name: occrp_web_site
3
description: 'Organized Crime and Corruption Reporting Project'
4
# Uncomment to run this scraper automatically:
5
# schedule: weekly
6
pipeline:
7
init:
8
# This first stage will get the ball rolling with a seed URL.
9
method: seed
10
params:
11
urls:
12
- https://occrp.org
13
handle:
14
pass: fetch
15
16
fetch:
17
# Download the seed page
18
method: fetch
19
params:
20
# These rules specify which pages should be scraped or included:
21
rules:
22
and:
23
- domain: occrp.org
24
handle:
25
pass: parse
26
parse:
27
# Parse the scraped pages to find if they contain additional links.
28
method: parse
29
params:
30
# Additional rules to determine if a scraped page should be stored or not.
31
# In this example, we're only keeping PDFs, word files, etc.
32
store:
33
or:
34
- mime_group: archives
35
- mime_group: documents
36
handle:
37
store: store
38
# this makes it a recursive web crawler:
39
fetch: fetch
40
store:
41
# Store the crawled documents to a directory
42
method: aleph_emit
43
params:
44
collection: occrp_web_site
Copied!

Getting started

To learn more about Memorious, you can:
Last modified 1mo ago