Crawling data with Memorious
Memorious is the web crawling framework that is part of Aleph toolkit. It can be used to periodically retrieve structured and unstructured data from the web and load it into Aleph.
- Maintain an overview of a fleet of crawlers
- Scrape and store both structured and unstructured data from the web
- Load the scraped data to Aleph in a variety of ways
- Schedule crawler execution in regular intervals
- Store execution information and error messages
- Distribute scraping tasks across multiple machines
- Make crawlers modular and simple tasks re-usable
- Get out of your way as much as possible
Memorious has a neat user interface to monitor the status of your crawler fleet at a glance. The interface also lets you start, stop and inspect crawlers with ease.
Memorious crawlers consist of a YAML configuration file and some (optional) Python functions to define crawler operations. Some built-in utility operations already come packaged with Memorious. For example, frequent operations like making HTTP requests, writing data into a database - can be done using built-in operations provided by Memorious. Memorious can also provide handy utilities to load the scraped data to Aleph for further processing.
A really simple crawler configuration in Memorious might look like this:
# Scraper for the OCCRP web site.
description: 'Organized Crime and Corruption Reporting Project'
# Uncomment to run this scraper automatically:
# schedule: weekly
# This first stage will get the ball rolling with a seed URL.
# Download the seed page
# These rules specify which pages should be scraped or included:
- domain: occrp.org
# Parse the scraped pages to find if they contain additional links.
# Additional rules to determine if a scraped page should be stored or not.
# In this example, we're only keeping PDFs, word files, etc.
- mime_group: archives
- mime_group: documents
# this makes it a recursive web crawler:
# Store the crawled documents to a directory
To learn more about Memorious, you can: