Memorious (named after Funes the Memorious) is a light-weight distributed web scraping toolkit. It can:
Maintain an overview of a fleet of crawlers
Scrape and store both structured and unstructured data from the web
Load the scraped data to Aleph in a variety of ways
Schedule crawler execution in regular intervals
Store execution information and error messages
Distribute scraping tasks across multiple machines
Make crawlers modular and simple tasks re-usable
Get out of your way as much as possible
Memorious has a neat user interface to monitor the status of your crawler fleet at a glance. The interface also lets you start, stop and inspect crawlers with ease.
Memorious crawlers consist of a YAML configuration file and some (optional) Python functions to define crawler operations. Some built-in utility operations already come packaged with Memorious. For example, frequent operations like making HTTP requests, writing data into a database - can be done using built-in operations provided by Memorious. Memorious can also provide handy utilities to load the scraped data to Aleph for further processing.
A really simple crawler configuration in Memorious might look like this:
# Scraper for the OCCRP web site.name: occrp_web_sitedescription: 'Organized Crime and Corruption Reporting Project'# Uncomment to run this scraper automatically:# schedule: weeklypipeline:init:# This first stage will get the ball rolling with a seed URL.method: seedparams:urls:- https://occrp.orghandle:pass: fetchfetch:# Download the seed pagemethod: fetchparams:# These rules specify which pages should be scraped or included:rules:and:- domain: occrp.orghandle:pass: parseparse:# Parse the scraped pages to find if they contain additional links.method: parseparams:# Additional rules to determine if a scraped page should be stored or not.# In this example, we're only keeping PDFs, word files, etc.store:or:- mime_group: archives- mime_group: documentshandle:store: store# this makes it a recursive web crawler:fetch: fetchstore:# Store the crawled documents to a directorymethod: aleph_emitparams:collection: occrp_web_site
To learn more about Memorious, you can: