mp_explore_source_template
Template with Playwright to build a source.
Check out Pre-requisites for more general information on what do you need to know and check before starting.
Arguments
- (Optional)
display_browserWhen enabled, a browser window is opened displaying the actions being performed.
Template
Structure
All of the files contain comments like # FIXME: ... with what needs to be updated in this template. Do not release the package unless you have addressed all of those comments.
Aside from those, you should change the name of the mp_explore_source_template folder to something that makes sense for your source. The convention is to name you package mp_explore_source_IDENTIFIER.
The files to look out for are the following:
pyproject.tomldefines the Python project. This allows you to build and publish the package.mp_explore_source_template/__init__.pycontains the logic of your source.- The
fetch_datafunction contains the scraping logic. - The
metadatafunction contains basic metadata about the source.
- The
tests/test_pipeline.pycontains a simple pipeline that you can use to check the output of your source.- See Testing for more information.
Testing
There is a testing pipeline you can use to test your source. It allows you to check two things:
- That your source runs correctly, and
- The data output is what you expect.
To run the pipeline, use the harness script. It sets up a clean virtual environment and installs of the necessary depenencies. The environment is reused between runs, but you can reset it at any time with the --reset flag.
python3 ./tests/harness.py [--reset]
Publishing
Before publishing your package, send an e-mail to contact@fsfe.org with the following subject line: [MP-Explore] Package <YOUR_PACKAGE_NAME> for <SOURCE_OF_DATA>.
It’s helpful for us to stay informed about all published packages to maintain a good overview of the ecosystem. We may also consider integrating it into the official MP Explore packages.
Pre-requisites
MP Explore and its ecosystem is based in Python and Pandas. This template integrates Playwright to simplify the scaffolding process. If you do not know Python, you might be interested in The Python Tutorial.
MP Explore Sources, at its core, are Python modules that obtain the data from official source programmatically however they
see fit. The fetch_data function just expects a DataFrame
with the fetched information.
First of all, you need to choose which parliament you want to create sources for. From our experience, the websites and official sources of national and regional parliaments tend to be the easiest to work with, but your mileage may vary.
After choosing a parlaiment, you need to look for official and up-to-date pages that contain, at least, the following information:
- Members of Parliament (MPs) with their respective names, group affiliantion, and e-mail address.
- Committees and which MPs are part of which committee and in which capacity.
Without both, the core filtering capabilities of MP Explore are dimished.
When looking for pages that contain such information, you might stumble upon computer readable versions of a subset of that information. If you choose to evaluate the usage of such sources, consider that the information needs to be up-to-date, correct, and maintained.
Most likely, you will not find this information in a single page, therefore you will need to navigate different pages and
experiment on how to extract information from them. You can look into existing
MP Explore Sources (mp_explore_source_*) to get an idea of how they are
currently implemented.
Normally, a basic knowledge of HTML/XML and Web Technologies is required to tinker on how to extract such information. If you do not know much about either, you might be interested in the Core Learning Modules of the MDN Web Development Course.
Once you learn how to extract this information from the various pages you found, preferably landing on a few JavaScript
functions that can be easily abstracted, you can start editing the fetch_data function!