mp_scrape_source_template
Template with Playwright to build a source
.
Arguments
- (Optional)
display_browser
When enabled, a browser window is opened displaying the actions being performed.
Template
Structure
All of the files contain comments like # FIXME: ...
with what needs to be updated in this template. Do not release the package unless you have addressed all of those comments.
Aside from those, you should change the name of the mp_scrape_source_template
folder to something that makes sense for your source. The convention is to name you package mp_scrape_source_IDENTIFIER
.
The files to look out for are the following:
pyproject.toml
defines the Python project. This allows you to build and publish the package.mp_scrape_source_template/__init__.py
contains the logic of your source.- The
fetch_data
function contains the scraping logic. - The
metadata
function contains basic metadata about the source.
- The
tests/test_pipeline.py
contains a simple pipeline that you can use to check the output of your source.- See Testing for more information.
Testing
There is a testing pipeline you can use to test your source. It allows you to check two things:
- That your source runs correctly, and
- The data output is what you expect.
To run the pipeline, use the harness script. It sets up a clean virtual environment and installs of the necessary depenencies. The environment is reused between runs, but you can reset it at any time with the --reset
flag.
python3 ./tests/harness.py [--reset]
Publishing
Before publishing your package, send an e-mail to contact@fsfe.org with the following subject line: [MP-Scrape] Package <YOUR_PACKAGE_NAME> for <SOURCE_OF_DATA>
.
It’s helpful for us to stay informed about all published packages to maintain a good overview of the ecosystem. We may also consider integrating it into the official MP Scrape packages.
Follow the Python Packaging User Guide instructions to build and publish your package. For this you will need a PyPI account.