A parser to transform TED .xml files to OCDS-compliant JSON.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Linus Sehn c228fb9865 Some rudimentary dev instructions 3 weeks ago
scripts Add PostgreSQL loading mechanism to script 3 weeks ago
tedective_api Rename `api.py` to `app.py` to avoid name collision 3 weeks ago
tests Rename project to `tedective-api` 3 weeks ago
.dockerignore Add docker-related files for API 3 weeks ago
.drone.yml Fix deployment 3 weeks ago
.gitignore Add some files to `.gitignore` 3 weeks ago
Dockerfile Rename `api.py` to `app.py` to avoid name collision 3 weeks ago
README.md Some rudimentary dev instructions 3 weeks ago
cpv2013.csv Add current csv codes 3 weeks ago
docker-compose.prod.yml Also mount dumps 3 weeks ago
docker-compose.yml Update docker-compose files 3 weeks ago
flake.lock Update and simplify flake 2 months ago
flake.nix Rename project to `tedective-api` 3 weeks ago
poetry.lock Add some packages 3 weeks ago
pyproject.toml Rename project to `tedective-api` 3 weeks ago
shell.nix Add some packages to devEnv 2 months ago

README.md

What is this about?

TL;DR This is an API that you can use this to find who buys what in the EU. Check out the API docs at https://api.tedective.org/docs

Another aim is to bring is to bring the XML files published by the European Union into shape by transforming them to the Open Contracting Data Standard.

Despite a range of previous efforts to parse and analyse TED data, there currently exists no offering that fulfils all of the following requirements with regards to the provision of TED data:

  1. It is free for non-commercial use
  2. It is up-to-date (updates at least on a monthly basis) and deduplicated
  3. It is available as OCDS-compliant JSON (via bulk downloads and REST API)

Sustainably providing long-term access to European tendering data in a way that fulfils these three requirements could enable numerous applications that are of interest to civil society, European business and beyond which could greatly enhance the transparency of business activity in Europe. There are a range of interesting questions that could be answered with this data if it was available in a well-documented and easy-to-understand format that is interoperable with tendering data published elsewhere.

What's inside

TED XML notices are downloaded and parsed into several databases (PostgreSQL and Neo4j) that you can query using the API located at https://api.tedective.org

Finished

In its current state, the repo contains:

  • Script (download.sh) for downloading and 1-to-1 mapping the TED notices from XML into large JSON files (.jsonl). Each file holds all the data for one month - one notice per line.
  • Script (parser.py) for parsing these JSON files into a PostgreSQL database according to a simplified data model.
  • An interactive API (fastapi) that offers multiple routes for querying the database. Its documentation is here:
  • Database dumps for each version of parsed tendering data and scripts to create them in development environment (dump_*.sh)

Work in progress

  • CI/CD for database and API updates
  • alembic migrations to ease the extension of the PostgreSQL database in the future
  • Authentication
  • Party deduplication (explicitly specifying weights for the different and properly transliterated reconciliation attributes. Possibly, something more involved using simple ML (see dedupe library)
  • Fully OCDS-compliant output

Development

To start developing locally, run:

# This starts neo4j and postgres
docker-compose up -d

# This starts the API in reload mode
poetry install
poetry shell
uvicorn tedective_api.app:app --reload

# At this point, no data will be parsed. So, open another terminal and run:
./scripts/download.sh
parse

Then you can navigate to http://localhost:8000 to see the interactive API documentation.

Resources

There are a number of resources that are helpful in developing and maintaining this parser. The most important ones are:

  1. The OCDS Schema reference which describes in detail what an OCDS release is and what it should look like
  2. The OCDS European Union Profile which maps fields in TED notices to fields in an OCDS release (Github, Internal Tooling).
  3. The OJS' Standard form for public procurement site lists all the forms used in any given TED notice.

Previous efforts

  • TheyBuyForYou (a project by "a consortium of 10 leading companies, universities, research centres, government departments and local authorities in the UK, Norway, Italy, Spain and Slovenia" funded by the EU Horizon 2020 programme. The project cost the EU around €3.3 million and was developed over two years until December 2020. It is now largely dysfunctional and out-of-date. Some code seems to be publicly available but is provided without an explicit license)

  • DigiWhist's opentender.eu (seems mostly abandoned, last substantial commit 2 years ago; last data update 2020/21 which coincides roughly with the founding of TenderX, a private for-profit tender/company data offering by one of the DigiWhist researchers)

    NB: This dataset seems to be used by the OCDS tool for scraping globally available OCDS data releases. We should aim to replace this with the data parsed by us once it's ready.

    NB There was apparently some collboration between the OKFN and the lead researcher now behind TenderX (Mihály Fazekas, here is his personal website).

  • OpenTED (seems abandoned, last commit 2015; didn't work with OCDS as it wasn't developed at the time)

  • opented (very old attempt at parsing TED data that didn't turn out to work)

  • OpenTED Browser (an academic paper about )

  • ExtracTED (according to the README, this was used to parse data between 2014-2016; last commit 5 years ago)

  • eu-hack (last commit 15 months ago, author is a data scientist at Amazon and target format is CSV, I could not run his code and achieve an error-free parsing of more recent TED data)