Use a database solution that makes sense #124

Open
opened 2025-11-14 10:18:36 +00:00 by fkobi · 3 comments
Owner

Right now we are using a relational database with no relations...

Right now we are using a relational database with no relations...
fkobi added this to the 2.0 -- A thought-out backend redesign milestone 2025-11-14 10:18:36 +00:00
fkobi added the
component
database
prio
mid
labels 2025-11-14 10:18:36 +00:00
fkobi self-assigned this 2025-11-14 10:18:36 +00:00
Author
Owner

Since we are making a database around a shell tool I suggest we use a plain database filesystem, just like in the 60s.
Let's use the Ockham's razor and reap these benefits:

  • less dependencies
  • improved efficiency
  • UNIX idiomacy

From Python's perspective it will just call a black box, collect it's stdout as logs (and remove them if all suceeds).


Only NoSQL makes sense here and since we have just one table with structured entries that runs on a single host I think only LevelDB would make sense here and I think a POSIX filesystem is a better solution.

EDIT: I will write an email with a full proposal

Since we are making a database around a shell tool I suggest we use a plain database filesystem, just like in the 60s. Let's use the Ockham's razor and reap these benefits: - less dependencies - improved efficiency - UNIX idiomacy From Python's perspective it will just call a [black box](https://git.fsfe.org/reuse/api/commit/9d1caef29eb2954ff31c193d6a7e077c4e9d3566), collect it's stdout as logs (and remove them if all suceeds). --- Only NoSQL makes sense here and since we have just one table with structured entries that runs on a single host I think only LevelDB would make sense here and I think a POSIX filesystem is a better solution. EDIT: I will write an email with a full proposal
fkobi added a new dependency 2025-11-14 12:46:35 +00:00
Owner

Note that REUSE API initially used Sqlite as DB. Linus migrated this to PostgreSQL as performance degraded significantly. IIRC this was mainly due to concurrent read/write operations with large chunks of data (e.g. the lint report or sbom of large projects).

I am afraid using the FS will not be better than SQLite was previously but rather complicate and customise the development for future contributors and maintainers, while using a standard DB is... well... more standard.

Note that REUSE API initially used Sqlite as DB. Linus migrated this to PostgreSQL as performance degraded significantly. IIRC this was mainly due to concurrent read/write operations with large chunks of data (e.g. the lint report or sbom of large projects). I am afraid using the FS will not be better than SQLite was previously but rather complicate and customise the development for future contributors and maintainers, while using a standard DB is... well... more standard.
Author
Owner

Database server vs database file (AKA postgres vs sqlite)

SQLite is a serverless database, so just a file.
The bulk of the data is two stdout files (for this repo around 7 and 31 KB in size).
As I understand it every time that these files need to be updated a flock -x is aquired.
Now imagine that we have 3k entries (so 6k locks) per this .sqlite file.

Database servers have a daemon that manages a way more complex data structure that is able to optimize these locks (and many more things).
Database servers are more efficient than database files for concurrent writes and big data, it is indisputable.

Catch: why this is irrelevant here

No matter what relational database is used for storing non-relational data -- it is still inefficient & unidiomatic (not to say wrong).

Filesystem vs relational database

I think for storing data on a single host a filesystem will be more efficient than any SQL solution and possibly more efficient than any database software!
My reasoning here is the software complexity: relations are the most crucial in relational databases and not having to implement them removes the main complex factor (and also purpose) of relational databases.

On the other hand using any database server to access what essentially is:

  • HEAD hash (40 char string)
  • database status (2 bits of information)
  • reuse lint rval (int)
  • reuse lint output (text file)
  • reuse spdx output (text file)
  • last_access (timestamp)

will introduce overhead that could be avoided if a plain filesystem is used.
As a bonus, the timestamp could be saved as HEAD hash file's mtime.

Bonus: FS VS SQLite for locks

Again with 3k entries: the whole SQLite database can be locked by 6 entries.
The filesystem, if we were to assume the structure proposed by update-entry.sh would look as follows:

$ ls -lh git.fsfe.org/reuse/api/
total 48K
-rw-r--r-- 1 fkobi fkobi   41 Nov 17 10:58 HEAD
-rw-r--r-- 1 fkobi fkobi 6.7K Nov 17 10:58 lint
-rw-r--r-- 1 fkobi fkobi    2 Nov 17 10:58 return-value
-rw-r--r-- 1 fkobi fkobi  31K Nov 17 10:58 spdx

There could be symbolic locks on per-entry basis (.locked files, as portage & pacman do it), so there would be no need to any FS-level blocks. Worst case scenario we symbolically lock 3k entries by creating 3k empty files!


I am afraid not using a database software (the most classic approach)...

I agree that it is less widely used so people generally have less experience with that approach.
I acknowledge that it is the biggest downside of that proposal.
However besides making the database more independent from repos.json I think it is the second most significant improvement proposed in 2.0.

### Database server vs database file (AKA postgres vs sqlite) SQLite is a serverless database, so just a file. The bulk of the data is two stdout files (for this repo around 7 and 31 KB in size). As I understand it every time that these files need to be updated a `flock -x` is aquired. Now imagine that we have 3k entries (so 6k locks) per this `.sqlite` file. Database servers have a daemon that manages a way more complex data structure that is able to optimize these locks (and many more things). Database servers are more efficient than database files for concurrent writes and big data, it is indisputable. #### Catch: why this is irrelevant here No matter what relational database is used for storing non-relational data -- it is still inefficient & unidiomatic (not to say wrong). ### Filesystem vs relational database I think for storing data on a single host a filesystem will be more efficient than any SQL solution and possibly more efficient than any database software! My reasoning here is the software complexity: **relations** are the most crucial in **relational** databases and not having to implement them removes the main complex factor (and also purpose) of relational databases. On the other hand using any database server to access what essentially is: - HEAD hash (40 char string) - database status (`2 bits of information`) - `reuse lint` rval (int) - `reuse lint` output (text file) - `reuse spdx` output (text file) - last_access (timestamp) will introduce overhead that could be avoided if a plain filesystem is used. As a bonus, the timestamp could be saved as HEAD hash file's `mtime`. #### Bonus: FS VS SQLite for locks Again with 3k entries: the **whole SQLite database** can be locked by 6 entries. The filesystem, if we were to assume the structure proposed by [`update-entry.sh`](https://git.fsfe.org/reuse/api/src/branch/new-files) would look as follows: ```console $ ls -lh git.fsfe.org/reuse/api/ total 48K -rw-r--r-- 1 fkobi fkobi 41 Nov 17 10:58 HEAD -rw-r--r-- 1 fkobi fkobi 6.7K Nov 17 10:58 lint -rw-r--r-- 1 fkobi fkobi 2 Nov 17 10:58 return-value -rw-r--r-- 1 fkobi fkobi 31K Nov 17 10:58 spdx ``` There could be symbolic locks on per-entry basis (`.locked` files, as `portage` & `pacman` do it), so there would be no need to any FS-level blocks. Worst case scenario we symbolically lock 3k entries by creating 3k empty files! --- > I am afraid **not using a database software** (the most classic approach)... I agree that it is less widely used so people generally have less experience with that approach. I acknowledge that it is the biggest downside of that proposal. However besides making the database more independent from repos.json I think it is the **second most significant** improvement proposed in 2.0.
fkobi added the enhancement label 2025-12-17 11:17:11 +00:00
Sign in to join this conversation.