Use a database solution that makes sense #124
Notifications
Due Date
No due date set.
Blocks
#116 Implement CRUD
reuse/api
Reference: reuse/api#124
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Right now we are using a relational database with no relations...
Since we are making a database around a shell tool I suggest we use a plain database filesystem, just like in the 60s.
Let's use the Ockham's razor and reap these benefits:
From Python's perspective it will just call a black box, collect it's stdout as logs (and remove them if all suceeds).
Only NoSQL makes sense here and since we have just one table with structured entries that runs on a single host I think only LevelDB would make sense here and I think a POSIX filesystem is a better solution.
EDIT: I will write an email with a full proposal
Note that REUSE API initially used Sqlite as DB. Linus migrated this to PostgreSQL as performance degraded significantly. IIRC this was mainly due to concurrent read/write operations with large chunks of data (e.g. the lint report or sbom of large projects).
I am afraid using the FS will not be better than SQLite was previously but rather complicate and customise the development for future contributors and maintainers, while using a standard DB is... well... more standard.
Database server vs database file (AKA postgres vs sqlite)
SQLite is a serverless database, so just a file.
The bulk of the data is two stdout files (for this repo around 7 and 31 KB in size).
As I understand it every time that these files need to be updated a
flock -xis aquired.Now imagine that we have 3k entries (so 6k locks) per this
.sqlitefile.Database servers have a daemon that manages a way more complex data structure that is able to optimize these locks (and many more things).
Database servers are more efficient than database files for concurrent writes and big data, it is indisputable.
Catch: why this is irrelevant here
No matter what relational database is used for storing non-relational data -- it is still inefficient & unidiomatic (not to say wrong).
Filesystem vs relational database
I think for storing data on a single host a filesystem will be more efficient than any SQL solution and possibly more efficient than any database software!
My reasoning here is the software complexity: relations are the most crucial in relational databases and not having to implement them removes the main complex factor (and also purpose) of relational databases.
On the other hand using any database server to access what essentially is:
2 bits of information)reuse lintrval (int)reuse lintoutput (text file)reuse spdxoutput (text file)will introduce overhead that could be avoided if a plain filesystem is used.
As a bonus, the timestamp could be saved as HEAD hash file's
mtime.Bonus: FS VS SQLite for locks
Again with 3k entries: the whole SQLite database can be locked by 6 entries.
The filesystem, if we were to assume the structure proposed by
update-entry.shwould look as follows:There could be symbolic locks on per-entry basis (
.lockedfiles, asportage&pacmando it), so there would be no need to any FS-level blocks. Worst case scenario we symbolically lock 3k entries by creating 3k empty files!I agree that it is less widely used so people generally have less experience with that approach.
I acknowledge that it is the biggest downside of that proposal.
However besides making the database more independent from repos.json I think it is the second most significant improvement proposed in 2.0.