Database is derived from repos.json #122

Open
opened 2025-11-14 10:04:21 +00:00 by fkobi · 7 comments
Owner

Right now for a project (url) to have an entry in the database it needs to be registered:

if not cls.is_registered(kwargs.get("url")):
# Error as breaks the pipeline & should not happen
current_app.logger.error(
"Project not registered, not creating an entry: '%s'",
kwargs.get("url"),
)
return None

And the registration information is stored in repos.json
def is_registered(url: str) -> bool:
"""
Ensure the user is registered and has validated their email from the `forms` app
"""
with open(current_app.config["FORMS_FILE"]) as f:
for project in json.load(f):
if project["include_vars"]["project"].lower() == url.lower():
return True
return False

That means that our database is derived from repos.json; it effectively is just cache.


EDIT: to state it more plainly: the project should have just one persistent database

Right now for a project (url) to have an entry in the database it needs to be registered: https://git.fsfe.org/reuse/api/src/commit/1a1e223bfcca0e55c6eb3d79908e9662960b6cb7/reuse_api/models.py#L46-L52 And the registration information is stored in `repos.json` https://git.fsfe.org/reuse/api/src/commit/1a1e223bfcca0e55c6eb3d79908e9662960b6cb7/reuse_api/models.py#L30-L38 That means that our database is derived from `repos.json`; it effectively is just cache. --- EDIT: to state it more plainly: the project should have just one persistent database
fkobi added this to the 2.0 -- A thought-out backend redesign milestone 2025-11-14 10:04:21 +00:00
fkobi added the bug
component
database
prio
high
labels 2025-11-14 10:04:21 +00:00
fkobi self-assigned this 2025-11-14 10:04:21 +00:00
fkobi added a new dependency 2025-11-14 10:07:11 +00:00
Owner

I think the title describes the main issue of REUSE API: that it depends on a JSON file produced by an app on the same host.

To fix this, forms needs an API from which a consumer app can get the relevant data.

I think the title describes the main issue of REUSE API: that it depends on a JSON file produced by an app on the same host. To fix this, forms needs an API from which a consumer app can get the relevant data.
Author
Owner

I disagree rather strongly.
I think depending (at all) on repos.json as the data source is hacky.
The fact that the file is read-only is the crucial design flaw -- it effectively causes all high priority database 2.0 issues.c

Reading is the only thing that API itself does with the JSON file and I think it does it as well as as it can.
Thus I think adding a reading API to forms is actually low priority.

TLDR: IMO forms can function perfectly well (although a bit kludgy) just with rw acces to repos.json


If one wants to think about forms I think there is a better question:
If two systems were to be redesigned how would we like them to interact with each other?

IMO the rule of hand says "let there be one database!" but with my current knowledge (or lack thereof) about forms I think that is a bad idea.
Forms was introduced for one task only: to provide a boolean information: if a new database record should be created for it.
Achieving this would be as simple as having a plain-text file with URLs that one could grep -q through.
Creating an API for getting a boolean value adds complexity.

I think that the conclusion "add an exists(url: str) -> bool API to forms" is a misguided solution for reuse/api problems.

I disagree rather strongly. I think *depending* (at all) on `repos.json` as the data source is hacky. The fact that the [file is read-only](https://git.fsfe.org/reuse/api/issues/123) is the crucial design flaw -- it effectively causes [all high priority database 2.0 issues](https://git.fsfe.org/reuse/api/milestone/25?type=all&state=all&labels=557%2C559&milestone=25).c Reading is the only thing that API itself does with the JSON file and I think it does it as well as as it can. Thus I think adding a reading API to forms is actually low priority. TLDR: IMO forms can function perfectly well (although a bit kludgy) just with rw acces to `repos.json` --- If one wants to think about forms I think there is a better question: **If two systems were to be redesigned how would we like them to interact with each other?** IMO the rule of hand says "let there be one database!" but with my current knowledge (or lack thereof) about forms I think that is a bad idea. Forms was introduced for one task only: to provide a boolean information: if a new database record should be created for it. Achieving this would be as simple as having a plain-text file with URLs that one could `grep -q` through. Creating an API for getting a boolean value adds complexity. I think that the conclusion "add an `exists(url: str) -> bool` API to forms" is a misguided solution for reuse/api problems.
Owner

It surely add complexity, but an understandable one. Relying on file read access to a file created by an independent application is usually considered bad design.

I consider the return of investment of adding this in forms quite good:

  1. adding this into forms isn't a big deal.
  2. It may have positive effects for testing this application, and setting it up as a dev instance (if you see the current workflow for this, you'll notice the insanity)
  3. It may also be beneficial for other interacting systems of forms. See some issues in forms about this.

For a very good reason, you'll find an "API first" principle in most professional architectures and organizations ;)

It surely add complexity, but an understandable one. Relying on file read access to a file created by an independent application is usually considered bad design. I consider the return of investment of adding this in forms quite good: 1. adding this into forms isn't a big deal. 2. It may have positive effects for testing this application, and setting it up as a dev instance (if you see the current workflow for this, you'll notice the insanity) 3. It may also be beneficial for other interacting systems of forms. See some issues in forms about this. For a very good reason, you'll find an "API first" principle in most professional architectures and organizations ;)
Member

The main problem I see with this "forms first" is the limited time frame @fkobi as for it. The time will not be enough for analyzing and overhauling forms plus implementing the needed changes in the services that use forms as source.

The main problem I see with this "forms first" is the limited time frame @fkobi as for it. The time will not be enough for analyzing and overhauling forms plus implementing the needed changes in the services that use forms as source.
Owner

How you set priorities is of course up to you. I am just pointing out that it's generally a bad idea to continue hacking around problems when you should actually solve the root issues once you detected them.

How you set priorities is of course up to you. I am just pointing out that it's generally a bad idea to continue hacking around problems when you should actually solve the root issues once you detected them.
Owner

Please see fsfe-system-hackers/forms#130 where I suggest a basic API for Forms that may allow you to build on proper API workflows from the start. I hope this will be helpful for you :)

Please see fsfe-system-hackers/forms#130 where I suggest a basic API for Forms that may allow you to build on proper API workflows from the start. I hope this will be helpful for you :)
Author
Owner

it's generally a bad idea to continue hacking around problems when you should actually solve the root issues once you detected them.

I agree with that 100%.

However I think your PR is misguided.


Before explaining why I think it is the case I would like to make sure that everyone is on the same page.

The REUSE API then can query the Forms API for this information if a repo isn't found in its own database

If a project/repo is not found within the database only once (if the software functions correctly):
reuse-api needs forms forms ONLY(!!) for the function that I brought up in the second block of the issue text.


Let's assume that your forms PR got merged and this project used it. Great, the issue is fixed!

Or is it..?

Forms still holds all the registration data so it is the authoritative information source.
There are still two databases. Entries in reuse's should exist only if they are present in forms'.

Let me bring up the last line from the issue's description:

That means that our database is derived from repos.json; it effectively is just cache.

In this scenario api's database is STILL derived from forms' but this time not directly from a file but from a file accessed via an authenticated HTTP...
(And still effectively functions as cache)

Summing up, your proposal does not fix the issue but makes it look better through adding complexity.


EDIT: to make it clear, I did not mean for this comment to be harsh.

> it's generally a bad idea to continue hacking around problems when you should actually solve the root issues once you detected them. I agree with that 100%. However I think your PR is misguided. --- Before explaining why I think it is the case I would like to make sure that everyone is on the same page. > The REUSE API then can query the Forms API for this information **if a repo isn't found in its own database** If a project/repo is not found within the database __only once__ (if the software functions correctly): `reuse-api` needs `forms` forms __ONLY(!!)__ for the function that I brought up in the second block of the issue text. --- Let's assume that your forms PR got merged and this project used it. Great, the issue is fixed! Or is it..? Forms still holds all the registration data so it is the authoritative information source. There are still two databases. Entries in reuse's should exist only if they are present in forms'. Let me bring up the last line from the issue's description: > That means that our database is derived from repos.json; it effectively is just cache. In this scenario `api`'s database is STILL derived from `forms`' but this time not directly from a file but from a file accessed via an authenticated HTTP... (And still effectively functions as cache) Summing up, your proposal does not fix the issue but makes it look better through adding complexity. --- EDIT: to make it clear, I did not mean for this comment to be harsh.
fkobi added the enhancement label 2025-12-17 11:17:28 +00:00
Sign in to join this conversation.