Deal with duplicated repos #3

Closed
opened 2019-08-01 16:24:28 +00:00 by max.mehl · 9 comments
Owner

Git repos usually have multiple URLs, e.g.

https://github.com/fsfe/reuse-tool
https://github.com/fsfe/reuse-tool.git
git://github.com/fsfe/reuse-tool
git@github.com/fsfe/reuse-tool.git

Can we make sure that we actually check only one instance of this project, and therefore only one badge? And this for all kinds of source forges?

Git repos usually have multiple URLs, e.g. ``` https://github.com/fsfe/reuse-tool https://github.com/fsfe/reuse-tool.git git://github.com/fsfe/reuse-tool git@github.com/fsfe/reuse-tool.git ``` Can we make sure that we actually check only one instance of this project, and therefore only one badge? And this for all kinds of source forges?
max.mehl changed title from How to deal with duplicated repos to How to deal with duplicated repos? 2019-08-01 16:24:33 +00:00
Owner

git@github.com/fsfe/reuse-tool.git

This syntax does not work.

The rest I agree needs some fixing, possibly, probably. I would suggest to rewrite any request that comes in to the format "git://github.com/fsfe/reuse-tool", but I am not sure whether all platforms support that syntax. This is the kind of thing that, when implemented, can cost someone an hour of their time if it doesn't work and they can't figure out why, and it turns out their URLs are being rewritten.

> git@github.com/fsfe/reuse-tool.git This syntax does not work. The rest I agree needs some fixing, possibly, probably. I would suggest to rewrite any request that comes in to the format "git://github.com/fsfe/reuse-tool", but I am not sure whether all platforms support that syntax. This is the kind of thing that, when implemented, can cost someone an hour of their time if it doesn't work and they can't figure out why, and it turns out their URLs are being rewritten.
Author
Owner

How about a dropdown of supported schemes, e.g. only http, https, and git so people know what they put into? The rest of the URL is probably always the same (except the .git suffix) and could be checked for duplicates in the backend.

How about a dropdown of supported schemes, e.g. only http, https, and git so people know what they put into? The rest of the URL is probably always the same (except the .git suffix) and could be checked for duplicates in the backend.
max.mehl added this to the 0.1 milestone 2019-08-07 14:51:10 +00:00
max.mehl changed title from How to deal with duplicated repos? to Deal with duplicated repos 2019-08-07 15:14:18 +00:00
Member

Are we actually sure we want to forbid re-registering with a different scheme? What if, for example, somebody registers http://git.acme.com/foo/bar and later decides to completely switch the server from http to https?

What's the damage for us when multiple URLs are registered, when only one of them will actually be queried?

Are we actually sure we want to forbid re-registering with a different scheme? What if, for example, somebody registers http://git.acme.com/foo/bar and later decides to completely switch the server from http to https? What's the damage for us when multiple URLs are registered, when only one of them will actually be queried?
Member

I just had another idea: we could store just the URL without the scheme, and when it comes to checking, we try "git", "https" and "http" (in a fixed, TBD order) and take the first one that works. This would even implicitly solve the issue of repositories changing the supported access scheme.

Maybe that costs us a few seconds when linting the repositories not supporting our first choice, but that runs in an asynchronous queue anyway.

I just had another idea: we could store just the URL without the scheme, and when it comes to checking, we try "git", "https" and "http" (in a fixed, TBD order) and take the first one that works. This would even implicitly solve the issue of repositories changing the supported access scheme. Maybe that costs us a few seconds when linting the repositories not supporting our first choice, but that runs in an asynchronous queue anyway.
Author
Owner

What’s the damage for us when multiple URLs are registered, when only one of them will actually be queried?

Resources, I am afraid. I would rather prefer linting the Linux kernel just once per commit (at least the primary repo)...

I just had another idea: we could store just the URL without the scheme, and when it comes to checking, we try “git”, “https” and “http” (in a fixed, TBD order) and take the first one that works. This would even implicitly solve the issue of repositories changing the supported access scheme.

Yes, that could be a viable solution!

> What’s the damage for us when multiple URLs are registered, when only one of them will actually be queried? Resources, I am afraid. I would rather prefer linting the Linux kernel just once per commit (at least the primary repo)... > I just had another idea: we could store just the URL without the scheme, and when it comes to checking, we try “git”, “https” and “http” (in a fixed, TBD order) and take the first one that works. This would even implicitly solve the issue of repositories changing the supported access scheme. Yes, that could be a viable solution!
Member

@carmenbianca what do you think about the proposal to just try git, https, and http and take the first that works? What do you think would be the best order to try?

@carmenbianca what do you think about the proposal to just try git, https, and http and take the first that works? What do you think would be the best order to try?
Owner

what do you think about the proposal to just try git, https, and http and take the first that works? What do you think would be the best order to try?

@reinhard This seems to work for me, in that order. There is probably some weird server out there that behaves differently based on protocol, but it's probably fine. The order git -> https -> http seems fine.

> what do you think about the proposal to just try git, https, and http and take the first that works? What do you think would be the best order to try? @reinhard This seems to work for me, in that order. There is probably some weird server out there that behaves differently based on protocol, but it's probably fine. The order `git -> https -> http` seems fine.
reinhard self-assigned this 2019-08-20 14:12:17 +00:00
Member

@carmenbianca instead of opening 3 ssh connections to the reuse lint server for the 3 tries, would it be smarter to improve the reuse-lint-repo script and make it accept the URL without protocol and try all 3 variants within a single run of the script?

@carmenbianca instead of opening 3 ssh connections to the reuse lint server for the 3 tries, would it be smarter to improve the reuse-lint-repo script and make it accept the URL without protocol and try all 3 variants within a single run of the script?
Member

@carmenbianca please forget the above question. The API does a git ls-remote on the repository and can remember which of the protocols worked before starting the remote lint.

@carmenbianca please forget the above question. The API does a `git ls-remote` on the repository and can remember which of the protocols worked before starting the remote lint.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: reuse/api#3
No description provided.