#64 Translation warning box sometimes wrong

Open
opened 1 year ago by max.mehl · 19 comments
max.mehl commented 1 year ago

On some pages the red “translation outdated” box is false-positive. For example index.de.xhtml is newer that its English pendant: DE vs EN.

On some other pages it seems to me that there’s a false-negative but I don’t have an example right now. Maybe when fixing the logic this error is also found.

On some pages the red "translation outdated" box is false-positive. For example index.de.xhtml is newer that its English pendant: [DE](https://git.fsfe.org/FSFE/fsfe-website/commits/branch/master/index.de.xhtml) vs [EN](https://git.fsfe.org/FSFE/fsfe-website/commits/branch/master/index.en.xhtml). On some other pages it seems to me that there's a false-negative but I don't have an example right now. Maybe when fixing the logic this error is also found.
max.mehl changed title from Translation warning box on wrong to Translation warning box sometimes wrong 1 year ago
max.mehl added the
xsl
label 1 year ago
max.mehl added the
bug
label 1 year ago
max.mehl added the
build
label 1 year ago
max.mehl removed the
xsl
label 1 year ago
paul commented 1 year ago

This is odd, because on the server both files got updated on November 21, even though according to git log, the english version should be 14 days older.

We are relying on filesystem time stamps here. Is there any chance that git touches files during merges, even when those files themselfes did not change?

On a side note: is it possible to enable proper time display in gitea. It is no fun to browse the repository for time stamp discrepancies when everything says “approx. 1 month ago”, or “2 days ago”.

Also: I do not think this has anything to do with xsl either.

This is odd, because on the server both files got updated on November 21, even though according to git log, the english version should be 14 days older. We are relying on filesystem time stamps here. Is there any chance that git touches files during merges, even when those files themselfes did not change? On a side note: is it possible to enable proper time display in gitea. It is no fun to browse the repository for time stamp discrepancies when everything says "approx. 1 month ago", or "2 days ago". Also: I do not think this has anything to do with xsl either.
max.mehl commented 1 year ago
Owner

Yes, git touches files after a merge which kind of makes sense in a git mindset. This by the way also seems to happen after a commit, at least my text editor sometimes tells me that a file has been changed after such an operation.

On a side note: is it possible to enable proper time display in gitea. It is no fun to browse the repository for time stamp discrepancies when everything says “approx. 1 month ago”, or “2 days ago”.

Not ideal but if you hover over this indicator you should see a more detailed time. No idea how to make this the default view.

Also: I do not think this has anything to do with xsl either.

Changed that, too

Yes, git touches files after a merge which kind of makes sense in a git mindset. This by the way also seems to happen after a commit, at least my text editor sometimes tells me that a file has been changed after such an operation. > On a side note: is it possible to enable proper time display in gitea. It is no fun to browse the repository for time stamp discrepancies when everything says "approx. 1 month ago", or "2 days ago". Not ideal but if you hover over this indicator you should see a more detailed time. No idea how to make this the default view. > Also: I do not think this has anything to do with xsl either. Changed that, too
paul commented 1 year ago

The status log confirms that the file was modified during the merge, despite not being part of a related commit (i believe?): https://status.fsfe.org/fsfe.org/status_1511267050.html

Good, so this means we should definitely find a way to make the box dependent on commit time stamps. Looking those up during page build is a lot slower than checking fs timestamps.

It might be faster to update fs time stamps at the start of the build to reflect commit times. Because this is also slow to do for the entire repo, maybe it should be implemented as part of the VCS updater (git_build_into() in buildrun.sh). This way we could limit lookups of the commit log to files from the merge.

The status log confirms that the file was modified during the merge, despite not being part of a related commit (i believe?): https://status.fsfe.org/fsfe.org/status_1511267050.html Good, so this means we should definitely find a way to make the box dependent on commit time stamps. Looking those up during page build is a lot slower than checking fs timestamps. It might be faster to update fs time stamps at the start of the build to reflect commit times. Because this is also slow to do for the entire repo, maybe it should be implemented as part of the VCS updater (git_build_into() in buildrun.sh). This way we could limit lookups of the commit log to files from the merge.
reinhard commented 5 months ago
Collaborator

Maybe with some git magic we can automatically add a timestamp=“…” attribute with the timestamp of the commit in the root node of all XML files?

This might also be related to #837.

Maybe with some git magic we can automatically add a timestamp="..." attribute with the timestamp of the commit in the root node of all XML files? This might also be related to #837.
max.mehl commented 2 months ago
Owner

Personally, I am a bit hesitant to automatically add/edit something in all XML files. Preferably, any lookup should be done on the server side.

@reinhard, do you think you could estimate the speed loss if we looked up the last edit time and author via Git? If it’s significant, could we do it during the midnight run?

Personally, I am a bit hesitant to automatically add/edit something in all XML files. Preferably, any lookup should be done on the server side. @reinhard, do you think you could estimate the speed loss if we looked up the last edit time and author via Git? If it's significant, could we do it during the midnight run?
reinhard commented 2 months ago
Collaborator

I haven’t really tested, but I guess the speed loss would be enormous. Maybe you can do a test during the webathon?

I haven't really tested, but I guess the speed loss would be enormous. Maybe you can do a test during the webathon?
max.mehl added this to the Hackathon1905 milestone 2 months ago
ulf self-assigned this 2 months ago
ulf commented 2 months ago
Collaborator

I am attaching a test script that does 30*30 file comparisons. On my rather old machine comparison of file timestamps takes less than one second, while comparison of git commit times takes about 44 seconds.

That is a huge difference but as the build process as a whole is quite time consuming the overall effect of the modified time comparison may be not that dramatic and maybe even acceptable.

I am attaching a test script that does 30*30 file comparisons. On my rather old machine comparison of file timestamps takes less than one second, while comparison of git commit times takes about 44 seconds. That is a huge difference but as the build process as a whole is quite time consuming the overall effect of the modified time comparison may be not that dramatic and maybe even acceptable.
max.mehl commented 2 months ago
Owner

PR #952 by @ulf

Oh wow, that’s quite a difference, and probably opposite to what we want to achieve with the latest improvements of the build script.

I wonder whether we could cache at least the change time of the EN file somehow to reduce the time for all the translations by almost 50%.

@reinhard, would you see any other potential methods to reduce the check time? Perhaps a file containing all timestamps which is only updated incrementally if one file is being changed?

PR #952 by @ulf Oh wow, that's quite a difference, and probably opposite to what we want to achieve with the latest improvements of the build script. I wonder whether we could cache at least the change time of the EN file somehow to reduce the time for all the translations by almost 50%. @reinhard, would you see any other potential methods to reduce the check time? Perhaps a file containing all timestamps which is only updated incrementally if one file is being changed?
reinhard commented 2 months ago
Collaborator

If I am not mistaken, we need the date of the last commit for two purposes:

  1. To check whether a translation is outdated (see #2). There might be better indicators for that than the commit time anyway.

  2. To display the date of the last change on the webpage itself (see #837). Actually, most websites out there don’t display that information at all, and in the past years, where this information in fact was not present, nobody missed it. So we might want to think again whether we really want that.

If we actually do want the date of the last commit, I still think a commit hook would be the best solution, since it is essentially the same as what we had in SVN times.

If I am not mistaken, we need the date of the last commit for two purposes: 1. To check whether a translation is outdated (see #2). There might be better indicators for that than the commit time anyway. 2. To display the date of the last change on the webpage itself (see #837). Actually, most websites out there don't display that information at all, and in the past years, where this information in fact was not present, nobody missed it. So we might want to think again whether we really want that. If we actually *do* want the date of the last commit, I still think a commit hook would be the best solution, since it is essentially the same as what we had in SVN times.
max.mehl commented 2 months ago
Owner

If we actually do want the date of the last commit, I still think a commit hook would be the best solution, since it is essentially the same as what we had in SVN times.

Could you please explain which kind of hook and what it shall do? I don’t remember the server-side hooks we had with SVN.

> If we actually do want the date of the last commit, I still think a commit hook would be the best solution, since it is essentially the same as what we had in SVN times. Could you please explain which kind of hook and what it shall do? I don't remember the server-side hooks we had with SVN.
reinhard commented 2 months ago
Collaborator

SVN filled in the $Author: and $Date: information upon commit, that was not a server-side hook, but rather a built-in SVN function.

Essentially, we would just need some git magic which writes the date of the last commit into a predefined space within the file.

SVN filled in the $Author: and $Date: information upon commit, that was not a server-side hook, but rather a built-in SVN function. Essentially, we would just need some git magic which writes the date of the last commit into a predefined space within the file.
max.mehl commented 2 months ago
Owner

OK, we could do that, but it would mean that we effectively would have to add this information to all files initially. That would mean that we also touch outdated files…

OK, we could do that, but it would mean that we effectively would have to add this information to all files initially. That would mean that we also touch outdated files...
reinhard commented 2 months ago
Collaborator

@max.mehl you have a point there :-/

@max.mehl you have a point there :-/
ulf commented 2 months ago
Collaborator

I am going to do some test builds to find out how large the actual impact of the modified timestamp comparison is. Please hold the line …

I am going to do some test builds to find out how large the actual impact of the modified timestamp comparison is. Please hold the line ...
ulf commented 2 months ago
Collaborator

During a full build more than 50,000 checks are done in order to find outdated translations.

Currently the checks are done by comparing file modification times. This takes almost no time (about 10 seconds on my box). A full build takes approx. 110 minutes on my box.

In PR #952 the checks are done by comparing git commit times. For each comparison “git log” is called for each file and the commit times are then compared. There is no optimisation. With these changes a full build takes approx. 135 minutes on my box.

PR #974 contains an alternative approach. During phase 1 of the build a “sidecar file” is created or updated for each “*.en.xhtml” file that contains its outdated translations. (This is implicitly assuming that “en” is the original language and all others are translations. This should be adjusted/generalised/fixed.) These sidecar files are later used to identify outdated files. With these changes a full build takes approx. 115 minutes on my box.

During a full build more than 50,000 checks are done in order to find outdated translations. Currently the checks are done by comparing file modification times. This takes almost no time (about 10 seconds on my box). A full build takes approx. 110 minutes on my box. In PR #952 the checks are done by comparing git commit times. For each comparison "git log" is called for each file and the commit times are then compared. There is no optimisation. With these changes a full build takes approx. 135 minutes on my box. PR #974 contains an alternative approach. During phase 1 of the build a "sidecar file" is created or updated for each "*.en.xhtml" file that contains its outdated translations. (This is implicitly assuming that "en" is the original language and all others are translations. This should be adjusted/generalised/fixed.) These sidecar files are later used to identify outdated files. With these changes a full build takes approx. 115 minutes on my box.
max.mehl commented 2 months ago
Owner

@reinhard what do you think, is any of these two approaches doable for our setup, especially if we do partial builds? Time-wise, the difference is much smaller than I thought, but perhaps we could make a test on our build server?

@reinhard what do you think, is any of these two approaches doable for our setup, especially if we do partial builds? Time-wise, the difference is much smaller than I thought, but perhaps we could make a test on our build server?
reinhard commented 2 months ago
Collaborator

I just had the following idea and would like to hear your feedback:

  • In the phase 1 Makefile, run touch -d "$(git log -1 --format="%ci" ${file})" .${file}.date for each file which was updated since the last build run.
  • So at the end of phase 1 Makefile, each file has a hidden companion whose filetime is the actual commit date of the real file.
  • This filetime can easily and cheaply be queried for all purposes, like determination of outdated translations, or inclusion of the commit date in the HTML output.

BUT I would also suggest that we really first decide about whether we actually want to use commit dates for anything, see my other comment of 27 May 9:33.

I just had the following idea and would like to hear your feedback: * In the phase 1 Makefile, run `touch -d "$(git log -1 --format="%ci" ${file})" .${file}.date` for each file which was updated since the last build run. * So at the end of phase 1 Makefile, each file has a hidden companion whose filetime is the actual commit date of the real file. * This filetime can easily and cheaply be queried for all purposes, like determination of outdated translations, or inclusion of the commit date in the HTML output. **BUT** I would also suggest that we really first decide about whether we actually want to use commit dates for anything, see my other comment of 27 May 9:33.
ulf commented 2 months ago
Collaborator

Sounds good.

I am wondering whether it is faster to call “git log” for each language to create “.date” files or to call “git log” once for the en.xhtml and once for the whole set of translations to create “.outdated-translations” files.

One should do some tests.

Sounds good. I am wondering whether it is faster to call "git log" for each language to create ".date" files or to call "git log" once for the en.xhtml and once for the whole set of translations to create ".outdated-translations" files. One should do some tests.
max.mehl commented 2 months ago
Owner

I am impartial on the format and strategy to create the companion files, but it sounds like a good plan to me.

BUT I would also suggest that we really first decide about whether we actually want to use commit dates for anything, see my other comment of 27 May 9:33.

That’s a complicated one.

  • Regarding showing the time stamps, it’s quite useful for debugging (and also a future sitemap file), but I would be fine if we hid that in an HTML comment.
  • Regarding outdated translations, we would have to find a good strategy which is suitable for webmasters, translators and editors alike. And if we had this, I doubt that we can adapt it to all old files but start using it incrementally. Until then, I think we have to rely on git commit times.
I am impartial on the format and strategy to create the companion files, but it sounds like a good plan to me. > BUT I would also suggest that we really first decide about whether we actually want to use commit dates for anything, see my other comment of 27 May 9:33. That's a complicated one. * Regarding showing the time stamps, it's quite useful for debugging (and also a future sitemap file), but I would be fine if we hid that in an HTML comment. * Regarding outdated translations, we would have to find a good strategy which is suitable for webmasters, translators and editors alike. And if we had this, I doubt that we can adapt it to all old files but start using it incrementally. Until then, I think we have to rely on git commit times.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
Cancel
Save
There is no content yet.