Mass-convert files to UTF-8 #918
Reference in New Issue
Block a user
Delete Branch "fix-encoding"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR adds some new/improved tools to convert a lot of files to UTF-8, and fixes #641.
It makes a few basic checks though:
The script
tools/encoding-convert.shmakes use ofcheck-translation-status.sh.Here is the complete log of the initial run. It shows successful conversions and the files which have been ignored due to their "outdatedness":
I noticed there are also files which claim to be non-UTF8 but actually are (or us-ascii which seems to be kinda equivalent). I made a check by running:
That results in:
So in the next commit I will just change the declared XML encoding to UTF-8 for those marked as "actually-utf" (sorry for the stupid name)
This is the list of files which are non-UTF8 but not outdated. That happens because some files do not have an EN base version, so we cannot easily check what's the original to prevent changing and outdated file.
One way to continue here could be to check whether there actually IS another language version. If not, we could safely convert the encoding.
There was one files which was present in more than 1 language:
For that it was obvious that the ES version has been the original and that there is no discrepancy between both versions. So I've changed the encoding for ES, and fake-updated the CA version.
The next commit was about about updating the rest of the list above.
What I forgot: fake-updating all up-to-date translations whose EN original changed due to the encoding changes
Solved by adding the -o flag to check-translation-status.sh and doing some semi-automatic comparisons, and by that a fake-update to
The following 67 files are still not UTF-8 because they are outdated against their EN original. For some, the English version just fixed a typo, some others definitely lag behind content-wise.
We'll have to go through them. Useful tools would be
git logandtools/check-translation-status.sh -a -f <file>to check the change dates of all correlated files.The latest commits also solve the leftover files mentioned above as well as a few other edge cases like files that are declared as non-UTF in their XML header but actually are UTF-8.
I have merged to the test branch to test it.
I tested ~20 sites on test.fsfe.org and everything looks fine. Will merge therefore :)