#1635 Add a search functionality (Fixes #739 )

Merged
max.mehl merged 1 commits from search-engine into master 5 months ago
vincent commented 7 months ago
Collaborator

Add a page fsfe.org/search.en.html.

To try this:

  1. Run python3 tools/index-website.py from the website root directory
  2. Build search/search.en.xhtml

TODO:

Add a page `fsfe.org/search.en.html`. To try this: 1. Run `python3 tools/index-website.py` from the website root directory 2. Build `search/search.en.xhtml` TODO: - [X] Figure out how to rename search.en.html to index.en.html so that it's the default page for `fsfe.org/search`. - [X] Index more that just `/news` - [X] Filter results in the user language + English - [x] Wire index building to the build system (or to a cron task?) - [X] Host lunr.js script on fsfe.org instead of a CDN - [X] Polish the design and make the page user-friendly - [X] Separate news from other content in the search results - [X] Display the date for news items - [X] Documentation and mention python dependency - [X] Use logging in the indexation script - [X] Index the document type ("news" or "page")
vincent changed title from Add a search functionality (#739 ) to Add a search functionality (Fixes #739 ) 7 months ago
vincent changed title from Add a search functionality (Fixes #739 ) to WIP: Add a search functionality (Fixes #739 ) 7 months ago
Owner

Cool! Didn't try it our completely yet, but regarding one of your todos:

Figure out how to rename search.en.html to index.en.html so that it's the default page for fsfe.org/search.

That should happen automatically. For instance, check out https://fsfe.org/activities/routers/. You can leave it blank, remove the trailing space, or add routers.html or index.html. The build system takes care of all of these. It just doesn't work with the normal local preview.

Cool! Didn't try it our completely yet, but regarding one of your todos: > Figure out how to rename search.en.html to index.en.html so that it's the default page for `fsfe.org/search`. That should happen automatically. For instance, check out https://fsfe.org/activities/routers/. You can leave it blank, remove the trailing space, or add routers.html or index.html. The build system takes care of all of these. It just doesn't work with the normal local preview.
vincent commented 7 months ago
Poster
Collaborator

That should happen automatically. For instance, check out https://fsfe.org/activities/routers/. You can leave it blank, remove the trailing space, or add routers.html or index.html. The build system takes care of all of these. It just doesn't work with the normal local preview.

Perfect!

> That should happen automatically. For instance, check out https://fsfe.org/activities/routers/. You can leave it blank, remove the trailing space, or add routers.html or index.html. The build system takes care of all of these. It just doesn't work with the normal local preview. Perfect!
Owner

I just played around with it. First of all: it's a really clever and slim idea, so thanks a lot!

Some feedback apart from your todos above, of course knowing that this is WIP:

  • The results are all lower-case. Is this to make search case-insensitive? Anyway, it would be preferrable if the original case could be shown
  • How are the results ordered? Especially for news items, having an item of 2015 above one from 2020 is a bit awkward. Or shall we show the date in these cases?
  • If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown.
  • Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see /activities/routers as well as the news items to this category.
  • Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that?
I just played around with it. First of all: it's a really clever and slim idea, so thanks a lot! Some feedback apart from your todos above, of course knowing that this is WIP: * The results are all lower-case. Is this to make search case-insensitive? Anyway, it would be preferrable if the original case could be shown * How are the results ordered? Especially for news items, having an item of 2015 above one from 2020 is a bit awkward. Or shall we show the date in these cases? * If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown. * Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see `/activities/routers` as well as the news items to this category. * Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that?
vincent commented 6 months ago
Poster
Collaborator
  • The results are all lower-case. Is this to make search case-insensitive? Anyway, it would be preferrable if the original case could be shown

Fixed with 69538ac9de :-)

  • How are the results ordered? Especially for news items, having an item of 2015 above one from 2020 is a bit awkward. Or shall we show the date in these cases?

The ranking algorithm is called BM25.

In a nutshell, the query is split into individual token (e.g. "free software" is split to ["free", "software"]). Then for each token, the number of occurences in the document, the document length, and the inverse document frequency of the token - the number of documents in which the token appears (more details) are taken into account to create a score for the particular token and a particular document. Then the results are sumed up across the tokens composing the query, so you end up with a score of the query and a document. You repeat for each document in the corpus and sort with the highest scores first :-) I take only the first top 10 results.

So the ranking is purely text-based. Maybe we can sort the top 10 results by date?

  • If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown.

Should be doable. I guess we can parse the current URL to know the language used by the user, and compare it with the URl of the search results. I would say that regardless of the date, we should filter results to only get the current language (if not English) and English results. That would improve the user experience IMO.

  • Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see /activities/routers as well as the news items to this category.

Yes. For now I only index /news but that will be extended to /activities and any other pages that one would expect to find through the search engine.

Edit: added /about and /activities to the index in 4cde1bab99.

  • Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that?

Uhm, it's not easy. We can't computationally tell if a tag is intuitive. I guess that all article tagged with "swpat" would contain the word "patent" or its translation, so I'm not sure if it's an issue.

> * The results are all lower-case. Is this to make search case-insensitive? Anyway, it would be preferrable if the original case could be shown Fixed with 69538ac9de :-) > * How are the results ordered? Especially for news items, having an item of 2015 above one from 2020 is a bit awkward. Or shall we show the date in these cases? The ranking algorithm is called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25). In a nutshell, the query is split into individual token (e.g. "free software" is split to ["free", "software"]). Then for each token, the number of occurences in the document, the document length, and the inverse document frequency of the token - the number of documents in which the token appears ([more details](https://vl8r.eu/posts/2019/04/01/text-representations-for-machine-learning-and-deep-learning/#tf-idf-term-frequency---inverse-document-frequency)) are taken into account to create a score for the particular token and a particular document. Then the results are sumed up across the tokens composing the query, so you end up with a score of the query and a document. You repeat for each document in the corpus and sort with the highest scores first :-) I take only [the first top 10](https://git.fsfe.org/FSFE/fsfe-website/src/commit/69538ac9de25bcb8e81d62cad154b93551520625/search/search.en.xhtml#L47) results. So the ranking is purely text-based. Maybe we can sort the top 10 results by date? > * If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown. Should be doable. I guess we can parse the current URL to know the language used by the user, and compare it with the URl of the search results. I would say that regardless of the date, we should filter results to only get the current language (if not English) and English results. That would improve the user experience IMO. > * Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see `/activities/routers` as well as the news items to this category. Yes. For now I only index `/news` but that will be extended to `/activities` and any other pages that one would expect to find through the search engine. Edit: added `/about` and `/activities` to the index in 4cde1bab99. > * Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that? Uhm, it's not easy. We can't computationally tell if a tag is intuitive. I guess that all article tagged with "swpat" would contain the word "patent" or its translation, so I'm not sure if it's an issue.
Owner

So the ranking is purely text-based. Maybe we can sort the top 10 results by date?

Thanks for the explanation! Yes, either sorting by date, or showing the date.

Another thought: would it work to show the type of the item? This would allow readers to distinguish between different types of resources.

  • If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown.

Should be doable. I guess we can parse the current URL to know the language used by the user, and compare it with the URl of the search results. I would say that regardless of the date, we should filter results to only get the current language (if not English) and English results. That would improve the user experience IMO.

Definitely, yes!

  • Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see /activities/routers as well as the news items to this category.

Yes. For now I only index /news but that will be extended to /activities and any other pages that one would expect to find through the search engine.

Edit: added /about and /activities to the index in 4cde1bab99.

Great!

  • Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that?

Uhm, it's not easy. We can't computationally tell if a tag is intuitive. I guess that all article tagged with "swpat" would contain the word "patent" or its translation, so I'm not sure if it's an issue.

Ah, so you search for the full article text, or just the teaser? In any case I agree that users will probably find what they search for.

> So the ranking is purely text-based. Maybe we can sort the top 10 results by date? Thanks for the explanation! Yes, either sorting by date, or showing the date. Another thought: would it work to show the type of the item? This would allow readers to distinguish between different types of resources. > > * If there are two news items from the same date, but in different languages, only those of the currently selected language – or EN as fallback – should be shown. > > Should be doable. I guess we can parse the current URL to know the language used by the user, and compare it with the URl of the search results. I would say that regardless of the date, we should filter results to only get the current language (if not English) and English results. That would improve the user experience IMO. Definitely, yes! > > * Do you plan to include "normal" articles as well? So if I type in "router", I would expect to see `/activities/routers` as well as the news items to this category. > > Yes. For now I only index `/news` but that will be extended to `/activities` and any other pages that one would expect to find through the search engine. > > Edit: added `/about` and `/activities` to the index in 4cde1bab99. Great! > > * Since you search for tags: some tags are not really intuitive, e.g. "swpat" for software patents. But of course, if people search for patents, they will expect content labeled with this. Do you have any idea how we could cope with that? > > Uhm, it's not easy. We can't computationally tell if a tag is intuitive. I guess that all article tagged with "swpat" would contain the word "patent" or its translation, so I'm not sure if it's an issue. Ah, so you search for the full article text, or just the teaser? In any case I agree that users will probably find what they search for.
vincent commented 6 months ago
Poster
Collaborator

So the ranking is purely text-based. Maybe we can sort the top 10 results by date?

Thanks for the explanation! Yes, either sorting by date, or showing the date.

Another thought: would it work to show the type of the item? This would allow readers to distinguish between different types of resources.

Yes, I definitly need to add contextual information. Either by adding the type (/news, /about, /activities and so on) or the complete URL. Displaying the date would be nice as well.

Ah, so you search for the full article text, or just the teaser? In any case I agree that users will probably find what they search for.

Currently I only search in the title and the tags. The reason for not adding full text search is that it would make the index quite large (6.4M gziped if we index the full text of /news, /about and /activities) and that would impact the download size of the index (every user downloads the index.js file) and the search performance. I could add some clever trick to only index relevant words from the full text though, but I'm not sure that it would make the index sixe bearable. There's a trade-off between the usefulness of the search engine, the size of the data indexed and its performance.

The logic of only indexing the teaser would only work for /news.

> > So the ranking is purely text-based. Maybe we can sort the top 10 results by date? > > Thanks for the explanation! Yes, either sorting by date, or showing the date. > > Another thought: would it work to show the type of the item? This would allow readers to distinguish between different types of resources. Yes, I definitly need to add contextual information. Either by adding the type (/news, /about, /activities and so on) or the complete URL. Displaying the date would be nice as well. > Ah, so you search for the full article text, or just the teaser? In any case I agree that users will probably find what they search for. Currently [I only search in the title and the tags](https://git.fsfe.org/FSFE/fsfe-website/src/commit/817d1640d8612d661c9a79709453f46417dbdb28/search/search.en.xhtml#L35). The reason for not adding full text search is that it would make the index quite large (6.4M gziped if we index the full text of /news, /about and /activities) and that would impact the download size of the index (every user downloads the `index.js` file) and the search performance. I could add some clever trick to only index relevant words from the full text though, but I'm not sure that it would make the index sixe bearable. There's a trade-off between the usefulness of the search engine, the size of the data indexed and its performance. The logic of only indexing the teaser would only work for /news.
vincent commented 6 months ago
Poster
Collaborator

@max.mehl and others, I also wonder if it's ok to only put a search box in the header. Should we add a "search" button <input type="submit"> next to the search box to submit the search? For a less tech-savy audience, is it obvious that pressing enter after entering text in the textbox submits the search and leads to the search page?

@max.mehl and others, I also wonder if it's ok to only put a search box in the header. Should we add a "search" button `<input type="submit">` next to the search box to submit the search? For a less tech-savy audience, is it obvious that pressing enter after entering text in the textbox submits the search and leads to the search page?
Owner

Thanks for calculating the index size. When speaking of a teaser, I always mean the first paragraph, as this introduces to the topic and will use the important terms – otherwise it would be a really bad teaser. The title could do so as well, but sometimes for the sake of a catchy title, one might omit that.

As an example, I am thinking about a text about software patents. It could be like the following:

Title: European Commission blindly follows wrong argumentation by troll firms

Teaser: The EC blindly followed the argumentation by legal troll firms in their regulation about software patents...

Tags: swpat

In this case, only the teaser would produce a result when someone searches for "patent".

@max.mehl and others, I also wonder if it's ok to only put a search box in the header. Should we add a "search" button <input type="submit"> next to the search box to submit the search? For a less tech-savy audience, is it obvious that pressing enter after entering text in the textbox submits the search and leads to the search page?

I would add an icon to make the purpose of the box clearer. E.g., what about this?

Thanks for calculating the index size. When speaking of a teaser, I always mean the first paragraph, as this introduces to the topic and will use the important terms – otherwise it would be a really bad teaser. The title could do so as well, but sometimes for the sake of a catchy title, one might omit that. As an example, I am thinking about a text about software patents. It could be like the following: ``` Title: European Commission blindly follows wrong argumentation by troll firms Teaser: The EC blindly followed the argumentation by legal troll firms in their regulation about software patents... Tags: swpat ``` In this case, only the teaser would produce a result when someone searches for "patent". > @max.mehl and others, I also wonder if it's ok to only put a search box in the header. Should we add a "search" button `<input type="submit">` next to the search box to submit the search? For a less tech-savy audience, is it obvious that pressing enter after entering text in the textbox submits the search and leads to the search page? I would add an icon to make the purpose of the box clearer. E.g., what about [this](https://stackoverflow.com/a/40303351/4273755)?
vincent commented 6 months ago
Poster
Collaborator

In this case, only the teaser would produce a result when someone searches for "patent".

Yea, I agree that it could be useful. But adding the teaser add quite a lot of size to the index.

I have added the teaser in fb3fad84fd (the size of the index would now be 490K). Stopwords are removed for Dutch, English, French, German, Italian to make the field slimmer. Despite the larger index.js file, the search is still fast on my computer ™, though I don't know whether this will be problematic once live in production. This commit will not be squashed so it can be easily reverted if need be.

I would add an icon to make the purpose of the box clearer. E.g., what about this?

I have made the search icon clickable in ef08fbf877.

> In this case, only the teaser would produce a result when someone searches for "patent". Yea, I agree that it could be useful. But adding the teaser add quite a lot of size to the index. I have added the teaser in fb3fad84fd (the size of the index would now be 490K). [Stopwords](https://en.wikipedia.org/wiki/Stop_word) are removed for Dutch, English, French, German, Italian to make the field slimmer. Despite the larger index.js file, the search is still fast on my computer ™, though I don't know whether this will be problematic once live in production. This commit will not be squashed so it can be easily reverted if need be. > I would add an icon to make the purpose of the box clearer. E.g., what about [this](https://stackoverflow.com/a/40303351/4273755)? I have made the search icon clickable in ef08fbf877.
Owner

Thanks a lot for the updates!

For me, the index currently has 1.5M, but search also is sufficiently fast, which of course can change if the index has to be loaded from the server.

I do wonder however, why the search term patents leads to no results, although the phrase is in the index file?

Also, some questions/remarks:

  • I've re-styled the box a bit following bootstrap paradigms. Please feel free to revert as you wish, it's just a proposal :)
  • Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests)
Thanks a lot for the updates! For me, the index currently has 1.5M, but search also is sufficiently fast, which of course can change if the index has to be loaded from the server. I do wonder however, why the search term `patents` leads to no results, although the phrase is in the index file? Also, some questions/remarks: * I've re-styled the box a bit following bootstrap paradigms. Please feel free to revert as you wish, it's just a proposal :) * Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests)
vincent commented 6 months ago
Poster
Collaborator

For me, the index currently has 1.5M, but search also is sufficiently fast, which of course can change if the index has to be loaded from the server.

Yes, me too. But I'm talking about the compressed size (the size which will be downloaded by the website visitors)

I do wonder however, why the search term patents leads to no results, although the phrase is in the index file?

Fixed in 8122d772d8

  • I've re-styled the box a bit following bootstrap paradigms. Please feel free to revert as you wish, it's just a proposal :)

Looks nice, but maybe it's a bit inconsistent with the rest of the items in the menu? "login", "change language" and "donate" have their respective icons on the left but with e1653bb0d1 the search icon is on the right of the text.

  • Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests)

Sorry, I should have added the rational behind the technical choices on the first comment of the PR. Anyway, here are the reasons why I went to JavaScript for the search functionality (by order of importance):

  • This is not a core functionality of the website and only one page would use JavaScript
  • I assume that people disabling JavaScript know about search engines features to do website-specific queries (e.g. site:fsfe.org patents, which works with JS disbaled) so they don't need the search feature anyway. Is it a reasonable assumption?
  • Building the search feature in JS allows a simple and portable implementation which integrates well with our static website logic and will most likely be easy to integrate in the build process
  • The JS implementation doesn't increase the attack surface of our infrastructure and doesn't open the door to abuses
  • It's easy to develop and maintain the functionally if done in JavaScript

On the other hand, server-side has the following disadvantages:

  • Server-side full text search implantations (examples: Elastic search, solr) come with a database which complicates our server infrastructure and add friction to the website development
  • No "lightweight" and well-maintained full text search implementations exist in a sever-side language that doesn't require a full-scale database
  • This may cause security issues
  • This makes the development of the feature itself more complicated

The advantages of server-side:

  • It will work if JS is disabled
  • The size of the search page will be lower, so users will download less data. The size of the gziped lunr.js library is 8.3K and the size of the gziped index is 490K.

So this is all about trade-offs. Of course, if the feature would have been a central part of the website, present on every pages, my point of view would have been completely different.

Regarding the search performance, this is hard to tell without doing both the client-side and server-side implementation with the same set of features.

What do you think?

> For me, the index currently has 1.5M, but search also is sufficiently fast, which of course can change if the index has to be loaded from the server. Yes, me too. But I'm talking about the compressed size (the size which will be downloaded by the website visitors) > I do wonder however, why the search term `patents` leads to no results, although the phrase is in the index file? Fixed in 8122d772d8 > * I've re-styled the box a bit following bootstrap paradigms. Please feel free to revert as you wish, it's just a proposal :) Looks nice, but maybe it's a bit inconsistent with the rest of the items in the menu? "login", "change language" and "donate" have their respective icons on the left but with e1653bb0d1 the search icon is on the right of the text. > * Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests) Sorry, I should have added the rational behind the technical choices on the first comment of the PR. Anyway, here are the reasons why I went to JavaScript for the search functionality (by order of importance): - This is not a core functionality of the website and only one page would use JavaScript - I assume that people disabling JavaScript know about search engines features to do website-specific queries (e.g. [site:fsfe.org patents](https://duckduckgo.com/?t=ffab&q=site%3Afsfe.org+patents), which works with JS disbaled) so they don't need the search feature anyway. Is it a reasonable assumption? - Building the search feature in JS allows a simple and portable implementation which integrates well with our static website logic and will most likely be easy to integrate in the build process - The JS implementation doesn't increase the attack surface of our infrastructure and doesn't open the door to abuses - It's easy to develop and maintain the functionally if done in JavaScript On the other hand, server-side has the following disadvantages: - Server-side full text search implantations (examples: [Elastic search](https://www.elastic.co/), [solr](https://lucene.apache.org/solr)) come with a database which complicates our server infrastructure and add friction to the website development - No "lightweight" and well-maintained full text search implementations exist in a sever-side language that doesn't require a full-scale database - This may cause security issues - This makes the development of the feature itself more complicated The advantages of server-side: - It will work if JS is disabled - The size of the search page will be lower, so users will download less data. The size of the gziped lunr.js library is 8.3K and the size of the gziped index is 490K. So this is all about trade-offs. Of course, if the feature would have been a central part of the website, present on every pages, my point of view would have been completely different. Regarding the search performance, this is hard to tell without doing both the client-side and server-side implementation with the same set of features. What do you think?
Owner

Looks nice, but maybe it's a bit inconsistent with the rest of the items in the menu? "login", "change language" and "donate" have their respective icons on the left but with e1653bb0d1 the search icon is on the right of the text.

Ah yes, good point. I've moved it in 0c6e55fdea, and we can surely think about just making it a blue search symbol without the blue background.

  • Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests)

Sorry, I should have added the rational behind the technical choices on the first comment of the PR. Anyway, here are the reasons why I went to JavaScript for the search functionality (by order of importance):

Thanks for the elaborate answer!

So this is all about trade-offs. Of course, if the feature would have been a central part of the website, present on every page, my point of view would have been completely different

Well, it would be present on any page then ;)

Regarding the search performance, this is hard to tell without doing both the client-side and server-side implementation with the same set of features.

I agree, and especially the server-performance and -security argument convinces me. And I didn't know that the slim library you used in the JS would not be so trivial to implement server-side.

So I'd say let's continue with the current approach. I can surely help with styling, and wiring the whole index to the build process somehow.

> Looks nice, but maybe it's a bit inconsistent with the rest of the items in the menu? "login", "change language" and "donate" have their respective icons on the left but with e1653bb0d1 the search icon is on the right of the text. Ah yes, good point. I've moved it in 0c6e55fdea, and we can surely think about just making it a blue search symbol without the blue background. > > * Would it actually make more sense to run the search server-side? This would eradicate the JS dependency and perhaps speed up the result generation (given that we will not get spammed with requests) > > Sorry, I should have added the rational behind the technical choices on the first comment of the PR. Anyway, here are the reasons why I went to JavaScript for the search functionality (by order of importance): Thanks for the elaborate answer! > So this is all about trade-offs. Of course, if the feature would have been a central part of the website, present on every page, my point of view would have been completely different Well, it would be present on any page then ;) > Regarding the search performance, this is hard to tell without doing both the client-side and server-side implementation with the same set of features. I agree, and especially the server-performance and -security argument convinces me. And I didn't know that the slim library you used in the JS would not be so trivial to implement server-side. So I'd say let's continue with the current approach. I can surely help with styling, and wiring the whole index to the build process somehow.
vincent commented 6 months ago
Poster
Collaborator

Ah yes, good point. I've moved it in 0c6e55fdea, and we can surely think about just making it a blue search symbol without the blue background.

Done in 867be575cf, though I'm not sure if it's the best way to do it.

I've rebased to latest master because Drone had issues with auto-merging for some reasons.

So I’d say let’s continue with the current approach. I can surely help with styling, and wiring the whole index to the build process somehow.

Thanks!

> Ah yes, good point. I've moved it in 0c6e55fdea, and we can surely think about just making it a blue search symbol without the blue background. Done in [`867be575cf`](https://git.fsfe.org/FSFE/fsfe-website/commit/867be575cfd573efb19c6e47ac6e1c6e96e986fb), though I'm not sure if it's the best way to do it. I've rebased to latest master because [Drone had issues with auto-merging](https://drone.fsfe.org/FSFE/fsfe-website/5099) for some reasons. > So I’d say let’s continue with the current approach. I can surely help with styling, and wiring the whole index to the build process somehow. Thanks!
max.mehl self-assigned this 6 months ago
vincent was assigned by max.mehl 6 months ago
Owner

I've equipped the build server with python3-bs4 which should suffice to run index-website.py. I've also improve the layout and text a bit.

Also, there's now a step in the Makefile to run the index with every build. Costs us 30s on the current hardware, but it's the cleanest solution.

I've equipped the build server with python3-bs4 which should suffice to run `index-website.py`. I've also improve the layout and text a bit. Also, there's now a step in the Makefile to run the index with every build. Costs us 30s on the current hardware, but it's the cleanest solution.
vincent commented 5 months ago
Poster
Collaborator

I've equipped the build server with python3-bs4 which should suffice to run index-website.py. I've also improve the layout and text a bit.

Thanks! In 2a5757fb23 I separate the result for news items and the results for /activities and /about. And I display the date for news items.

Also, there's now a step in the Makefile to run the index with every build. Costs us 30s on the current hardware, but it's the cleanest solution.

How many cpus the build server has? In index-website.py I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-)

> I've equipped the build server with python3-bs4 which should suffice to run `index-website.py`. I've also improve the layout and text a bit. Thanks! In 2a5757fb23 I separate the result for news items and the results for `/activities` and `/about`. And I display the date for news items. > Also, there's now a step in the Makefile to run the index with every build. Costs us 30s on the current hardware, but it's the cleanest solution. How many cpus the build server has? In [`index-website.py` I spawn 4 processes](https://git.fsfe.org/FSFE/fsfe-website/src/commit/2a5757fb23e4c1f0bf0a6a43f4d79810dcf8014b/tools/index-website.py#L33). Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-)
vincent commented 5 months ago
Poster
Collaborator

Also, once the pull request has been merged, I suggest to add the python3 and python3-bs4 packages in the list of packages needed to build the website.

Also, once the pull request has been merged, I suggest to add the python3 and python3-bs4 packages in [the list of packages needed to build the website](https://wiki.fsfe.org/TechDocs/Mainpage/BuildLocally#Set_up_necessary_packages).
vincent changed title from WIP: Add a search functionality (Fixes #739 ) to Add a search functionality (Fixes #739 ) 5 months ago
Owner

Great changes, that's much more usable now!

Do you know a way how we can make the whole thing translatable? Currently, quite a lot is hardcoded as EN in the javascript. Could we somehow make use of our XSL gettext function (e.g. <xsl:call-template name="fsfe-gettext"><xsl:with-param name="id" select="'search/placeholder'" /></xsl:call-template>)?

Also, I'm not sure whether /about, /news, and /activities is all users expect. What about /freesoftware, /contribute and so?

How many cpus the build server has? In index-website.py I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-)

Two virtual cores. I hope that with the new cluster(s) we can increase the number to at least 4 or 6.

Great changes, that's much more usable now! Do you know a way how we can make the whole thing translatable? Currently, quite a lot is hardcoded as EN in the javascript. Could we somehow make use of our XSL gettext function (e.g. `<xsl:call-template name="fsfe-gettext"><xsl:with-param name="id" select="'search/placeholder'" /></xsl:call-template>`)? Also, I'm not sure whether /about, /news, and /activities is all users expect. What about /freesoftware, /contribute and so? > How many cpus the build server has? In index-website.py I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-) Two virtual cores. I hope that with the new cluster(s) we can increase the number to at least 4 or 6.
Owner

Ah, regarding echoing in index-website.py: The * indentation was on purpose to match what you see when running make, and what's shown on the status page.

Ideally, the output would look like:

* Creating search index
*  Spawning 4 processes
*  Indexing 3013 files
*  Indexing done!
*  [...]/search/index.js
Ah, regarding echoing in index-website.py: The `* ` indentation was on purpose to match what you see when running `make`, and what's shown on [the status page](https://status.fsfe.org/fsfe.org/). Ideally, the output would look like: ``` * Creating search index * Spawning 4 processes * Indexing 3013 files * Indexing done! * [...]/search/index.js ```
vincent commented 5 months ago
Poster
Collaborator

Do you know a way how we can make the whole thing translatable? Currently, quite a lot is hardcoded as EN in the javascript. Could we somehow make use of our XSL gettext function (e.g. <xsl:call-template name="fsfe-gettext"><xsl:with-param name="id" select="'search/placeholder'" /></xsl:call-template>)?

I tried for quite some time to use XSL templates in search.en.xml, but to no avail :-( I'm far from a XSL master though so if someone wants to step in and manage to call the fsfe-gettext template inside search.en.xml that would be great.

One thing to note however is that the page is perfectly translatable. We can translate the content and create search.{de,es,it,nl,fr}.xml and so on. it's just harder.

Also, I'm not sure whether /about, /news, and /activities is all users expect. What about /freesoftware, /contribute and so?

Added /freesoftware and /contribute in fda4c822d8 and in d1fb0b571b I index all xHTML pages. This raises the number of indexed pages to 3222. It was 3013 before so I don't think this creates a noticable difference performance-wise.

How many cpus the build server has? In index-website.py I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-)

Two virtual cores. I hope that with the new cluster(s) we can increase the number to at least 4 or 6.

Maybe it would be better to create only 2 processes then. Can you please try to change the n_processes variable in index-website.py to see if it makes the indexation quicker?

> Do you know a way how we can make the whole thing translatable? Currently, quite a lot is hardcoded as EN in the javascript. Could we somehow make use of our XSL gettext function (e.g. `<xsl:call-template name="fsfe-gettext"><xsl:with-param name="id" select="'search/placeholder'" /></xsl:call-template>`)? I tried for quite some time to use XSL templates in `search.en.xml`, but to no avail :-( I'm far from a XSL master though so if someone wants to step in and manage to call the fsfe-gettext template inside `search.en.xml` that would be great. One thing to note however is that the page is perfectly translatable. We can translate the content and create `search.{de,es,it,nl,fr}.xml` and so on. it's just harder. > Also, I'm not sure whether /about, /news, and /activities is all users expect. What about /freesoftware, /contribute and so? Added /freesoftware and /contribute in fda4c822d8 and in d1fb0b571b I index all xHTML pages. This raises the number of indexed pages to 3222. It was 3013 before so I don't think this creates a noticable difference performance-wise. > > How many cpus the build server has? In index-website.py I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-) > > Two virtual cores. I hope that with the new cluster(s) we can increase the number to at least 4 or 6. Maybe it would be better to create only 2 processes then. Can you please try to change the `n_processes` variable in `index-website.py` to see if it makes the indexation quicker?
vincent commented 5 months ago
Poster
Collaborator

Ah, regarding echoing in index-website.py: The * indentation was on purpose to match what you see when running make, and what's shown on the status page.

Sorry, I should have been more careful ^^' Fixed in c1ed6fc4e7

> Ah, regarding echoing in index-website.py: The `* ` indentation was on purpose to match what you see when running `make`, and what's shown on [the status page](https://status.fsfe.org/fsfe.org/). Sorry, I should have been more careful ^^' Fixed in c1ed6fc4e7
vincent commented 5 months ago
Poster
Collaborator

Also I tried to increase performances in 00a9afcd11 and 88c41317a1.

Also I tried to increase performances in 00a9afcd11 and 88c41317a1.
Owner

I just made the form translatable.

By the way, I noticed two things regarding indexing:

  • Only XHTML files from the third level (no/no/here) seem to be found. E.g. searching for "policy" should at least show /activities/policy.html. Is this because the path */**/*.xhtml in L72?
  • I can still only seem to find stuff from news, about and activities. Perhaps because of the restiction to these folders in p.map() (L74)?
I just made the form translatable. By the way, I noticed two things regarding indexing: * Only XHTML files from the third level (no/no/here) seem to be found. E.g. searching for "policy" should at least show `/activities/policy.html`. Is this because the path `*/**/*.xhtml` in L72? * I can still only seem to find stuff from news, about and activities. Perhaps because of the restiction to these folders in p.map() (L74)?
Owner

Maybe it would be better to create only 2 processes then. Can you please try to change the n_processes variable in index-website.py to see if it makes the indexation quicker?

I doesn't make any significant difference whether I use 2, 4 or 8 threads. So I'd keep it at four, considering that one of the first things I'd like to do with the new clusters is to move the build server to a VM with at least 4 vCores.

> Maybe it would be better to create only 2 processes then. Can you please try to change the n_processes variable in index-website.py to see if it makes the indexation quicker? I doesn't make any significant difference whether I use 2, 4 or 8 threads. So I'd keep it at four, considering that one of the first things I'd like to do with the new clusters is to move the build server to a VM with at least 4 vCores.
vincent commented 5 months ago
Poster
Collaborator

By the way, I noticed two things regarding indexing:

  • Only XHTML files from the third level (no/no/here) seem to be found. E.g. searching for "policy" should at least show /activities/policy.html. Is this because the path */**/*.xhtml in L72?
  • I can still only seem to find stuff from news, about and activities. Perhaps because of the restiction to these folders in p.map() (L74)?

Fixed, sorry!

> By the way, I noticed two things regarding indexing: > > * Only XHTML files from the third level (no/no/here) seem to be found. E.g. searching for "policy" should at least show `/activities/policy.html`. Is this because the path `*/**/*.xhtml` in L72? > * I can still only seem to find stuff from news, about and activities. Perhaps because of the restiction to these folders in p.map() (L74)? Fixed, sorry!
max.mehl merged commit e753e02fce into master 5 months ago
max.mehl deleted branch search-engine 5 months ago
Owner

Thanks a lot, @vincent and @max.mehl. Great job!

Thanks a lot, @vincent and @max.mehl. Great job!
The pull request has been merged as e753e02fce.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.