Add a search functionality (Fixes #739 ) #1635
No reviewers
Labels
No Label
bug
build
cgi Scripting
design
disruptive
documentation
duplicate
easy
feature-request
help wanted
javascript
priority/low
question
system-hackers
tagging
text
translations
wait/bugfix
wait/inprogress
wait/misc
wait/proofread
wontfix
xsl
No Milestone
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: FSFE/fsfe-website#1635
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "search-engine"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Add a page
fsfe.org/search.en.html
.To try this:
python3 tools/index-website.py
from the website root directorysearch/search.en.xhtml
TODO:
fsfe.org/search
./news
Add a search functionality (#739 )to Add a search functionality (Fixes #739 )Add a search functionality (Fixes #739 )to WIP: Add a search functionality (Fixes #739 )Cool! Didn't try it our completely yet, but regarding one of your todos:
That should happen automatically. For instance, check out https://fsfe.org/activities/routers/. You can leave it blank, remove the trailing space, or add routers.html or index.html. The build system takes care of all of these. It just doesn't work with the normal local preview.
Perfect!
I just played around with it. First of all: it's a really clever and slim idea, so thanks a lot!
Some feedback apart from your todos above, of course knowing that this is WIP:
/activities/routers
as well as the news items to this category.Fixed with 69538ac9de :-)
The ranking algorithm is called BM25.
In a nutshell, the query is split into individual token (e.g. "free software" is split to ["free", "software"]). Then for each token, the number of occurences in the document, the document length, and the inverse document frequency of the token - the number of documents in which the token appears (more details) are taken into account to create a score for the particular token and a particular document. Then the results are sumed up across the tokens composing the query, so you end up with a score of the query and a document. You repeat for each document in the corpus and sort with the highest scores first :-) I take only the first top 10 results.
So the ranking is purely text-based. Maybe we can sort the top 10 results by date?
Should be doable. I guess we can parse the current URL to know the language used by the user, and compare it with the URl of the search results. I would say that regardless of the date, we should filter results to only get the current language (if not English) and English results. That would improve the user experience IMO.
Yes. For now I only index
/news
but that will be extended to/activities
and any other pages that one would expect to find through the search engine.Edit: added
/about
and/activities
to the index in 4cde1bab99.Uhm, it's not easy. We can't computationally tell if a tag is intuitive. I guess that all article tagged with "swpat" would contain the word "patent" or its translation, so I'm not sure if it's an issue.
Thanks for the explanation! Yes, either sorting by date, or showing the date.
Another thought: would it work to show the type of the item? This would allow readers to distinguish between different types of resources.
Definitely, yes!
Great!
Ah, so you search for the full article text, or just the teaser? In any case I agree that users will probably find what they search for.
Yes, I definitly need to add contextual information. Either by adding the type (/news, /about, /activities and so on) or the complete URL. Displaying the date would be nice as well.
Currently I only search in the title and the tags. The reason for not adding full text search is that it would make the index quite large (6.4M gziped if we index the full text of /news, /about and /activities) and that would impact the download size of the index (every user downloads the
index.js
file) and the search performance. I could add some clever trick to only index relevant words from the full text though, but I'm not sure that it would make the index sixe bearable. There's a trade-off between the usefulness of the search engine, the size of the data indexed and its performance.The logic of only indexing the teaser would only work for /news.
@max.mehl and others, I also wonder if it's ok to only put a search box in the header. Should we add a "search" button
<input type="submit">
next to the search box to submit the search? For a less tech-savy audience, is it obvious that pressing enter after entering text in the textbox submits the search and leads to the search page?Thanks for calculating the index size. When speaking of a teaser, I always mean the first paragraph, as this introduces to the topic and will use the important terms – otherwise it would be a really bad teaser. The title could do so as well, but sometimes for the sake of a catchy title, one might omit that.
As an example, I am thinking about a text about software patents. It could be like the following:
In this case, only the teaser would produce a result when someone searches for "patent".
I would add an icon to make the purpose of the box clearer. E.g., what about this?
Yea, I agree that it could be useful. But adding the teaser add quite a lot of size to the index.
I have added the teaser in fb3fad84fd (the size of the index would now be 490K). Stopwords are removed for Dutch, English, French, German, Italian to make the field slimmer. Despite the larger index.js file, the search is still fast on my computer ™, though I don't know whether this will be problematic once live in production. This commit will not be squashed so it can be easily reverted if need be.
I have made the search icon clickable in ef08fbf877.
Thanks a lot for the updates!
For me, the index currently has 1.5M, but search also is sufficiently fast, which of course can change if the index has to be loaded from the server.
I do wonder however, why the search term
patents
leads to no results, although the phrase is in the index file?Also, some questions/remarks:
Yes, me too. But I'm talking about the compressed size (the size which will be downloaded by the website visitors)
Fixed in 8122d772d8
Looks nice, but maybe it's a bit inconsistent with the rest of the items in the menu? "login", "change language" and "donate" have their respective icons on the left but with e1653bb0d1 the search icon is on the right of the text.
Sorry, I should have added the rational behind the technical choices on the first comment of the PR. Anyway, here are the reasons why I went to JavaScript for the search functionality (by order of importance):
On the other hand, server-side has the following disadvantages:
The advantages of server-side:
So this is all about trade-offs. Of course, if the feature would have been a central part of the website, present on every pages, my point of view would have been completely different.
Regarding the search performance, this is hard to tell without doing both the client-side and server-side implementation with the same set of features.
What do you think?
Ah yes, good point. I've moved it in 0c6e55fdea, and we can surely think about just making it a blue search symbol without the blue background.
Thanks for the elaborate answer!
Well, it would be present on any page then ;)
I agree, and especially the server-performance and -security argument convinces me. And I didn't know that the slim library you used in the JS would not be so trivial to implement server-side.
So I'd say let's continue with the current approach. I can surely help with styling, and wiring the whole index to the build process somehow.
Done in
867be575cf
, though I'm not sure if it's the best way to do it.I've rebased to latest master because Drone had issues with auto-merging for some reasons.
Thanks!
I've equipped the build server with python3-bs4 which should suffice to run
index-website.py
. I've also improve the layout and text a bit.Also, there's now a step in the Makefile to run the index with every build. Costs us 30s on the current hardware, but it's the cleanest solution.
Thanks! In 2a5757fb23 I separate the result for news items and the results for
/activities
and/about
. And I display the date for news items.How many cpus the build server has? In
index-website.py
I spawn 4 processes. Depending on the build server hardware we may want to increase that number, which would speed up the indexation :-)Also, once the pull request has been merged, I suggest to add the python3 and python3-bs4 packages in the list of packages needed to build the website.
WIP: Add a search functionality (Fixes #739 )to Add a search functionality (Fixes #739 )Great changes, that's much more usable now!
Do you know a way how we can make the whole thing translatable? Currently, quite a lot is hardcoded as EN in the javascript. Could we somehow make use of our XSL gettext function (e.g.
<xsl:call-template name="fsfe-gettext"><xsl:with-param name="id" select="'search/placeholder'" /></xsl:call-template>
)?Also, I'm not sure whether /about, /news, and /activities is all users expect. What about /freesoftware, /contribute and so?
Two virtual cores. I hope that with the new cluster(s) we can increase the number to at least 4 or 6.
Ah, regarding echoing in index-website.py: The
*
indentation was on purpose to match what you see when runningmake
, and what's shown on the status page.Ideally, the output would look like:
I tried for quite some time to use XSL templates in
search.en.xml
, but to no avail :-( I'm far from a XSL master though so if someone wants to step in and manage to call the fsfe-gettext template insidesearch.en.xml
that would be great.One thing to note however is that the page is perfectly translatable. We can translate the content and create
search.{de,es,it,nl,fr}.xml
and so on. it's just harder.Added /freesoftware and /contribute in fda4c822d8 and in d1fb0b571b I index all xHTML pages. This raises the number of indexed pages to 3222. It was 3013 before so I don't think this creates a noticable difference performance-wise.
Maybe it would be better to create only 2 processes then. Can you please try to change the
n_processes
variable inindex-website.py
to see if it makes the indexation quicker?Sorry, I should have been more careful ^^' Fixed in c1ed6fc4e7
Also I tried to increase performances in 00a9afcd11 and 88c41317a1.
I just made the form translatable.
By the way, I noticed two things regarding indexing:
/activities/policy.html
. Is this because the path*/**/*.xhtml
in L72?I doesn't make any significant difference whether I use 2, 4 or 8 threads. So I'd keep it at four, considering that one of the first things I'd like to do with the new clusters is to move the build server to a VM with at least 4 vCores.
Fixed, sorry!
Thanks a lot, @vincent and @max.mehl. Great job!