Search function sometimes misses actual teaser #2472

Closed
opened 2022-03-02 14:46:44 +00:00 by max.mehl · 4 comments
Owner

The search index misses some results because the teaser takes the first <p> element which is often not the actual teaser.

Example: This search does not show this page, but it should.

The entry of the spreadtheword page in the search index looks like the following:

{"url": "https://fsfe.org/contribute/spreadtheword.en.html", "tags": "", "title": "Spread the word", "teaser": "Contribute", "type": "page", "date": null}

"Contribute" is the first p element on this page:

<p id="category"><a href="/contribute/">Contribute</a></p>

Possible solution: make the search ignore these paragraphs, so also the second and third on this page which are also not actual teasers:

"teaser": " ".join(

The search index misses some results because the teaser takes the first `<p>` element which is often not the actual teaser. Example: [This search](https://fsfe.org/search/search.en.html?q=promotion) does not show [this page](https://fsfe.org/contribute/spreadtheword), but it should. The entry of the spreadtheword page in the search index looks like the following: `{"url": "https://fsfe.org/contribute/spreadtheword.en.html", "tags": "", "title": "Spread the word", "teaser": "Contribute", "type": "page", "date": null}` "Contribute" is the first p element on this page: https://git.fsfe.org/FSFE/fsfe-website/src/commit/41cda35a26f825950081c8c3d66ad03a3519ee75/contribute/spreadtheword.en.xhtml#L12 Possible solution: make the search ignore these paragraphs, so also the second and third on this page which are also not actual teasers: https://git.fsfe.org/FSFE/fsfe-website/src/commit/41cda35a26f825950081c8c3d66ad03a3519ee75/tools/index-website.py#L59
max.mehl added the
bug
help wanted
labels 2022-03-02 14:46:44 +00:00
Member

Thanks for reporting! I see two solutions:

  1. As you rightfully said for now the indexer assumes that the first <p> is the teaser. Maybe we can narrow down the assumption to only specific subfoders or files? For example, we can probably assume that the first <p> is always the teaser for news items.

  2. Identify teasers with a 'teaser' css class

  1. will always lead to edgecases and 2. will be more tedious but more rewarding
Thanks for reporting! I see two solutions: 1. As you rightfully said for now the indexer assumes that the first `<p>` is the teaser. Maybe we can narrow down the assumption to only specific subfoders or files? For example, we can probably assume that the first `<p>` is always the teaser for news items. 2. Identify teasers with a 'teaser' css class 1) will always lead to edgecases and 2. will be more tedious but more rewarding
Author
Owner

Thanks. I lean towards 1. although I see the problem of excluding all these edge cases. 2. makes things for editors harder, and we already have a number of ids and classes for introductions and such.

@reinhard, what's your take on this?

Thanks. I lean towards 1. although I see the problem of excluding all these edge cases. 2. makes things for editors harder, and we already have a number of ids and classes for introductions and such. @reinhard, what's your take on this?
Member

I think the best solution would be to explicitly exclude some paragraphs based on id and/or class and then take the first paragraph of what is left.

In case we implement #1348, we could add a rule that if a paragraph is formatted with that class="lead", then that's the teaser with highest priority.

I think the best solution would be to explicitly exclude some paragraphs based on id and/or class and then take the first paragraph of what is left. In case we implement #1348, we could add a rule that if a paragraph is formatted with that `class="lead"`, then that's the teaser with highest priority.
Author
Owner

Sounds good. So we would tackle this from two sides:

  1. Exclude obvious false-positive teasers based on id, class (or length?)
  2. When #1348 is done, this lead class would overrule the heuristics from the first point.

Right?

Sounds good. So we would tackle this from two sides: 1. Exclude obvious false-positive teasers based on id, class (or length?) 2. When #1348 is done, this lead class would overrule the heuristics from the first point. Right?
vincent self-assigned this 2022-08-22 16:25:53 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: FSFE/fsfe-website#2472
No description provided.