The lang parameter to text/gemini
solderpunk
solderpunk at SDF.ORG
Thu May 28 22:15:35 BST 2020
On Thu, May 28, 2020 at 04:51:56PM -0400, Natalie Pendragon wrote:
> I agree with this in principle (i.e., as a guidepost for good user
> experience in a search engine), but there can be complicating factors
> in practice. In particular, some common and generally effective text
> indexing processes involve things like porter stemming words (so
> "stemmed" and "stemming" would both get indexed as something like
> "stem") and removal of "stop words" (and, in, the...). As you might
> imagine, both of these operations are specific to language.
>
> So, simply in creating an index of Geminispace, there might already be
> an assumed "default" language. In the case of GUS, this is English. I
> don't stop GUS from indexing any non-English content currently, but
> the quality of indexing is lower for other languages. Operations like
> the above (porter stemming and removing stop words) will simply be
> no-ops.
>
> And then the other side of this experience is that when a user types
> in a search query, that also goes through the same process - the query
> is porter stemmed, stripped of its stop words, then shuttled off to
> the TF-IDF index to find and score the actual matches.
Thanks for shedding some light on the processing that happens behind the
scenes in GUS! Language declaration is even more important to search
engines than I had realised. Once we've got this specced we will have
to really encourage the authors of servers to make it possible for users
to control this parameter, and to encourage the folks at non-English
servers to use it!
> What does directionality mean in this context?
Left-to-right vs right-to-left vs top-to-bottom, etc. I don't think
this will actually be relevant for search at all?
> In terms of how the above all sounds:
> - I support the addition of the document-level lang parameter, which
> can accept multiple values, to the spec.
> - I do NOT support the addition of a default document-level lang value
> to the spec.
> - I do NOT support the addition of a line-level lang parameter.
Thanks for the nice, clear summary!
Cheers,
Solderpunk
More information about the Gemini
mailing list