The lang parameter to text/gemini

Natalie Pendragon natpen at natpen.net
Thu May 28 21:51:56 BST 2020


On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> As far as I recall, nobody actually objected to this as something we
> should do in principle, instead we just got distracted by various edge
> cases.  But I guess I may as well ask now: does anybody think this is a
> *bad* idea?

Nope, I think it's a nice addition and not a bad idea at all! Low
extensibility, high value for the two use cases you described (screen
readers and search engines).

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> I was, and am, opposed to putting a default language in the spec.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> The case of search engines is trickier, since their resulting database
> does not have just one user but many.  This was where autodetection
> first came up, which some people seemed to get carried away with.  Fully
> generalised autodetection of language is computationally expensive and
> it gives answers with some uncertainty.  A large search engine project
> *may* want to think about it - the idea of clients for humans users
> doing it as a routine response to a lack of a lang parameter is nuts.

Agreed.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> A simpler option for search engines might simply be to interpret a user
> request of "only show me results in languages X" as "don't show results
> *known* to be in languages other than X".  i.e documents for which the
> language is not known are always possible search results.  This is
> imperfect, but, well, sometimes life is.
>
> In short, I am not sure that the lack of specified default behaviour is
> a good reason not to go ahead with this.

I agree with this in principle (i.e., as a guidepost for good user
experience in a search engine), but there can be complicating factors
in practice. In particular, some common and generally effective text
indexing processes involve things like porter stemming words (so
"stemmed" and "stemming" would both get indexed as something like
"stem") and removal of "stop words" (and, in, the...). As you might
imagine, both of these operations are specific to language.

So, simply in creating an index of Geminispace, there might already be
an assumed "default" language. In the case of GUS, this is English. I
don't stop GUS from indexing any non-English content currently, but
the quality of indexing is lower for other languages. Operations like
the above (porter stemming and removing stop words) will simply be
no-ops.

And then the other side of this experience is that when a user types
in a search query, that also goes through the same process - the query
is porter stemmed, stripped of its stop words, then shuttled off to
the TF-IDF index to find and score the actual matches.

For what its worth, I do not believe any of what I've written here is
an argument for adding a default language to the spec. That, to me,
feels solidly outside the appropriate scope of the spec. But, if we're
talking search engines, there's probably going to end up being a
default language in practice for any search engine based on mainstream
full-text search approaches.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> The second question was what to do when a document contains text in
> multiple languages.  This is a trickier question.  I'd prefer not to
> define a new line type to handle it.  We could at least allow the lang
> parameter to accept multiple values separated by some delimiter.

Agreed! I think the power-to-weight ratio of a line-specific lang
value is too low. Allowing multiple lang values at the document level
feels like a nice balance though.

On Thu, May 28, 2020 at 06:43:21PM +0000, solderpunk wrote:
> There's also the question of directionality, which I think might require
> a separate parameter entirely.  But let's focus on the language thing
> for now.  How does the above sound to people?

What does directionality mean in this context?

In terms of how the above all sounds:
- I support the addition of the document-level lang parameter, which
  can accept multiple values, to the spec.
- I do NOT support the addition of a default document-level lang value
  to the spec.
- I do NOT support the addition of a line-level lang parameter.

Natalie


More information about the Gemini mailing list