Logging format for Gemini servers
Anna “CyberTailor”
cyber at sysrq.in
Sun Jul 18 23:22:49 BST 2021
Hello everyone, today I'd like to talk about access logs.
Almost every HTTP server uses NCSA Common Log Format (or its superset -
Combined Log Format). This is very cool, because developers of misc
utilities (like fail2ban or monitoring tools) don't need to bother
writing log parsers for each server.
## Example log entry
.---------------------- IP address of the client which made the request
| .------------ rfc1413 identity (always "-" in practice)
| | .---------- authorized user ID (as in .htpasswd file)
| | | .---- datetime string [%d/%b/%Y:%H:%M:%S %z]
| | | |
| | | |
* * * *
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
* * *
| | |
| | |
HTTP method, resource and protocol version --------. | |
HTTP status code returned to the client ---------------. |
number of bytes of data transferred (without headers) ------.
References:
=> https://en.wikipedia.org/wiki/Common_Log_Format
=> https://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html#common
=> https://www.loganalyzer.net/log-analyzer/apache-common-log.html
## Adaptibility
If you look at Gophernicus code, it's using Combined Log Format, which
is nice but confusing (I mean seeing "HTTP/1.0" string and HTTP status
codes in a Gopher server's log feels weird), however compatibility is
worth it.
I think Common Log Format can be applied for Gemini too. The only
problem is, such format does not include <META>. Also it won't look good
in syslog because of double datetime.
Let's review the syntax:
> host ident authuser date request status bytes
Everything is obvious except authuser. I suggest using last 7 characters
of client certificate's SHA-1 cache (git had shown that it is enough).
## RFC 1413: Ident protocol
If you run a webserver, you probably understand how useful User-agent is
for identifying robots visiting your website.
Thankfully, Gemini doesn't require client identification as there're no
compatibility issues between different Gemini clients. But that makes
learning anything about robots very hard for capsule operators :(
I appreciate Stéphane Bortzmeyer for including additional info in
robots.txt requests:
> gemini://example.space/robots.txt?robot=true&uri=gemini://gemini.bortzmeyer.org/software/lupa/
I'd like to suggest another one solution for this problem (so we have 15
competing standards later).
Let's suppose Yuri runs a Gemini server, Sergei runs a Gemini search
egnine *AND* an identd server, for example, fakeidentd:
=> http://www.guru-group.fi/~too/sw/ A static, secure identd. One source file only!
Sergei's crawler makes a request to Yuri's server. Yuri's server sends
an ident query to Sergei's identd server, reads response and writes
access log. Yuri reads 'celestial-crawler' in the logs and gets excited
about his capsule getting indexed.
Upsides:
* looks cool and fun
* opt-in
* actually standartized
* 'ident' field can be logged every time a request is made
* human visitors can leave their names in server logs so Geminispace
feels more comfy and personal
=> https://tvtropes.org/pmwiki/pmwiki.php/Main/KilroyWasHere
Downsides:
* identd probably won't work behind ISP's NAT
* requires writing asynchronous or threaded server code to avoid
blocking main thread (although separating logger and listener
processes is a good idea as it's more secure)
* default fail2ban filters rely on 'ident' field always being "-"
What are you thoughts?
Feel free to ask questions 🙃
More information about the Gemini
mailing list