Be aware of your Locale
Table of Contents
I happened to have reinstalled my PC recently. While I have set things up to pretty much out of the box, I have left one glaring omission which came back to haunt me (for a single afternoon or two, after which I've devised and quickly deployed a fix).
Note: This post will follow mostly the same path I took to investigate. Except a couple dead ends before I "figure out" the solution.
1. The problem
I was writing my previous article about getting LASIK, when I suddenly noticed all the Euro signs in the article were replaced with "???". This perturbed me quite a bit, chiefly because I would've found it really unbelievable to think that even a simple SSG like Haunt would not handle special characters properly.
And yet "that's exactly what happened" (obviously not, but bear with me for a moment): The encoding of the source code was correct. The parsed SXML was correct. The final HTML was not correct.
Now, I'm not exactly privy to the inner workings of Haunt. I have dabbled in its source, mostly to figure out defaults, but I haven't really dug deep. Because of that, I only had vague ideas where to start looking.
Based on the documentation for Readers, I knew whichever format you use originally, Haunt will first parse it into SXML and then render that SXML into HTML and write it onto the disk.
My first assumption was that it must have been one of the extensions I used. Ox-html stable IDs didn't touch the parsed output in the slightest and is so simple, that I figured it's unlikely it could be the cause.
I'm also using Jakob L. Kreuze's Org-mode reader for Haunt, which is quite a bit larger and more complex. But, after adding a debug print line and checking the output, the generated SXML was fine.
Okay, so the extensions are fine. Then it must be Haunt. By grepping
the source code for sxml, I found a file called html.scm, that
contains the following:
(define* (sxml->html tree #:optional (port (current-output-port))) "Write the serialized HTML form of @var{tree} to @var{port}." (define (write-escaped-string str) (define (escape ch) (case ch ((#\") (put-string port """)) ((#\&) (put-string port "&")) ((#\<) (put-string port "<")) ((#\>) (put-string port ">")) (else (put-char port ch)))) (string-for-each escape str)) ;; <... The rest elided because it's not relevant ...>
Well, that was easy, right? The code just isn't escaping "€" right.
Except, no, that explanation doesn't track in the slightest due to the
else clause of that function. Haunt only escapes these four special
cases, everything else is passed through as-is.
At this point I figured, well, I might as well try out this function.
I fired up Geiser, Emacs's integrated Scheme REPL. Because I'm also
using direnv and its accompanying extension for Emacs emacs-direnv, it
automatically picked up the right version of Guile and also the right
environment variables for me to be able to import Haunt's modules from
the .envrc file found in the root of my site repository.
As my first attempt, I called the function with a very simple SXML tree:
scheme@(guile-user)> (use-modules (haunt html)) scheme@(guile-user)> (sxml->html-string '(html "€")) $6 = "<html>\ufffd\ufffd\ufffd</html>" scheme@(guile-user)>
It sure seems like sxml->html1 is the problem, huh? Let's
simplify it even further:
scheme@(guile-user)> (sxml->html-string "€") $7 = "\ufffd\ufffd\ufffd"
With a sudden stroke of inspiration, I figured what if we drop sxml->html?
scheme@(guile-user)> "€" $8 = "\ufffd\ufffd\ufffd"
A-ha! So the problem is in-fact Guile, not Haunt. On one hand, this was good, because I could rule out an entire project. On the other hand, now I fell into the even bigger bog of figuring what Guile was doing wrong.
One thing that struck me as odd is that not only were the files written to the disk off, I also noticed that the little arrows Haunt usually prints when copying files were also replaced by question marks.
copy assets/imgs/2026-03-10-lasik/drops.avif → /assets/imgs/2026-03-10-lasik/drops.avif
This already made me have a hunch, but the Unicode code-point \ufffd
was also a glowing hint what was going. This character is called the
Replacement Character, because it is what UTF-8 rendering software
show when the character-code cannot be properly decoded.
Okay, so not even the terminal is able to encode things. Since I'm using Alacritty at the moment, which is a very modern terminal emulator, it not supporting something as basic as a Euro-sign was impossible.
2. The solution
And that was when I finally realized the issue: My locale was busted.
If you're using a mainstream distro (and haven't done anything overly fancy), you might not even know what a Locale is. In a nutshell, it's a set of rules that defines how locale-aware applications format things that differ on a language-by-language basis. Examples include dates (YYYY-MM-DD or DD-MM-YYYY?), fractions (1.5 or 1,5?), money, etc.
All of these rules are stored in locale files, which usually come pre-generated with your distro. However, if you happen to be using Arch (or one of its derivatives), you have to manually specify which locales you intend to use and then generate them, before they become usable.
If you forget to do this (like I have), your system will fall back to the "C/POSIX Locale". Here's a great quote I found on Stack Exchange to explain what it does (emphasis mine):
Stéphane Chazelas: The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers.
In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), […] and things like currency symbols are not defined.
This explained everything. Now I only had to figure out how to fix it.
What followed was a long (and ultimately pointless) goose-chase, where
I first checked how Guile handles locales, then how to make sure the
locale is actually loaded, making sure my terminal was set to (what I
thought was) the right LANG value, etc.
Nothing seemed to help, when I realized I haven't checked the most
obvious thing: What language/region is set in KDE. I opened the
settings, went to the right category, and I found it's set to
British English, or en_GB.UTF-8.2
I did indeed want my PC to be using an English locale, but I only had
en_US.UTF-8 and hu_HU.UTF-8 (my native language) enabled in my
locale.gen file. Somehow, I completely failed to enable
en_GB.UTF-8!
With some shaking of my head, I changed my locale to en_US.UTF-8,
rebooted my PC, regenerated my site, and guess what, everything was
working nicely.
Moral of the story? If your applications are behaving in silly ways when it comes to Unicode, your first assumption should probably not be that they're faulty, but that there is probably a misconfiguration somewhere along the chain.
I also must emphasise the boon that is FOSS again. Had Haunt and Guile been closed source programs, I probably would have had a far rougher adventure figuring out what's set badly… or I would have had to abandon using them in the first place.
Thanks for reading!
Footnotes:
sxml->html-string is basically just a wrapper around
sxml->html, whose only purpose is to tie the output port to
stdout.
