Why is the output of case folding different from the expected?

use feature 'fc';
use utf8;
use open ':std', ':encoding(UTF-8)';
my $char = "İ";
my $folded = fc($char);
print "fc(İ): $folded\n";
print "Hex values: ", join(" ", map { sprintf "%02X", ord($_) } split //, $folded), "\n";
The above code is generated by chatGPT, and chatGPT said the output should be:
fc(İ): i
Hex values: 69
but on my laptops (macOS, Perl v5.40.2) and (Ubuntu, Perl v5.38.2) both output:
fc(İ): i̇
Hex values: 69 307
I'm just three days into learning Perl and I'm curious whether any additional configuration is necessary.
I tried using Perl’s fc() function to case-fold the character İ. I expected it to return "i", but instead it returned "i̇" (an i with a combining dot above). I’m confused because I thought fc would behave like Unicode case folding. Is there something I need to configure?
Answer
You have
- U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Unicode 14.0.0 defines its fold as follows:
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
Since there are two entries, it means there are two ways to fold this code point.
It has one that’s larger than the original string, indicated by “F” for “full case folding”. This is the preferred where possible. Solutions that can’t handle this fall back to the non-growing alternative denoted by a “S” for “simple case folding”.
But here we have “T” instead of “S” because the character in question is very exceptional. Of “T”, the standard says:
special case for uppercase I and dotted uppercase I
- For non-Turkic languages, this mapping is normally not used.
- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. Note that the Turkic mappings do not maintain canonical equivalence without additional processing. See the discussions of case mapping in the Unicode Standard for more information.
Perl supports foldings that are larger than the original, and Perl's fc
is not language aware, so it returns the following as mandated by the Unicode spec:
- U+0069 LATIN SMALL LETTER I
- U+0307 COMBINING DOT ABOVE
Enjoyed this question?
Check out more content on our blog or follow us on social media.
Browse more questions