Saturday, February 20, 2010

Perl HTML::Template and UTF-8 Unicode

HTML::Template does not support file encoding:


#!/usr/bin/perl -w
use strict;
use Encode;
use HTML::Template;
my $template = HTML::Template->new(
filehandle => *DATA,
);
print Encode::encode('UTF-8', $template->output);
__DATA__
¡™£¢∞§¶•ªº

prints, ¡™£¢∞§¶•ªº (or something like that!)

In the example above, this makes sense, since we're printing on an open filehandle (even if it's only to our magical, DATA) that we didn't put a file layer filter thingy to. That's easy to fix:



#!/usr/bin/perl -w
use strict;
use Encode;

binmode DATA, ':encoding(UTF-8)';

use HTML::Template;
my $template = HTML::Template->new(
filehandle => *DATA,
);
print Encode::encode('UTF-8', $template->output);
__DATA__
¡™£¢∞§¶•ªº
prints, ¡™£¢∞§¶•ªº, yay!


This also works if we want to just pass a reference to a scalar to HTML::Template:


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";
use HTML::Template;
my $template = HTML::Template->new(
scalarref => \$content,
);
print Encode::encode('UTF-8', $template->output);
prints, ¡™£¢∞§¶•ªº, yay!

This doesn't work, if we want to just give it a name of a template file. This is really useful, since HTML::Template has a feature to allow you to search through a file structure (or at least an array of directories, looking for the file).

And this is where encoding madness begins.

Cause I know what you're thinking, just treat HTML::Template's output like information that's coming from outside your program (since, if you're using a template *file*, it kinda is).

So, all you need to do is decode (this is the WRONG WAY to solve the problem, but let's just make that mistake...) the return value of ->output, like this:


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;

use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
);

my $output = $template->output;
$output = Encode::decode('UTF-8', $output);

print Encode::encode('UTF-8', $output);


prints, ¡™£¢∞§¶•ªº. Yes.

But... what if you have a variable (it is a templating system) and the variable in the param() you pass has UTF-8 strings? MUAHAHA!


#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "
<!-- tmpl_var one -->
\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}
";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;


use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
);
$template->param(
one => "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}",
);

my $output = $template->output;
$output = Encode::decode('UTF-8', $output);

print Encode::encode('UTF-8', $output);

Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.


Bahahaha!

Take those decode/encode lines (I know it looks strange to one, right after the other ) and you'll still get a weird output:


¡™£¢∞§¶•ªº
¡™£¢∞§¶•ªº


Darned if you do/don't. Those two lines should have the same string. They don't. No amount of encoding/decoding is going to help.


The trick, other than tweaking HTML::Template's source to include file filter layer thingamabobs, is to decode the contents of the file it opens up.

How to do that.

Trolling through the HTML::Template mailing list archives leads to the idea of using a HTML::Template filter that matches everything, that then does our decoding:



#!/usr/bin/perl -w
use strict;
use Encode;
my $content = "
<!-- tmpl_var one -->
\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}
";

my $filename = 'utf8string.tmpl';

open my $fh, '>:encoding(UTF-8)', $filename or die $!;
print $fh $content;
close $fh;


use HTML::Template;
my $template = HTML::Template->new(
filename => $filename,
filter => [
{ sub => \&decode_str, format => 'scalar' },
],
);
$template->param(
one => "\x{a1}\x{2122}\x{a3}\x{a2}\x{221e}\x{a7}\x{b6}\x{2022}\x{aa}\x{ba}",
);

my $output = $template->output;


print Encode::encode('UTF-8', $output);



sub decode_str {
my $ref = shift;
${$ref} = Encode::decode('UTF-8', ${$ref});
}

This sort of lines up all the data to be UTF-8 encoded and aware and all that stuff that the unicodefaqthingy perldoc tells you to do.

But, oh, it gets better.

DON'T use that filter trick thing if you're using a scalarref, or a properly encoded file handle! You'll get a nice error, like this:

HTML::Template->new() : fatal error occured during filter call: Cannot decode string with wide characters at /System/Library/Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
at /Library/Perl/5.10.0/HTML/Template.pm line 1697
HTML::Template::_init_template('HTML::Template=HASH(0x1008aafb8)') called at /Library/Perl/5.10.0/HTML/Template.pm line 1238
HTML::Template::_init('HTML::Template=HASH(0x1008aafb8)') called at /Library/Perl/5.10.0/HTML/Template.pm line 1124



Brilliant.


So I don't know what the best advice is to give. If you're passing the template as a scalarref, DON'T use that filter, unless you want to, perhaps encode your template beforehand (which makes little sense?)

If it's a filename, use that filter trick perhaps (or edit the sourcecode of HTML::Template).

2 comments:

Robert said...

You should talk to Sam Tregar and see what ideas he has about it.

The Perl Hacker Painter said...

At least from going through the HTML::Template mailing lists, the other option is to switch from HTML::Template to HTML::Template::Compiled. I'm not too interested in doing that, since H::T::Compiled doesn't support things like Template Expressions, which is something that I expose as a feature to advanced users of this project.

As far as I understand, Sam is very very conservative on releases of H::T, but currently, H::T, hasn't the best plugin support to extend his fine module.