Try to parse this file with the following Perl script:
#!/usr/bin/env perl use strict; use HTML::HTML5::Parser; use utf8; # for the characters in the script. use open ':encoding(UTF-8)'; # for the file arguments. binmode STDIN, ':encoding(UTF-8)'; # for stdin. binmode STDOUT, ':encoding(UTF-8)'; # for stdout. @ARGV == 1 or die "Usage: $0 <file.html>\n"; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_file($ARGV[0]); print "Charset: '", $parser->charset($doc), "'\n"; print $doc->toString();
See Debian bug 750946.
For the test: "é"