Chris,
I think I've "got it." Finally. :-) Here's the story:
Thank you, Julian. The missing element was the <meta ../> tag in the output
XHTML. Without telling the browser that the encoding was UTF-8, my browser
simply rendered what it thought must be 1-byte ASCII. When it saw the
two-byte Unicode characters, it simply rendered them as a pair of ASCII
characters. With the right meta tag (shown in my XSL below), everything
renders fine on IE 6 and Netscape 7.
I use Vim. I haven't taken the time to figure out how to make Vim
Unicode-aware. In this case, it's a good thing, because it allows me to see
the two-byte pairs (and even three-byte pairs for chars like –). It's
part of what allowed me to diagnose the problem. In fact, if I hadn't seen
anything funny in the "raw" HTML output, I would have used 'od -c' or
something to see what was actually in there.
Now, I think I'm finally on the correct path to the solution. Below, see the
XSL that I think does its work the smart way. Some notes:
- The real "trick" is telling the browser how to render the content of the
file. This is done with the <meta> tag.
- The entity declarations (DOCTYPE section) merely allow me to refer to
non-ASCII characters in my XSL by using convenient names (instead of having
to wonder what "–" is everytime I see it in my XSL source). The entity
stuff isn't magic; it's just syntactic sugar that makes it easier to read
the XSL.
Cary
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl [
<!ENTITY nbsp " ">
<!ENTITY copy "©">
<!ENTITY ndash "–">
]>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<title>Testing</title>
</head>
<body>
Copyright © 1999–2004 by ABC. All rights reserved.
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~test1.html
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Testing</title>
</head>
<body>
Copyright © 1999-2004 by ABC. All rights reserved.
</body>
</html>
Post by Chris BarberI stand corrected.
http://www1.tip.nl/~t876506/utf8tbl.html
It's my first time trying to comprehend encoding so sorry for the
misunderstanding about UTF-8.
Post by Chris BarberHave you opened the output text in a hex editor to confirm if Saxon is
outputting the two-byte
Post by Chris Barbercharacters? What text editor / browsers did you use to view the output
text? I used the Xselerator
Post by Chris Barberinternal text editor and IE to view the output. Maybe your text editor is not capable of
understanding the encoding declaration and as such is mistakenly
displaying the first byte as a
Post by Chris Barbercharacter instead of using to process the next character?
Have you tried outputting to a .htm file on disk and then loading that into the browser?
Chris.
Chris,
Post by Chris BarberThe output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barberoutput character. The extraneous characters that you are seeing are just
that - extra and unwanted.
I don't think this is true. The definition of UTF-8 is that it's a
variable-length encoding [Harold and Means, XML in a Nutshell, O'Reilly,
"Characters 0 through 127, that is, the ASCII character set, are encoded in
one byte each, exactly as they would be in ASCII. ...There is a one-to-one
identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII
files are also acceptable UTF-8 files. UTF-8 represents characters 128 to
2047, a range that covers the most common nonideographic scripts, in two
bytes each."
I'm going to try a couple of things this morning in response to Julian's
extra comments, and then I'll summarize here what I've learned.
Cary
Post by Chris BarberI don't think Unicode has anything to do with this. I reckon Saxon has a
bug in it's output stream.
Post by Chris BarberYour XSLT was fine (apart from the couple of character references that
won't render in UTF-8 and
Post by Chris Barberappear as a square).
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barberoutput character. The extraneous characters that you are seeing are just
that - extra and unwanted.
Post by Chris BarberSaxon must be unwittingly outputting the extra characters for those
characters that are not within
Post by Chris Barberthe UTF-8 range as opposed to encoding them as character references in the
output (which the browser
Post by Chris Barberwould then understand).
Can anyone shed any more light on this?
I read Julian's posts but feel no more enlightened than before - mostly
because I just do *not*
Post by Chris Barberunderstand the implications of specifying an encoding in the XSLT.
Chris.
I feel confused now. I'm doing the transform with InstantSaxon from within
saxon.exe -o %2 %1 %3
The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand
the
Post by Chris Barbersame idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???
Cary
Post by Chris BarberYou are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive
of
Post by Chris Barbera reason why a
Post by Chris Barbernon-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barberunless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as
posted
Post by Chris Barberand the output in both
Post by Chris Barberthe text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barbermethod or XSLT engine is exhibiting different behaviour to MSXML v3.0
and
Post by Chris Barber4.0. (not in itself
Post by Chris Barberunexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
Post by Chris Barber"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &#169;1999&#150;2004 by ABC. All rights
reserved.<br>
Post by Chris BarberPost by Chris BarberCopyright ©
1999–2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris BarberChris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris BarberPost by Chris BarberPost by Chris Barberindent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - |<br />
125 - }<br />
126 - ~<br />
127 - <br />
128 - €<br />
129 - <br />
130 - ‚<br />
...<br />
254 - þ<br />
255 - ÿ<br />
256 - Ā<br />
257 - ā<br />
258 - Ă<br />
...<br />
511 - ǿ<br />
512 - Ȁ<br />
Copyright © 1999–2004 by ABC. All rights reserved.<br
/>
Post by Chris BarberPost by Chris BarberCopyright &#169;1999&#150;2004 by ABC. All rights reserved.<br
/>
disable-output-escaping="yes">&#169;</xsl:text>
Post by Chris BarberPost by Chris Barber1999<xsl:text disable-output-escaping="yes">&#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 -
128 - Â?
129 - Â
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - Ä
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright ©1999–2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>