Discussion:
Character entity references emit extra  character
(too old to reply)
Cary Millsap
2004-04-29 18:20:51 UTC
Permalink
With the stylesheet and XML input listed below, I get the following output
in my XHTML body:

Copyright © 1999Â-2004 by ABC. All rights reserved.

Why does the 'Â' character (Â) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?

Thank you in advance,

Cary

~~~~~XSL Stylesheet
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"

doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
indent="yes" media-type="text/xml"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>Copyright&#xa0;&#169; 1999&#150;2004 by ABC. All rights
reserved.</body>
</html>
</xsl:template>
</xsl:stylesheet>


~~~~~XML Input
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="./test1.xsl" ?>
<a />


~~~~~XHTML Output
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
<head>
<title>Testing</title>
</head>
<body>Copyright © 1999Â-2004 by ABC. All rights reserved.</body>
</html>
Julian F. Reschke
2004-04-29 19:33:38 UTC
Permalink
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following output
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.

Julian
Cary Millsap
2004-04-29 20:43:00 UTC
Permalink
Ok. Perhaps my failure to understand is caused by my not asking the correct
question in the first place.

I want to emit XHTML strings that render in the same manner that the
following strings would render in a browser:

Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;

The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to emit
the strings that I'm trying to emit?

Cary
Post by Julian F. Reschke
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following output
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.
Julian
Chris Barber
2004-04-29 21:23:50 UTC
Permalink
You have to ask it to not escape the '&' characters. What is happening (I think) is that the already
escaped character entities are being escaped again?

Use:

<xsl:value-of select="something" disable-output-escaping="yes"/>

This is a proprietary process for MSXML so it should be avoided if you need cross platform / XSLT
engine compatibility.

Chris.

"Cary Millsap" <***@hotsos.com> wrote in message news:***@corp.supernews.com...
Ok. Perhaps my failure to understand is caused by my not asking the correct
question in the first place.

I want to emit XHTML strings that render in the same manner that the
following strings would render in a browser:

Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;

The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to emit
the strings that I'm trying to emit?

Cary
Post by Julian F. Reschke
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following output
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.
Julian
Cary Millsap
2004-04-29 23:19:59 UTC
Permalink
I think I understand better what Julian was telling me. UTF-8 emits 2 bytes
for chars 128-2047. Apparently, the first byte happens to be decimal 194,
which browsers render the same as &Acirc. Ok.

What I really wanted to do is this in my XSL:

<p class="copyright">Copyright<xsl:text
disable-output-escaping="yes">&amp;copy;</xsl:text> 1999...</p>

Thanks all!


Cary
Post by Chris Barber
You have to ask it to not escape the '&' characters. What is happening (I
think) is that the already
Post by Chris Barber
escaped character entities are being escaped again?
<xsl:value-of select="something" disable-output-escaping="yes"/>
This is a proprietary process for MSXML so it should be avoided if you
need cross platform / XSLT
Post by Chris Barber
engine compatibility.
Chris.
Ok. Perhaps my failure to understand is caused by my not asking the correct
question in the first place.
I want to emit XHTML strings that render in the same manner that the
Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;
The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to emit
the strings that I'm trying to emit?
Cary
Post by Julian F. Reschke
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following
output
Post by Julian F. Reschke
Post by Cary Millsap
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.
Julian
Chris Barber
2004-04-30 01:27:10 UTC
Permalink
I'm lost now - doesn't this work:

<body>Copyright &#169;1999-2004 by ABC. All rights reserved.</body>

It seems to work fine with MSXML v4.0 and 3.0 in Xselerator.

UTF-8 doesn't emit unicode unless you ask the output to do so? The encoding is simply a marker for
the output text stream that embeds the character set reference that the text stream was intended to
be viewed with?

I'm not really familiar with encoding so I may be completely wrong in my understanding of it.

Chris.

"Cary Millsap" <***@hotsos.com> wrote in message news:***@corp.supernews.com...
I think I understand better what Julian was telling me. UTF-8 emits 2 bytes
for chars 128-2047. Apparently, the first byte happens to be decimal 194,
which browsers render the same as &Acirc. Ok.

What I really wanted to do is this in my XSL:

<p class="copyright">Copyright<xsl:text
disable-output-escaping="yes">&amp;copy;</xsl:text> 1999...</p>

Thanks all!


Cary
Post by Chris Barber
You have to ask it to not escape the '&' characters. What is happening (I
think) is that the already
Post by Chris Barber
escaped character entities are being escaped again?
<xsl:value-of select="something" disable-output-escaping="yes"/>
This is a proprietary process for MSXML so it should be avoided if you
need cross platform / XSLT
Post by Chris Barber
engine compatibility.
Chris.
Ok. Perhaps my failure to understand is caused by my not asking the correct
question in the first place.
I want to emit XHTML strings that render in the same manner that the
Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;
The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to emit
the strings that I'm trying to emit?
Cary
Post by Julian F. Reschke
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following
output
Post by Julian F. Reschke
Post by Cary Millsap
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get rid of
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.
Julian
Cary Millsap
2004-04-30 02:43:23 UTC
Permalink
Yes, this is what I want in my XHTML. But I'm not going to get it by
specifying the emission of that string in my XSL transform. When I cut and
paste your body text into my test1.xsl file, the output file gets this:

Copyright ©1999-2004 by ABC. All rights reserved.

At the bottom of this note is an interesting test that has taught me a lot
today. In it, I emitted a few values of "&#n;" for interesting values of n.
Note that for values of 128 and above, the UTF-8 two-byte character rule
kicks in. You can see both the characters in the output. Notice the symmetry
in the values 128 and 256, 129 and 257, etc. as the second byte cycles
through the character set for a different first-byte value. And finally,
notice the final three lines in the body, which build up to the right string
to emit.

This is Julian's point (but unfortuantely it took a while for me to figure
out what he was telling me). I was instructing my XSL to put two bytes into
my output when I said for my XSL to emit the Unicode character &#169. It
did. When you render those two bytes in IE (or, I suspect, any other
browser), it simply shows you what those two bytes look like: "©".

What I really want in my XSL output (my XHTML file) is exactly the string
you've asked about, which includes the substring "&#169" in it instead of
the two-byte sequence "©". To emit what I want, I need to emit an ampersand
(by using "&amp;") and the string "#169;", which was easy enough, once I had
the right perspective.

Cary

~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"

doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br />
Copyright <xsl:text disable-output-escaping="yes">&amp;#169;</xsl:text>
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>


~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
Post by Chris Barber
<body>Copyright &#169;1999-2004 by ABC. All rights reserved.</body>
It seems to work fine with MSXML v4.0 and 3.0 in Xselerator.
UTF-8 doesn't emit unicode unless you ask the output to do so? The
encoding is simply a marker for
Post by Chris Barber
the output text stream that embeds the character set reference that the
text stream was intended to
Post by Chris Barber
be viewed with?
I'm not really familiar with encoding so I may be completely wrong in my understanding of it.
Chris.
I think I understand better what Julian was telling me. UTF-8 emits 2 bytes
for chars 128-2047. Apparently, the first byte happens to be decimal 194,
which browsers render the same as &Acirc. Ok.
<p class="copyright">Copyright<xsl:text
disable-output-escaping="yes">&amp;copy;</xsl:text> 1999...</p>
Thanks all!
Cary
Post by Chris Barber
You have to ask it to not escape the '&' characters. What is happening (I
think) is that the already
Post by Chris Barber
escaped character entities are being escaped again?
<xsl:value-of select="something" disable-output-escaping="yes"/>
This is a proprietary process for MSXML so it should be avoided if you
need cross platform / XSLT
Post by Chris Barber
engine compatibility.
Chris.
Ok. Perhaps my failure to understand is caused by my not asking the
correct
Post by Chris Barber
question in the first place.
I want to emit XHTML strings that render in the same manner that the
Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;
The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to
emit
Post by Chris Barber
the strings that I'm trying to emit?
Cary
Post by Julian F. Reschke
Post by Cary Millsap
With the stylesheet and XML input listed below, I get the following
output
Post by Julian F. Reschke
Post by Cary Millsap
Copyright © 1999Â-2004 by ABC. All rights reserved.
Why does the 'Â' character (&Acirc;) prefix every character emitted in
response to an entity reference? More to the point, how can I get
rid
Post by Chris Barber
of
Post by Chris Barber
Post by Julian F. Reschke
Post by Cary Millsap
it!?
Why would you want to? You asked for UTF-8-encoded output, and that's
what the the transformation produced.
Julian
Chris Barber
2004-04-30 04:17:54 UTC
Permalink
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive of a reason why a
non-Unicode string would suddenly find itself with a unicode two-byte character embedded in it
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as posted and the output in both
the text viewer and IE doesn't show these spurious characters so I can only conclude that your
method or XSLT engine is exhibiting different behaviour to MSXML v3.0 and 4.0. (not in itself
unexpected of course).
How are you doing the transform? Can you show the code?

The test detailed below in Xselerator using MSXML v4.0 produced:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:saxon="http://icl.com/saxon">
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - €<br>
129 - <br>
130 - ‚<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999–2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br>
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>

Which when viewed in IE doesn't show the extra characters that you are seeing with your transform?

Chris.

<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"

doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br />
Copyright <xsl:text disable-output-escaping="yes">&amp;#169;</xsl:text>
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>


~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.

</snipped>
Cary Millsap
2004-04-30 05:17:24 UTC
Permalink
I feel confused now. I'm doing the transform with InstantSaxon from within
xmlspy on my WinXP laptop, using the simple command line:

saxon.exe -o %2 %1 %3

The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand the
same idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???


Cary
Post by Chris Barber
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive of a reason why a
non-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barber
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as posted
and the output in both
Post by Chris Barber
the text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barber
method or XSLT engine is exhibiting different behaviour to MSXML v3.0 and 4.0. (not in itself
unexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br>
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris Barber
Chris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris Barber
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br />
Copyright <xsl:text disable-output-escaping="yes">&amp;#169;</xsl:text>
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>
Chris Barber
2004-04-30 10:56:39 UTC
Permalink
I don't think Unicode has anything to do with this. I reckon Saxon has a bug in it's output stream.
Your XSLT was fine (apart from the couple of character references that won't render in UTF-8 and
appear as a square).
The output stream is patently *not* unicode or else the entire file would have to be two bytes per
output character. The extraneous characters that you are seeing are just that - extra and unwanted.
Saxon must be unwittingly outputting the extra characters for those characters that are not within
the UTF-8 range as opposed to encoding them as character references in the output (which the browser
would then understand).

Can anyone shed any more light on this?
I read Julian's posts but feel no more enlightened than before - mostly because I just do *not*
understand the implications of specifying an encoding in the XSLT.

Chris.

"Cary Millsap" <***@hotsos.com> wrote in message news:***@corp.supernews.com...
I feel confused now. I'm doing the transform with InstantSaxon from within
xmlspy on my WinXP laptop, using the simple command line:

saxon.exe -o %2 %1 %3

The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand the
same idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???


Cary
Post by Chris Barber
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive of a reason why a
non-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barber
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as posted
and the output in both
Post by Chris Barber
the text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barber
method or XSLT engine is exhibiting different behaviour to MSXML v3.0 and 4.0. (not in itself
unexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br>
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris Barber
Chris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris Barber
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br />
Copyright <xsl:text disable-output-escaping="yes">&amp;#169;</xsl:text>
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>
Cary Millsap
2004-04-30 14:31:58 UTC
Permalink
Chris,
Post by Chris Barber
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.

I don't think this is true. The definition of UTF-8 is that it's a
variable-length encoding [Harold and Means, XML in a Nutshell, O'Reilly,
p73]:

"Characters 0 through 127, that is, the ASCII character set, are encoded in
one byte each, exactly as they would be in ASCII. ...There is a one-to-one
identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII
files are also acceptable UTF-8 files. UTF-8 represents characters 128 to
2047, a range that covers the most common nonideographic scripts, in two
bytes each."

I'm going to try a couple of things this morning in response to Julian's
extra comments, and then I'll summarize here what I've learned.

Cary
Post by Chris Barber
I don't think Unicode has anything to do with this. I reckon Saxon has a
bug in it's output stream.
Post by Chris Barber
Your XSLT was fine (apart from the couple of character references that
won't render in UTF-8 and
Post by Chris Barber
appear as a square).
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.
Post by Chris Barber
Saxon must be unwittingly outputting the extra characters for those
characters that are not within
Post by Chris Barber
the UTF-8 range as opposed to encoding them as character references in the
output (which the browser
Post by Chris Barber
would then understand).
Can anyone shed any more light on this?
I read Julian's posts but feel no more enlightened than before - mostly
because I just do *not*
Post by Chris Barber
understand the implications of specifying an encoding in the XSLT.
Chris.
I feel confused now. I'm doing the transform with InstantSaxon from within
saxon.exe -o %2 %1 %3
The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand the
same idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???
Cary
Post by Chris Barber
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive
of
Post by Chris Barber
a reason why a
Post by Chris Barber
non-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barber
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as posted
and the output in both
Post by Chris Barber
the text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barber
method or XSLT engine is exhibiting different behaviour to MSXML v3.0
and
Post by Chris Barber
4.0. (not in itself
Post by Chris Barber
unexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
Post by Chris Barber
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights
reserved.<br>
Post by Chris Barber
Post by Chris Barber
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris Barber
Chris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris Barber
Post by Chris Barber
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br
/>
Post by Chris Barber
Copyright <xsl:text
disable-output-escaping="yes">&amp;#169;</xsl:text>
Post by Chris Barber
Post by Chris Barber
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>
Chris Barber
2004-04-30 14:50:20 UTC
Permalink
I stand corrected.
http://www1.tip.nl/~t876506/utf8tbl.html

It's my first time trying to comprehend encoding so sorry for the misunderstanding about UTF-8.

Have you opened the output text in a hex editor to confirm if Saxon is outputting the two-byte
characters? What text editor / browsers did you use to view the output text? I used the Xselerator
internal text editor and IE to view the output. Maybe your text editor is not capable of
understanding the encoding declaration and as such is mistakenly displaying the first byte as a
character instead of using to process the next character?
Have you tried outputting to a .htm file on disk and then loading that into the browser?

Chris.

"Cary Millsap" <***@hotsos.com> wrote in message news:***@corp.supernews.com...
Chris,
Post by Chris Barber
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.

I don't think this is true. The definition of UTF-8 is that it's a
variable-length encoding [Harold and Means, XML in a Nutshell, O'Reilly,
p73]:

"Characters 0 through 127, that is, the ASCII character set, are encoded in
one byte each, exactly as they would be in ASCII. ...There is a one-to-one
identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII
files are also acceptable UTF-8 files. UTF-8 represents characters 128 to
2047, a range that covers the most common nonideographic scripts, in two
bytes each."

I'm going to try a couple of things this morning in response to Julian's
extra comments, and then I'll summarize here what I've learned.

Cary
Post by Chris Barber
I don't think Unicode has anything to do with this. I reckon Saxon has a
bug in it's output stream.
Post by Chris Barber
Your XSLT was fine (apart from the couple of character references that
won't render in UTF-8 and
Post by Chris Barber
appear as a square).
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.
Post by Chris Barber
Saxon must be unwittingly outputting the extra characters for those
characters that are not within
Post by Chris Barber
the UTF-8 range as opposed to encoding them as character references in the
output (which the browser
Post by Chris Barber
would then understand).
Can anyone shed any more light on this?
I read Julian's posts but feel no more enlightened than before - mostly
because I just do *not*
Post by Chris Barber
understand the implications of specifying an encoding in the XSLT.
Chris.
I feel confused now. I'm doing the transform with InstantSaxon from within
saxon.exe -o %2 %1 %3
The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand the
same idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???
Cary
Post by Chris Barber
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive
of
Post by Chris Barber
a reason why a
Post by Chris Barber
non-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barber
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as posted
and the output in both
Post by Chris Barber
the text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barber
method or XSLT engine is exhibiting different behaviour to MSXML v3.0
and
Post by Chris Barber
4.0. (not in itself
Post by Chris Barber
unexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
Post by Chris Barber
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights
reserved.<br>
Post by Chris Barber
Post by Chris Barber
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris Barber
Chris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris Barber
Post by Chris Barber
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br />
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br
/>
Post by Chris Barber
Copyright <xsl:text
disable-output-escaping="yes">&amp;#169;</xsl:text>
Post by Chris Barber
Post by Chris Barber
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>
Cary Millsap
2004-04-30 15:31:51 UTC
Permalink
Chris,

I think I've "got it." Finally. :-) Here's the story:

Thank you, Julian. The missing element was the <meta ../> tag in the output
XHTML. Without telling the browser that the encoding was UTF-8, my browser
simply rendered what it thought must be 1-byte ASCII. When it saw the
two-byte Unicode characters, it simply rendered them as a pair of ASCII
characters. With the right meta tag (shown in my XSL below), everything
renders fine on IE 6 and Netscape 7.

I use Vim. I haven't taken the time to figure out how to make Vim
Unicode-aware. In this case, it's a good thing, because it allows me to see
the two-byte pairs (and even three-byte pairs for chars like &#8211;). It's
part of what allowed me to diagnose the problem. In fact, if I hadn't seen
anything funny in the "raw" HTML output, I would have used 'od -c' or
something to see what was actually in there.

Now, I think I'm finally on the correct path to the solution. Below, see the
XSL that I think does its work the smart way. Some notes:

- The real "trick" is telling the browser how to render the content of the
file. This is done with the <meta> tag.
- The entity declarations (DOCTYPE section) merely allow me to refer to
non-ASCII characters in my XSL by using convenient names (instead of having
to wonder what "&#8211;" is everytime I see it in my XSL source). The entity
stuff isn't magic; it's just syntactic sugar that makes it easier to read
the XSL.


Cary

~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl [
<!ENTITY nbsp "&#160;">
<!ENTITY copy "&#169;">
<!ENTITY ndash "&#8211;">
]>

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"

doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>

<xsl:template match="/">
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<title>Testing</title>
</head>
<body>
Copyright &copy;&nbsp;1999&ndash;2004 by ABC. All rights reserved.
</body>
</html>
</xsl:template>

</xsl:stylesheet>


~~~~~test1.html
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Testing</title>
</head>
<body>
Copyright © 1999-2004 by ABC. All rights reserved.
</body>
</html>
Post by Chris Barber
I stand corrected.
http://www1.tip.nl/~t876506/utf8tbl.html
It's my first time trying to comprehend encoding so sorry for the
misunderstanding about UTF-8.
Post by Chris Barber
Have you opened the output text in a hex editor to confirm if Saxon is
outputting the two-byte
Post by Chris Barber
characters? What text editor / browsers did you use to view the output
text? I used the Xselerator
Post by Chris Barber
internal text editor and IE to view the output. Maybe your text editor is not capable of
understanding the encoding declaration and as such is mistakenly
displaying the first byte as a
Post by Chris Barber
character instead of using to process the next character?
Have you tried outputting to a .htm file on disk and then loading that into the browser?
Chris.
Chris,
Post by Chris Barber
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.
I don't think this is true. The definition of UTF-8 is that it's a
variable-length encoding [Harold and Means, XML in a Nutshell, O'Reilly,
"Characters 0 through 127, that is, the ASCII character set, are encoded in
one byte each, exactly as they would be in ASCII. ...There is a one-to-one
identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII
files are also acceptable UTF-8 files. UTF-8 represents characters 128 to
2047, a range that covers the most common nonideographic scripts, in two
bytes each."
I'm going to try a couple of things this morning in response to Julian's
extra comments, and then I'll summarize here what I've learned.
Cary
Post by Chris Barber
I don't think Unicode has anything to do with this. I reckon Saxon has a
bug in it's output stream.
Post by Chris Barber
Your XSLT was fine (apart from the couple of character references that
won't render in UTF-8 and
Post by Chris Barber
appear as a square).
The output stream is patently *not* unicode or else the entire file would
have to be two bytes per
Post by Chris Barber
output character. The extraneous characters that you are seeing are just
that - extra and unwanted.
Post by Chris Barber
Saxon must be unwittingly outputting the extra characters for those
characters that are not within
Post by Chris Barber
the UTF-8 range as opposed to encoding them as character references in the
output (which the browser
Post by Chris Barber
would then understand).
Can anyone shed any more light on this?
I read Julian's posts but feel no more enlightened than before - mostly
because I just do *not*
Post by Chris Barber
understand the implications of specifying an encoding in the XSLT.
Chris.
I feel confused now. I'm doing the transform with InstantSaxon from within
saxon.exe -o %2 %1 %3
The only other reconciling theory I can think of is that if you're using a
Unicode-aware text editor that can render the dual-byte characters as a
single Unicode char, AND if your META tag helps your browser understand
the
Post by Chris Barber
same idea, then perhaps your output and mine are identical, it's just that
we're viewing that output differently. ???
Cary
Post by Chris Barber
You are using Saxon?
This doesn't happen with MSXML as far as I can see and I can't conceive
of
Post by Chris Barber
a reason why a
Post by Chris Barber
non-Unicode string would suddenly find itself with a unicode two-byte
character embedded in it
Post by Chris Barber
unless something else is going on.
I test this in Xselerator using your initial XML and XSL exactly as
posted
Post by Chris Barber
and the output in both
Post by Chris Barber
the text viewer and IE doesn't show these spurious characters so I can
only conclude that your
Post by Chris Barber
method or XSLT engine is exhibiting different behaviour to MSXML v3.0
and
Post by Chris Barber
4.0. (not in itself
Post by Chris Barber
unexpected of course).
How are you doing the transform? Can you show the code?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
Post by Chris Barber
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"><html
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon">
Post by Chris Barber
<head>
<META http-equiv="Content-Type" content="text/xml; charset=UTF-8">
<title>Testing</title>
</head>
<body>
124 - |<br>
125 - }<br>
126 - ~<br>
127 - <br>
128 - ?<br>
129 - <br>
130 - ,<br>
...<br>
254 - þ<br>
255 - ÿ<br>
256 - A<br>
257 - a<br>
258 - A<br>
...<br>
511 - ?<br>
512 - ?<br>
Copyright © 1999-2004 by ABC. All rights reserved.<br>
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights
reserved.<br>
Post by Chris Barber
Post by Chris Barber
Copyright &#169;
1999&#150;2004 by
ABC. All rights reserved.<br></body>
</html>
Which when viewed in IE doesn't show the extra characters that you are
seeing with your transform?
Post by Chris Barber
Chris.
<snipped>
~~~~~test1.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:saxon="http://icl.com/saxon"
<xsl:output method="saxon:xhtml" version="1.0" encoding="UTF-8"
omit-xml-declaration="no"
doctype-public="-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN"
doctype-system="http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd"
Post by Chris Barber
Post by Chris Barber
Post by Chris Barber
indent="yes" media-type="text/xml"
saxon:character-representation="decimal"
/>
<xsl:template match="/">
<html>
<head>
<title>Testing</title>
</head>
<body>
124 - &#124;<br />
125 - &#125;<br />
126 - &#126;<br />
127 - &#127;<br />
128 - &#128;<br />
129 - &#129;<br />
130 - &#130;<br />
...<br />
254 - &#254;<br />
255 - &#255;<br />
256 - &#256;<br />
257 - &#257;<br />
258 - &#258;<br />
...<br />
511 - &#511;<br />
512 - &#512;<br />
Copyright&#160;&#169; 1999&#150;2004 by ABC. All rights reserved.<br
/>
Post by Chris Barber
Post by Chris Barber
Copyright &amp;#169;1999&amp;#150;2004 by ABC. All rights reserved.<br
/>
Post by Chris Barber
Copyright <xsl:text
disable-output-escaping="yes">&amp;#169;</xsl:text>
Post by Chris Barber
Post by Chris Barber
1999<xsl:text disable-output-escaping="yes">&amp;#150;</xsl:text>2004 by
ABC. All rights reserved.<br />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
~~~~~The output
124 - |
125 - }
126 - ~
127 - 
128 - Â?
129 - 
130 - Â,
...
254 - ß
255 - ÿ
256 - Ä?
257 - ā
258 - Ä,
...
511 - Ç¿
512 - È?
Copyright © 1999Â-2004 by ABC. All rights reserved.
Copyright &#169;1999&#150;2004 by ABC. All rights reserved.
Copyright © 1999-2004 by ABC. All rights reserved.
</snipped>
Julian F. Reschke
2004-04-30 07:49:07 UTC
Permalink
Post by Cary Millsap
Yes, this is what I want in my XHTML. But I'm not going to get it by
specifying the emission of that string in my XSL transform. When I cut and
Copyright ©1999-2004 by ABC. All rights reserved.
At the bottom of this note is an interesting test that has taught me a lot
today. In it, I emitted a few values of "&#n;" for interesting values of n.
Note that for values of 128 and above, the UTF-8 two-byte character rule
kicks in. You can see both the characters in the output. Notice the symmetry
in the values 128 and 256, 129 and 257, etc. as the second byte cycles
through the character set for a different first-byte value. And finally,
notice the final three lines in the body, which build up to the right string
to emit.
Right so far.
Post by Cary Millsap
This is Julian's point (but unfortuantely it took a while for me to figure
out what he was telling me). I was instructing my XSL to put two bytes into
my output when I said for my XSL to emit the Unicode character &#169. It
did. When you render those two bytes in IE (or, I suspect, any other
browser), it simply shows you what those two bytes look like: "©".
Nope. This only occurs if you manage to confuse the browser about the
encoding. This will happen if the following three declarations do not agree:

- HTTP content-type header
- XML declaration (if present)
- HTML META tag http-equiv content-type
Post by Cary Millsap
What I really want in my XSL output (my XHTML file) is exactly the string
you've asked about, which includes the substring "&#169" in it instead of
the two-byte sequence "©". To emit what I want, I need to emit an ampersand
(by using "&amp;") and the string "#169;", which was easy enough, once I had
the right perspective.
Again: why would you want that? Your "fix" will get you the copyright
symbol, but as you probably have a software problem somewhere inside
your transformation process, *any* non-ASCII character that may appear
in content will still be broken.

Solve the software problem, and you never will have to think about this
again.

Regards, Julian
Julian F. Reschke
2004-04-30 07:45:14 UTC
Permalink
Post by Cary Millsap
I think I understand better what Julian was telling me. UTF-8 emits 2 bytes
for chars 128-2047. Apparently, the first byte happens to be decimal 194,
which browsers render the same as &Acirc. Ok.
<p class="copyright">Copyright<xsl:text
disable-output-escaping="yes">&amp;copy;</xsl:text> 1999...</p>
Why would you ever want to do that?

Stop thinking in terms of named entities. Use Unicode as it is designed
for. The browsers will just work fine. Which encoding isn't used doesn't
matter at all, as long as it's declared properly. This applies both to
XML and HTML.


Julian
Julian F. Reschke
2004-04-30 07:43:22 UTC
Permalink
Post by Cary Millsap
Ok. Perhaps my failure to understand is caused by my not asking the correct
question in the first place.
I want to emit XHTML strings that render in the same manner that the
Copyright &copy; 1999&ndash;2004 by ABC.
&nbsp;&nbsp;?&nbsp;&nbsp;
The approaches I've tried haven't worked. (For example, when I emit
"&#169;", it renders as "©" instead of "©".) What is an elegant way to emit
the strings that I'm trying to emit?
The UTF-8 encoded output should render just fine in the browser, *as
long* as it is served with the correct content type. If it doesn't,
you'll have to check your server-side code that does the transform and
sends the result back to the browser.

Julian
Loading...