Tomcat中JSP中文乱码问题的原因和解决方法

时间：2005-09-19 来源：wlst

讨论Tomcat中JSP中文乱码问题的原因和解决方法，根据网上的资料总结而成。

JSP 2.0 的 page 指令中有两个属性 contentType 和 pageEncoding
SCWCD Exam Study Kit 的叙述：
The contentType attribute specifies the MIME type and character encoding of the
output. The default value of the MIME type is text/html; the default value of the
character encoding is ISO-8859-1. The MIME type and character encoding are
separated by a semicolon, as shown here:

<%@ page contentType="text/html;charset=ISO-8859-1"%>

This is equivalent to writing the following line in a servlet:

response.setContentType("text/html;charset=ISO-8859-1");

The pageEncoding attribute specifies the character encoding of the JSP page. The
default value is ISO-8859-1. The following line illustrates the syntax:

<%@ page pageEncoding="ISO-8859-1" %>

下面是JSP 2.0 Spec 中 contentType 和 pageEncoding 的叙述：
contentType
Defines the MIME type and the character encoding for the
response of the JSP page, and is also used in determining the
character encoding of the JSP page.
Values are either of the form “TYPE” or “TYPE;charset=
CHARSET”with an optional white space after the “;”.
“TYPE” is a MIME type, see the IANA registry at
http://www.iana.org/assignments/media-types/index.html
for useful values. “CHARSET”, if present, must be the IANA name for
a character encoding.
The default value for “TYPE” is “text/html” for JSP pages in
standard syntax, or “text/xml” for JSP documents in XML
syntax. If “CHARSET” is not specified, the response
character encoding is determined as described in
Section JSP.4.2, “Response Character Encoding”.
See Chapter JSP.4 for complete details on character
encodings.

pageEncoding

Describes the character encoding for the JSP page. The value
is of the form “CHARSET”, which must be the IANA name
for a character encoding. For JSP pages in standard syntax,
the character encoding for the JSP page is the charset given
by the pageEncoding attriute if it is present, otherwise the
charset given by the contentType attribute if it is present,
otherwise “ISO-8859-1”.
For JSP documents in XML syntax, the character encoding
for the JSP page is determined as described in section 4.3.3
and appendix F.1 of the XML specification. The pageEncoding
attribute is not needed for such documents. It is a
translation-time error if a document names different
encodings in its XML prolog / text declaration and in the
pageEncoding attribute. The corresponding JSP
configuration element is page-encoding (see
Section JSP.3.3.4, “Declaring Page Encodings”).
See Chapter JSP.4 for complete details on character
encodings.
For JSP pages in standard syntax, the page character encoding is determined
from the following sources:

A JSP configuration element page-encoding value whose URL pattern matches
the page.

The pageEncoding attribute of the page directive of the page. It is a translation-
time error to name different encodings in the pageEncoding attribute of
the page directive of a JSP page and in a JSP configuration element whose
URL pattern matches the page.

The charset value of the contentType attribute of the page directive. This is
used to determine the page character encoding if neither a JSP configuration
element page-encoding nor the pageEncoding attribute are provided.

If none of the above is provided, ISO-8859-1 is used as the default character
encoding.
关于 contentType 和 pageEncoding 的差异和中文JSP页的设定技巧:

contentType -- 指定的是JSP页最终 Browser(客户端)所见到的网页内容的编码.
就是 Mozilla的 Character encoding, 或者是 IE6的 encoding. 例如 JSPtw Forum 用的contentType就是 Big5.

pageEncoding -- 指定JSP编写时所用的编码
如果你的是 WIN98, 或 ME 的NOTEPAD记事本编写JSP, 就一定是常用的是Big5 或 gb2312, 如果是用 WIN2k winXP的
NOTEPAD时, SAVE时就可以选择不同的编,码, 包括 ANSI(BIG5/GB2312)或 UTF-8 或 UNIONCODE(估是 UCS 16).

因为 JSP要经过两次的"编码",
第一阶段会用 pageEncoding, 第二阶段会用 utf-8 至utf-8, 第三阶段就是由TOMCAT出来的网页, 用的是contentType.

阶段一是 JSPC的 JSP至JAVA(.java)原码的"翻译", 它会跟据 pageEncoding 的设定读取JSP. 结果是由指定的
pageEncoding(utf-8,Big5,gb2312)的JSP 翻译成统一的utf-8 JAVA原码(.java). 如果pageEncoding设定错了, 或没设定
(预设 ISO8859-1), 出来的在这个阶段就已是中文乱码.

阶段二是由 JAVAC的JAVA原码至JAVA BYTECODE的编译. 不论JSP的编写时是用(utf-8,Big5,gb2312),经过阶段一的结果全
都是utf-8的ENCODING的JAVA原码.
JAVAC用 utf-8的ENCODING读取AVA原码, 编译成字符串是 utf-8 ENCODING的二进制码(.class). 这是 JAVA VIRTUAL MACNHINE
对常数字符串在二进制码(JAVA BYTECODE)内表逹的规范.

阶段三是TOMCAT(或其的application container)加载和执行阶段二得来的JAVA二进制码, 输出的结果( 也就是BROWSER(客户端))
见到的. 这时一早隐藏在阶段一和二的参数contentType, 就发挥了功效. (见阶段一的 ).

response.setContentType("text/html; charset=utf-8");

出来的可以是 utf-8, Big5, gb2312, 看的就是JSP ? contentType的设定.

<%@ page session="false" pageEncoding="big5" contentType="text/html; charset=utf-8" %>

还有, pageEncoding 和contentType的预设都是 ISO8859-1. 而随便设定了其中一个, 另一个就跟着一样了(TOMCAT4.1.27是如此).
但这不是绝对, 看的各自JSPC的处理方式. 而pageEncoding不等于contentType, 更有利亚洲区的文字 CJKV系JSP网页的开发和展示,
(例pageEncoding=Big5 不等于 contentType=utf-8).

一个简单的解决方法是在包含和被包含文件的开始部分都加上：

<%@ page contentType="text/html;charset=GB2312" language="java" %>

下面是一个示例：main.jsp

<%@ page contentType="text/html;charset=GB2312" language="java" %>
<html>
<head><title>测试页</title></head>
<body>
<%@ include file="hello.jsp" %>
<b><p align="center"><font color="#ff0000">主页中的表：</font></p></b>
<br>
<table width="98%" height="20" border="0" cellpadding="0" align="center" bgcolor="#99CCCC" cellspacing="0">
<tr>
<td align="center" valign="middle"><font color="wihte">
首页 | 
产品介绍 | 
留言板 | 
技术论坛</a> | 
库存拍卖</a> | 
系统管理</a> | 
关于本站</a> | 
联系我们</font>
</tr>
</table>
</body>
</html>

hello.jsp

<%@page contentType="text/html;charset=GB2312" %>
<b><p align="center"><font color="#ff0000">被包含页中的表：</font></p></b>
<br>
<table width="98%" height="20" border="0" cellpadding="0" align="center" bgcolor="#99CCCC" cellspacing="0">
<tr>
<td align="center" valign="middle"><font color="wihte">
首页 | 
产品介绍 | 
留言板 | 
技术论坛</a> | 
库存拍卖</a> | 
系统管理</a> | 
关于本站</a> | 
联系我们</font>
</tr>
</table>

在Tomcat-5.5.7中运行正常。