For more info on how loadHTML/loadHTMLFile handle encodings, please visit http://www.onphp5.com/article/57
DOMDocument::loadHTML
(No version information available, might be only in CVS)
DOMDocument::loadHTML — 文字列から HTML を読み込む
説明
bool DOMDocument::loadHTML
( string $source
)
この関数は、文字列 source に含まれる HTML を パースします。XML を読み込む場合とは異なり、妥当な HTML でなくても 読み込むことができます。この関数をスタティックにコールすると、 読み込んだ内容をもとに DOMDocument オブジェクトを作成します。 読み込み前に DOMDocument のプロパティを 設定する必要がない場合に、スタティックに実行することがあるでしょう。
パラメータ
- source
-
HTML 文字列。
返り値
成功した場合に TRUE を、失敗した場合に FALSE を返します。
エラー / 例外
空の文字列を source に渡すと、警告が発生します。 この警告は libxml が発するものではないので、libxml のエラー処理関数では処理できません。
例
例1 ドキュメントを作成する
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
?>
DOMDocument::loadHTML
onPHP5.com
19-Nov-2007 03:51
19-Nov-2007 03:51
xuanbn at yahoo dot com
04-Oct-2007 10:38
04-Oct-2007 10:38
If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset like
<meta http-equiv="content-type" content="text/html; charset=utf-8">
I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title> Vietnamese - Tiếng Việt</title>
</head>
<body></body>
</html>
will be OK with loadHTML() when <meta> tag appear <title> tag.
But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.
<html>
<head>
<title> Vietnamese - Tiếng Việt</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body></body>
</html>
Ard
18-Jun-2007 12:55
18-Jun-2007 12:55
The comment from bigtree at DONTSPAM dot 29a dot nl
26-Apr-2005 11:15 was helpful.
In addition I noted that if your doctype declaration is not valid, DomDocument::loadHtml won't respect your charset=utf-8. It made me crazy. Beware!
hanhvansu at yahoo dot com
27-Apr-2007 05:50
27-Apr-2007 05:50
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
$pageDom = new DomDocument();
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
@$pageDom->loadHTML($htmlUTF8Page);
?>
romain dot lalaut at laposte dot net
15-Feb-2007 05:31
15-Feb-2007 05:31
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
bigtree at DONTSPAM dot 29a dot nl
26-Apr-2005 11:15
26-Apr-2005 11:15
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
