dracoblue.net

Gotchas, when parsing xml/html with php

Since Craur is able to parse XML/HTML easily (by using DOMDocument, BUT not XPath under the hood), you might want to know from what diversity of headaches it safes you.

DOMDocument::loadXML/loadHTML and UTF-8: It does not like non-utf8 strings.

You have to work around this, by using iconv/mbstring:

<?php
$xml_string = iconv($encoding, 'utf-8', $xml_string);
$node = new DOMDocument('1.0', 'utf-8');
$is_loaded = $node->loadXML($xml_string, LIBXML_NOCDATA | LIBXML_NOWARNING | LIBXML_NOERROR);
if (!$is_loaded)
{
    throw new Exception('Invalid xml: ' . $xml_string);
}

DOMDocument::loadXML and warnings/errors: By default DOMDocument::loadXML will use php's warnings + errors to tell you something is wrong with the given xml. If you prefer exceptions (and you should!) in this case, you will have to set two options. In my option this is prefered, because you might catch the exception and handle warnings/errors on per case basis.

The LIBXML_NOWARNING and LIBXML_NOERROR option will disable the warnings and you can check the return value of DOMDocument::loadXML to see if it worked, or not.

<?php
$is_loaded = $node->loadXML($xml_string, LIBXML_NOCDATA | LIBXML_NOWARNING | LIBXML_NOERROR);
if (!$is_loaded)
{
    throw new Exception('Invalid xml: ' . $xml_string);
}

DOMDocument::loadHTML and errors/warnings: loadHTML does not use the same technique to supress warnings/errors, like loadXML does. So you have to enable internal errors on libxml level, execute your logic, disable it again and finally handle the response.

<?php
libxml_use_internal_errors(true);
$node->loadHTML($html_string);
$error = libxml_get_last_error();
libxml_use_internal_errors(false);

if ($error)
{
    throw new Exception('Invalid html (' . trim($error->message) . ', line: ' . $error->line . ', col: ' . $error->column . '): ' . $html_string);
}

DOMDocument and namespaces: Even though namespaces in XML look like attributes, you can't get them with the normal attribute functions. There is a workaround to query the document by using xpath:

<?php
$xpath = new DOMXPath($node);
$root_node_name = $node->documentElement->nodeName;
$namespaces = array();
foreach ($xpath->query('namespace::*') as $namespace_node)
{
    $namespace_name = $namespace_node->nodeName;
    if ($namespace_name !== 'xmlns:xml')
    {
        $namespaces[$namespace_name] = $namespace_node->nodeValue;
    }
}

$namespaces = array_reverse($namespaces, true);
foreach ($namespaces as $namespace_name => $namespace_uri)
{
    $attributes[$namespace_name] = $namespace_uri;
}
var_dump($attributes);

DOMDocument and vertical tabs (or other special utf8 characters): Some word processors (like MS' Word) inject broken characters into documents. Then you'll end up with errors like non SGML character number 11.

There are lots of discussions about this issue, but I ended up to solve it with a regular expression before calling DOMDocument::loadXML or DOMDocument::loadHTML.

<?php
$xml_string = preg_replace('/[\x1-\x8\xB-\xC\xE-\x1F]/', '', $xml_string);

DOMDocument::loadHTML and fragments: If you have only a snippet of HTML available, but want to load it with DOMDocument, you have to wrap it with a <html>-tag first. Otherwise it cannot load it with loadHTML method, and if you are brave enough to try loadXML it will fail because of for instance unclosed tags.

Another issue is, that DOMDocument cannot determine the encoding on fragements. So you have to add <meta http-equiv="Content-Type" charset="utf8"../> for your fragment to keep the special characters working.

<?php
$is_just_a_fragment = (strpos(strtolower($html_string), '<html') === false) ? true : false;

if ($is_just_a_fragment)
{
        $html_string = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=' . $encoding . '"/></head><body>' . $html_string . '</body></html>';
}
// code to load the html and so on
// finally:
if ($is_just_a_fragment)
{
        $data = $data['html']['body'];
}

CDATA Tags: Usually, you don't want to know if the data was escaped as CDATA in the XML. You only want to know the contents of the tag.

Otherwise things like $child_node->nodeType === XML_TEXT_NODE will become very difficult, because you have to check also on XML_CDATA_SECTION_NODE, with no benefit (so far!).

You can disable CDATA sections with the LIBXML_NOCDATA flag: <?php $is_loaded = $node->loadXML($xml_string, LIBXML_NOCDATA | LIBXML_NOWARNING | LIBXML_NOERROR);

Child node DOMDocumentType: When using $node->childNodes, you should call $node->hasChildNodes() first, to see if there any child nodes at all. But there also might be a child node of type DOMDocumentType.

This child node is for <!DOCTYPE html> and should be ignored (except you really need it!).

Conclusion: Use Craur and contribute to avoid headaches/gotchas like this.

In craur, html, open source, php, xml by
@ 10 Dec 2013, Comments at Reddit & Hackernews

Give something back

Were my blog posts useful to you? If you want to give back, support one of these charities, too!

Report hate in social media Campact e.V. With our technology and your help, we protect the oceans from plastic waste. Gesellschaft fur Freiheitsrechte e. V. The civil eye in the mediterranean

Recent Dev-Articles

Read recently

Recent Files

About