遍历DOM树

问题描述

| 由于大多数执行HTML清理的PHP库（例如HTML Purifier）都严重依赖于正则表达式，因此我认为尝试编写使用DOMDocument和相关类的HTML清理器将是一个值得尝试的尝试。尽管我还处于初期阶段，但到目前为止该项目显示出了一定的希望。我的想法围绕着一个类，该类使用DOMDocument遍历提供的标记中的所有节点，将它们与白名单进行比较，并删除白名单中没有的任何内容。（第一种实现是非常基本的，仅根据节点的类型删除节点，但我希望将来变得更复杂并分析节点的属性，链接是否指向不同域中的项等）。我的问题是如何遍历DOM树？据我了解，DOM *对象具有childNodes属性，因此我是否需要遍历整个树？另外，早期使用DOMNodeLists进行的实验表明，删除内容的顺序必须非常小心，否则可能会遗留项目或触发异常。如果有人有使用PHP操纵DOM树的经验，我将很感谢您对此主题的任何反馈。编辑：我已经为我的HTML清洁类构建了以下方法。它递归地遍历DOM树，并检查找到的元素是否在白名单上。如果不是，则将其删除。我遇到的问题是，如果删除节点，则DOMNodeList中所有后续节点的索引都会更改。简单地从下到上工作可以避免此问题。当前，它仍然是一种非常基本的方法，但是我认为它显示出了希望。它的工作速度肯定比HTMLPurifier快得多，尽管可以肯定的是Purifier可以做很多事情。

/**
 * Recursivly remove elements from the DOM that aren\'t whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node Failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName,$this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element\'s children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed,$this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist,so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception (\'Failed to remove node from DOM\');
        }
    }
    return ($removed);
}

解决方法

首先，您可以看一下此自定义RecursiveDomIterator： https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator 码：

class RecursiveDOMIterator implements RecursiveIterator
{
    /**
     * Current Position in DOMNodeList
     * @var Integer
     */
    protected $_position;

    /**
     * The DOMNodeList with all children to iterate over
     * @var DOMNodeList
     */
    protected $_nodeList;

    /**
     * @param DOMNode $domNode
     * @return void
     */
    public function __construct(DOMNode $domNode)
    {
        $this->_position = 0;
        $this->_nodeList = $domNode->childNodes;
    }

    /**
     * Returns the current DOMNode
     * @return DOMNode
     */
    public function current()
    {
        return $this->_nodeList->item($this->_position);
    }

    /**
     * Returns an iterator for the current iterator entry
     * @return RecursiveDOMIterator
     */
    public function getChildren()
    {
        return new self($this->current());
    }

    /**
     * Returns if an iterator can be created for the current entry.
     * @return Boolean
     */
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    /**
     * Returns the current position
     * @return Integer
     */
    public function key()
    {
        return $this->_position;
    }

    /**
     * Moves the current position to the next element.
     * @return void
     */
    public function next()
    {
        $this->_position++;
    }

    /**
     * Rewind the Iterator to the first element
     * @return void
     */
    public function rewind()
    {
        $this->_position = 0;
    }

    /**
     * Checks if current position is valid
     * @return Boolean
     */
    public function valid()
    {
        return $this->_position < $this->_nodeList->length;
    }
}

您可以将其与RecursiveIteratorIterator组合使用。使用示例在页面上。通常，使用XPath搜索被列入黑名单的节点比遍历DOM树要容易得多。还请记住，DOM通过自动转义nodeValues中的xml实体已经非常擅长防止XSS。您还需要注意的另一件事是，对DOMDocument的任何操作都会立即影响您可能从XPath查询中获得的任何DOMNodeList，并且在处理它们时可能会导致跳过节点。有关示例，请参见用PHP的DOM类替换DOMNode。

dom 遍历遍历遍历