简单的HTML DOM解析器-抓取Wikipedia表：空行和＆nbsp;

问题描述

作为一个练习，我试图解析Wikipedia上的表格。如标题所示，我正在使用简单HTML DOM解析器来实现此目的。我几乎拥有了我需要的所有东西，但是有两件事我无法摆脱。我做了什么：

<?php 

require('simple_html_dom.php');

$html = file_get_html('https://de.wikipedia.org/wiki/ISO-3166-1-Kodierliste');

$table = $html->find('table',0);
$rowData = array();

foreach($table->find('a') as $a){
    
    foreach ($a->getAllAttributes() as $attr => $val){
        $a->removeAttribute($attr);
    }
}

foreach($table->find('sup') as $sup){
    $sup->innertext='';
}

foreach($table->find('img') as $img){
    $img->innertext='';
}

foreach($table->find('span') as $span){
    $span->innertext='';
}

foreach($table->find('tr') as $row) {
    $uselessrow = 'hintergrundfarbe8';
    $strpostest = strpos($row,$uselessrow);

    if ($strpostest === false){
        $state = array();
        foreach($row->find('td') as $cell) {
            $state[] = $cell->plaintext;
        }
        $rowData[] = $state;
    } 
}

echo '<table>';

foreach ($rowData as $row => $tr) {
    echo '<tr>';
    foreach ($tr as $td){
        if($td !== ''){
            $td = preg_replace( '@\(.*@','',$td );
            $td = preg_replace( '@.*?\)@',$td );
            $td = preg_replace( '@\(.*?\)@',$td );
            $td = preg_replace( '/^\s+|\s$/',$td);
        }
    echo '<td>' . $td . '</td>';
    } 
echo '</tr>';
}
echo '</table>';
?>

我需要摆脱的东西：

<html>
<head>
</head>
<body>
    <table>
        <tbody>
            <tr></tr> //this
            <tr><td>&nbsp;Afghanistan</td> //and the &nbsp; here

空的<tr>标签以及每行中第一个 开头的<td>。我已经尝试了几乎所有我能想到的东西，但是我对这一切都是陌生的，只是想哭。有关如何解决此问题的任何提示？

Plus：关于我需要学习“正确”解析的任何技巧？不一定要谈论这个特定的解析器。我有非常非常基本的PHP技能，并且不知道从哪里开始，因为要学习的东西太多。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

simple-html-dom

简单的HTML DOM解析器-抓取Wikipedia表：空行和＆nbsp;

问题描述

解决方法

相关问答