获取innerText并按<br>分割

问题描述

下面是我试图提取文本内容的一些 HTML 的最小示例。我想要的结果是数组 ['keep1','keep2','keep3','keep4','keep5'],所以我要删除属于 div 子元素的任何内容,然后将 div 的文本拆分为 <br /> 标签上的数组。

通常我会在 div 上使用 .innerText,这有助于获取所有文本并删除子元素,但据我所知在这种情况下不适合,因为那样我会丢失 <br /> 标签我需要拆分成一个数组。下面是我能想到的最好的,但不处理子元素没有被 <br /> 包围的情况。有没有更好的方法来做到这一点?

const text = document
  .querySelector("div")
  .innerHTML.split("<br>")
  .map(e => e.trim())
  .filter(e => e[0] != "<" && e != "");
console.log(text);
<div>
  <br /> keep1 <br /> keep2
  <span>drop</span> keep3
  <br /> keep4
  <br />
  <h4>drop2</h4>
  <br />keep5
</div>

解决方法

在操作顺序上,先用<br>/\n/g标签替换换行符比较容易,然后再拆分结果。一旦我们处理了我们关心的唯一 html 元素 (<br>),我们可以使用正则表达式 /\<(.*)\>/g

去除其余元素

当标签被解析时,<br /> 被“规范化为 <br>”实际上让我感到惊讶 - 但正如 this S.O. post 所述,<br /> 是 XHTML 和浏览器将所有内容解析为 HTML <br>

const text = document
  .querySelector("div")
  .innerHTML.replace(/\n/g,"<br>") // replace all line breaks with `<br>`
  .split("<br>")
  .map(e => e.replace(/\<(.*)\>/g,'').trim()) // we clean and trim the element from any html tags
  .filter(e=>e) // this cleans out the empty array elements
console.log(text);
<div>
  <br /> keep1 <br /> keep2
  <span>drop</span> keep3
  <br /> keep4
  <br />
  <h4>drop2</h4>
  <br />keep5
</div>

,

一种可能的方法如下:

// we use the spread syntax inside of an Array-literal to convert the
// iterable result of document.querySelector().childNodes into an
// Array:
const text = [...
  // here we retrieve the first/only <div> element from the document
  // and return the live NodeList of all its child-nodes:
  document.querySelector('div').childNodes
  // we then use Array.prototype.filter() to filter the returned collection:
].filter(
  // we use an Arrow function to test each node passed to the
  // Array.prototype.filter() method ('node' is a reference to the current
  // node of the Array of nodes;
  // node.nodeType: we first test that the node has a nodeType,// we then assess if the node is a textNode (the nodeType of a text-node
  // is 3),// finally - to prevent empty array-element-values - we check that
  // the length of the nodeValue (the text-content of the text-node) once
  // leading and trailing white-space is removed has a length greater
  // than zero:
  (node) => node.nodeType && node.nodeType === 3 && node.nodeValue.trim().length > 0
  // we then use Array.prototype.map() to return a new Array based on the existing
  // Array of text-nodes:
).map(
  // again we pass the array-element into the function,// and here we trim the leading/trailing white-space of the node's value,// by passing the string to String.prototype.trim():
  (node) => node.nodeValue.trim()
);

console.log(text); // ["keep1","keep2","keep3","keep4","keep5"]
<div>
  <br /> keep1 <br /> keep2
  <span>drop</span> keep3
  <br /> keep4
  <br />
  <h4>drop2</h4>
  <br />keep5
</div>

JS Fiddle demo

参考文献: