如何从维基百科 API 获取干净的 json

问题描述

我想从维基百科页面 https://en.wikipedia.org/wiki/February_2 中获取 JSON 格式的结果。

我尝试使用他们的 API：https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&formatversion=2&format=json

虽然它以 Json 格式给出。内容是 HTML。我只想要内容。

我需要一种方法来获得干净的结果。

解决方法

如果你想要没有标记的纯文本，你必须首先解析 JSON 对象，然后从 HTML 代码中提取文本：

function htmlToText(html) {
   let tempDiv = document.createElement("div");
   tempDiv.innerHTML = html;
   return tempDiv.textContent || tempDiv.innerText || "";
}

const url = 'https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&format=json&formatversion=2&origin=*';

$.getJSON(url,function(data) {
  const html = data['parse']['text'];
  const plainText = htmlToText(html);
  const array = [...plainText.matchAll(/^\d{4} *–.*/gm)].map(x=>x[0]);
  console.log(array);
});

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

更新：我根据下面的评论编辑了上面的代码。现在该函数提取所有列表项并将它们放入一个数组中。

我猜干净是指源wikitext。在这种情况下，您可以使用修订模块：

https://en.wikipedia.org/w/api.php?action=query&titles=February_2&prop=revisions&rvprop=content&formatversion=2&format=json

有关详细信息，请参阅 API:Get the contents of a page 和 API:Revisions。

api api json wikipedia wikipedia wikipedia-api