如何在递归中使用多个Promise?

问题描述

我正在尝试解决以下问题:脚本进入网站,从网站获取前10个链接,然后继续访问这10个链接,然后继续浏览前10个页面中每个页面上的下10个链接。直到访问的页面数为1000。 看起来是这样的:

我试图通过在promise和递归中使用for循环来获得此代码,这是我的代码

const rp = require('request-promise');
const url = 'http://somewebsite.com/';

const websites = []
const promises = []

const getonSite = (url,count = 0) => {
    console.log(count,websites.length)
    promises.push(new Promise((resolve,reject) => {
        rp(url)
            .then(async function (html) {
                let links = html.match(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g)
                if (links !== null) {
                    links = links.splice(0,10)
                }
                websites.push({
                    url,links,emails: emails === null ? [] : emails
                })
                if (links !== null) {
                    for (let i = 0; i < links.length; i++) {
                        if (count < 3) {
                            resolve(getonSite(links[i],count + 1))
                        } else {
                            resolve()
                        }
                    }
                } else {
                    resolve()
                }

            }).catch(err => {
                resolve()
            })
    }))

}

getonSite(url)

解决方法

我认为您可能需要一个带有三个参数的递归函数:

  1. 从中提取链接的网址数组
  2. 一组累积的链接
  3. 何时停止爬网的限制

您只需使用根URL调用它,然后等待所有返回的Promise:

const allLinks = await Promise.all(crawl([rootUrl]));

在首次调用时,第二个和第三个参数可以采用默认值:

async function crawl (urls,accumulated = [],limit = 1000) {
  ...
}

该函数将获取每个url,提取其链接,然后递归直到达到极限。 我还没有测试过任何,但我正在考虑以下方面的内容:

// limit the number of links per page to 10
const perPageLimit = 10;

async function crawl (urls,limit = 1000) {

  // if limit has been depleted or if we don't have any urls,// return the accumulated result
  if (limit === 0 || urls.length === 0) {
    return accumulated;
  }

  // process this set of links
  const links = await Promise.all(
    urls
      .splice(0,perPageLimit) // limit to 10
      .map(url => fetchHtml(url) // fetch the url
      .then(extractUrls)); // and extract its links
  );

  // then recurse
  return crawl(
    links,// newly extracted array of links from this call
    [...accumulated,links],// pushed onto the accumulated list
    limit - links.length // reduce the limit and recurse
  );
}

async fetchHtml (url) {
   //
}

const extractUrls = (html) => html.match( ... )