如何在递归中使用多个Promise？

问题描述

我正在尝试解决以下问题：脚本进入网站，从网站获取前10个链接，然后继续访问这10个链接，然后继续浏览前10个页面中每个页面上的下10个链接。直到访问的页面数为1000。看起来是这样的：

我试图通过在promise和递归中使用for循环来获得此代码，这是我的代码：

const rp = require('request-promise');
const url = 'http://somewebsite.com/';

const websites = []
const promises = []

const getonSite = (url,count = 0) => {
    console.log(count,websites.length)
    promises.push(new Promise((resolve,reject) => {
        rp(url)
            .then(async function (html) {
                let links = html.match(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g)
                if (links !== null) {
                    links = links.splice(0,10)
                }
                websites.push({
                    url,links,emails: emails === null ? [] : emails
                })
                if (links !== null) {
                    for (let i = 0; i < links.length; i++) {
                        if (count < 3) {
                            resolve(getonSite(links[i],count + 1))
                        } else {
                            resolve()
                        }
                    }
                } else {
                    resolve()
                }

            }).catch(err => {
                resolve()
            })
    }))

}

getonSite(url)

解决方法

我认为您可能需要一个带有三个参数的递归函数：

从中提取链接的网址数组
一组累积的链接
何时停止爬网的限制

您只需使用根URL调用它，然后等待所有返回的Promise：

const allLinks = await Promise.all(crawl([rootUrl]));

在首次调用时，第二个和第三个参数可以采用默认值：

async function crawl (urls,accumulated = [],limit = 1000) {
  ...
}

该函数将获取每个url，提取其链接，然后递归直到达到极限。 我还没有测试过任何，但我正在考虑以下方面的内容：

// limit the number of links per page to 10
const perPageLimit = 10;

async function crawl (urls,limit = 1000) {

  // if limit has been depleted or if we don't have any urls,// return the accumulated result
  if (limit === 0 || urls.length === 0) {
    return accumulated;
  }

  // process this set of links
  const links = await Promise.all(
    urls
      .splice(0,perPageLimit) // limit to 10
      .map(url => fetchHtml(url) // fetch the url
      .then(extractUrls)); // and extract its links
  );

  // then recurse
  return crawl(
    links,// newly extracted array of links from this call
    [...accumulated,links],// pushed onto the accumulated list
    limit - links.length // reduce the limit and recurse
  );
}

async fetchHtml (url) {
   //
}

const extractUrls = (html) => html.match( ... )

es6-promise javascript request-promise