使用Spark和pyspark的并行API请求

问题描述

我正在执行以下操作以使用EMR集群进行API请求：

def get_normal_objects(self,object_name,get_id,chunk_size=35,**params):
    contacts_pages = []
    batch = 0
    while True:
        urls = ["{}/{}".format(self.base_url,"{}?page={}&api_key={}".format(object_name,page_number,self.api_keys))
                for page_number in range(batch * chunk_size + 1,chunk_size * (1 + batch) + 1)]

        responses_raw = self.get_responses(urls,self.office_token,chunk_size)
        LOGGER.info("Collecting data for {} for batch  {}".format(object_name,batch))

        try:
            responses_json = [json.loads(response_raw['output']) for response_raw in responses_raw]

当我提取不需要id的简单对象时，代码将正常工作，但是当它尝试提取首先需要id才能到达API的复杂关系对象时，实际上要花费很多时间：

"https://integrations.mydesktop.com.au/api/v1.2/properties/22028014/sales?api_key"

def get_complex_objects(self,object_name_1,object_name_2,ids,spark,chunk_size=30,**params):
    results = []
    batch = 0

    while True:
        ids_start = batch * chunk_size + 1
        ids_end = chunk_size * (1 + batch) + 1
        chunk_ids = [ids[i] for i in range(ids_start,ids_end) if i < len(ids)]

        urls = [
            "{}/{}".format(self.base_url,"{}/{}/{}?api_key={}".format(object_name_1,contactId,self.api_keys))
            for contactId in chunk_ids]
        

        LOGGER.info("Collecting data for {}:{} for batch  {}".format(object_name_1,batch))
        responses_raw = self.get_responses(urls,chunk_size)

我正在使用以下get_response函数来获取响应：

def get_responses(self,urls,office_token,**params):
    """Calls all the urls in parallel in bathes of {chuck_size}

    Arguments:
        urls {List} -- list of all urls to call
        office_token {String} -- Office token

    Keyword Arguments:
        chunk_size {int} -- nuber of parallel api calls (default: {30})

    Returns:
        [type] -- [description]
    """
    for chunk in list(mydesktop.chunks(urls,chunk_size)):
        loop = asyncio.get_event_loop()
        future = asyncio.ensure_future(self.__run(params,chunk))
        responses = loop.run_until_complete(future)

    return responses

async def __fetch(self,url,params,session):
  try:
    async with session.get(url,params=params) as response:
        #print('X-RateLimit-Remaining:{0}'.format(response.headers['X-RateLimit-Remaining']))
        output = await response.read()
        return output
  except asyncio.TimeoutError as e:
    print(str(e))
    return None

async def __bound_fetch(self,sem,session):
    # Getter function with semaphore.
    async with sem:
        output = await self.__fetch(url,session)
        return {"url": url,"output":output}

async def __run(self,auth_user,urls):
    tasks = []
    sem = asyncio.Semaphore(400)
    async with ClientSession(auth=BasicAuth(auth_user,password='',),connector=TCPConnector(ssl=False)) as session:
        for url in urls:
            task = asyncio.ensure_future(self.__bound_fetch(sem,session))
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
    return responses

我的问题是如何利用Spark的并行性功能并将URL分配给执行程序，以减少提取时间？

正在考虑使用urls = spark.sparkContext.parallelize（urls）将URL发送给执行者，然后使用map lambda进行获取请求。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark api api async-await pyspark python-asyncio