更快地遍历 xarray 和数据帧

问题描述

我是 Python 新手，不了解所有方面。

我想遍历 dataframe (2D) 并将其中一些值分配给 xarray (3D)。我的 xarray 的坐标是公司股票代码 (1)、财务变量 (2) 和每日日期 (3)。每家公司的 dataframe 列是一些与 xarray 中相同的财务变量，索引由季度日期组成。

我的目标是为每个公司取一个已经生成的 dataframe 并在某个变量的列和某个日期的行中查找一个值，并将其分配给 xarray 中的相应位置{1}}。

由于某些日期不会出现在 dataframe 的索引中（每个日历年只有 4 个日期），我想为 xarray 或xarray 上一个日期的值，如果该值也不为 0。我曾尝试使用嵌套的 for 循环来完成此操作，但仅在一个变量中遍历所有日期大约需要 20 秒。

我的日期列表如果由大约 8000 个日期组成，变量列表有大约 30 个变量，公司列表大约有 800 个公司。如果我要循环所有这些，我将需要几天时间才能完成嵌套的 for 循环。有没有更快的方法将这些值分配给 xarray？我的猜测类似于 iterrows() 或 iteritems()，但在 xarray 中。这是我的程序的示例代码，其中包含公司和变量的较短列表：

import pandas as pd
from datetime import datetime,date,timedelta
import numpy as np
import xarray as xr
import time

start_time = time.time()

# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df. 
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents','shortTermInvestments','cashAndShortTermInvestments','totalAssets','totalLiabilities','totalStockholdersEquity','netIncome','freeCashFlow']
rows = []
for year in range(1989,2020):
    for month,day in zip([3,6,9,12],[31,30,31]):
        rows.append(date(year,month,day))
a = np.random.randint(100,size=(len(rows),len(cols)))
df = pd.DataFrame(data=a,columns=cols)
df.insert(column='date',value=rows,loc=0)
# This is just to set the date format so that I can later look up the values
for item,i in zip(df.iloc[:,0],range(len(df.iloc[:,0]))):
    df.iloc[i,0] = datetime.strptime(str(item),'%Y-%m-%d')
df.set_index('date',inplace=True)

# Coordinates for the xarray:
companies = ['AAPL']  # This is actually longer (around 800 companies),but for the sake of the question,it is limited to just one company.
variables = ['totalAssets','totalStockholdersEquity']  # Same as with the companies (around 30 variables).
first_date = date(1998,3,25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date,end=last_date).tolist()

# We create a zero xarray,so that we can later fill it up with values:
z = np.zeros((len(companies),len(variables),len(dates)))
ds = xr.DataArray(z,coords=[companies,variables,dates],dims=['companies','variables','dates'])

# We assign values from the df to the ds
for company in companies:
    for variable in variables:
        first_value_found = False
        for date in dates:
            # Dates in the df are quarterly dates and dates in the ds are daily dates.
            # We start off by looking for a certain date in the df. If we dont find it,we give it the value 0 in the ds
            # If we do find it,we assign it the value found in the df and tell it that the first value has been found
            # Now that the first value has been found,when we dont find a value in the df,instead of giving it a value of 0,we give it the value of the last date.
            if first_value_found == False:
                try:
                    ds.loc[company,variable,date] = df.loc[date,variable]
                    first_value_found = True
                except:
                    ds.loc[company,date] = 0
            else:
                try:
                    ds.loc[company,variable]
                except:
                    ds.loc[company,date] = ds.loc[company,date + timedelta(-1)]

print("My program took",time.time() - start_time,"to run")

主要问题在于 for 循环，因为我已经在单独的文件上测试过这些循环，而且这些似乎是最耗时的。

解决方法

一种可能的策略是遍历 DataFrame 的实际索引，而不是所有可能的索引

add_filter( 'woocommerce_gateway_title','change_payment_gateway_title',100,2 );
function change_payment_gateway_title( $title,$payment_id ){
    $targeted_payment_id  = 'redsys_gw'; // Set your payment method ID
    $targeted_product_ids = array(37,53); // Set your product Ids

    // Only on checkout page for specific payment method Id
    if( is_checkout() && ! is_wc_endpoint_url() && $payment_id === $targeted_payment_id ) {
        // Loop through cart items
        foreach( WC()->cart->get_cart() as $item ) {
            // Check for specific products: Change payment method title
            if( in_array( $item['product_id'],$targeted_product_ids ) ) {
                return __("Payment in installments","woocommerce");
            }
        }
    }
  return $title;
}

这应该已经减少了相当多的迭代次数。你仍然需要确保所有的空白都被填满，所以你会做一些像

avail_dates = df.index
for date in avail_dates:
    # Copy the data

没错，您可以使用列表对 DataArray 和 DataFrame 进行索引。（另外我不会使用 da.loc[company,variables,date:] = df.loc[date,variables] 作为来自 ds 的东西的变量名而不是 xarray）

不过，您可能想要使用的是 pandas.DataFrame.reindex()。

如果我明白你想要做什么，这或多或少应该可以解决问题（未经测试）

DataSet

loops pandas pandas performance performance performance python python-xarray