在Python数据集中检测区域

问题描述

我正在尝试解决一个不太难解决的问题，但是我很难找出解决方法。

基本上，我有一组OHLC数据：

>>print(df)

                       Open    High     Low   Close       Volume                Date
Date
2020-11-02 00:00:00  396.68  401.01  396.44  400.70  41468.48318 2020-11-02 00:00:00
2020-11-02 00:30:00  400.68  404.50  400.61  402.45  35209.25068 2020-11-02 00:30:00
2020-11-02 01:00:00  402.48  403.14  400.62  401.89  18107.53656 2020-11-02 01:00:00
2020-11-02 01:30:00  401.88  402.88  401.26  402.48  13852.17215 2020-11-02 01:30:00
2020-11-02 02:00:00  402.49  403.85  398.82  401.17  21853.35028 2020-11-02 02:00:00
...                     ...     ...     ...     ...          ...                 ...
2020-11-04 19:30:00  401.88  403.88  401.88  402.46  17944.49509 2020-11-04 19:30:00
2020-11-04 20:00:00  402.50  404.23  397.72  399.59  41674.44864 2020-11-04 20:00:00
2020-11-04 20:30:00  399.60  402.26  399.40  401.21  18606.38545 2020-11-04 20:30:00
2020-11-04 21:00:00  401.20  403.15  400.79  402.70  14408.66482 2020-11-04 21:00:00
2020-11-04 21:30:00  402.69  403.01  401.74  402.71   8873.15569 2020-11-04 21:30:00

给出一个固定的固定范围（可以是10到350（从350到360，从351到361，依此类推）），以检测在该范围内何时关闭了N个以上的蜡烛。因此，基本上这个范围需要“滑过”整个图表，并找到满足上述条件的区域（在该范围内关闭的蜡烛数量超过N个）。

这是一个视觉示例：

在这种情况下，白框中有6支蜡烛关闭，所以这就是我要寻找的，请注意，蜡烛必须不穿过该框，它只需要“启动”即可“在那里。

我试图使其尽可能清晰和详细。我想发布更多代码，但是我真的很努力地找到解决方法，尽管我很确定使用Pandas，Numpy或scipy应该很容易。有人可以帮我找到一个方向吗？任何建议都是欢迎的

解决方法

您的描述有点模糊，但这也许会有所帮助：

说，您有一个名为start的numpy数组中的起点，并使用以下命令找到这些点在350到360之间的位置：

np.where((start > 350) & (start < 360))

查看这些分数有多少：

len(np.where((start  >350) & (start  < 360))[0])

我建议您在代码中添加一个循环。会是这样的：

mini = df['close'].min()
maxi  = df['close'].max()

candles = []
for i in range(mini,maxi-10):
    n = len(df[df['Close'].between(i,i+10)])
    if n>=6:
        candles.append((mini,maxi,n))

请在您的DataFrame上尝试一下，看看是否可行！

您可以通过以下方法在numpy中查找区域：1）制作一个整数T / F数组，标记该区域中的点； 2）通过减去相邻点来找出台阶的位置（进入和离开该区域）； 3）使用np.nonzero从第2步中找到边界。

这里有个例子（最后一个图中的绿色带表示仅由从nonzero返回的两个索引标识的区域）：

import matplotlib.pyplot as plt
import numpy as np

# make some data
dmin,dmax = 0.3,0.7
x = np.linspace(0,100,300)
data = 1 - 1/(1+np.exp(-(x-70)/2))

# do the three step above:
region = ((data>dmin) & (data<dmax)).astype(int)  # mark region with 1s and 0s
boundaries = region[1:] - region[:-1]  # calculate the boundaries to 1s and -1s corresponding to "into" and "out of",or use np.diff
indices = np.nonzero(boundaries)   # find the indices of the boundary points

fig,axs = plt.subplots(3,1)
axs[0].plot(x,data)
axs[1].plot(x,region)
axs[2].plot(x[1:],boundaries)
axs[2].axvspan(x[indices[0][0]],x[indices[0][1]],facecolor='g',alpha=0.2)

要查找多个大于一定长度的区域，请遍历边界索引列表以构建边界对列表，这主要是簿记问题和端点问题（例如，如果从地区等）。

这是一个执行此操作的示例。两个主要更改是：1）我将boundaries拆分为产生starts和stops索引；并且2）我计算large_rios。

dmin,1000  # just look for being above a min: for multiple regions,make some data that oscillates and this is easier to visualize
minL = 10

# make up  some data
x = np.linspace(0,98.5,600)  # 98.5 so data ends in a region of interest,which is a case I wanted to check for
data0 = 1-np.exp(-(x-50)**2/400.)
data = 0.5 + 0.5*np.sin((1+1*(data0+1))*x)

rois = ((data>dmin) & (data<dmax)).astype(int) # roi = "region of interest"
boundaries = rois[1:] - rois[:-1]
starts = list(np.nonzero(boundaries>0)[0])  # starting points of roi,and make a list for easy insertion
stops = list(np.nonzero(boundaries<0)[0])   # stopping points of roi,and make a list for easy appending

if stops[0] < starts[0]: # if data starts in a roi,fix it
    starts.insert(0,0)

if starts[-1]>stops[-1]: # if data stops in a roi,fix it
    stops.append(len(data))

large_rois = [(start,stop) for (start,stop) in zip(starts,stops) if stop-start > minL]

print(large_rois)

fig,rois)
axs[2].plot(x[1:],boundaries)
for (start,stop) in large_rois:
    axs[2].axvspan(x[start],x[stop],facecolor='r',alpha=0.4)

另外，请注意这里，我有一个遍历列表的循环，通常在使用pandas和numpy时，最好尽量避免这种循环，但是在这种情况下，循环不是遍历所有数据，而仅仅是遍历所有数据。端点列表，比原始数据要短得多。

最后，请注意所有要查找谨慎数据区域的问题，还有关于如何处理边界的问题，因此，如果这很重要，请确保根据需要进行解决。

numpy pandas pandas python scipy scipy