如何在 Java 中找到两个列表之间的公共序列

问题描述

我正在尝试找到两个列表之间的公共序列。如果我们尝试在具有所有唯一值的列表中找到公共序列，我可以做到。例如：

list one: [1,8,3,13,14,6,11]
listTwo : [8,9,10,11,12,15]

如我们所见，[13,14] 序列对于两个列表是通用的。我的算法是，使用 retainAll 函数我有共同的值，在这个例子中它是 [8,14]。但是由于列表一已被“retainAll”函数更改，我正在创建列表一的副本。然后我从它们的原始列表（列表一和列表二）中获取这些共同值的位置。之后，我得到连续值的位置差异。喜欢：

       list1   list2   difList1     difList2
[8]     1      0     -1  (0-1)   -1  (0-1)
[11]    6      3     -5  (1-6)   -3  (0-3)
[13]    3      5      3  (6-3)   -2  (3-5)
[14]    4      6     -1  (3-4)   -1  (5-6)

如果difList1和difLis2的值都显示“-1”，则表示该值和前一个值是连续的，构成序列。由于本例中[14]满足条件，则序列为[13][14] .

对于这种情况，我的代码是：

public static void main(String args[]) {
    List<Integer> list1= new ArrayList(Arrays.asList(1,11));
    List<Integer> list2= new ArrayList(Arrays.asList(8,15));
    list1.retainAll(list2);
    List<Integer> ori_list1= new ArrayList(Arrays.asList(1,11));
    List<Integer> difList1= new ArrayList<>();
    List<Integer> diffList2= new ArrayList<>();
    difList1.add(-1); // Since the first element doesn't have any prevIoUs element in common elements list,i'm putting -1 on first index.
    diffList2.add(-1); // Since the first element doesn't have any prevIoUs element in common elements list,i'm putting -1 on first index.
    System.out.println(list1); // common elements are [8,11]


    for(int k=1;k<list1.size();k++){ // Let's say k = 2 ..
        int index1_1 = ori_list1.indexOf(list1.get(k)); // For index 2,it takes actual index of 14 value -> 4
        int index1_2 = ori_list1.indexOf(list1.get(k-1)); // it takes actual index of 13 value -> 3
        int diff_list1 = index1_2-index1_1; // 3-4= -1 -> we got -1 .That means they're consecutive.
        difList1.add(diff_list1); // And putting the -1 into the diffList1.
        int index2_1 = list2.indexOf(list1.get(k)); // doing the same thing for list2.. -> 6
        int index2_2 = list2.indexOf(list1.get(k-1)); // doing the same thing for list2.. -> 5
        int diff_doc2 = index2_2-index2_1;  // 5-6 = -1
        diffList2.add(diff_doc2); // put -1 in diffList2 
    }
    for(int y=1;y<difList1.size();y++){ 
        if(difList1.get(y)==-1 && diffList2.get(y)==-1){  // Since they are both -1 for 14 value 

            System.out.println("The common sequence is:"+list1.get(y-1)+" "+list1.get(y)); // Print them
        }
    }
}

但是我需要针对重复元素情况的解决方案。假设我们有像

这样的列表

列表一：[1,8,10,15]

现在我们有另一个公共序列 [8,10]。在输出中，我想同时看到 [13,14] 和 [8,10]。但我只看到 [13,14]。因为在计算 8 和 10 的索引时，程序采用前 8 和 10 的索引。对于 list1 ，它采用第 1 个索引为 8 值，第 3 个索引为 10 值。但是我需要传递它们，因为我已经使用过它们，我需要像 5 和 6 这样的索引，而不是 1 和 3。

而且我不知道如何找到具有两个以上值的序列。例如，如果它们是连续的，则不仅 [13,14] 而且 [13,15] 或更多。我知道这有点棘手，但我需要你的帮助。

解决方法

我不太确定您要做什么，但如果我在做通用序列，我会通过创建子列表并比较它们来实现：

        public static Set<List<Integer>> findCommonSequence(List<Integer> source,List<Integer> target,int startLength) {
        Set<List<Integer>> sequences = new LinkedHashSet<>();

        // algorithm works in this way:
        // we prepare all possible sublists of source list that are at least startLength length
        // and then we check every of those sublists against the target list to see if it contains any

        // length is from startLength to maxSize,to check all sublists with that length
        // ie if startLength is 2 and source is 10,it will be 2 - 10 and thus it will check all sublist sizes
        for (int length = startLength; length < source.size(); length++) {
            // startIndex will move from 0 to original_list - length,so if length is 2,it will generate sublists
            // with indexes 0,1; 1,2; 2,3 ... 8,9
            for (int startIndex = 0; startIndex+length < source.size(); startIndex++) {
                // creates lightweight sublist that shares the data
                List<Integer> sublist = source.subList(startIndex,startIndex+length);
                // add all found subsequences into the set
                sequences.addAll(findSequenceIn(target,sublist));
            }
        }

        return sequences;
    }

    // Returns all subsequences that are inside the target list
    private static Set<List<Integer>> findSequenceIn(List<Integer> target,List<Integer> sublist) {
        Set<List<Integer>> subsequences = new LinkedHashSet<>();

        // simply do the same process as in first method but with fixed length to the length of sublist
        for (int i=0; i<target.size() - sublist.size(); i++) {
            // create another sublist,this time from target (again,share data)
            List<Integer> testSublist = target.subList(i,i+sublist.size());

            // compare two sublists,if they are equal,that means target list contains sublist from original list
            if (testSublist.equals(sublist)) {
                // add it to the set
                subsequences.add(new ArrayList<>(sublist));
            }
        }

        return subsequences;
    }

然后您可以优化算法以仅通过发送索引而不是子列表来进行检查并手动进行比较。这个算法的复杂度应该是从 O(n3) 到 O(n4)。可能是 O(n4)，因为我们最多做了 n2 个子列表，然后将哪个是 n 个操作与列表 2 的 n 个子列表进行比较，但它可能是 n3，因为比较较小，不知道在数学上它与 n3 或 n4 的接近程度。

当然还有另一个 n 带有子列表的副本，但您可以优化它。

将整数值处理为 codePoints:
[1] 将 list2 转换为 str2
[2] 将 list1 转换为 str1，去掉左边所有不在 list2 中的 int 值
[3] 将 str1 移动到 str2 上，记住 str1 站在 str2 之上的最长序列

ArrayList<int[]> results = new ArrayList<>();

String str2 = new String(
            new int[] { 1,8,3,10,13,14,6,11 },11 ); //[1]
int[] tmp = new int[] { 8,9,11,12,15 };
int[] arr1 = IntStream.of( tmp ).dropWhile(
    c -> str2.indexOf( c ) < 0 ).toArray();      //[2]
String str1 = new String( arr1,arr1.length );
for( int i = str1.length() - 2; i >= 0; i-- ) {  //[3]
  int[] rslt = new int[0];
  for( int j = 0; j < str2.length() - 2; j++ ) {
    int[] idx2 = new int[] { j };
    rslt = str1.substring( i ).codePoints().takeWhile(
        c -> c == (int)str2.charAt( idx2[0]++ ) ).toArray();
    if( rslt.length >= 2 ) {
      results.add( rslt );
    }
  }
}
  
results.forEach(a -> System.out.println( Arrays.toString( a ) ));

获取：[13,14]、[10,14]、[8,10]

algorithm algorithm java java list list pattern-matching