java获取/跳过python文件中的所有注释行

问题描述

我将如何以编程方式解析python文件获取所有用三引号引起来的注释行,'''和通常注释行#,以便我可以跳过它们以加快解析时间? / p>

解决方法

这是用Java 8编写的,用于解析Python3,但在其他Java和Python版本中也可以使用(可能需要进行一些调整)

--- JAVA CODE ---

位于文件顶部:

import java.io.*;
import java.util.*;

在您的主要方法中(不必进入主要方法,但是如果这是一个独立的java文件(即,如果没有其他.java文件调用此文件),则它将需要主要方法):

String PathToPythonFileAsString="C:\\Users\\myUsername\\thisIsAnExamplePath\\pythonfile.py";
File pyFile = new File(PathToPythonFileAsString);

List<Integer> LineNumsThatAreInTripleQuotes = getLinesInTripleQuotes(pyFile);

Scanner scan = new Scanner(pyFile);
int CurrentLineNumber = 0;
while (scan.hasNext()) {
    String lineInCurrentPythonFile = scan.nextLine();
    CurrentLineNumber = CurrentLineNumber + 1;
    
    //skip these lines right away to speed up execution
    if (lineInCurrentPythonFile.contains("print(")) {
        continue;
    }

    //skip these lines right away to speed up parsing
    if (lineInCurrentPythonFile.contains("print") && lineInCurrentPythonFile.contains("(")) {
        continue;
    }

    //skip these lines right away to speed up parsing
    if (lineInCurrentPythonFile.contains("import ")) {
        continue;
    }

    //skip these lines right away to speed up parsing
    if (LineNumsThatAreInTripleQuotes.contains(CurrentLineNumber)) {
        continue;
    }


    //skip these lines right away to speed up parsing
    if (lineInCurrentPythonFile.contains("#")) {
        String lineWithBeginningWhitespaceTrimmed = lineInCurrentPythonFile.trim();
        if (lineWithBeginningWhitespaceTrimmed.length() > 0) {
            if (lineWithBeginningWhitespaceTrimmed.substring(0,1).equals("#")) {
                //line is a comment #
                continue;
            } else {
                //line CONTAINS a comment,but PART of the line is NOT a comment
                int PoundIdx = lineInCurrentPythonFile.indexOf("#");
                //remove the parts of the line that are a comment
                lineInCurrentPythonFile = lineInCurrentPythonFile.substring(0,PoundIdx);
            }
        } else {
            //line is all spaces
            continue;
        }
    }



    //now that the lines in triple quotes have been skipped,do stuff with the actual lines



}
scan.close();

getLinesInTripleQuotes方法(返回用三引号引起来的行号。)如果您希望它返回行本身而不是行号,请进行更改List<Integer>发生在下面的List<String>,并更改LinesThatAreInTripleQuotes.add以添加“ CurrentLine”而不是“ LineNum”。我发现使用行号更为可靠,因为有时存在重复文件中的行。

public static List<Integer> getLinesInTripleQuotes (File pyFile) throws FileNotFoundException {

    List<Integer> LinesThatAreInTripleQuotes = new ArrayList<>();
    boolean foundBeginning=false;
    boolean foundEnd=false;

    Scanner scan = new Scanner(pyFile);
    int lineNum = 0;
    while (scan.hasNext()) {
        String CurrentLine = scan.nextLine();
        lineNum = lineNum+1;
        boolean AddedThisLine = false;
        //System.out.println("CurrentLine: "+CurrentLine);
        if(CurrentLine.contains("'''") && foundBeginning!=true) {
            foundBeginning=true;
            LinesThatAreInTripleQuotes.add(lineNum);
            continue;
        }
        if(foundBeginning==true && foundEnd==false) {
            if(CurrentLine.contains("'''")) {
                foundEnd = true;
            } else {
                LinesThatAreInTripleQuotes.add(lineNum);
                AddedThisLine=true;
            }
        }
        if(foundBeginning==true && foundEnd==true) {
            //reset both so we can find the next triple-commented section
            foundBeginning=false;
            foundEnd=false;
        }
        //System.out.println("AddedThisLine: "+AddedThisLine+"\n");
    }
    return LinesThatAreInTripleQuotes;
}