问题描述
我将如何以编程方式解析python文件并获取所有用三引号引起来的注释行,'''
和通常注释行#
,以便我可以跳过它们以加快解析时间? / p>
解决方法
这是用Java 8编写的,用于解析Python3,但在其他Java和Python版本中也可以使用(可能需要进行一些调整)
--- JAVA CODE ---
位于文件顶部:
import java.io.*;
import java.util.*;
在您的主要方法中(不必进入主要方法,但是如果这是一个独立的java文件(即,如果没有其他.java文件调用此文件),则它将需要主要方法):
String PathToPythonFileAsString="C:\\Users\\myUsername\\thisIsAnExamplePath\\pythonfile.py";
File pyFile = new File(PathToPythonFileAsString);
List<Integer> LineNumsThatAreInTripleQuotes = getLinesInTripleQuotes(pyFile);
Scanner scan = new Scanner(pyFile);
int CurrentLineNumber = 0;
while (scan.hasNext()) {
String lineInCurrentPythonFile = scan.nextLine();
CurrentLineNumber = CurrentLineNumber + 1;
//skip these lines right away to speed up execution
if (lineInCurrentPythonFile.contains("print(")) {
continue;
}
//skip these lines right away to speed up parsing
if (lineInCurrentPythonFile.contains("print") && lineInCurrentPythonFile.contains("(")) {
continue;
}
//skip these lines right away to speed up parsing
if (lineInCurrentPythonFile.contains("import ")) {
continue;
}
//skip these lines right away to speed up parsing
if (LineNumsThatAreInTripleQuotes.contains(CurrentLineNumber)) {
continue;
}
//skip these lines right away to speed up parsing
if (lineInCurrentPythonFile.contains("#")) {
String lineWithBeginningWhitespaceTrimmed = lineInCurrentPythonFile.trim();
if (lineWithBeginningWhitespaceTrimmed.length() > 0) {
if (lineWithBeginningWhitespaceTrimmed.substring(0,1).equals("#")) {
//line is a comment #
continue;
} else {
//line CONTAINS a comment,but PART of the line is NOT a comment
int PoundIdx = lineInCurrentPythonFile.indexOf("#");
//remove the parts of the line that are a comment
lineInCurrentPythonFile = lineInCurrentPythonFile.substring(0,PoundIdx);
}
} else {
//line is all spaces
continue;
}
}
//now that the lines in triple quotes have been skipped,do stuff with the actual lines
}
scan.close();
getLinesInTripleQuotes方法(返回用三引号引起来的行号。)如果您希望它返回行本身而不是行号,请进行更改List<Integer>
发生在下面的List<String>
,并更改LinesThatAreInTripleQuotes.add以添加“ CurrentLine”而不是“ LineNum”。我发现使用行号更为可靠,因为有时存在重复文件中的行。
public static List<Integer> getLinesInTripleQuotes (File pyFile) throws FileNotFoundException {
List<Integer> LinesThatAreInTripleQuotes = new ArrayList<>();
boolean foundBeginning=false;
boolean foundEnd=false;
Scanner scan = new Scanner(pyFile);
int lineNum = 0;
while (scan.hasNext()) {
String CurrentLine = scan.nextLine();
lineNum = lineNum+1;
boolean AddedThisLine = false;
//System.out.println("CurrentLine: "+CurrentLine);
if(CurrentLine.contains("'''") && foundBeginning!=true) {
foundBeginning=true;
LinesThatAreInTripleQuotes.add(lineNum);
continue;
}
if(foundBeginning==true && foundEnd==false) {
if(CurrentLine.contains("'''")) {
foundEnd = true;
} else {
LinesThatAreInTripleQuotes.add(lineNum);
AddedThisLine=true;
}
}
if(foundBeginning==true && foundEnd==true) {
//reset both so we can find the next triple-commented section
foundBeginning=false;
foundEnd=false;
}
//System.out.println("AddedThisLine: "+AddedThisLine+"\n");
}
return LinesThatAreInTripleQuotes;
}