问题描述
||
我一直在尝试找到一种将PDF文档转换为文本的方法。
以下解决方案效果最佳,但不适用于所有pdf。
他们都:
PDF-1.4
5 0 obj
Length 6 0 R/Filter /FlateDecode
。
我需要在服务器端进行操作,并且无法安装模块。我在格式化代码输出的字符串时没有问题。我的大脑因搜寻而难过。
function pdf2string($sourcefile) {
$fp = fopen($sourcefile,\'rb\');
$content = fread($fp,filesize($sourcefile));
fclose($fp);
$searchstart = \'stream\';
$searchend = \'endstream\';
$pdfText = \'\';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {
$pos = strpos($content,$searchstart,$startpos);
$pos2 = strpos($content,$searchend,$startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}
if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}
$textsection = substr(
$content,$pos + strlen($searchstart) + 2,$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .=\"<br>\".pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
}
}
return preg_replace(\'/(\\s)+/\',\' \',$pdfText);
}
function pdfExtractText($psData){
if (!is_string($psData)) {
return \'\';
}
$text = \'\';
// Handle brackets in the text stream that Could be mistaken for
// the end of a text field. I\'m sure you can do this as part of the
// regular expression,but my skills aren\'t good enough yet.
$psData = str_replace(\'\\)\',\'##ENDBRACKET##\',$psData);
$psData = str_replace(\'\\]\',\'##ENDSBRACKET##\',$psData);
preg_match_all(
\'/(T[wdcm*])[\\s]*(\\[([^\\]]*)\\]|\\(([^\\)]*)\\))[\\s]*Tj/si\',$psData,$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != \'\') {
// Run another match over the contents.
preg_match_all(\'/\\(([^)]*)\\)/si\',$matches[3][$i],$subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != \'\') {
$text .= ($matches[1][$i] == \'Tc\' ? \' \' : \'\') . $matches[4][$i];
}
}
// Translate special characters and put back brackets.
$trans = array(
\'...\' => \'…\',\'\\205\' => \'…\',\'\\221\' => chr(145),\'\\222\' => chr(146),\'\\223\' => chr(147),\'\\224\' => chr(148),\'\\226\' => \'-\',\'\\267\' => \'•\',\'\\(\' => \'(\',\'\\[\' => \'[\',\'##ENDBRACKET##\' => \')\',\'##ENDSBRACKET##\' => \']\',chr(133) => \'-\',chr(141) => chr(147),chr(142) => chr(148),chr(143) => chr(145),chr(144) => chr(146),);
$text = strtr($text,$trans);
return $text;
}
解决方法
检查您的服务器上是否安装了“ pdftotext”:
echo shell_exec(\'pdftotext --help\');
如果是,请使用它轻松地将pdf转换为文本。
如果不是,请尝试下载源代码以查看其操作方式(pdftotext是开放源代码)。