Phantom JS 没有将网页下载为某些网站的 html，但其他网站可以工作

问题描述

我正在使用 github 项目将网页下载为 html 并另存为 pdf。

我正在调用使用 phantom js 来执行此操作的 javascript 文件。其他页面工作正常，但我在 atlassian 上的目标网址不起作用。我得到一个没有数据的空白 pdf，当我另存为 html 时，它是一个空白网页

有人知道我为什么或如何解决这个问题吗？

目标网址是 https://blackdogs.atlassian.net/wiki/spaces/DOR/overview/

回购链接是 https://github.com/morninj/web2pdf

screenhot.js

var page = require('webpage').create()
var system = require('system')
var address = system.args[1];
var output = system.args[2];
var fs = require('fs');
page.viewportSize = { width: 1280,height: 702 }; // Default on my 13" MacBook
page.customHeaders = {
    "Connection": "keep-alive"
};
    
page.open(address,function (status) {
    if (status !== 'success') {
      console.log('Unable to load the address!');
        fs.write('1.html',page.content,'w');
    
        phantom.exit();
 } else {
       window.setTimeout(function () 
         page.render(output);
         fs.write('1.html','w');
    
         phantom.exit();
         },10000);
       }
   });

web2pdf.py

#!/usr/bin/env python
from subprocess import call
import argparse
import sys
import os
import time
parser = argparse.ArgumentParser()
parser.add_argument('-u','--url',help='The URL of a single page ' \
    + 'to download.')
parser.add_argument('-f','--filename',help='The name of a file containing ' \
    + 'multiple URLs to download. Put each URL on a new line.')
parser.add_argument('-o','--output',help='If you\'re archiving just one ' \
    + 'page,this is the name of the output file. The default is ' \
    + 'archive.pdf.')
parser.add_argument('-d','--directory',help='The name of the directory ' \
    + 'to store multiple archives. The default is "archives."')
args = parser.parse_args()

def web2pdf():
    if args.url and not args.filename:
        # Save a single URL
        output_filename = 'archive.pdf'
        if args.output: output_filename = args.output
        time.sleep(4)
        make_screenshot(args.url,output_filename)
    elif args.filename and not args.url:
        # Save multiple URLs
        # Create the archives directory
        archives_directory = 'archives'
        if args.directory: archives_directory = args.directory
        call(['mkdir',archives_directory])
        # Process each line in the input file
        with open(args.filename) as f:
            counter = 0
            for line in f:
                print ('Archiving %s...' % line.strip())
                # Generate filenames: 01.pdf,02.pdf,...,99.pdf
                counter = counter + 1
                output_filename = str(counter)
                if counter < 10: output_filename = '0' + output_filename
                output_filename = archives_directory + '/' \
                    + output_filename + '.pdf'
                make_screenshot(line.strip(),output_filename)
    else:
        # No URL or list of URLs provided
        print ('Please give either a URL or a filename containing a list of URLs.')
        sys.exit()
    print ('Done.')

def make_screenshot(url,filename):
    # Call PhantomJS and suppress its output
    with open(os.devnull,"w") as fnull: 
        call(
            ['phantomjs',os.path.dirname(os.path.abspath(__file__)) + '/screenshot.js',url,filename],stdout=fnull,stde

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

javascript phantomjs python web-scraping