如何使用 jsoub 或任何其他方式从网站获取完整的 html 代码

问题描述

我正在尝试从网站获取 html 代码,如果网站代码像这样小:(https://abdelftahzowail.github.io/WriteUpsideDown/) 我得到完整代码但如果网站代码像这样大:({{ 3}}) 我没有得到完整的代码

我尝试了 JsoupHttpURLConnection 但没有给我完整的代码

这是我的代码

        Thread thread = new Thread(() -> {
            try  {
                Document doc;
                doc = Jsoup.connect(editText.getText().toString())
                        .header("Accept-Encoding","gzip,deflate")
                        .userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/30.0.1599.69 Safari/537.36")
                        .maxBodySize(0)
                        .timeout(0)
                        .get();
                Log.i("IMPORTANT !!!!","doc ( "+editText.getText().toString()+" )\n"+doc);
            } catch (Exception e) {
                Log.i("IMPORTANT !!!!","error : "+e);
            }
        });
        thread.start();

这是我从这个网站 (https://www.pixel4k.com/page/1?s=deadpool) 得到的代码

    <!doctype html>
<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#"> 
 <head> 
  <Meta charset="UTF-8"> 
  <title>You searched for deadpool - 4k Wallpapers,Hd Wallpapers,Desktop Wallpapers,Free Backgrounds Download,Widescreen Wallpapers</title> 
  <link rel="icon" href="https://www.pixel4k.com/wp-content/uploads/2018/09/favicon.ico" type="image/x-icon"> 
  <link rel="apple-touch-icon" href="apple-touch-icon.png"> 
  <Meta name="viewport" content="width=device-width,initial-scale=1.0"> 
  <Meta name="apple-mobile-web-app-capable" content="yes"> 
  <Meta name="apple-mobile-web-app-status-bar-style" content="black"> 
  <link rel="stylesheet" type="text/css" media="all" href="https://www.pixel4k.com/wp-content/themes/pxxx/style.css"> 
  <link rel="pingback" href="https://www.pixel4k.com/xmlrpc.PHP"> 
  <Meta name="google-site-verification" content="xHAo1q6wJG7bz-iw00VylrwaMabFjK_xSyU1jakgwaQ"> 
  <Meta name="wot-verification" content="317f71c46e1fb6060ce1"> 
  <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" type="f8f50ad6803275492fa5ce1d-text/javascript"></script> 
  <script type="f8f50ad6803275492fa5ce1d-text/javascript">(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2555268506534283",enable_page_level_ads:true});</script> <!--[if lt IE 9]>
    <script src="https://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]--> 
  <Meta name="robots" content="noindex,follow"> 
  <link rel="next" href="https://www.pixel4k.com/search/deadpool/page/2"> 
  <Meta property="og:locale" content="en_US"> 
  <Meta property="og:type" content="object"> 
  <Meta property="og:title" content="You searched for deadpool - 4k Wallpapers,Widescreen Wallpapers"> 
  <Meta property="og:url" content="https://www.pixel4k.com/search/deadpool"> 
  <Meta property="og:site_name" content="4k Wallpapers,Widescreen Wallpapers"> 
  <Meta name="twitter:card" content="summary_large_image"> 
  <Meta name="twitter:title" content="You searched for deadpool - 4k Wallpapers,Widescreen Wallpapers"> 
  <script type="application/ld+json">{"@context":"https:\/\/schema.org","@type":"Person","url":"https:\/\/www.pixel4k.com\/","sameAs":[],"@id":"#person","name":"Mika"}</script> 
  <link rel="dns-prefetch" href="//ajax.googleapis.com"> 
  <link rel="dns-prefetch" href="//www.pixel4k.com"> 
  <link rel="alternate" type="application/RSS+xml" title="4k Wallpapers,Widescreen Wallpapers » Feed" href="https://www.pixel4k.com/Feed"> 
  <link rel="alternate" type="application/RSS+xml" title="4k Wallpapers,Widescreen Wallpapers » Comments Feed" href="https://www.pixel4k.com/comments/Feed"> 
  <link rel="alternate" type="application/RSS+xml" title="4k Wallpapers,Widescreen Wallpapers » Search Results for “deadpool” Feed" href="https://www.pixel4k.com/search/deadpool/Feed/RSS2/"> 
  <style type="text/css">img.wp-smiley,img.emoji{display:inline!important;border:none!important;Box-shadow:none!important;height:1em!important;width:1em!important;margin:0 .07em!important;vertical-align:-.1em!important;background:none!important;padding:0!important}</style> 
  <link rel="stylesheet" id="wp-block-library-css" href="https://www.pixel4k.com/wp-includes/css/dist/block-library/style.min.css?ver=5.3.8" type="text/css" media="all"> 
  <style id="rocket-lazyload-inline-css" type="text/css">.rll-youtube-player{position:relative;padding-bottom:56.23%;height:0;overflow:hidden;max-width:100%;background:#000;margin:5px}.rll-youtube-player iframe{position:absolute;top:0;left:0;width:100%;height:100%;z-index:100;background:0 0}.rll-youtube-player img{bottom:0;display:block;left:

但此应用 (https://www.pixel4k.com/page/1?s=deadpool) 获取完整代码

我该怎么办?

解决方法

您正在获取所有数据(您的两个 url 和您的代码生成完整的 html),但是当您调用它时,android 记录器不会输出所有内容。

如果您尝试编写文件而不是日志语句,您很可能会注意到您的所有数据都可用。

参见What is the size limit for Logcat and how to change its capacity?

,

我在 Java 中搜索了 String 的最大长度。根据 this question 中的 Takahiko Kawasaki,最大长度为 65536 个字符。

由于您使用的方法将网页的 HTML 代码写入 String,这意味着如果您尝试下载的网页小于 65.536 字节,您的代码将按预期工作。

我不知道您在获取网页的 HTML 代码后需要做什么,因此以下建议可能不足以满足您的需要,但是:您是否尝试将 HTML 代码存储在 {{1} } 而不是 StringBuffer