无法使用curl / PHP获取公开的Linkedin公司页面

问题描述

我正在使用以下代码来尝试将Linkedin的公开公司页面添加到变量中,但它始终会返回404找不到的Linkedin页面。你知道我要去哪里哪里吗?

$html = get_web_page('https://www.linkedin.com/company/google/');
echo stripos( $html['content'],'occludable-update' );
echo $html['content'];

function get_web_page( $url )
{
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

                CURLOPT_CUSTomrEQUEST  =>"GET",//set request type post or get
                CURLOPT_POST           =>false,//set to GET
                CURLOPT_USERAGENT      => $user_agent,//set user agent
                CURLOPT_COOKIEFILE     =>"cookie.txt",//set cookie file
                CURLOPT_COOKIEJAR      =>"cookie.txt",//set cookie jar
                CURLOPT_RETURNTRANSFER => true,// return web page
                CURLOPT_HEADER         => false,// don't return headers
                CURLOPT_FOLLOWLOCATION => true,// follow redirects
                CURLOPT_ENCODING       => "",// handle all encodings
                CURLOPT_AUTOREFERER    => true,// set referer on redirect
                CURLOPT_CONNECTTIMEOUT => 120,// timeout on connect
                CURLOPT_TIMEOUT        => 120,// timeout on response
                CURLOPT_MAXREDirs      => 10,// stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch,$options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
}

解决方法

它们必须具有某种刮擦保护措施。如果您通过CLI通过curl抓取页面,您会看到它只返回了一些Javascript代码:

$ curl https://www.linkedin.com/company/google/
<html><head>
<script type="text/javascript">
window.onload = function() {
  // Parse the tracking code from cookies.
  var trk = "bf";
  var trkInfo = "bf";
  var cookies = document.cookie.split("; ");
  for (var i = 0; i < cookies.length; ++i) {
    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {
      trk = cookies[i].substring(8);
    }
    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {
      trkInfo = cookies[i].substring(8);
    }
  }

  if (window.location.protocol == "http:") {
    // If "sl" cookie is set,redirect to https.
    for (var i = 0; i < cookies.length; ++i) {
      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);
        return;
      }
    }
  }

  // Get the new domain. For international domains such as
  // fr.linkedin.com,we convert it to www.linkedin.com
  var domain = "www.linkedin.com";
  if (domain != location.host) {
    var subdomainIndex = location.host.indexOf(".linkedin");
    if (subdomainIndex != -1) {
      domain = "www" + location.host.substring(subdomainIndex);
    }
  }

  window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +
      "&originalReferer=" + document.referrer.substr(0,200) +
      "&sessionRedirect=" + encodeURIComponent(window.location.href);
}
</script>
</head></html>