问题描述
我正在使用以下代码来尝试将Linkedin的公开公司页面添加到变量中,但它始终会返回404找不到的Linkedin页面。你知道我要去哪里哪里吗?
$html = get_web_page('https://www.linkedin.com/company/google/');
echo stripos( $html['content'],'occludable-update' );
echo $html['content'];
function get_web_page( $url )
{
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTomrEQUEST =>"GET",//set request type post or get
CURLOPT_POST =>false,//set to GET
CURLOPT_USERAGENT => $user_agent,//set user agent
CURLOPT_COOKIEFILE =>"cookie.txt",//set cookie file
CURLOPT_COOKIEJAR =>"cookie.txt",//set cookie jar
CURLOPT_RETURNTRANSFER => true,// return web page
CURLOPT_HEADER => false,// don't return headers
CURLOPT_FOLLOWLOCATION => true,// follow redirects
CURLOPT_ENCODING => "",// handle all encodings
CURLOPT_AUTOREFERER => true,// set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120,// timeout on connect
CURLOPT_TIMEOUT => 120,// timeout on response
CURLOPT_MAXREDirs => 10,// stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch,$options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
解决方法
它们必须具有某种刮擦保护措施。如果您通过CLI通过curl抓取页面,您会看到它只返回了一些Javascript代码:
$ curl https://www.linkedin.com/company/google/
<html><head>
<script type="text/javascript">
window.onload = function() {
// Parse the tracking code from cookies.
var trk = "bf";
var trkInfo = "bf";
var cookies = document.cookie.split("; ");
for (var i = 0; i < cookies.length; ++i) {
if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {
trk = cookies[i].substring(8);
}
else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {
trkInfo = cookies[i].substring(8);
}
}
if (window.location.protocol == "http:") {
// If "sl" cookie is set,redirect to https.
for (var i = 0; i < cookies.length; ++i) {
if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);
return;
}
}
}
// Get the new domain. For international domains such as
// fr.linkedin.com,we convert it to www.linkedin.com
var domain = "www.linkedin.com";
if (domain != location.host) {
var subdomainIndex = location.host.indexOf(".linkedin");
if (subdomainIndex != -1) {
domain = "www" + location.host.substring(subdomainIndex);
}
}
window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +
"&originalReferer=" + document.referrer.substr(0,200) +
"&sessionRedirect=" + encodeURIComponent(window.location.href);
}
</script>
</head></html>