title: 代理IP的那些事
copyright: true
top: 0
date: 2019-11-13 14:20:39
tags: 代理IP
categories: 爬虫笔记
permalink:
password:
keywords:
description: 代理IP的背后原理
他命带无数桃花,但他迟钝到了一定的地步。他就是复活节岛上那些眺望海面的石头雕像,桃花飘到他身上,纯是白瞎了。
简单的来说,代理IP就是本来是A–>C变成了A–>B–>C。
代理IP分类
匿名程度分类
按照隐匿性由高到低可以分如下四类:
- 高匿名代理
- 混淆代理
- 匿名代理
- 透明代理
代理协议分类
按照代理IP协议来分有如下六类:
- FTP代理服务器:主要用于访问FTP服务器,一般有上传、下载以及缓存功能,端口一般为21、2121等。
- HTTP代理服务器:主要用于访问网页,一般有内容过滤和缓存功能,端口一般为80、8080、3128等。
- SSL/TLS代理:主要用于访问加密网站,一般有SSL或TLS加密功能(最高支持128位加密强度),端口一般为443。
- RTSP代理:主要用于访问Real流媒体服务器,一般有缓存功能,端口一般为554。
- Telnet代理:主要用于telnet远程控制(黑客入侵计算机时常用于隐藏身份),端口一般为23。
POP3/SMTP代理:主要用于POP3/SMTP方式收发邮件,一般有缓存功能,端口一般为110/25。 - SOCKS代理:只是单纯传递数据包,不关心具体协议和用法,所以速度快很多,一般有缓存功能,端口一般为1080。SOCKS代理协议又分为SOCKS4和SOCKS5,前者只支持TCP,而后者支持TCP和UDP,还支持各种身份验证机制、服务器端域名解析等。简单来说,SOCK4能做到的SOCKS5都可以做到,但SOCKS5能做到的SOCK4不一定能做到。
代理IP原理
上面说起四种代理类型,这四种的区别在于代理IP服务器的配置,不同配置造成不同的代理类型。
其中,REMOTE_ADDR,HTTP_VIA,HTTP_X_FORWARDED_FOR是决定性因素。
REMOTE_ADDR
如果不是用代理访问我的博客,那么我的服务器记录REMOTE_ADDR设为你的的IP地址,如果使用代理,则会记录代理的IP。
HTTP_VIA
via是HTTP协议里面的一个header,记录了一次HTTP请求所经过的代理和网关,经过1个代理服务器,就添加一个代理服务器的信息,经过2个就添加2个。
X-Forwarded-For
X-Forwarded-For是一个HTTP扩展头部,用来表示HTTP请求端真实IP。当客户端使用了代理时,web代理服务器就不知道客户端的真实IP地址。为了避免这个情况,代理服务器通常会增加一个X-Forwarded-For的头信息,把客户端的IP添加到头信息里面。
简易代理池
# -*- coding:utf-8 -*-
import requests,re,random,time,datetime,threading,queue
IP_66_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Cookie': '__jsluid_h=d26e11a062ae566f576fd73c1cd582be; __jsl_clearance=1563459072.346|0|lMwNkWbcOEZhV8NGTNIpXgDvE8U%3D',
'Host': 'www.66ip.cn',
'Referer': 'http://www.66ip.cn/mo.PHP?sxb=&tqsl=30&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea=2',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
IP_XC_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTBmOWM5NDc1OWY4NjljM2ZjMzU3OTM1MGMxOTEwMjNhBjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMWVGT0Z1dVpKUXdTMVFEN1JHTnJ3VVhYS05WWlIzUlFEcncvM1daVER2blk9BjsARg%3D%3D--66057a30315f0a34734318d2e6963e608017f79e; Hm_lvt_0cf76c77469e965d2957f0553e6ecf59=1563458856; Hm_lpvt_0cf76c77469e965d2957f0553e6ecf59=1563460669',
'Host': 'www.xicidaili.com',
'if-none-match': 'W/"b7acf7140e4247040788777914f600e1"',
'Referer': 'http://www.66ip.cn/mo.PHP?sxb=&tqsl=30&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea=2',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
IP_89_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Cookie': 'yd_cookie=325275f9-21df-4b82a1658307a42df71b5943b40f8aa57b86; Hm_lvt_f9e56acddd5155c92b9b5499ff966848=1572920966; Hm_lpvt_f9e56acddd5155c92b9b5499ff966848=1572922039', 'Host': 'www.89ip.cn', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
IP_KD_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7', 'Connection': 'keep-alive', 'Cookie': 'channelid=0; sid=1572920890672670; _ga=GA1.2.1056826151.1572920918; _gid=GA1.2.265678962.1573616616; Hm_lvt_7ed65b1cc4b810e9fd37959c9bb51b31=1572920918,1573616632; Hm_lpvt_7ed65b1cc4b810e9fd37959c9bb51b31=1573616638', 'Host': 'www.kuaidaili.com', 'Referer': 'https://www.kuaidaili.com/free/inha/1/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
IP_66_URL = 'http://www.66ip.cn/mo.PHP?sxb=&tqsl=30&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea='
IP_XC_URL = 'http://www.xicidaili.com/nn/'
IP_89_URL = 'http://www.89ip.cn/index_{}.html'
IP_KD_URL = 'https://www.kuaidaili.com/free/inha/{}/'
ProducerIp = queue.Queue()
ConsumerIp = queue.Queue()
filename = (str(datetime.datetime.Now()).replace(' ','-').replace(':','-').split('.')[0])+'AliveProxyIP.txt'
def GetUrlContent(url,headers):
try:
r = requests.get(url,headers=headers,timeout=10)
return r.content
except:
return None
def GetProxyIp():
while 1:
for i in range(1,50):
content = GetUrlContent(IP_KD_URL.format(i),IP_KD_HEADERS)
if content != None:
try:
content = content.decode()
results = re.findall('<td data-title="IP">(\d.*?)</td.*?-title="PORT">(\d.*?)</td>',content,re.S)
ips = [':'.join(x) for x in results]
for ip in ips:
ProducerIp.put(ip)
except Exception as e:
pass
content = GetUrlContent(IP_89_URL.format(i),IP_89_HEADERS)
if content != None:
try:
content = content.decode()
ips = [':'.join(x) for x in re.findall('<td>\n\t\t\t(\d.*?)\t\t</td>\n\t\t<td>\n\t\t\t(\d.*?)\t\t</td>', content)]
for ip in ips:
ProducerIp.put(ip)
except Exception as e:
pass
content = GetUrlContent(IP_66_URL+str(i),IP_66_headers)
if content != None:
try:
ips = re.findall(b'\t(\d.*?:\d.*\d)<br />',content)
for ip in ips:
ProducerIp.put(ip.decode())
except:
pass
content = GetUrlContent(IP_XC_URL+str(i),IP_XC_HEADERS)
if content:
try:
ips = re.findall(b'<td>(\d.*\.\d.*)</td>\n.*?<td>(\d.*)</td>\n', content)
for i in ips:
ip = (i[0].decode() + ':' + i[1].decode())
ProducerIp.put(ip)
except:
pass
def CheckProxyIp():
while 1:
proxies={}
ip = ProducerIp.get()
# 获取得到的代理IP
proxies['http'] = str(ip)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
try:
url = 'https://www.baidu.com'
req2 = requests.get(url=url, proxies=proxies, headers=headers, timeout=5)
if req2.status_code == 200 and '百度一下'.encode() in req2.content:
print('[ {} ] 发现存活代理IP : {} '.format(str(datetime.datetime.Now()).replace(' ','-').replace(':','-').split('.')[0] ,ip))
with open(filename, 'a+', encoding='utf-8')as a:
a.write(proxies['http'] + '\n')
ConsumerIp.put(proxies['http'])
except Exception as e:
#print(e)
pass
if __name__ == '__main__':
threading.Thread(target=GetProxyIp).start()
for i in range(10):
threading.Thread(target=CheckProxyIp).start()