问题描述
在下面的代码中,我在 Newpaper3k 的帮助下抓取 google 搜索链接。但是,只要遇到不可抓取或其他方式的链接,代码就会失败。如何跳过无法抓取的网站,并使用相同的代码挖掘那些可以抓取的链接的结果。
df_A :
|index| price|
--------------
1 | 20
2 | 20
一旦遇到错误,我可以手动插入链接删除代码以及网站元素(如下所示),但重复的手动过程很麻烦。请帮助我找到一种方法来在出现无法抓取的网站链接时继续循环,其余的结果与代码一致。
import pandas as pd
import time
!pip3 install newspaper3k
from googlesearch import search
import nltk
from newspaper import Article
newslist=[]
query=input("enter your query")
try:
for i in search(query,tld="com",num=70,stop=70,pause=2,lang='en'):
print(i)
newslist.append(i)
list_dataframe = pd.DataFrame(newslist)
list_dataframe.reset_index(drop=True)
df=list_dataframe
df.rename(columns={ df.columns[0]: "Links" },inplace = True)
df=df.reset_index(drop=True)
len=df.shape[0]
date=[]
image=[]
Text=[]
Summary=[]
Keywords=[]
url_links=[]
i=0
nltk.download('punkt')
try:
for i in range(0,(len)):
# print(i)
print(i)
url=df['Links'][i]
print(url)
url_links.append(url)
article=Article(url)
article.download()
article.parse()
article.nlp()
imag=article.top_image
image.append(imag)
Texxt=article.text
Text.append(Texxt)
Sumary=article.summary
Summary.append(Sumary)
Kewords=article.keywords
Keywords.append(Kewords)
i += 1
except:
print("error")
data={'Links':url_links,'image':image,'Text':Text,'Summary':Summary,'Keywords':Keywords}
df1=pd.DataFrame(data)
df1
df1.to_csv('Table.csv',index = False)
except:
print("error")
解决方法
所以我找到了一种绕过无法抓取的网页的方法,结果在其余的网络链接上继续。
import pandas as pd
import time
!pip3 install newspaper3k
from googlesearch import search
import nltk
from newspaper import Article
newslist=[]
query=input("enter your query")
try:
for i in search(query,tld="com",num=100,stop=100,pause=2,lang='en'):
print(i)
newslist.append(i)
list_dataframe = pd.DataFrame(newslist)
list_dataframe.reset_index(drop=True)
df=list_dataframe
df.rename(columns={ df.columns[0]: "Links" },inplace = True)
df=df.reset_index(drop=True)
len=df.shape[0]
len
date=[]
image=[]
Text=[]
Summary=[]
Keywords=[]
url_links=[]
i=0
nltk.download('punkt')
try:
for i in range(0,(len)):
# print(i)
print(i)
url=df['Links'][i]
print(url)
try:
article=Article(url)
article.download()
article.parse()
except:
print("This link cannot be scraped.Trying next")
i+=1
url=df['Links'][i]
article=Article(url)
article.download()
article.parse()
url_links.append(url)
article.nlp()
imag=article.top_image
image.append(imag)
Texxt=article.text
Text.append(Texxt)
Sumary=article.summary
Summary.append(Sumary)
Kewords=article.keywords
Keywords.append(Kewords)
i += 1
except:
print("error1")
data={'Links':url_links,'image':image,'Text':Text,'Summary':Summary,'Keywords':Keywords}
df1=pd.DataFrame(data)
df1
df1.to_csv('Table.csv',index = False)
except:
print("error2")
这适用于任意数量的网络查询。