我创建了一个循环函数,使用搜索API以一定的间隔提取推文(假设每隔5分钟).此功能执行以下操作:连接到twitter,提取包含特定关键字的推文,并将其保存在csv文件中.但偶尔(每天2-3次)循环因为以下两个错误之一而停止:
> htmlTreeParse中的错误(URL,useInternal = TRUE):
为http://search.twitter.com/search.atom?q=创建解析器时出错
6.95322e-310tst&安培; RPP = 100安培;页= 10
> UseMethod中的错误(“xmlNamespaceDeFinitions”):
没有适用于’xmlNamespaceDeFinitions’的方法应用于对象
class“NULL”
>是什么导致这些错误发生?
>如何调整代码以避免这些错误?
>如果遇到错误(例如使用Try函数),我如何“强制”循环继续运行?
我的功能(基于在线发现的几个脚本)如下:
library(XML) # htmlTreeParse twitter.search <- "Keyword" QUERY <- URLencode(twitter.search) # Set time loop (in seconds) d_time = 300 number_of_times = 3000 for(i in 1:number_of_times){ tweets <- NULL tweet.count <- 0 page <- 1 read.more <- TRUE while (read.more) { # construct Twitter search URL URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=',page,sep='') # fetch remote URL and parse XML <- htmlTreeParse(URL,useInternal=TRUE,error = function(...){}) # Extract list of "entry" nodes entry <- getNodeSet(XML,"//entry") read.more <- (length(entry) > 0) if (read.more) { for (i in 1:length(entry)) { subdoc <- xmlDoc(entry[[i]]) # put entry in separate object to manipulate published <- unlist(xpathApply(subdoc,"//published",xmlValue)) published <- gsub("Z"," ",gsub("T",published) ) # Convert from GMT to central time time.gmt <- as.POSIXct(published,"GMT") local.time <- format(time.gmt,tz="Europe/Amsterdam") title <- unlist(xpathApply(subdoc,"//title",xmlValue)) author <- unlist(xpathApply(subdoc,"//author/name",xmlValue)) tweet <- paste(local.time," @",author,": ",title,sep="") entry.frame <- data.frame(tweet,local.time,stringsAsFactors=FALSE) tweet.count <- tweet.count + 1 rownames(entry.frame) <- tweet.count tweets <- rbind(tweets,entry.frame) } page <- page + 1 read.more <- (page <= 15) # Seems to be 15 page limit } } names(tweets) # top 15 tweeters #sort(table(tweets$author),decreasing=TRUE)[1:15] write.table(tweets,file=paste("Twitts - ",format(Sys.time(),"%a %b %d %H_%M_%s %Y"),".csv"),sep = ";") Sys.sleep(d_time) } # end if
解决方法
这是我的解决方案,尝试使用Twitter API的类似问题.
我在Twitter API中查询了一长串Twitter用户中每个人的关注者数量.当用户的帐户受到保护时,我会收到错误,并且在我输入try函数之前循环会中断.使用try允许循环继续工作,跳过列表中的下一个人.
这是设置
# load library library(twitteR) # # Search Twitter for your term s <- searchTwitter('#rstats',n=1500) # convert search results to a data frame df <- do.call("rbind",lapply(s,as.data.frame)) # extract the usernames users <- unique(df$screenName) users <- sapply(users,as.character) # make a data frame for the loop to work with users.df <- data.frame(users = users,followers = "",stringsAsFactors = FALSE)
这是循环,尝试处理错误,同时填充用户$Twitter粉丝从Twitter API获得的追随者计数
for (i in 1:nrow(users.df)) { # tell the loop to skip a user if their account is protected # or some other error occurs result <- try(getUser(users.df$users[i])$followersCount,silent = TRUE); if(class(result) == "try-error") next; # get the number of followers for each user users.df$followers[i] <- getUser(users.df$users[i])$followersCount # tell the loop to pause for 60 s between iterations to # avoid exceeding the Twitter API request limit print('Sleeping for 60 seconds...') Sys.sleep(60); } # # Now inspect users.df to see the follower data