通过Nokogiri获取可见的文字词

问题描述

| 我想用Nokogiri打开一个网页，提取用户在浏览器中访问该网页时看到的所有单词，并分析单词频率。使用nokogiri从html文档中获取所有可读单词的最简单方法是什么？理想的代码段将采用html页面（例如文件），并给出来自所有可读类型的元素的单个单词数组。（无需担心javascript或css的隐藏元素，从而隐藏单词；只需设计用于显示的所有单词就可以了。）

解决方法

您想要Nokogiri::XML::Node#inner_text方法：

require \'nokogiri\'
require \'open-uri\'
html = Nokogiri::HTML(open \'http://stackoverflow.com/questions/6129357\')

# Alternatively
html = Nokogiri::HTML(IO.read \'myfile.html\')

text  = html.at(\'body\').inner_text

# Pretend that all words we care about contain only a-z,0-9,or underscores
words = text.scan(/\\w+/)
p words.length,words.uniq.length,words.uniq.sort[0..8]
#=> 907
#=> 428
#=> [\"0\",\"1\",\"100\",\"15px\",\"2\",\"20\",\"2011\",\"220px\",\"24158nokogiri\"]

# How about words that are only letters?
words = text.scan(/[a-z]+/i)
p words.length,words.uniq.sort[0..5]
#=> 872
#=> 406
#=> [\"Answer\",\"Ask\",\"Badges\",\"Browse\",\"DocumentFragment\",\"Email\"]

# Find the most frequent words
require \'pp\'
def frequencies(words)
  Hash[
    words.group_by(&:downcase).map{ |word,instances|
      [word,instances.length]
    }.sort_by(&:last).reverse
  ]
end
pp frequencies(words)
#=> {\"nokogiri\"=>34,#=>  \"a\"=>27,#=>  \"html\"=>18,#=>  \"function\"=>17,#=>  \"s\"=>13,#=>  \"var\"=>13,#=>  \"b\"=>12,#=>  \"c\"=>11,#=>  ...

# Hrm...let\'s drop the javascript code out of our words
html.css(\'script\').remove
words = html.at(\'body\').inner_text.scan(/\\w+/)
pp frequencies(words)
#=> {\"nokogiri\"=>36,#=>  \"words\"=>18,#=>  \"html\"=>17,#=>  \"text\"=>13,#=>  \"with\"=>12,#=>  \"a\"=>12,#=>  \"the\"=>11,#=>  \"and\"=>11,#=>  ...

,如果您确实想使用Nokogiri进行此操作（否则，您可以使用正则表达式剥离标签），那么您应该： doc = Nokogiri :: HTML（open（\'url \'）。read）＃open-uri 使用诸如doc.search（\'script \'）之类的内容删除所有javascript和样式标签。每个{| el | el.unlink} doc.text

nokogiri nokogiri 字词文字词获取获取获取