如何计算一个字符串中不同单词的个数？

ih99xse1 于 2022-10-15 发布在 Ruby

关注(0)|答案(1)|浏览(201)

我想编写一个方法，该方法接受一个字符串并返回该字符串中不同单词的数量。
我最初开始使用split，然后在数组上使用map，但在研究了许多方法之后，我完全搞不懂如何将同一数组中的一个值与另一个值进行比较。

ruby

来源：https://stackoverflow.com/questions/73963478/how-to-count-number-of-distinct-words-within-a-string

1条答案

按热度按时间

agxfikkp1#

让我们考虑一下实现这一目标的各种方法。

以下所有词语都有一个限制，即它们不处理连字符的单词(“don‘t”或“will’thad”)或缩写(“so-so”或“婆婆”)。
字符串可以包含非单词的标点符号或子字符串(“$79.21”)。作为初始步骤，我们可能会尝试删除这些字符，或者干脆忽略它们。我选择了后一种方法。
我们希望“bugaboo”和“bugaboo”被视为同一个词，因此第一步可能是将字符串小写(或大写)。

让我们使用下面的字符串(查克·狄更斯写的一个相当长的句子的开头)。

str = "It was the best of times, it was the worst of times."

def find_em(str)
  str.downcase.scan(/[a-z]+/).uniq.size
end

find_em(str) #=> 7

请注意

str.downcase.scan(/[a-z]+/).uniq
  #=> ["it", "was", "the", "best", "of", "times", "worst"]

正则表达式*/[a-z]+/匹配一个或多个小写字母，尽可能多。

见字符串#扫描和阵列#Uniq。
请注意，前面的方法生成一个中间数组str.downcase.scan(/[a-z]+/)。我们可以按如下方式避免这种情况。

def find_em(str)
  str.downcase.gsub(/[a-z]+/).with_object([]) do |s,a|
    a << s unless a.include?(s)
  end.size
end

find_em(str) #=> 7

这使用了(很少使用的)形式的字符串#gSub，它接受一个参数，但不接受块，从而产生一个枚举数。它只枚举正则表达式/[a-z]+/的匹配项，因此与字符串替换无关。
先前的方法有效，但具有对找到的每个单词s采用线性搜索(a.include?(s))的缺点。我们可以通过构造一个集合(根据定义，该集合具有唯一的元素)并在最后将其转换为数组来解决这个问题。

def find_em(str)
  str.downcase.gsub(/[a-z]+/).with_object(Set.new) { |s,st| st << s }.size
end

find_em str
  #=> 7

请参见Set：：New、Set#<<和Set#Include？
另一种变体是将scan与更复杂的正则表达式一起使用。我将以“自由空格模式”表达这个正则表达式，以使其自记录。

RGX = /
      (         # begin capture group 1
        \b      # match a word boundary
        [a-z]+  # match one or more lowercase letters
        \b      # match a word boundary
      )         # end capture group 1
      (?!       # begin negative lookahead
        .*      # match zero or more characters
        \b      # match a word boundary
        \1      # match the contents of capture group 1
        \b      # match a word boundary
      )         # end negative lookahead
      /x        # use free-spacing regex definition mode

def find_em(str)
  str.downcase.scan(RGX).size
end

find_em(str) #=> 7

请注意

str.downcase.scan(RGX)    
  #=> [["best"], ["it"], ["was"], ["the"], ["worst"], ["of"], ["times"]]

此正则表达式是按常规编写的

/(\b[a-z]+\b)(?!.*\b\1\b)/

或者，我们可以使用gsub的奇怪形式，它完全避免了数组的构造。

def find_em(str)
  str.downcase.gsub(RGX).count
end

find_em(str) #=> 7

请参见枚举器#count。

赞(0）回复(0）举报 2022-10-15

我来回答

如何计算一个字符串中不同单词的个数？

1条答案

相关问题

热门标签

最新问答