
服务器之家 > 正文


时间:2021-11-26 22:41     来源/作者:deephub

我一直在寻找有效关键字提取任务算法。 目标是找到一种算法,能够以有效的方式提取关键字,并且能够平衡提取质量和执行时间,因为我的数据语料库迅速增加已经达到了数百万行。 我对于算法一个主要的要求是提取关键字本身总是要有意义的,即使脱离了上下文的语境也能够表达一定的含义。


本篇文章使用 2000 个文档的语料库对几种著名的关键字提取算法进行测试和试验。




  • RAKE
  • YAKE
  • PKE
  • KeyBERT
  • Spacy

Pandas 和Matplotlib还有其他通用库




我们将首先导入包含我们的文本数据的数据集。 然后,我们将为每个算法创建提取逻辑的单独函数

algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]


extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}

下一步,使用Spacy帮助我们定义一个匹配器对象,用来判断关键字是否对我们的任务有意义,该对象将返回 true 或 false。




  1. ['To follow up from my previous questions. . Here is the result!\n',
  2. 'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'





  1. # initiate BERT outside of functions
  2. bert = KeyBERT()
  3. # 1. RAKE
  4. def rake_extractor(text):
  5. """
  6. Uses Rake to extract the top 5 keywords from a text
  7. Arguments: text (str)
  8. Returns: list of keywords (list)
  9. """
  10. r = Rake()
  11. r.extract_keywords_from_text(text)
  12. return r.get_ranked_phrases()[:5]
  13. # 2. YAKE
  14. def yake_extractor(text):
  15. """
  16. Uses YAKE to extract the top 5 keywords from a text
  17. Arguments: text (str)
  18. Returns: list of keywords (list)
  19. """
  20. keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
  21. results = []
  22. for scored_keywords in keywords:
  23. for keyword in scored_keywords:
  24. if isinstance(keyword, str):
  25. results.append(keyword)
  26. return results
  27. # 3. PositionRank
  28. def position_rank_extractor(text):
  29. """
  30. Uses PositionRank to extract the top 5 keywords from a text
  31. Arguments: text (str)
  32. Returns: list of keywords (list)
  33. """
  34. # define the valid Part-of-Speeches to occur in the graph
  35. pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
  36. extractor = pke.unsupervised.PositionRank()
  37. extractor.load_document(text, language='en')
  38. extractor.candidate_selection(pos=pos, maximum_word_number=5)
  39. # 4. weight the candidates using the sum of their word's scores that are
  40. # computed using random walk biaised with the position of the words
  41. # in the document. In the graph, nodes are words (nouns and
  42. # adjectives only) that are connected if they occur in a window of
  43. # 3 words.
  44. extractor.candidate_weighting(window=3, pos=pos)
  45. # 5. get the 5-highest scored candidates as keyphrases
  46. keyphrases = extractor.get_n_best(n=5)
  47. results = []
  48. for scored_keywords in keyphrases:
  49. for keyword in scored_keywords:
  50. if isinstance(keyword, str):
  51. results.append(keyword)
  52. return results
  53. # 4. SingleRank
  54. def single_rank_extractor(text):
  55. """
  56. Uses SingleRank to extract the top 5 keywords from a text
  57. Arguments: text (str)
  58. Returns: list of keywords (list)
  59. """
  60. pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
  61. extractor = pke.unsupervised.SingleRank()
  62. extractor.load_document(text, language='en')
  63. extractor.candidate_selection(pos=pos)
  64. extractor.candidate_weighting(window=3, pos=pos)
  65. keyphrases = extractor.get_n_best(n=5)
  66. results = []
  67. for scored_keywords in keyphrases:
  68. for keyword in scored_keywords:
  69. if isinstance(keyword, str):
  70. results.append(keyword)
  71. return results
  72. # 5. MultipartiteRank
  73. def multipartite_rank_extractor(text):
  74. """
  75. Uses MultipartiteRank to extract the top 5 keywords from a text
  76. Arguments: text (str)
  77. Returns: list of keywords (list)
  78. """
  79. extractor = pke.unsupervised.MultipartiteRank()
  80. extractor.load_document(text, language='en')
  81. pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
  82. extractor.candidate_selection(pos=pos)
  83. # 4. build the Multipartite graph and rank candidates using random walk,
  84. # alpha controls the weight adjustment mechanism, see TopicRank for
  85. # threshold/method parameters.
  86. extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
  87. keyphrases = extractor.get_n_best(n=5)
  88. results = []
  89. for scored_keywords in keyphrases:
  90. for keyword in scored_keywords:
  91. if isinstance(keyword, str):
  92. results.append(keyword)
  93. return results
  94. # 6. TopicRank
  95. def topic_rank_extractor(text):
  96. """
  97. Uses TopicRank to extract the top 5 keywords from a text
  98. Arguments: text (str)
  99. Returns: list of keywords (list)
  100. """
  101. extractor = pke.unsupervised.TopicRank()
  102. extractor.load_document(text, language='en')
  103. pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
  104. extractor.candidate_selection(pos=pos)
  105. extractor.candidate_weighting()
  106. keyphrases = extractor.get_n_best(n=5)
  107. results = []
  108. for scored_keywords in keyphrases:
  109. for keyword in scored_keywords:
  110. if isinstance(keyword, str):
  111. results.append(keyword)
  112. return results
  113. # 7. KeyBERT
  114. def keybert_extractor(text):
  115. """
  116. Uses KeyBERT to extract the top 5 keywords from a text
  117. Arguments: text (str)
  118. Returns: list of keywords (list)
  119. """
  120. keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)
  121. results = []
  122. for scored_keywords in keywords:
  123. for keyword in scored_keywords:
  124. if isinstance(keyword, str):
  125. results.append(keyword)
  126. return results




我们已经通过传递 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 来限制一些可接受的语法模式——这与 Spacy 一起将确保几乎所有的关键字都是从人类语言视角来选择的。 我们还希望关键字包含三个单词,只是为了有更具体的关键字并避免过于笼统。



  1. def extract_keywords_from_corpus(extractor, corpus):
  2. """This function uses an extractor to retrieve keywords from a list of documents"""
  3. extractor_name = extractor.__name__.replace("_extractor", "")
  4. logging.info(f"Starting keyword extraction with {extractor_name}")
  5. corpus_kws = {}
  6. start = time.time()
  7. # logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer
  8. for idx, text in tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
  9. corpus_kws[idx] = extractor(text)
  10. end = time.time()
  11. # logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer
  12. elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start))
  13. logging.info(f"Time elapsed: {elapsed}")
  14. return {"algorithm": extractor.__name__,
  15. "corpus_kws": corpus_kws,
  16. "elapsed_time": elapsed}



这个函数确保提取器返回的关键字始终(几乎?)意义。 例如,



Spacy 与 Matcher 对象可以帮助我们做到这一点。 我们将定义一个匹配函数,它接受一个关键字,如果定义的模式匹配,则返回 True 或 False。

  1. def match(keyword):
  2. """This function checks if a list of keywords match a certain POS pattern"""
  3. patterns = [
  4. [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
  5. [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
  6. [{'POS': 'VERB'}, {'POS': 'NOUN'}],
  7. [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
  8. [{'POS': 'NOUN'}, {'POS': 'VERB'}],
  9. [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
  10. [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
  11. [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
  12. [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
  13. [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
  14. [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
  15. [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
  16. [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
  17. [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
  18. [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
  19. [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
  20. [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
  21. [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
  22. [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
  23. [{'POS': 'VERB'}, {'POS': 'ADV'}],
  24. [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
  25. ]
  26. matcher = Matcher(nlp.vocab)
  27. matcher.add("pos-matcher", patterns)
  28. # create spacy object
  29. doc = nlp(keyword)
  30. # iterate through the matches
  31. matches = matcher(doc)
  32. # if matches is not empty, it means that it has found at least a match
  33. if len(matches) > 0:
  34. return True
  35. return False


我们马上就要完成了。 这是启动脚本和收集结果之前的最后一步。

我们将定义一个基准测试函数,它接收我们的语料库和一个布尔值,用于对我们的数据进行打乱。 对于每个提取器,它调用

extract_keywords_from_corpus 函数返回一个包含该提取器结果的字典。 我们将该值存储在列表中。


  • 平均提取关键词数
  • 匹配关键字的平均数量
  • 计算一个分数表示找到的平均匹配数除以执行操作所花费的时间

我们将所有数据存储在 Pandas DataFrame 中,然后将其导出为 .csv。

  1. def get_sec(time_str):
  2. """Get seconds from time."""
  3. h, m, s = time_str.split(':')
  4. return int(h) * 3600 + int(m) * 60 + int(s)
  5. def benchmark(corpus, shuffle=True):
  6. """This function runs the benchmark for the keyword extraction algorithms"""
  7. logging.info("Starting benchmark...\n")
  8. # Shuffle the corpus
  9. if shuffle:
  10. random.shuffle(corpus)
  11. # extract keywords from corpus
  12. results = []
  13. extractors = [
  14. rake_extractor,
  15. yake_extractor,
  16. topic_rank_extractor,
  17. position_rank_extractor,
  18. single_rank_extractor,
  19. multipartite_rank_extractor,
  20. keybert_extractor,
  21. ]
  22. for extractor in extractors:
  23. result = extract_keywords_from_corpus(extractor, corpus)
  24. results.append(result)
  25. # compute average number of extracted keywords
  26. for result in results:
  27. len_of_kw_list = []
  28. for kws in result["corpus_kws"].values():
  29. len_of_kw_list.append(len(kws))
  30. result["avg_keywords_per_document"] = np.mean(len_of_kw_list)
  31. # match keywords
  32. for result in results:
  33. for idx, kws in result["corpus_kws"].items():
  34. match_results = []
  35. for kw in kws:
  36. match_results.append(match(kw))
  37. result["corpus_kws"][idx] = match_results
  38. # compute average number of matched keywords
  39. for result in results:
  40. len_of_matching_kws_list = []
  41. for idx, kws in result["corpus_kws"].items():
  42. len_of_matching_kws_list.append(len([kw for kw in kws if kw]))
  43. result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
  44. # compute average percentange of matching keywords, round 2 decimals
  45. result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 2)
  46. # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
  47. for result in results:
  48. elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1
  49. # weigh the score based on the time elapsed
  50. result["performance_score"] = round(result["avg_matched_keywords_per_document"] / elapsed_seconds, 2)
  51. # delete corpus_kw
  52. for result in results:
  53. del result["corpus_kws"]
  54. # create results dataframe
  55. df = pd.DataFrame(results)
  56. df.to_csv("results.csv", index=False)
  57. logging.info("Benchmark finished. Results saved to results.csv")
  58. return df


  1. results = benchmark(texts[:2000], shuffle=True)






avg_matched_keywords_per_document/time_elapsed_in_seconds), Rake 在 2 秒内处理 2000 个文档,尽管准确度不如 KeyBERT,但时间因素使其获胜。


avg_matched_keywords_per_document 和 avg_keywords_per_document 之间的比率,我们得到这些结果


从准确性的角度来看,Rake 的表现也相当不错。如果我们不考虑时间的话,KeyBERT 肯定会成为最准确、最有意义关键字提取的算法。Rake 虽然在准确度上排第二,但是差了一大截。

如果需要准确性,KeyBERT 肯定是首选,如果要求速度的话Rake肯定是首选,因为他的速度块,准确率也算能接受吧。




yue是什么意思 网络流行语yue了是什么梗
yue是什么意思 网络流行语yue了是什么梗 2020-10-11
背刺什么意思 网络词语背刺是什么梗
背刺什么意思 网络词语背刺是什么梗 2020-05-22
2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全
2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全 2019-12-26
2021年耽改剧名单 2021要播出的59部耽改剧列表
2021年耽改剧名单 2021要播出的59部耽改剧列表 2021-03-05
苹果12mini价格表官网报价 iPhone12mini全版本价格汇总
苹果12mini价格表官网报价 iPhone12mini全版本价格汇总 2020-11-13