python—如何在大型查询表或任何其他主题建模资源上使用google cloud自然语言处理api?

rjee0c15  于 2021-09-08  发布在  Java

正如标题中提到的,我有一个包含1800万行的bigquery表,其中近一半是无用的,我应该根据一个重要的列为每行分配一个主题/利基(该列包含关于产品和网站的详细信息),我已经在一个大小为10的样本数据上测试了nlp api,000,这确实很奇怪,但我的标准方法是迭代newarr(这是我通过查询bigquery表获得的重要细节列),在这里,我一次只发送一个单元格,等待api的响应并将其附加到结果数组。

  1. for x in newarr:
  2. i += 1
  3. results.append(sample_classify_text(x))


  1. # this function will return category for the text
  2. from import language_v1
  3. def sample_classify_text(text_content):
  4. """
  5. Classifying Content in a String
  6. Args:
  7. text_content The text content to analyze. Must include at least 20 words.
  8. """
  9. client = language_v1.LanguageServiceClient()
  10. # text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
  11. # Available types: PLAIN_TEXT, HTML
  12. type_ = language_v1.Document.Type.PLAIN_TEXT
  13. # Optional. If not specified, the language is automatically detected.
  14. # For list of supported languages:
  15. #
  16. language = "en"
  17. document = {"content": text_content, "type_": type_, "language": language}
  18. response = client.classify_text(request = {'document': document})
  19. #return response.categories
  20. # Loop through classified categories returned from the API
  21. for category in response.categories:
  22. # Get the name of the category representing the document.
  23. # See the predefined taxonomy of categories:
  24. #
  25. x = format(
  26. return x
  27. # Get the confidence. Number representing how certain the classifier
  28. # is that this category represents the provided text.


