运行python udf的配置单元twitter表在关闭操作符时出现配置单元运行时错误

okxuctiv  于 2021-06-26  发布在  Hive
关注(0)|答案(0)|浏览(280)

我正在尝试在hive中运行python自定义项,以便对flume捕获的twitter数据进行情绪分析。
我的twitter表代码:

  1. CREATE EXTERNAL TABLE tweets (
  2. id bigint,
  3. created_at string,
  4. source STRING,
  5. favorited BOOLEAN,
  6. retweeted_status STRUCT<
  7. text:STRING,
  8. user:STRUCT<screen_name:STRING,name:STRING>,
  9. retweet_count:INT>,
  10. entities STRUCT<
  11. urls:ARRAY<STRUCT<expanded_url:STRING>>,
  12. user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
  13. hashtags:ARRAY<STRUCT<text:STRING>>>,
  14. lang string,
  15. retweet_count int,
  16. text string,
  17. user STRUCT<
  18. screen_name:STRING,
  19. name:STRING,
  20. friends_count:INT,
  21. followers_count:INT,
  22. statuses_count:INT,
  23. verified:BOOLEAN,
  24. utc_offset:INT,
  25. time_zone:STRING>
  26. )
  27. PARTITIONED BY (datehour int)
  28. ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
  29. LOCATION 'hdfs://192.168.0.73:8020/user/flume/tweets'

我的python代码:

  1. import hashlib
  2. import sys
  3. for line in sys.stdin:
  4. line = line.strip()
  5. (lang, text) = line.split('\t')
  6. positive = set(["love", "good", "great", "happy", "cool", "best", "awesome", "nice", "helpful", "enjoyed"])
  7. negative = set(["hate", "bad", "stupid", "terrible", "unhappy"])
  8. words = text.split()
  9. word_count = len(words)
  10. positive_matches = [1 for word in words if word in positive]
  11. negative_matches = [-1 for word in words if word in negative]
  12. st = sum(positive_matches) + sum(negative_matches)
  13. if st > 0:
  14. print ('\t'.join([lang, text, 'positive', str(word_count)]))
  15. elif st < 0:
  16. print ('\t'.join([lang, text, 'negative', str(word_count)]))
  17. else:
  18. print ('\t'.join([lang, text, 'neutral', str(word_count)]))

最后是我的Hive查询:

  1. ADD JAR /tmp/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar;
  2. ADD FILE /tmp/my_py_udf.py;
  3. SELECT
  4. TRANSFORM (lang, text)
  5. USING 'python my_py_udf.py'
  6. AS (lang, text, sentiment, word_count)
  7. FROM tweets

通过此查询,我在关闭运算符时出错。
如果在python udf中仅使用一个变量,则查询将成功运行,前提是:

  1. text = line.replace('\n',' ')

它可能来自分裂('\t')中的序列吗?
有人能帮忙吗?在过去的10天里,我对这件事很讨厌。。。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题