当阅读逐行时，PostgreSQL在某些JSON文档上抛出错误

我尝试将jeopardy json文件从Kaggle读入postgresql。数据来自一个json文档数组。我使用以下代码将其转换为python中的单行json文件（请注意，在尝试所有200 k之前，我只获取了前50个文档用于测试）：

f = open('200k_questions.json')
data = json.load(f) # this is an array of dictionaries
file = open('temp_ut.json','w') # open file to write to
max=50
for i in range(0,max):
    temp = data[i]
    json.dump(temp, file)
    file.write('\n')
file.close()
f.close()

字符串
结果文件的前4行是：

{"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
{"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
{"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
{"category": "THE COMPANY LINE", "air_date": "2004-12-31", "question": "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'", "value": "$200", "answer": "McDonald\\'s", "round": "Jeopardy!", "show_number": "4680"}

型
在PostgreSQL中，我成功地从命令行执行了以下命令：

CREATE DATABASE jeopardy;
CREATE TABLE questions(id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, question jsonb NOT NULL);

型
然后，为了将json文档复制到PostgreSQL中，我执行以下操作：

\COPY questions (question) FROM 'C:\Users\malawley\Documents\Northwestern MS\Course 6 MSDS 420 Database\TermPaper\data\jeopardy\temp_ut.json';

型
PostgreSQL生成此错误：
错误代码：json类型的输入语法无效请注意：Token“The”无效。CONTEXT：JSON data，line 1：...“2004-12-31”，“question”：“'In 1963，live on“The... COPY questions，line 4，column question：“{“category”：“THE COMPANY LINE”，“air_date”：“2004-12-31”，“question”：“1963年，在“艺术L”的现场...”
问题出在这个文档上（前3个将成功加载）：
{“category”：“THE COMPANY LINE”，“air_date”：“2004-12-31”，“question”：“'In 1963，live on“The Art链接字母Show”，this company served its billionth burger'“，“value”：“$200”，“answer”：“McDonald's”，“round”：“Jeopardy！"，“show_number”：“4680”}
其他几个有问题的文件是：
{“category”：“EPITAPHS & TIBUTES”，“air_date”：“2004-12-31”，“question”：“'”And away we go“”，“value”：“$400”，“answer”：“Jackie Gleason”，“round”：“Jeopardy！"，“show_number”：“4680”}
{“category”：“ESPN's Top 10 All-Time Athletes”，“air_date”：“2004-12-31”，“question”：“'No. 1：Lettered in hoops，football &长曲棍球at锡拉丘兹兹& if you think he couldn't act，ask his 11“injure”buddies'“，“value”：“$600”，“answer”：“Jim Brown”，“round”：“Jeopardy！"，“show_number”：“4680”}
有人知道这是怎么回事吗？我已经用下面的函数检查了所有200 k的python文档，它们都检查出来了。所以，我卡住了。

def is_json(myjson):
  try:
    json.loads(myjson)
  except ValueError as e:
    return False
  return True

型

你被双重逃跑咬了。
默认情况下，copy假设输入为“文本格式”。这使用\作为转义字符。
在COPY数据中可以使用反斜杠字符（）来引用数据字符，否则这些数据字符可能被用作行或列分隔符。
您的\首先由COPY解释，然后将剩余的内容传递给JSON。
例如。这个输入...

{ "question": "'In 1963, live on \"The Art Linkletter Show\" ..." }

字符串
.被copy读取并变成这个.

"question": "'In 1963, live on "The Art Linkletter Show" ..."

型
这是一个无效的JSON。
你得加倍逃跑。

{ "question": "'In 1963, live on \\"The Art Linkletter Show\\" ..." }

型
或者将数据转换为单列CSV，并将format csv添加到copy中。这听起来可能很傻，但您可以让Python csv library为您处理转义，并且可以扩展到任何列数。

当阅读逐行时，PostgreSQL在某些JSON文档上抛出错误

1条答案

相关问题

热门标签

最新问答