我尝试将jeopardy json文件从Kaggle读入postgresql。数据来自一个json文档数组。我使用以下代码将其转换为python中的单行json文件(请注意,在尝试所有200 k之前,我只获取了前50个文档用于测试):
f = open('200k_questions.json')
data = json.load(f) # this is an array of dictionaries
file = open('temp_ut.json','w') # open file to write to
max=50
for i in range(0,max):
temp = data[i]
json.dump(temp, file)
file.write('\n')
file.close()
f.close()
字符串
结果文件的前4行是:
{"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
{"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
{"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
{"category": "THE COMPANY LINE", "air_date": "2004-12-31", "question": "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'", "value": "$200", "answer": "McDonald\\'s", "round": "Jeopardy!", "show_number": "4680"}
型
在PostgreSQL中,我成功地从命令行执行了以下命令:
CREATE DATABASE jeopardy;
CREATE TABLE questions(id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, question jsonb NOT NULL);
型
然后,为了将json文档复制到PostgreSQL中,我执行以下操作:
\COPY questions (question) FROM 'C:\Users\malawley\Documents\Northwestern MS\Course 6 MSDS 420 Database\TermPaper\data\jeopardy\temp_ut.json';
型
PostgreSQL生成此错误:
错误代码:json类型的输入语法无效请注意:Token“The”无效。CONTEXT:JSON data,line 1:...“2004-12-31”,“question”:“'In 1963,live on“The... COPY questions,line 4,column question:“{“category”:“THE COMPANY LINE”,“air_date”:“2004-12-31”,“question”:“1963年,在“艺术L”的现场...”
问题出在这个文档上(前3个将成功加载):
{“category”:“THE COMPANY LINE”,“air_date”:“2004-12-31”,“question”:“'In 1963,live on“The Art链接字母Show”,this company served its billionth burger'“,“value”:“$200”,“answer”:“McDonald's”,“round”:“Jeopardy!",“show_number”:“4680”}
其他几个有问题的文件是:
{“category”:“EPITAPHS & TIBUTES”,“air_date”:“2004-12-31”,“question”:“'”And away we go“”,“value”:“$400”,“answer”:“Jackie Gleason”,“round”:“Jeopardy!",“show_number”:“4680”}
{“category”:“ESPN's Top 10 All-Time Athletes”,“air_date”:“2004-12-31”,“question”:“'No. 1:Lettered in hoops,football &长曲棍球at锡拉丘兹兹& if you think he couldn't act,ask his 11“injure”buddies'“,“value”:“$600”,“answer”:“Jim Brown”,“round”:“Jeopardy!",“show_number”:“4680”}
有人知道这是怎么回事吗?我已经用下面的函数检查了所有200 k的python文档,它们都检查出来了。所以,我卡住了。
def is_json(myjson):
try:
json.loads(myjson)
except ValueError as e:
return False
return True
型
1条答案
按热度按时间kiz8lqtg1#
你被双重逃跑咬了。
默认情况下,
copy
假设输入为“文本格式”。这使用\
作为转义字符。在COPY数据中可以使用反斜杠字符()来引用数据字符,否则这些数据字符可能被用作行或列分隔符。
您的
\
首先由COPY解释,然后将剩余的内容传递给JSON。例如。这个输入...
字符串
.被
copy
读取并变成这个.型
这是一个无效的JSON。
你得加倍逃跑。
型
或者将数据转换为单列CSV,并将
format csv
添加到copy
中。这听起来可能很傻,但您可以让Python csv library为您处理转义,并且可以扩展到任何列数。