postgresql SQL中的反向索引

ar5n3qh5  于 2023-11-18  发布在  PostgreSQL
关注(0)|答案(1)|浏览(179)

在本作业中,您将创建一个文档表,然后使用SQL为这些文档生成一个反向索引,以标识包含特定单词的每个文档。
仅供参考:与所提供的示例SQL不同,您将把倒排索引中的所有单词都Map为小写(即Python、PYTHON和python在倒排索引中的结尾都应该是“python”)。

CREATE TABLE docs01 (id SERIAL, doc TEXT, PRIMARY KEY(id));

CREATE TABLE invert01 (
  keyword TEXT,
  doc_id INTEGER REFERENCES docs01(id) ON DELETE CASCADE
);

字符串
以下是要插入docs 01的单行文档:

INSERT INTO docs01 (doc) VALUES
('The building blocks of programs'),
('In the next few chapters we will learn more about the vocabulary'),
('sentence structure paragraph structure and story structure of Python'),
('We will learn about the powerful capabilities of Python and how to'),
('compose those capabilities together to create useful programs'),
('There are some lowlevel conceptual patterns that we use to construct'),
('programs These constructs are not just for Python programs they are'),
('part of every programming language from machine language up to the'),
('file or even some kind of sensor like a microphone or GPS In our'),
('initial programs our input will come from the user typing data on');


下面是反向索引的前几个预期行的示例:

SELECT keyword, doc_id FROM invert01 ORDER BY keyword, doc_id LIMIT 10;


| 关键字|文档标识|
| --|--|
| 一| 9 |
| 约| 2 |
| 约| 4 |
| 和| 3 |
| 和| 4 |
| 是| 6 |
| 是| 7 |
| 块| 1 |
| 建筑| 1 |
| 能力| 4 |

INSERT INTO invert01 (keyword, doc_id)
SELECT
    lower(keyword) AS keyword,
    doc_id
FROM (
    SELECT
        id AS doc_id,
        unnest(string_to_array(doc, ' ')) AS keyword
    FROM docs01
) AS words;


我试过了,但错误显示“关键字”are“应该在2个文档中,但只在3个文档中”。请帮助我查找错误。

tvz2xvvm

tvz2xvvm1#

如前所述,每个文档都有一个关键字的多个示例,比如数字7:
('programs这些构造 * are * not just for Python programs they * are* are'),
导致查询多次产生(keyword,doc_id)配对:
| 关键字|文档标识|
| --|--|
| 是| 6 |
| 是| 7 |
| 是| 7 |
因为默认情况下SELECT是接受重复的SELECT ALL。您可以通过切换到丢弃重复的SELECT DISTINCT来防止这种情况。
您也不需要使用子查询来应用lower(),并且您可以使用string_to_table()将文档直接拆分为关键字行,而不是先拆分为数组,然后再调用unnest()Demo

INSERT INTO invert01 (keyword, doc_id)
SELECT DISTINCT lower(string_to_table(doc, ' ')) AS keyword,
                id AS doc_id
FROM docs01;

字符串
结果

SELECT keyword, doc_id FROM invert01 WHERE keyword='are' 
ORDER BY keyword, doc_id LIMIT 10;


| 关键字|文档标识|
| --|--|
| 是| 6 |
| 是| 7 |

相关问题