在postgresql中进行postgresql全文搜索-日语,中文,阿拉伯语

e0bqpujr  于 2023-05-06  发布在  PostgreSQL
关注(0)|答案(4)|浏览(134)

我正在为我目前的项目设计一个postgresql的全文搜索功能。到目前为止,它在ispell/myspell字典上工作正常。现在我需要添加对中文,日文和阿拉伯语搜索的支持。我该从何说起呢?据我所知,这些语言没有可用的模板或字典。pg_catalog.simple配置是否有效?

1bqhqjot

1bqhqjot1#

关于manual的一点提示:OpenOffice Wiki上有一个很大的字典列表。

bz4sfanl

bz4sfanl2#

字典对你的中文不会有太大的帮助--你需要看看NGRAM标记。

guykilcj

guykilcj3#

www.example.com上的link的类似解决方案stackoverflow.com是How do I implement full text search in Chinese on PostgreSQL?
尽管如此,我还是会根据我的经验和互联网上的解决方案提供以下详细的解决方案。我使用SCWSzhparser两个工具作为postgres中文全文检索的解决方案。

20160131更新:

  • 您必须检查您是否安装了postgresql-server-devel-{number version},因为我们将使用它的pgxs函数在postgresql中创建扩展。*
    Step1:安装SCWS。

值得注意的是,--prefix=/usr/local/scws跟在**./configure**后面。不只是有./configure沿着下面的第四行。

wget http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2
tar xvjf scws-1.2.2.tar.bz2
cd scws-1.2.2
./configure --prefix=/usr/local/scws 
make
make install

要检查是否安装成功,请输入以下命令:

ls -al /usr/local/scws/lib/libscws.la

Step2:安装zhparser

git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws/include make && make install

**20160131更新:**如果使用Mac OS X约塞米蒂,SCWS_HOME的上述值相同。但是如果您使用Ubuntu 14.04 LTS,请将SCWS_HOME的值更改为/usr/local/scws
Step 3:在Postres中使用zhparser配置新扩展
步骤3.1:通过终端/命令行登录您的postgres数据库

psql yourdatabasename

步骤3.2:在Postgres中创建扩展。您可以指定您想要的字典名称。

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION dictionarynameyouwant (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION dictionarynameyouwant ADD MAPPING FOR n,v,a,i,e,l WITH simple;

如果你按照上面的步骤操作,你就可以使用Postgres的中文/普通话全文搜索功能了。

Rails中使用pg_search gem的额外步骤(不必要):Step 4.在app/models/yourmodel.rb中配置:tsearch的dictionary name at:dictionary属性

class YourOwnClass < ActiveRecord::Base
    ...
    include PgSearch
    pg_search_scope :functionnameyoulike, :against => [columnsyoulike1, columnsyoulike2, ...,etc], :using => { :tsearch => {:dictionary => "dictionary name you just specified in creating a extension in postgres", blah blah blah, ..., etc} }
end

参考文献:

  1. SCWS install tutorial
  2. Zhparser@github.com
  3. Francs' Post - Postgres full-text search in Chinese with zhparser and SCWS
  4. Rails365.net's Post - Postgres full-text search in Chinese with pg_search gem with zhparser
  5. My Post at xuite.net - Make Postgres support full text search in Mandarin/Chinese
bakd9h0s

bakd9h0s4#

对于那些登陆这里进行日语PostgreSQL全文搜索的人,这里是在ubuntu上进行搜索的方法:
安装以下软件和开发环境:

apt-get install libmecab-dev libmecab2 mecab-ipadic-utf8 mecab-utils libmecab-perl libtext-mecab-perl mecab mecab-jumandic-utf8

https://www.postgresql.org/ftp/projects/pgFoundry/textsearch-ja/textsearch_ja/9.0.0/下载textsearch_ja
以下是针对使用PostgreSQL版本12的用户:

cd textsearch_ja-9.0.0
make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config
sudo make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config install

这将产生类似于:

/bin/mkdir -p '/usr/lib/postgresql/12/lib'
/bin/mkdir -p '/usr/share/postgresql/12/contrib'
/usr/bin/install -c -m 755  textsearch_ja.so '/usr/lib/postgresql/12/lib/textsearch_ja.so'
/usr/bin/install -c -m 644 .//uninstall_textsearch_ja.sql textsearch_ja.sql '/usr/share/postgresql/12/contrib/'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode/textsearch_ja'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/ '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
/usr/bin/install -c -m 644 textsearch_ja.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_eucjp.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_utf8.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 pgut/pgut-be.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
cd '/usr/lib/postgresql/12/lib/bitcode' && /usr/lib/llvm-10/bin/llvm-lto -thinlto -thinlto-action=thinlink -o textsearch_ja.index.bc textsearch_ja/textsearch_ja.bc textsearch_ja/encoding_eucjp.bc textsearch_ja/encoding_utf8.bc textsearch_ja/pgut/pgut-be.bc

然后,在自动生成的textsearch_ja. sql中,将LANGUAGE='C'更改为LANGUAGE='c'(小写):

perl -pi -E "s/LANGUAGE 'C'/LANGUAGE 'c'/" textsearch_ja.sql

然后,您可以使用超级用户将其添加到PostgreSQL:

sudo -u postgres psql -f textsearch_ja.sql

如果您想为现有数据库添加它:

sudo -u postgres psql -d my_database -f textsearch_ja.sql

这将产生类似于:

SET
BEGIN
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
COMMENT
CREATE FUNCTION
CREATE TEXT SEARCH TEMPLATE
CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH CONFIGURATION
COMMENT
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
COMMIT

要测试它,请按照https://github.com/HiraokaHyperTools/textsearch_ja

SELECT ja_wakachi('分かち書きを行います。');

将产生:

ja_wakachi
----------------------------
 分かち書き を 行い ます 。
(1 row)
SELECT furigana('漢字の読みをカタカナで返します。');

将产生:

furigana
--------------------------------------
 カンジノヨミヲカタカナデカエシマス。
(1 row)
SELECT * FROM ts_debug('japanese', E'日\n本\n語\n文\n字\n中\nの\n改\n行\nは\n除\n去\n');

将产生:

alias |    description    | token  |  dictionaries   |  dictionary   | lexemes
-------+-------------------+--------+-----------------+---------------+----------
 word  | Word, all letters | 日本語 | {japanese_stem} | japanese_stem | {日本語}
 word  | Word, all letters | 文字   | {japanese_stem} | japanese_stem | {文字}
 word  | Word, all letters | 中     | {japanese_stem} | japanese_stem | {中}
 blank | Space symbols     | の     | {}              |               |
 word  | Word, all letters | 改行   | {japanese_stem} | japanese_stem | {改行}
 blank | Space symbols     | は     | {}              |               |
 word  | Word, all letters | 除去   | {japanese_stem} | japanese_stem | {除去}
(7 rows)
SELECT * FROM ts_debug('japanese', E'Line\nbreaks\nin\nEnglish\ntext\nare\nreserved.');

将产生:

alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | Line     | {english_stem} | english_stem | {line}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | breaks   | {english_stem} | english_stem | {break}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | in       | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | English  | {english_stem} | english_stem | {english}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | text     | {english_stem} | english_stem | {text}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | are      | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | reserved | {english_stem} | english_stem | {reserv}
 blank     | Space symbols   | .        | {}             |              |
(14 rows)
SELECT * FROM ts_debug('japanese', '日本語とEnglishがmixedな文も解析OKです。');

将产生:

alias   |    description    |  token  |  dictionaries   |  dictionary   |  lexemes
-----------+-------------------+---------+-----------------+---------------+-----------
 word      | Word, all letters | 日本語  | {japanese_stem} | japanese_stem | {日本語}
 blank     | Space symbols     | と      | {}              |               |
 asciiword | Word, all ASCII   | English | {english_stem}  | english_stem  | {english}
 blank     | Space symbols     | が      | {}              |               |
 asciiword | Word, all ASCII   | mixed   | {english_stem}  | english_stem  | {mix}
 blank     | Space symbols     | な      | {}              |               |
 word      | Word, all letters | 文      | {japanese_stem} | japanese_stem | {文}
 blank     | Space symbols     | も      | {}              |               |
 word      | Word, all letters | 解析    | {japanese_stem} | japanese_stem | {解析}
 asciiword | Word, all ASCII   | OK      | {english_stem}  | english_stem  | {ok}
 blank     | Space symbols     | です    | {}              |               |
 blank     | Space symbols     | 。      | {}              |               |
(12 rows)
SELECT s
  FROM regexp_split_to_table(to_tsvector('japanese',
 '語尾は基本形に戻されます。')::text, ' ') AS t(s)
  ORDER BY s;

将产生:

s
------------
 'れる':4
 '基本形':2
 '戻す':3
 '語尾':1
(4 rows)
SELECT s
   FROM regexp_split_to_table(to_tsvector('japanese',
 'ユーザとユーザーは正規化されます。ミラーとミラは別扱い。')::text, ' ') AS t(s)
  ORDER BY s;

将产生:

s
--------------
 'する':5
 'ミラー':7
 'ミラ':8
 'ユーザ':1,2
 'れる':6
 '別':9
 '化':4
 '扱い':10
 '正規':3
(9 rows)

相关问题