使用smlar插件, 相似度计算公式采用tfidf，效率和你列出的差别特别大 #33

miaojianxin · 2018-07-09T08:41:21Z

我参照网页上你的步骤1步步做的，
数据是从https://www.mockaroo.com/c5418bd0，通过命令curl "https://api.mockaroo.com/api/c5418bd0?count=100000&key=b21830e0" > "tfidf.csv"下载的，一次只能下载5000，所以我下载了20次，得到10万数据，

然后我的执行计划：

postgres=# explain (analyze,verbose,timing,costs,buffers) select
postgres-# *,
postgres-# smlar( tokenize(body), '{Etiam,pretium,iaculis}'::text[] )
postgres-# from
postgres-# documents
postgres-# where
postgres-# tokenize(body) %% '{Etiam,pretium,iaculis}'::text[] -- where TFIDF similarity >= smlar.threshold
postgres-# order by
postgres-# smlar( tokenize(body), '{Etiam,pretium,iaculis}'::text[] ) desc
postgres-# limit 10;
QUERY PLAN

Limit (cost=368.05..368.07 rows=10 width=258) (actual time=4957.830..4957.832 rows=10 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Etiam,pretium
,iaculis}'::text[]))
Buffers: shared hit=4870
-> Sort (cost=368.05..368.30 rows=100 width=258) (actual time=4957.829..4957.829 rows=10 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Etiam,p
retium,iaculis}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{Etiam,pretium,
iaculis}'::text[])) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=4870
-> Bitmap Heap Scan on public.documents (cost=17.05..365.89 rows=100 width=258) (actual time=111.75
7..4957.313 rows=666 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Et
iam,pretium,iaculis}'::text[])
Recheck Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{Etiam,pr
etium,iaculis}'::text[])
Rows Removed by Index Recheck: 17737
Heap Blocks: exact=3525
Buffers: shared hit=4870
-> Bitmap Index Scan on documents_tokenize_idx (cost=0.00..17.03 rows=100 width=0) (actual ti
me=95.413..95.413 rows=18403 loops=1)
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{Etia
m,pretium,iaculis}'::text[])
Buffers: shared hit=1345
Planning time: 0.456 ms
Execution time: 4957.952 ms
(19 rows)

而你的类似的查询，才7毫秒左右，我shared hit的特别多，我自己中文的文档，10万文档，然类似查询的话，时间也在几千毫秒左右，这是哪边配置不对吗，
set smlar.type = tfidf;
set smlar.stattable = 'documents_body_stats'
set smlar.persistent_cache=true;
set smlar.threshold =0.4;
set smlar.tf_method = 'n';

miaojianxin · 2018-07-09T08:43:03Z

我参考的是你这篇文章的方法：
https://github.com/digoal/blog/blob/c1a3db101af25ababd35d0e6963e4c1c4609c8d5/201701/20170116_03.md

miaojianxin · 2018-07-09T09:00:05Z

set enable_bitmap_indexscan=false;
只走index scan，查询计划

                                                              QUERY PLAN

Limit (cost=416.94..416.97 rows=10 width=258) (actual time=3561.033..3561.033 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,mal
esuada,euismod}'::text[]))
Buffers: shared hit=13022
-> Sort (cost=416.94..417.19 rows=100 width=258) (actual time=3561.032..3561.032 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringil
la,malesuada,euismod}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,male
suada,euismod}'::text[])) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=13022
-> Index Scan using documents_tokenize_idx on public.documents (cost=0.28..414.78 rows=100 width=25
8) (actual time=3561.019..3561.019 rows=0 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fr
ingilla,malesuada,euismod}'::text[])
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{fringilla,
malesuada,euismod}'::text[])
Rows Removed by Index Recheck: 12299
Buffers: shared hit=13022
Planning time: 0.247 ms
Execution time: 3561.105 ms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用smlar插件, 相似度计算公式采用tfidf，效率和你列出的差别特别大 #33

使用smlar插件, 相似度计算公式采用tfidf，效率和你列出的差别特别大 #33

miaojianxin commented Jul 9, 2018

miaojianxin commented Jul 9, 2018

miaojianxin commented Jul 9, 2018

使用smlar插件, 相似度计算公式采用tfidf， 效率和你列出的差别特别大 #33

使用smlar插件, 相似度计算公式采用tfidf， 效率和你列出的差别特别大 #33

Comments

miaojianxin commented Jul 9, 2018

miaojianxin commented Jul 9, 2018

miaojianxin commented Jul 9, 2018

使用smlar插件, 相似度计算公式采用tfidf，效率和你列出的差别特别大 #33

使用smlar插件, 相似度计算公式采用tfidf，效率和你列出的差别特别大 #33