postmodern-text-searching

Postmodern Examples Table of Contents Arrays

Searching

Introductions to indexing and searching

http://www.sai.msu.su/~megera/postgres/fts/doc/fts-query.html

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html

Text Indexing

There are a couple of different ways to index a posgresql database, You could do the following to create an index, but the whole thing must be reindexed or it goes out of date. (And reindexing takes a long time)

CREATE INDEX idx_countries_text ON countries USING gin  (to_tsvector('english'::regconfig, text)) 

A different way of indexing is to add a ts_vector column to the table as discussed in http://linuxgazette.net/164/sephton.html. Then you search against the ts_vector column, not the index.

ALTER TABLE countries ADD COLUMN tsv tsvector; UPDATE countries SET tsv =      to_tsvector('english', text); CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE   ON countries FOR EACH ROW EXECUTE PROCEDURE   tsvector_update_trigger(tsv, 'pg_catalog.english', text);

Using this method, the search query is simpler, using the tsv column.:

(query "select id,name,text from countries where tsv @@ to_tsquery('english', $1)"  "search-term1 & searchterm2")

Or using the :raw keyword

(postmodern:query (:select 'id 'name :from 'countries :where (:raw "to_tsvector('english' , text ) @@ to_tsquery('english' , $1)")) final-search-string)

Searching

Postgres provides 2 functions for matching against tsvectors: plainto_tsquery and to_tsquery. plainto_tsquery takes every word in the search string and assumes that they should all be connected by an "and" operator ('&').

to_tsquery() provides boolean abilities, but each word in the search string must be separated by a boolean operator. ('&' '|' or '!') (and, or, not). Unfortunately, it will throw up on text which doesn't have the separators (no phrases). It will not accept text which is not separated by '&', '|' or '!'. It does accept braces, which are used to indicate operator precidence, but two tokens which translate to different lexemes directly after one another will cause totsquery() to fail.

select to_tsquery('wonderful text'); ERROR: syntax error in tsquery: "wonderful text" select to_tsquery('wonderful | text'); to_tsquery ---------------------- 'wonderful' | 'text'

Simple search

(query "select id,name from countries where to_tsvector('english', text) @@ to_tsquery('english', 'oid')")(query (:select 'id 'name :from 'countries :where (:raw "to_tsvector('english', text) @@ to_tsquery('english', 'oid')")))

After indexing, this seems to work. The following postmodern query will return the obvious records quickly.

(query "select id,name from countries where         to_tsvector('english', text) @@ to_tsquery('english', $1)"  "technical         & India")

The string "fat & (rat | cat) & ! dog" will look for records that have the word "fat" and either the word "rat" or the word "cat" but cannot also have the word "dog".

The string "super:*" will look for records with a words that begins with "super".

Here is a function searching against a code database:

(defun search-code (search-string &optional (limit 20))
  "Tries to run a text search on a string through source-code-functions2. By using plainto_tsquery any words passed 
   to search code are treated as connected by 'and'. So a search-string like 'graphviz edge ignore' will return the 
   source code for functions which have all three words."
  (postmodern:query (:limit (:order-by (:select 'id 'name 'package-name 'code
                                         :from (:as (:select 'id 'name 'package-name 'code 'tsv
                                                             :from 'source-code-functions2
                                                             (:as (:plainto_tsquery '$1) 'q)
                                                             :where (:@@ 'tsv  'q))
                                                    't1))
                                       (:desc (:ts-rank-cd 't1.tsv (:plainto-tsquery '$2))))
                            '$3)
                    search-string search-string limit))

Ranking

select id,name,ts_rank_cd(to_tsvector(countries.text), to_tsquery('stamp')) as rank  from countries where to_tsvector(countries.text) @@ to_tsquery('stamp')

Snippets with the search text in bold

select id,name,ts_headline(countries.text, to_tsquery('stamp')) as snippet  from countries  where to_tsvector(countries.text) @@ to_tsquery('stamp')

More information at Datatype text search