Fork me on GitHub

Picky

Documentation

Single Page Help Index

edit

This is the one page help document for Picky.

Search for things using your browser (use ⌘F).

Edit typos directly in the github page of a section using the edit button.

Getting started

It's All Ruby. You'll never feel powerless. Look at your index data anytime.

Generating an app

Creating an example app to get you up and running fast, Servers or Clients.

Generating them:

More infos on the applications:

Integration in Rails/Sinatra etc.

How to integrate Picky in:

Tokenizing

How data is cut into little pieces for the index and when searching.

Indexes

How the data is stored and what you can do with Indexes.

Configuring an index:

How does data get into an index?

How is the data categorized?

How is the data prepared?

Getting at the data:

There are four different store types:

Advanced topics:

Searching

How to configure a search interface over an index (or multiple).

What options does a user have when searching?

Advanced topics:

Facets

When you need a slice over a category's data.

Results

What a picky search returns.

JavaScript

We include a JavaScript library to make writing snazzy interfaces easier – see the options.

Thanks

A bit of thanks!

All Ruby

edit

Never forget this: Picky is all Ruby, all the time!

Even though we only describe examples of classic and Sinatra style servers, Picky can be included directly in Rails, as a client or server. Or in DRb. Or in your simple script without HTTP. Anywhere you like, as long as it's Ruby, really.

To drive the point home, remember that Picky is mainly two pieces working together: An index, and a search interface on indexes.

The index normally has a source, knows how to tokenize data, and has a few data categories. And the search interface normally knows how to tokenize incoming queries. That's it (copy and run in a script):

require 'picky'

Person = Struct.new :id, :first, :last

index = Picky::Index.new :people do
  source { People.all }
  indexing splits_text_on: /[\s-]/
  category :first
  category :last
end
index.add Person.new(1, 'Florian', 'Hanke')
index.add Person.new(2, 'Peter', 'Mayer-Miller')

people = Picky::Search.new index do
  searching splits_text_on: /[\s,-]/
end

results = people.search 'Miller'
p results.ids # => [2]

You can put these pieces anywhere, independently.

Transparency

Picky tries its best to be transparent so you can go have a look if something goes wrong. It wants you to never feel powerless.

All the indexes can be viewed in the /index directory of the project. They are waiting for you to inspect their JSONy goodness. Should anything not work with your search, you can investigate how it is indexed by viewing the actual index files (remember, they are in readable JSON) and change your indexing parameters accordingly.

You can also log as much data as you want to help you improve your search application until it's working perfectly.

Generators

edit

Picky offers a few generators to have a running server and client up in 5 minutes. So you can either get started right away

or, run gem install

gem install picky-generators

and simply enter

picky generate

This will raise an Picky::Generators::NotFoundException and show you the possibilities.

The "All In One" Client/Server might be interesting for Heroku projects, as it is a bit complicated to set up two servers that interact with each other.

Servers

Currently, Picky offers two generated example projects that you can adapt to your project: Separate Client and Server (recommended) and All In One.

If this is your first time using Picky, we suggest to start out with these even if you have a project where you want to integrate Picky already.

Sinatra

The server is generated with

picky generate server target_directory

and generates a full Sinatra server that you can try immediately. Just follow the instructions.

All In One

All In One is actually a single Sinatra server containing the Server AND the client. This server is generated with

picky generate all_in_one target_directory

and generates a full Sinatra Picky server and client in one that you can try immediately. Just follow the instructions.

Clients

Picky currently offers an example Sinatra client that you can adapt to your project (or look at it to get a feeling for how to use Picky in Rails).

Sinatra

This client is generated with

picky generate client target_directory

and generates a full Sinatra Picky client (including Javascript etc.) that you can try immediately. Just follow the instructions.

Servers / Applications

edit

Picky, from version 3.0 onwards, is designed to run anywhere, in anything. An octopus has eight legs, remember?

This means you can have a Picky server running in a DRb instance if you want to. Or in irb, for example.

We do run and test the Picky server in two styles, Classic and Sinatra.

But don't let that stop you from just using it in a class or just a script. This is a perfectly ok way to use Picky:

require 'picky'

include Picky # So we don't have to type Picky:: everywhere.

books_index = Index.new(:books) do
  source Sources::CSV.new(:title, :author, file: 'library.csv')
  category :title
  category :author
end

books_index.index
books_index.reload

books = Search.new books_index do
  boost [:title, :author] => +2
end

results = books.search "test"
results = books.search "alan turing"

require 'pp'
pp results.to_hash

More Ruby, more power to you!

Sinatra Style

A Sinatra server is usually just a single file. In Picky, it is a top-level file named

app.rb

We recommend to use the modular Sinatra style as opposed to the classic style. It's possible to write a Picky server in the classic style, but using the modular style offers more options.

require 'sinatra/base'
require 'picky'

class BookSearch < Sinatra::Application

  books_index = Index.new(:books) do
    source { Book.order("isbn ASC") }
    category :title
    category :author
  end

  books = Search.new books_index do
    boost [:title, :author] => +2
  end

  get '/books' do
    results = books.search params[:query],
                           params[:ids]    || 20,
                           params[:offset] ||  0
    results.to_json
  end

end

This is already a complete Sinatra server.

Routing

The Sinatra Picky server uses the same routing as Sinatra (of course). More information on Sinatra routing.

If you use the server with the picky client software (provided with the picky-client gem), you should return JSON from the Sinatra get. Just call to_json on the returned results to get the results in JSON format.

get '/books' do
  results = books.search params[:query], params[:ids] || 20, params[:offset] ||  0
  results.to_json
end

The above example search can be called using for example curl:

curl 'localhost:8080/books?query=test'

Logging

TODO Update this section.

This is one way to do it:

MyLogger = Logger.new "log/search.log"

# ...

get '/books' do
  results = books.search "test"
  MyLogger.info results
  results.to_json
end

or set it up in separate files for different environments:

require "logging/#{PICKY_ENVIRONMENT}"

Note that this is not Rack logging, but Picky search engine logging. The resulting file can be used with the picky-statistics gem.

All In One (Client + Server)

The All In One server is a Sinatra server and a Sinatra client rolled in one.

It's best to just generate one and look at it:

picky generate all_in_one all_in_one_test

and then follow the instructions.

When would you use an All In One server? One place is Heroku, since it is a bit more complicated to set up two servers that interact with each other.

It's nice for small convenient searches. For production setups we recommend to use a separate server to make everything separately cacheable etc.

Integration

edit

How do you integrate Picky in…?

Rails

There are basically two basic ways to integrate Picky in Rails:

The advantage of the first setup is that you don't need to manage an external server. However, having a separate search server is much cleaner: You don't need to load the indexes on Rails startup as you just leave the search server running separately.

Inside your Rails app

If you just want a small search engine inside your Rails app, this is the way to go.

In config/initializers/picky.rb, add the following: (lots of comments to help you)

# Set the Picky logger.
#
Picky.logger = Picky::Loggers::Silent.new
# Picky.logger = Picky::Loggers::Concise.new
# Picky.logger = Picky::Loggers::Verbose.new

# Set up an index and store it in a constant.
#
BooksIndex = Picky::Index.new :books do
  # Our keys are usually integers.
  #
  key_format :to_i
  # key_format :to_s # From eg. Redis they are strings.
  # key_format ... (whatever method needs to be called on
  # the id of what you are indexing)

  # Some indexing options to start with.
  # Please see: http://florianhanke.com/picky/documentation.html#tokenizing
  # on what the options are.
  #
  indexing removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
           stopwords:          /\b(and|the|of|it|in|for)\b/i,
           splits_text_on:     /[\s\/\-\_\:\"\&\/]/,
           rejects_token_if:   lambda { |token| token.size < 2 }

  # Define categories on your data.
  #
  # They have a lot of options, see:
  # http://florianhanke.com/picky/documentation.html#indexes-categories
  #
  category :title
  category :subtitle
  category :author
  category :isbn,
           :partial => Picky::Partial::None.new # Only full matches
end

# BookSearch is the search interface
# on the books index. More info here:
# http://florianhanke.com/picky/documentation.html#search
#
BookSearch = Picky::Search.new BooksIndex

# We are explicitly indexing the book data.
#
Book.all.each { |book| BooksIndex.add book }

That's already a nice setup. Whenever Rails starts up, this will add all books to the index.

From anywhere (if you have multiple, call Picky::Indexes.index to index all).

Ok, this sets up the index and the indexing. What about the model?

In the model, here app/models/book.rb add this:

# Two callbacks.
#
after_save    :picky_index
after_destroy :picky_index

# Updates the Picky index.
#
def picky_index
  if destroyed?
    BooksIndex.remove id
  else
    BooksIndex.replace self
  end
end

I actually recommend to use after_commit, but it did not work at the time of writing.

Now, in the controller, you need to return some results to the user.

# GET /books/search
#
def search
  results = BookSearch.search query, params[:ids] || 20, params[:offset] || 0

  # Render nicely as a partial.
  #
  results = results.to_hash
  results.extend Picky::Convenience
  results.populate_with Book do |book|
    render_to_string :partial => "book", :object => book
  end

  respond_to do |format|
    format.html do
      render :text => "Book result ids: #{results.ids.to_s}"
    end
    format.json do
      render :text => results.to_json
    end
  end
end

The first line executes the search using query params. You can try this using curl:

curl http://127.0.0.1:4567/books/search?query=test

The next few lines use the results as a hash, and populate the results with data loaded from the database, rendering a book partial.

Then, we respond to HTML requests with a simple web page, or respond to JSON requests with the results rendered in JSON.

As you can see, you can do whatever you want with the results. You could use this in an API, or send simple text to the user, or...

TODO Using the Picky client JavaScript.

External Picky server

TODO

Advanced Ideas

TODO Reloading indexes live

TODO Prepending the current user to filter

# Prepends the current user filter to
# the current query.
#
query = "user:#{current_user.id} #{params[:query]}"

Sinatra

TODO

TODO Also mention Padrino.

DRb

TODO

Ruby Script

TODO

Tokenizing

edit

The indexing method in an Index describes how index data is handled.

The searching method in a Search describes how queries are handled.

This is where you use these options:

Picky::Index.new :books do
  indexing options_hash_or_tokenizer
end

Search.new *indexes do
  searching options_hash_or_tokenizer
end

Both take either an options hash, your hand-rolled tokenizer, or a Picky::Tokenizer instance initialized with the options hash.

Options

Picky by default goes through the following list, in order:

  1. substitutescharacterswith: A character substituter that responds to #substitute(text) #=> substituted text
  2. removes_characters: Regexp of characters to remove.
  3. stopwords: Regexp of stopwords to remove.
  4. splitstexton: Regexp on where to split the query text, including category qualifiers.
  5. removescharactersafter_splitting: Regexp on which characters to remove after the splitting.
  6. normalizes_words: [[/matching_regexp/, 'replace match \1']]
  7. max_words: How many words will be passed into the core engine. Default: Infinity (Don't go there, ok?).
  8. rejectstokenif: ->(token){ token == 'hello' }
  9. case_sensitive: true or false, false is default.
  10. stems_with: A stemmer, ie. an object that responds to stem(text) that returns stemmed text.

You pass the above options into

Search.new *indexes do
  searching options_hash
end

You can provide your own tokenizer:

Search.new books_index do
  searching MyTokenizer.new
end

TODO Update what the tokenizer needs to return.

The tokenizer needs to respond to the method #tokenize(text), returning a Picky::Query::Tokens object. If you have an array of tokens, e.g. [:my, :nice, :tokens], you can pass it into Picky::Query::Tokens.process(my_tokens) to get the tokens and return these.

rake 'try[text,some_index,some_category]' (some_index, some_category optional) tells you how a given text is indexed.

It needs to be programmed in a performance efficient way if you want your search engine to be fast.

Tokenizer

Even though you usually provide options (see below), you can provide your own:

Picky::Index.new :books do
  indexing MyTokenizer.new
end

The tokenizer must respond to tokenize(text) and return [tokens, words], where tokens is an Array of processed tokens and words is an Array of words that represent the original words in the query (or as close as possible to the original words).

It is also possible to return [tokens], where tokens is the Array of processed query words. (Picky will then just use the tokens as words)

Examples

A very simple tokenizer that just splits the input on commas:

class MyTokenizer
  def tokenize text
    tokens = text.split ','
    [tokens]
  end
end

MyTokenizer.new.tokenize "Hello, world!" # => [["Hello", " world!"]]

Picky::Index.new :books do
  indexing MyTokenizer.new
end

The same can be achieved with this:

Picky::Index.new :books do
  indexing splits_text_on: ','
end

Notes

Usually, you use the same options for indexing and searching:

tokenizer_options = { ... }

index = Picky::Index.new :example do
  indexing tokenizer_options
end

Search.new index do
  searching tokenizer_options
end

However, consider this example. Let's say your data has lots of words in them that look like this: all-data-are-tokenized-by-dashes. And people would search for them using spaces to keep words apart: searching for data. In this case it's a good idea to split the data and the query differently. Split the data on dashes, and queries on \s:

index = Picky::Index.new :example do
  indexing splits_text_on: /-/
end

Search.new index do
  searching splits_text_on: /\s/
end

The rule number one to remember when tokenizing is: Tokenized query text needs to match the text that is in the index.

So both the index and the query need to tokenize to the same string:

Either look in the /index directory (the "prepared" files is the tokenized data), or use Picky's try rake task:

$ rake try[test]
"test" is saved in the Picky::Indexes index as ["test"]
"test" as a search will be tokenized as ["test"]

You can tell Picky which index, or even category to use:

$ rake try[test,books]
$ rake try[test,books,title]

Indexes

edit

Indexes do three things:

Types

Picky offers a choice of four index types:

This is how they look in code:

books_memory_index = Index.new :books do
  # Configuration goes here.
end

books_redis_index = Index.new :books do
  backend Backends::Redis.new
  # Configuration goes here.
end

Both save the preprocessed data from the data source in the /index directory so you can go look if the data is preprocessed correctly.

Indexes are then used in a Search interface.

Searching over one index:

books = Search.new books_index

Searching over multiple indexes:

media = Search.new books_index, dvd_index, mp3_index

The resulting ids should be from the same id space to be useful – or the ids should be exclusive, such that eg. a book id does not collide with a dvd id.

In-Memory / File-based

The in-memory index saves its indexes as files transparently in the form of JSON files that reside in the /index directory.

When the server is started, they are loaded into memory. As soon as the server is stopped, the indexes are deleted from memory.

Indexing regenerates the JSON index files and can be reloaded into memory, even in the running server (see below).

Redis

The Redis index saves its indexes in the Redis server on the default port, using database 15.

When the server is started, it connects to the Redis server and uses the indexes in the key-value store.

Indexing regenerates the indexes in the Redis server – you do not have to restart the server running Picky.

SQLite

TODO

File

TODO

Accessing

If you don't have access to your indexes directly, like so

books_index = Index.new(:books) do
  # ...
end

books_index.do_something_with_the_index

and for example you'd like to access the index from a rake task, you can use

Picky::Indexes

to get all indexes.

To get a single index use

Picky::Indexes[:index_name]

and to get a single category of an index, use

Picky::Indexes[:index_name][:category_name]

That's it.

Configuration

This is all you can do to configure an index:

books_index = Index.new :books do
  source   { Book.order("isbn ASC") }

  indexing removes_characters:                 /[^a-z0-9\s\:\"\&\.\|]/i,                       # Default: nil
           stopwords:                          /\b(and|the|or|on|of|in)\b/i,                   # Default: nil
           splits_text_on:                     /[\s\/\-\_\:\"\&\/]/,                           # Default: /\s/
           removes_characters_after_splitting: /[\.]/,                                         # Default: nil
           normalizes_words:                   [[/\$(\w+)/i, '\1 dollars']],                   # Default: nil
           rejects_token_if:                   lambda { |token| token == :blurf },             # Default: nil
           case_sensitive:                     true,                                           # Default: false
           substitutes_characters_with:        Picky::CharacterSubstituters::WestEuropean.new, # Default: nil
           stems_with:                         Lingua::Stemmer.new                             # Default: nil

  category :id
  category :title,
           partial:    Partial::Substring.new(:from => 1),
           similarity: Similarity::DoubleMetaphone.new(2),
           qualifiers: [:t, :title, :titre]
  category :author,
           partial: Partial::Substring.new(:from => -2)
  category :year,
           partial: Partial::None.new
           qualifiers: [:y, :year, :annee]

  result_identifier 'boooookies'
end

Usually you won't need to configure all that.

But if your boss comes in the door and asks why X is not found… you know. And you can improve the search engine relatively quickly and painless.

More power to you.

Data Sources

Data sources define where the data for an index comes from. There are explicit data sources and implicit data sources.

Explicit Data Sources

Explicit data sources are mentioned in the index definition using the #source method.

You define them on an index:

Index.new :books do
  source Book.all # Loads the data instantly.
end

Index.new :books do
  source { Book.all } # Loads on indexing. Preferred.
end

Or even on a single category:

Index.new :books do
  category :title,
           source: lambda { Book.all }
end

TODO more explanation how index sources and single category sources might work together.

Explicit data sources must respond to #each, for example, an Array.

Responding to #each

Picky supports any data source as long as it supports #each.

See under Flexible Sources how you can use this.

In short. Model:

class Monkey
  attr_reader :id, :name, :color
  def initialize id, name, color
    @id, @name, @color = id, name, color
  end
end

The data:

monkeys = [
  Monkey.new(1, 'pete', 'red'),
  Monkey.new(2, 'joey', 'green'),
  Monkey.new(3, 'hans', 'blue')
]

Setting the array as a source

Index::Memory.new :monkeys do
  source   { monkeys }
  category :name
  category :couleur, :from => :color # The couleur category will take its data from the #color method.
end
Delayed

If you define the source directly in the index block, it will be evaluated instantly:

Index::Memory.new :books do
  source Book.order('title ASC')
end

This works with ActiveRecord and other similar ORMs since Book.order returns a proxy object that will only be evaluated when the server is indexing.

For example, this would instantly get the records, since #all is a kicker method:

Index::Memory.new :books do
  source Book.all # Not the best idea.
end

In this case, it is better to give the source method a block:

Index::Memory.new :books do
  source { Book.all }
end

This block will be executed as soon as the indexing is running, but not earlier.

Implicit Data Sources

Implicit data sources are not mentioned in the index definition, but rather, the data is added (or removed) via realtime methods on an index, like #add, #<<, #unshift, #remove, #replace, and a special form, #replace_from.

So, you don't define them on an index or category as in the explicit data source, but instead add to either like so:

index = Index.new :books do
  category :example
end

Book = Struct.new :id, :example
index.add Book.new(1, "Hello!")
index.add Book.new(2, "World!")

Or to a specific category:

index[:example].add Book.new(3, "Only add to a single category")
Methods to change index or category data

Currently, there are 7 methods to change an index:

Indexing / Tokenizing

See Tokenizing for tokenizer options.

Categories

edit

Categories – usually what other search engines call fields – define categorized data. For example, book data might have a title, an author and an isbn.

So you define that:

Index.new :books do
  source { Book.order('author DESC') }

  category :title
  category :author
  category :isbn
end

(The example assumes that a Book has readers for title, author, and isbn)

This already works and a search will return categorized results. For example, a search for "Alan Tur" might categorize both words as author, but it might also at the same time categorize both as title. Or one as title and the other as author.

That's a great starting point. So how can I customize the categories?

Option partial

The partial option defines if a word is also found when it is only partially entered. So, Picky will be found when typing Pic.

Partial Marker *

The default partial marker is *, so entering Pic* will force Pic to be looked for in the partial index.

The last word in a query is always partial, by default. If you want to force a non partial search on the last query word, use " as in last query word would be "partial", but here partial would not be searched in the partial index.

Setting the markers

By default, the partial marker is * and the non-partial marker is ". You change the markers by setting

Default

You define this by this:

category :some, partial: (some generator which generates partial words)

The Picky default is

category :some, partial: Picky::Partial::Substring.new(from: -3)

You get this one by defining no partial option:

category :some

The option Partial::Substring.new(from: 1) will make a word completely partially findable.

So the word Picky would be findable by entering Picky, Pick, Pic, Pi, or P.

No partials

If you don't want any partial finds to occur, use:

category :some, partial: Partial::None.new

Other partials

There are four built-in partial options. All examples use "hello" as the token.

The general rule is: The more tokens are generated from a token, the larger your index will be. Ask yourself whether you really need an infix partial index.

Your own partials

You can also pass in your own partial generators. How?

Implement an object which has a single method #each_partial(token, &block). That method should yield all partials for a given token. Want to implement a (probably useless) random partial search? No problem.

Example:

You need an alphabetic index search. If somebody searches for a name, it should only be found if typed as a whole. But you'd also like to find it when just entering a, for Andy, Albert, etc.

class AlphabeticIndexPartial
  def each_partial token, &block
    [token[0], token].each &block
  end
end

This will result in "A" and "Andy" being in the index for "Andy".

Pretty straightforward, right?

Option weight

The weight option defines how strongly a word is weighed. By default, Picky rates a word according to the logarithm of its occurrence. This means that a word that occurs more often will be weighed slightly higher.

You define a weight option like this:

category :some, weight: MyWeights.new

The default is Weights::Logarithmic.new.

You can also pass in your own weight generators. See this article to learn more.

If you don't want Picky to calculate weights for your indexed entries, you can use constant or dynamic weights.

With 0.0 as a constant weight:

category :some, weight: Weights::Constant.new # Returns 0.0 for all results.

With 3.14 as a constant weight:

category :some, weight: Weights::Constant.new(3.14) # Returns 3.14 for all results.

Or with a dynamically calculated weight:

Weights::Dynamic.new do |str_or_sym|
  sym_or_str.length # Uses the length of the symbol as weight.
end

You almost never need to define weights. More often than not, you can fiddle with boosting combinations of categories , via the boost method in searches.

Why choose fiddling with weight rather than boosts?

Usually it is preferable to boost specific search results, say "florian hanke" mapped to [:first_name, :last_name], but sometimes you want a specific category boosted wherever it occurs.

For example, the title in a movie search engine would need to be boosted in all searches it occurs. Do this:

category :title, weight: Weights::Logarithmic.new(+1)

This adds +1 to all weights. Why the logarithmic? By default, Picky weighs categories using the logarithm of occurrences. So the default would be:

category :title, weight: Weights::Logarithmic.new # The default.

The Logarithmic initializer accepts a constant to be added to the result. Adding the constant +1 is like multiplying the weight by Math::E (e is Euler's constant). If you don't understand, don't worry, just know that by adding a constant you multiply by a certain value.

In short: * Use weight on the index, if you need a category to be boosted everywhere, wherever it occurs * Use boosting if you need to boost specific combinations of categories only for a specific search.

Option similarity

The similarity option defines if a word is also found when it is typed wrong, or close to another word. So, "Picky" might be already found when typing "Pocky~" (Picky will search for similar word when you use the tilde, ~).

You define a similarity option like this:

category :some, similarity: Similarity::None.new

(This is also the default)

There are several built-in similarity options, like

category :some, similarity: Similarity::Soundex.new
category :this, similarity: Similarity::Metaphone.new
category :that, similarity: Similarity::DoubleMetaphone.new

You can also pass in your own similarity generators. See this article to learn more.

Option qualifier/qualifiers (categorizing)

Usually, when you search for title:wizard you will only find books with "wizard" in their title.

Maybe your client would like to be able to only enter t:wizard. In that case you would use this option:

category :some, qualifier: "t"

Or if you'd like more to match:

category :some,
         qualifiers: ["t", "title", "titulo"]

(This matches "t", "title", and also the italian "titulo")

Picky will warn you if on one index the qualifiers are ambiguous (Picky will assume that the last "t" for example is the one you want to use).

This means that:

category :some,  qualifier: "t"
category :other, qualifier: "t"

Picky will assume that if you enter t:bla, you want to search in the other category.

Searching in multiple categories can also be done. If you have:

category :some,  :qualifier => 's'
category :other, :qualifier => 'o'

Then searching with s,o:bla will search for bla in both :some and :other. Neat, eh?

Option from

Usually, the categories will take their data from the reader or field that is the same as their name.

Sometimes though, the model has not the right names. Say, you have an italian book model, Libro. But you still want to use english category names.

Index.new :books do
  source { Libro.order('autore DESC') }

  category :title,  :from => :titulo
  category :author, :from => :autore
  category :isbn
end

You can also populate the index at runtime (eg. with index.add) using a lambda. The required argument inside the lambda is the object being added to the index.

Index.new :books do
  category :authors, :from => lambda { |book| book.authors.map(&:name) }
end

Option key_format

You will almost never need to use this, as the key format will usually be the same for all categories, which is when you would define it on the index, like so.

But if you need to, use as with the index.

Index.new "books" do
  category :title,
           :key_format => :to_s
end

Option source

You will almost never need to use this, as the source will usually be the same for all categories, which is when you would define it on the index, "like so":#indexes-sources.

But if you need to, use as with the index.

Index.new :books do
  category :title,
           source: some_source
end

Option tokenize

Set this option to false when you give Picky already tokenized data (an Array, or generally an Enumerator).

Index.new :people do
  category :names, tokenize: false
end

And Person has a method #names which returns this array:

class Person

  def names
    ['estaban', 'julio', 'ricardo', 'montoya', 'larosa', 'ramirez']
  end

end

Then Picky will simply use the tokens in that array without (pre-)processing them. Of course, this means you need to do all the tokenizing work. If you leave the tokens in uppercase formatting, then nothing will be found, unless you set the Search to be case-sensitive, for example.

User Search Options

Users can use some special features when searching. They are:

These options can be combined (e.g. title,author:funky~"): This will try to find similar words to funky (like "fonky"), but no partials of them (like "fonk"), in both title and author.

Non-partial will win over partial, if you use both, as in test*".

Also note that these options need to make it through the tokenizing, so don't remove any of *":,-. TODO unclear

Key Format (Format of the indexed Ids)

By default, the indexed data points to keys that are integers, or differently said, are formatted using to_i.

If you are indexing keys that are strings, use to_s – a good example are MongoDB BSON keys, or UUID keys.

The key_format method lets you define the format:

Index.new :books do
  key_format :to_s
end

The Picky::Sources already set this correctly. However, if you use an #each source that supplies Picky with symbol ids, you should tell it what format the keys are in, eg. key_format :to_s.

Identifying in Results

By default, an index is identified by its name in the results. This index is identified by :books:

Index.new :books do
  # ...
end

This index is identified by media in the results:

Index.new :books do
  # ...
  result_identifier 'media'
end

You still refer to it as :books in e.g. Rake tasks, Picky::Indexes[:books].reload. The result_identifier option is just for the results.

Indexing

edit

Indexing can be done programmatically, at any time. Even while the server is running.

Indexing all indexes is done with

Picky::Indexes.index

Indexing a single index can be done either with

Picky::Indexes[:index_name].index

or

index_instance.index

Indexing a single category of an index can be done either with

Picky::Indexes[:index_name][:category_name].index

or

category_instance.index

Loading

Loading (or reloading) your indexes in a running application is possible.

Loading all indexes is done with

Picky::Indexes.load

Loading a single index can be done either with

Picky::Indexes[:index_name].load

or

index_instance.load

Loading a single category of an index can be done either with

Picky::Indexes[:index_name][:category_name].load

or

category_instance.load

Using signals

To communicate with your server using signals:

books_index = Index.new(:books) do
  # ...
end

Signal.trap("USR1") do
  books_index.reindex
end

This reindexes the books_index when you call

kill -USR1 <server_process_id>

You can refer to the index like so if want to define the trap somewhere else:

Signal.trap("USR1") do
  Picky::Indexes[:books].reindex
end

Reindexing

Reindexing your indexes is just indexing followed by reloading (see above).

Reindexing all indexes is done with

Picky::Indexes.reindex

Reindexing a single index can be done either with

Picky::Indexes[:index_name].reindex

or

index_instance.reindex

Reindexing a single category of an index can be done either with

Picky::Indexes[:index_name][:category_name].reindex

or

category_instance.reindex

edit

Picky offers a Search interface for the indexes. You instantiate it as follows:

Just searching over one index:

books = Search.new books_index # searching over one index

Searching over multiple indexes:

media = Search.new books_index, dvd_index, mp3_index

Such an instance can then search over all its indexes and returns a Picky::Results object:

results = media.search "query", # the query text
                            20, # number of ids
                             0  # offset (for pagination)

Please see the part about Results to know more about that.

Options

You use a block to set search options:

media = Search.new books_index, dvd_index, mp3_index do
  searching tokenizer_options_or_tokenizer
  boost [:title, :author] => +2,
        [:author, :title] => -1
end

Searching / Tokenizing

See Tokenizing for tokenizer options.

Boost

The boost option defines what combinations to boost.

This is unlike boosting in most other search engines, where you can only boost a given field. I've found it much more useful to boost combinations.

For example, you have an index of addresses. The usual case is that someone is looking for a street and a number. So if Picky encounters that combination (in that order), it should promote the results containing that combination to a more prominent spot. On the other hand, if picky encounters a street number followed by a street name, which is unlikely to be a search for an address (where I come from), you might want to demote that result.

So let's boost street, streetnumber, while at the same time deboost streetnumber, street:

addresses = Picky::Search.new address_index do
  boost [:street, :streetnumber] => +2,
        [:streetnumber, :street] => -1
end

If you still want to boost a single category, check out the category weight option. For example:

Picky::Index.new :addresses do
  category :street, weight: Picky::Weights::Logarithmic.new(+4)
  category :streetnumber
end

This boosts the weight of the street category for all searches using the index with this category. So whenever the street category is found in results, it will boost these.

Note on Boosting

Picky combines consecutive categories in searches for boosting. So if you search for "star wars empire strikes back", when you defined [:title] => +1, then that boosting is applied.

Why? In earlier versions of Picky we found that boosting specific combinations is less useful than boosting a specific order of categories.

Let me give you an example from a movie search engine. instead of having to say boost [:title] => +1, [:title, :title] => +1, [:title, :title, :title] => +1, it is far more useful to say "If you find any number of title words in a row, boost it". So, when searching for "star wars empire strikes back 1979", it is less important that the query contains 5 title words than that it contains a title followed by a release year. So in this particular case, a boost defined by [:title, :release_year] => +3 would be applied.

Ignoring Categories

There's a full blog post devoted to this topic.

In short, an ignore :name option makes that Search throw away (ignore) any tokens (words) that map to category name.

Let's say we have a search defined:

names = Picky::Search.new name_index do
  ignore :first_name
end

Now, if Picky finds the tokens "florian hanke" in both :first_name, :last_name and :last_name, :last_name, then it will throw away the solutions for :first_name ("florian" will be thrown away) leaving only "hanke", since that is a last name. The [:last_name, :last_name] combinations will be left alone – ie. if "florian" and "hanke" are both found in last_name.

Ignoring Combinations of Categories

The ignore option also takes arrays. If you give it an array, it will throw away all solutions where that order of categories occurs.

Let's say you want to throw away results where last name is found before first name, because your search form is in order: [first_name last_name].

names = Picky::Search.new name_index do
  ignore [:last_name, :first_name]
end

So if somebody searches for "peter paul han" (each a last name as well as a first name), and Picky finds the following combinations:

[:first_name, :first_name, :first_name]
[:last_name, :first_name, :last_name]
[:first_name, :last_name, :first_name]
[:last_name, :first_name, :first_name]
[:last_name, :last_name, :first_name]

then the combinations

[:last_name, :first_name, :first_name]
[:last_name, :last_name, :first_name]

will be thrown away, since they are in the order [:last_name, :first_name]. Note that [:last_name, :first_name, :last_name] is not thrown away since it is last-first-last.

Keeping Combinations of Categories

This is the opposite of the ignore option above.

Almost. The only option only takes arrays. If you give it an array, it will keep only solutions where that order of categories occurs.

Let's say you want to keep only results where first name is found before last name, because your search form is in order: [first_name last_name].

names = Picky::Search.new name_index do
  only [:first_name, :last_name]
end

So if somebody searches for "peter paul han" (each a last name as well as a first name), and Picky finds the following combinations:

[:first_name, :first_name, :last_name]
[:last_name, :first_name, :last_name]
[:first_name, :last_name, :first_name]
[:last_name, :first_name, :first_name]
[:last_name, :last_name, :first_name]

then only the combination

[:first_name, :first_name, :last_name]

will be kept, since it is the only one where first comes before last, in that order.

Ignore Unassigned Tokens

There's a full blog post devoted to this topic.

In short, the ignore_unassigned_tokens true/false option makes Picky be very lenient with your queries. Usually, if one of the search words is not found, say in a query "aston martin cockadoodledoo", Picky will return an empty result set, because "cockadoodledoo" is not in any index, in a car search, for example.

By ignoring the "cockadoodledoo" that can't be assigned sensibly, you will still get results.

This could be used in a search for advertisements that are shown next to the results.

If you've defined an ads search like so:

ads_search = Search.new cars_index do
  ignore_unassigned_tokens true
end

then even if Picky does not find anything for "aston martin cockadoodledoo", it will find an ad, simply ignoring the unassigned token.

Maximum Allocations

The max_allocations(integer) option cuts off calculation of allocations.

What does this mean? Say you have code like:

phone_search = Search.new phonebook do
  max_allocations 1
end

And someone searches for "peter thomas".

Picky then generates all possible allocations and sorts them.

It might get

with the first allocation being the most probable one.

So, with max_allocations 1 it will only use the topmost one and throw away all the others.

It will only go through the first one and calculate only results for that one. This can be used to speed up Picky in case of exploding amounts of allocations.

Early Termination

The terminate_early(integer) or terminate_early(with_extra_allocations: integer) option stops Picky from calculate all ids of all allocations.

However, this will also return a wrong total.

So, important note: Only use when you don't display a total. Or you want to fool your users (not recommended).

Examples:

Stop as soon as you have calculated enough ids for the allocation.

phone_search = Search.new phonebook do
  terminate_early # The default uses 0.
end

Stop as soon as you have calculated enough ids for the allocation, and then calculate 3 allocations more (for example, to show to the user).

phone_search = Search.new phonebook do
  terminate_early 3
end

There's also a hash form to be more explicit. So the next coder knows what it does. (However, us cool Picky hackers know ;) )

phone_search = Search.new phonebook do
  terminate_early with_extra_allocations: 5
end

This option speeds up Picky if you don't need a correct total.

Results

edit

Results are returned by the Search instance.

books = Search.new books_index do
  searching splits_text_on: /[\s,]/
  boost [:title, :author] => +2
end

results = books.search "test"

p results         # Returns results in log form.
p results.to_hash # Returns results as a hash.
p results.to_json # Returns results as JSON.

Sorting

If no sorting is defined, Picky results will be sorted in the order of the data provided by the data source.

However, you can sort the results any way you want.

Arbitrary Sorting

You can define an arbitrary sorting on results by calling Results#sort_by. It takes a block with a single parameter: The stored id of a result item.

This example looks up a result item via id and then takes the priority of the item to sort the results.

results.sort_by { |id| MyResultItemsHash[id].priority }

The results are only sorted within their allocation. If you, for example, searched for Peter, and Picky allocated results in first_name and last_name, then each allocation's results would be sorted.

Picky is optimized: it only sorts results which are actually visible. So if Picky looks for the first 20 results, and the first allocation already has more than 20 results in it – say, 100 --, then it will only sort the 100 results of the first allocation. It will still calculate all other allocations, but not sort them.

Sorting Costs

sort_hash = {
  1 => 10, # important
  2 => 100 # not so important
}
results.sort_by { |id| sort_hash[id] }

Note that in Ruby, a lower value => more to the front (the higher up in Picky).

Logging

TODO Update with latest logging style and ideas on how to separately log searches.

Picky results can be logged wherever you want.

A Picky Sinatra server logs whatever to wherever you want:

MyLogger = Logger.new "log/search.log"

# ...

get '/books' do
  results = books.search "test"
  MyLogger.info results
  results.to_json
end

or set it up in separate files for different environments:

require "logging/#{PICKY_ENVIRONMENT}"

A Picky classic server logs to the logger defined with the Picky.logger= writer.

Set it up in a separate logging.rb file (or directly in the app/application.rb file).

Picky.logger = Picky::Loggers::Concise.new STDOUT

and the Picky classic server will log the results into it, if it is defined.

Why in a separate file? So that you can have different logging for different environments.

More power to you.

Facets

edit

Here's the Wikipedia entry on facets. I fell asleep after about 5 words. Twice.

In Picky, categories are explicit slices over your index data. Picky facets are implicit slices over your category data.

What does "implicit" mean here?

It means that you didn't explicitly say, "My data is shoes, and I have these four brands: Nike, Adidas, Puma, and Vibram".

No, instead you told Picky that your data is shoes, and there is a category "brand". Let's make this simple:

index = Picky::Index.new :shoes do
  category :brand
  category :name
  category :type
end

index.add Shoe.new(1, 'nike', 'zoom', 'sports')
index.add Shoe.new(2, 'adidas', 'speed', 'sports')
index.add Shoe.new(3, 'nike', 'barefoot', 'casual')

With this data in mind, let's look at the possibilities:

Index facets

Index facets are very straightforward.

You ask the index for facets and it will give you all the facets it has and how many results there are within:

index.facets :brand # => { 'nike' => 2, 'adidas' => 1 }

The category type is a good candidate for facets, too:

index.facets :type # => { 'sports' => 2, 'casual' => 1 }

What are the options?

at_least only gives you facets which occur at least n times and counts tells the facets method whether you want counts with the facets or not. If counts are omitted, you'll get an Array of facets instead of a Hash.

Pretty straightforward, right?

Search facets are quite similar:

Search facets

Search facets work similarly to index facets. In fact, you can use them in the same way:

search_interface.facets :brand # => { 'nike' => 2, 'adidas' => 1 }
search_interface.facets :type # => { 'sports' => 2, 'casual' => 1 }
search_interface.facets :brand, at_least: 2 # => { 'nike' => 2 }
search_interface.facets :brand, counts: false # => ['nike', 'adidas']
search_interface.facets :brand, at_least: 2, counts: false # => ['nike']

However search facets are more powerful, as you can also filter the facets with a filter query option:

shoes.facets :brand, filter: 'some filter query'

What does that mean?

Usually you want to use multiple facets in your interface. For example, a customer might already have filtered results by type "sports" because they are only interested in sports shoes. Now you'd like to show them the remaining brands, so that they can filter on the remaining facets.

How do you do this?

Let's say we have an index as above, and a search interface to the index:

shoes = Picky::Search.new index

If the customer has already filtered for sports, you simply pass the query to the filter option:

shoes.facets :brand, filter: 'type:sports' # => { 'nike' => 1, 'adidas' => 1 }

This will give you only 1 "nike" facet. If the customer filtered for "casual":

shoes.facets :brand, filter: 'type:casual' # => { 'nike' => 1 }

then we'd only get the casual nike facet (from that one "barefoot" shoe picky loves so much).

As said, filtering works like the query string passed to picky. So if the customer has filtered for brand "nike" and type "sports", you'd get:

shoes.facets :brand, filter: 'brand:nike type:sports' # => { 'nike' => 1 }
shoes.facets :name, filter: 'brand:nike type:sports' # => { 'zoom' => 1 }

Playing with it is fun :)

See below for testing and performance tips.

Testing How To

Let's say we have an index with some data:

index = Picky::Index.new :people do
  category :name
  category :surname
end

person = Struct.new :id, :name, :surname
index.add person.new(1, 'tom', 'hanke')
index.add person.new(2, 'kaspar', 'schiess')
index.add person.new(3, 'florian', 'hanke')

This is how you test facets:

Index Facets

# We should find two surname facets.
#
index.facets(:surname).should == {
  'hanke' => 2,  # hanke occurs twice
  'schiess' => 1 # schiess occurs once
}

# Only one occurs at least twice.
#
index.facets(:surname, at_least: 2).should == {
  'hanke' => 2
}

Search Facets

# Passing in no filter query just returns the facets
#
finder.facets(:surname).should == {
  'hanke' => 2,
  'schiess' => 1
}

# A filter query narrows the facets down.
#
finder.facets(:name, filter: 'surname:hanke').should == {
  'tom' => 1,
  'florian' => 1
}

# It allows explicit partial matches.
#
finder.facets(:name, filter: 'surname:hank*').should == {
  'fritz' => 1,
  'florian' => 1
}

Performance

Two rules:

  1. Index facets are faster than filtered search facets. If you don't filter though, search facets are as fast as index facets.
  2. Only use facets on data which are a good fit for facets – where there aren't many facets to the data.

A good example for a meaningful use of facets would be brands of shoes. There aren't many different brands (usually less than 100).

So this facet query

finder.facets(:brand, filter: 'type:sports')

does not return thousands of facets.

Should you find yourself in a position where you have to use a facet query on uncontrolled data, eg. user entered data, you might want to cache the results:

category = :name
filter   = 'age_bracket:40'

some_cache[[category, filter]] ||= finder.facets(category, filter: filter)

JavaScript

edit

Picky offers a standard HTML interface that works well with its JavaScript. Render this into your HTML (needs the picky-client gem):

Picky::Helper.cached_interface

Adding a JS interface (written in jQuery for brevity):

$(document).ready(function() {
  pickyClient = new PickyClient({
    // A full query displays the rendered results.
    //
    full: '/search/full',

    // More options...

  });
});

See the options described and listed below.

The variable pickyClient has the following functions:

// Params are params for the controller action. Full is either true or false.
//
pickyClient.insert(query, params, full);

// Resends the last query.
//
pickyClient.resend;

// If not given a query, will use query from the URL (needs history.js).
//
pickyClient.insertFromURL(overrideQuery);

When creating the client itself, you have many more options, as described here:

Javascript Options

Search options

Search options are about configuring the search itself.

There are four different callbacks that you can use. The part after the || describes the default, which is an empty function.

The beforeInsert is executed before a call to pickyClient.beforeInsert. Use this to sanitize queries coming from URLs:

var beforeInsertCallback = config.beforeInsert || function(query) { };

The before is executed before a call to the server. Use this to add any filters you might have from radio buttons or other interface elements:

var beforeCallback = config.before || function(query, params) { };

The success is executed just after a successful response. Use this to modify returned results before Picky renders them:

var successCallback = config.success || function(data, query) { };

The after callback is called just after Picky has finished rendering results – use it to make any changes to the interface (like update an advertisement or similar).

var afterCallback = config.after || function(data, query) { };

This will cause the interface to search even if the input field is empty:

var searchOnEmpty = config.searchOnEmpty || false;

If you want to tell the server you need more than 0 live search results, use liveResults:

var liveResults = config.liveResults || 0;

If the live results need to be rendered, set this to be true. Usually used when full results need to be rendered even for live searches (search as you type):

var liveRendered = config.liveRendered || false;

After each keystroke, Picky waits for a designated interval (default is 180ms) for the next keystroke. If no key is hit, it will send a "live" query to the search server. This option lets you change that interval time:

var liveSearchTimerInterval = config.liveSearchInterval || 180;

You can completely exchange the backend used to make calls to the server – in this case I trust you to read the JS code of Picky yourself:

var backends = config.backends;

Text options

With these options, you can change the text that is displayed in the interface.

These options can be locale dependent.

Qualifiers are used when you have a category that uses a different qualifier name than the category. That is, if you have a category in the index that is named differently from its qualifiers. Eg. category :application, qualifiers: ['app']. You'd then have to tell the Picky interface to map the category correctly to a qualifier.

qualifiers: {
  en:{
    application: 'app'
  }
},

Remember that you only need this if you do funky stuff. Keep to the defaults and you'll be fine.

Explanations are the small headings over allocations (grouped results). Picky just writes "with author soandso" – if you want a better explanation, use the explanations option:

explanations: {
  en:{
    title:     'titled',
    author:    'written by',
    year:      'published in',
    publisher: 'published by',
    subjects:  'with subjects'
  }
}

Picky would now write "written by soandso", making it much nicer to read.

Choices describe the choices that are given to a user when Picky would like to know what the user was searching. This is done when Picky gets too many results in too many allocations, e.g. it is very unclear what the user was looking for.

An example for choices would be:

choices: {
  en:{
    'title': {
      format: "Called <strong>%1$s</strong>",
      filter: function(text) { return text.toUpperCase(); },
      ignoreSingle: true
    },
    'author': 'Written by %1$s',
    'subjects': 'Being about %1$s',
    'publisher': 'Published by %1$s',
    'author,title':    'Called %1$s, written by %2$s',
    'title,author':    'Called %2$s, written by %1$s',
    'title,subjects':  'Called %1$s, about %2$s',
    'author,subjects': '%1$s who wrote about %2$s'
  }
},

Was the user just looking for a title? (Displayed as eg. "ULYSSES – because of the filter and format) or was he looking for an author? (Displayed as "Written by Ulysses")

Multicategory combinations are possible. If the user searches for Ulysses Joyce, then Picky will most likely as if this is a title and an author: "Called Ulysses, written by Joyce".

This is a much nicer way to ask the user, don't you think?

The last option just describes which categories should not show ellipses behind the text (eg. ) if the user searched for it in a partial way. Use this when the categories are not partially findable on the server.

nonPartial: ['year', 'id']

When searching for "1977", this will result in the text being "written in 1977" instead of "written in 1977…", where the ellipses don't make much sense.

The last option describes how to group the choices in a text. Play with this to see the effects (I know, am tired ;) ).

groups: ['title', 'author'];

Modifying the interface itself: Selectors

There are quite a few selector options – you only need those if you heavily customise the interface. You tell Picky where to find the div containing the results or the search form etc.

The selector that contains the search input and the result:

config['enclosingSelector'] || '.picky';

The selector that describes the form the input field is in:

var formSelector = config['formSelector'] || (enclosingSelector + ' form');

The formSelector (short fs) is used to find the input etc.:

config['input']   = $(config['inputSelector']   || (fs + ' input[type=search]'));
config['reset']   = $(config['resetSelector']   || (fs + ' div.reset'));
config['button']  = $(config['buttonSelector']  || (fs + ' input[type=button]'));
config['counter'] = $(config['counterSelector'] || (fs + ' div.status'));

The enclosingSelector (short es) is used to find the results

config['results']      = $(config['resultsSelector']   || (es + ' div.results'));
config['noResults']    = $(config['noResultsSelector'] || (es + ' div.no_results'));
config['moreSelector'] =   config['moreSelector'] ||
  es + ' div.results div.addination:last';

The moreSelector refers to the clickable "more results" pagination/addination.

The result allocations are selected on by these options:

config['allocations']         = $(config['allocationsSelector'] ||
  (es + ' .allocations'));
config['shownAllocations']    = config['allocations'].find('.shown');
config['showMoreAllocations'] = config['allocations'].find('.more');
config['hiddenAllocations']   = config['allocations'].find('.hidden');
config['maxSuggestions']      = config['maxSuggestions'] || 3;

Results rendering is controlled by:

config['results']        = $(config['resultsSelector'] ||
  (enclosingSelector + ' div.results'));
config['resultsDivider'] = config['resultsDivider']    || '';
config['nonPartial']     = config['nonPartial']        || [];
  // e.g. ['category1', 'category2']
config['wrapResults']    = config['wrapResults']       || '<ol></ol>';

The option wrapResults refers to what the results are wrapped in, by default <ol></ol>.

Thanks!

edit

Thanks to whoever made the Sinatra README page for the inspiration.

Logos and all images are CC Attribution licensed to Florian Hanke.