Introducing SPSearchStore: an Objective-C wrapper for SearchKit

For the inaugural post of the phildow.net relaunch I’m pleased to present the first bit of open sourced material from the Journler code base. Back in February I announced that I would be releasing Journler to the public, and this is an important first step and a significant bit of code it that direction.

This class has been factored directly out of Journler and brought up to date with the help of modern Objective-C 2.0 language syntax, Mac OS concurrency APIs and blocks programming. A download link is available at the bottom of this article, or you may go to the Source Code page for a link.

Summary

SPSearchStore is an Objective-C wrapper for the SearchKit API. SearchKit offers document indexing and “google-like” querying to Core Foundation apps. SearchKit is a stupid powerful Core Services API that supports threaded access, document to term and term to document mapping, text summarizing, and similarity, boolean, phrasal and wildcard prefix/suffix queries.

SPSearchStore makes most of the API accessible to Cocoa applications by way of a simple public interface. The class supports direct access to the complete two-way to-many documents / terms graph contained in a SearchKit index, establishing the foundation for more complex analysis based on the semantic relationships among an arbitrary collection of documents.

Like SearchKit, SPSearchStore is thread safe. You may use multiple threads to read from and write to the search store. In fact the class will itself use threads to import documents and query the index if you enable its concurrency option. SPSearchStore employs locks to manage access to the underlying data and Grand Central Dispatch by way of operation queue blocks to thread processing.

Workflow

Employing SPSearchStore is a two step process, with an optional third step to take advantage of the document / term graph. Each step can be accomplished with a minimal amount of code, as I’d like to show you here. Generally, you will 1. Establish a store, 2. Perform searches and 3. Perform document / term analysis.

1. Establishing an index

In the first step you create a store and add content to it. SearchKit supports both in-memory and disk based indices. The APIs manage persistence for the latter option, but you can also persist in-memory stores with just a little extra effort. For either store type you may specify a wide range of text analysis options including minimum term length, stop words, synonyms, starting term characters, and so on.

You can import two kinds of content into the store: file based content or free standing text that you supply. For file based content you may optionally have SearchKit use Spotlight importers for text extraction. In both cases SearchKit uses URIs to identify the content, and it is primarily with URIs that you’ll be interfacing with SPSearchStore.

A. Set up default text analysis options prior to store creation:

[SPSearchStore setDefaultTextAnalysisOption:[NSNumber numberWithInteger:2]
forKey:(NSString *)kSKMinTermLength];

B. Create a memory or disk based store with a single call:

searchStore = [[SPSearchStore alloc] initStoreWithMemory:nil 
type:kSKIndexInvertedVector];

C. You can then set store behavior:

searchStore.usesSpotlightImporters = YES;
searchStore.usesConcurrentIndexing = YES;

D. And add content to the store:

[searchStore addDocument:(NSURL*)obj typeHint:nil];

2. Performing a search

SearchKit uses a two stage pattern to perform searching, and that process is mirrored by SPSearchStore. The first stage initializes the search from a query / options combination and starts the search running on a separate thread. The second stage consists of potentially multiple calls to the index asking for chunks of the search results.

SearchKit is extremely flexible and quite powerful in its handling of queries. It supports similarity, phrasal, wildcard and boolean searching. Documentation describes the syntax as “Google-like”. Often, however, it will not be enough to simply pass a user generated string directly to SearchKit. Cocoa developers accustomed to the simplicity of predicate syntax such as contains[cd] will wonder why SearchKit doesn’t provide this kind of searching directly.

While diacrtical and case sensitive issues are automatically handled, SearchKit otherwise looks for exact matches unless you surround your strings with the wildcard character. For example, predicate syntax such as “contains[cd] tune” must look like “*tune*” to SearchKit. If your user expects this kind of functionality but doesn’t know or shouldn’t have to correctly format the string, it will be up to you to make the necessary adjustments.

Refer to the developer documentation for more information on SearchKit query syntax.

A. Initialize a store search from a query string and query options:

NSString *searchString = @"foo* && *bar";
[searchStore prepareSearch:searchString options:kSKSearchOptionDefault];

B. Fetch the search results either at one time or with multiple calls:

NSArray * results = nil;
NSArray * ranks = nil;
[searchStore fetchResults:&results ranksArray:&ranks untilFinished:YES];

C. Normalize the relevancy results:

NSArray * normalizedRanks = [searchStore normalizedRankingsArray:ranks];

3. Performing document / term analysis

SearchKit offers complete access to the two-way term / document associations in an inverted vector index. The SearchKit documentation only mentions vector indices in the context of similarity searching, but more generally, vector indices map documents to terms. With a vector index you can acquire a complete list of the unique terms contained within a document.

Normally, searching employs an inverted index approach which maps terms to documents. You specify a term and the search returns the documents in which the term appears. A combined inverted vector index maps both the term to document and document to term associations, allowing you to go back and forth among document / term relationships.

With an inverted vector index you can acquire all the terms in the index, all the documents in the index, all the documents associated with a term, all the terms in a document, and the number of times a term appears in the index or in a document.

Imagine: the user queries for a term and you return a list of documents. The user selects a document and you display a list of the terms it contains. The user selects a term, and again you show the documents which match it, and so on. This is the principle behind Journler’s lexical capabilities and is, in a word, awesome. SearchKit rocks, and SPSearchStore secures straightforward access to these capabilities.

A. Get all the terms or documents in the search index:

NSArray *allDocs = [searchStore allDocuments];
NSArray *terms = [searchStore allTerms];

B. Get all the unique terms contained in a specific document:

NSURL *docURI = ...;
NSArray *docTerms = [searchStore termsForDocument:docURI];

C. Get all the documents which contain a specific term:

NSString *term = @"term";
NSArray *docs = [searchStore documentsForTerm:term];

Limitations

SPSearchStore provides access to most of SearchKit’s functionality, but there are a couple of noticeable limitations.

1. No support for document hierarchies: SearchKit supports the hierarchical indexing of document content, whether file based on free-standing text. SPSearchStore does not provide an interface to this mechanism.

2. No support for text summarization: SearchKit includes a set of APIs for summarizing documents. SPSearchStore does not support this functionality, choosing instead to focus on query and document/term capabilities. It should, however, be trivial to add a summarization category to NSString.

Concerns

In the past there have been bugs reported with SearchKit’s use of 3rd party Spotlight importers. There have also been more recent reports of memory issues with SearchKit. SPSearchStore could benefit from a comprehensive set of UnitTests for different document types and indexing conditions.

Perhaps most annoyingly, SearchKit only indices the textual content of files. It does not index any other metadata information nor does it index the file’s name. Users will probably expect searches to match file names, but this is something you must accomplish separately, probably using NSPredicate.

Conclusion

SearchKit is a powerful, flexible Core Foundation API that can be difficult for Cocoa programmers to wrap their heads around. SPSearchStore is an Objective-C class which captures much of the SearchKit functionality. With a minimal amount of coding Cocoa programmers now have access to this incredible API.

SPSearchStore is available free of charge under the terms of a BSD license. If the terms of this license do not meet your needs you may optionally purchase a non-attribution license instead. For more information, refer to the Licensing page.

Download SPSearchStore

This entry was posted in Source Code. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

You must be logged in to post a comment.