KWIC (Keywords in Context) Output
(1Q18)
Keywords In Context (KWIC) helps users to quickly scan through search results by listing hits surrounded by their context. eXist provides a KWIC module that is not bound to a specific index or query operation. It but can be applied to query results from all indexes that support match highlighting. This includes the Lucene-based index and the ngram index.
The documentation search function on eXist's home page is a good example. It queries documents written in DocBook format. However, the KWIC module has also been successfully used with different schemas (e.g. TEI) and languages (e.g. Chinese).
Using the Module
The KWIC module is entirely written in XQuery. To use the module, import its namespace into your query (you don't need to specify a location):
import module namespace kwic="http://exist-db.org/xquery/kwic";
The easiest way to get KWIC output is to call the kwic:summarize
function on an element node returned from a full text or ngram query:
import module namespace kwic="http://exist-db.org/xquery/kwic";
for $hit in doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "'nature'")]
order by ft:score($hit) descending
return
kwic:summarize($hit, <config width="40"/>)
Every call to kwic:summarize
will return an HTML paragraph
containing 3 span
elements with the text before and after each match, as
well as the match text itself:
<p>
<span class="previous">
... s effect, sir; after what flourish your
</span>
<span class="hi">
nature
</span>
<span class="following">
will.
</span>
</p>
The <config>
element, passed to kwic:summarize
(as second
parameter) determines the appearance of the generated HTML. It recognizes 3
attributes:
width
-
The maximum number of characters to be printed before and after the match
table
-
By default
kwic:summarize
returns an HTML paragraph with spans.If
table="yes"
it will return an HTML table row<tr>
element. The text chunks will be enclosed in a table column<td>
element. link
-
If present, each match will be enclosed within a link, using the URI in the link attribute as target.
Using a callback function for more fine-grained control
If you look at the output of query above you may notice that a space is missing
between words if the previous or following chunk extends to a different
<LINE>
element. And it would also be nicer to display text from
<LINE>
elements only and to ignore <SPEAKER>
or
<STAGEDIR>
elements. This can be achieved with the help of a callback
function:
import module namespace kwic="http://exist-db.org/xquery/kwic";
declare function local:filter($node as node(), $mode as xs:string) as xs:string? {
if ($node/parent::SPEAKER or $node/parent::STAGEDIR) then
()
else if ($mode eq 'before') then
concat($node, ' ')
else
concat(' ', $node)
};
for $hit in doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "'nature'")]
order by ft:score($hit) descending
return
kwic:summarize($hit, <config width="40"/>, local:filter#2)
The third parameter to kwic:summarize
here is a reference to a
function accepting 2 arguments:
-
A single text node which should be appended or prepended to the current text chunk
-
A string indicating the current direction in which text is appended:
before
orafter
.
The function can return the empty sequence if the current node should be ignored (for instance if it belongs to a footnote which should not be displayed). Otherwise it must return a single string.
The local:filter
function above first checks if the passed node
has a SPEAKER or STAGEDIR parent. If so, it ignores that node
by returning the empty sequence. If not, the function adds a single whitespace
before or after the string, so adjacent lines will be properly separated.
Advanced Use
Using kwic:summarize
, you will get one KWIC-formatted item for
every match, even if the matches are in the same paragraph. Also, the context from which
the text is taken is always the same: the element you queried. To get more control over
the output, you can directly call kwic:get-summary
, which is the
module's core function.
kwic:get-summary
expects 3 or 4 parameters.
-
The current context root
-
The match object to process
-
Parameters 3 and 4 are the same as for
kwic:summarize
Before passing nodes to kwic:get-summary
you have to
expand them, which basically means to create an in-memory copy in
which all matches are properly marked up with <exist:match>
tags. The main part
of the query should look as follows:
for $hit in doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "'nature'")]
let $expanded := kwic:expand($hit)
order by ft:score($hit) descending
return
kwic:get-summary($expanded, ($expanded//exist:match)[1], <config width="40"/>,
local:filter#2)
In this example, we select the first <exist:match>
only, thus ignoring all
other matches within $expanded
.
Sometimes you may also want to change the context to restrict the KWIC display to
certain elements within the larger query context, for instance paragraphs within
sections. The following example still queries <SPEECH>
but displays a KWIC
entry for each <LINE>
with a match, grouped by speech:
for $hit in doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "nature")]
let $expanded := kwic:expand($hit)
order by ft:score($hit) descending
return
<div class="speech">{
for $line in $expanded//LINE[.//exist:match]
return
kwic:get-summary($line, ($line/exist:match)[1], <config width="40"/>,
local:filter#2)
}</div>
You might wonder why we don't query <LINE>
directly to get a different
context, as in:
//SPEECH[ft:query(LINE, "nature")]
This is because Lucene computes the relevance of each match with respect to the SPEECH context, not LINE. If we queried LINE, each single line would get a match score and the matches would end up in a completely different order.
Marking up Matches without using KWIC
Sometimes you don't want to use the KWIC module, but still would like an indication
where matches were found in the text. eXist's XML serializer can automatically highlight
matches when it writes out a piece of XML. All the matches will be surrounded by an
<exist:match>
tag.
You can achieve the same within an XQuery by calling the extension function
util:expand
:
let $expanded := util:expand($hit, "expand-xincludes=no")
return $expanded//exist:match
util:expand
returns a copy of the XML fragment it received in its
first parameter, which, unless configured otherwise, has all matches wrapped into
<exist:match>
tags.
Please note that util:expand
will not expand matches in Lucene fields.
Use ft:highlight-field-matches
instead. For more information, see lucene.xml#expand-fields.