java.lang.Object
- com.gengoai.hermes.lexicon.Lexicon

All Implemented Interfaces:

Extractor, PrefixSearchable, WordList, Serializable, Iterable<String>, Predicate<HString>

Direct Known Subclasses:

PersistentLexicon, TrieLexicon
```
public abstract class Lexicon
extends Object
implements Predicate<HString>, WordList, Extractor, PrefixSearchable, Serializable
```
A traditional approach to information extraction incorporates the use of lexicons, also called gazetteers, for finding specific lexical items in text. Hermes's Lexicon classes provide the ability to match lexical items using a greedy longest match first or maximum span probability strategy. Both matching strategies allow for case-sensitive or case-insensitive matching and the use of constraints (using the Lyre expression language), such as part-of-speech, on the match.

Lexicons are managed using the LexiconManager, which acts as a cache associating lexicons with a name and a language. This allows for lexicons to be defined via configuration and then to be loaded and retrieved by their name (this is particularly useful for annotators that use lexicons).

Lexicons are defined using a LexiconSpecification in the following format:
```
 
 lexicon:(mem|disk):name(:(csv|json))*::RESOURCE(;ARG=VALUE)*
 **
 
```
The schema of the specification is "lexicon" and the currently supported protocols are: mem: An in-memory Trie-based lexicon. disk: A persistent on-disk based lexicon.The name of the lexicon is used during annotation to mark the provider. Additionally, a format (csv or json) can be specified, with json being the default if none is provided, to specify the lexicon format when creating in-memory lexicons. Finally, a number of query parameters (ARG=VALUE) can be given from the following choices:
- caseSensitive=(true|false): Is the lexicon case-sensitive (true) or case-insensitive (false) (default false).
Author:

David B. Bracewell

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor Description

Lexicon()

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type Method Description

abstract void add(LexiconEntry lexiconEntry)
Adds an entry to the lexicon

void addAll(@NonNull Iterable<LexiconEntry> lexiconEntries)
Adds all lexicon entries in the given iterable to the lexicon

abstract Set<LexiconEntry> entries()

Extraction extract(@NonNull HString source)
Generate an Extraction from the given HString.

abstract Set<LexiconEntry> get(@NonNull String word)
Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

abstract int getMaxLemmaLength()

abstract int getMaxTokenLength()

abstract String getName()

double getProbability(@NonNull HString hString)
Gets the maximum probability for matching the given HString

double getProbability(@NonNull HString hString, @NonNull Tag tag)
Gets the maximum probability for matching the given HString with the given Tag

double getProbability(@NonNull String lemma)
Gets the maximum probability for matching the given String

double getProbability(@NonNull String string, @NonNull Tag tag)
Gets the maximum probability for matching the given String with the given tag

Optional<String> getTag(@NonNull HString hString)
Gets the first matched tag, if one, for the given HString

Optional<String> getTag(@NonNull String lemma)
Gets the first matched tag, if one, for the given String

abstract boolean isCaseSensitive()
Is the Lexicon case sensitive or not

abstract boolean isProbabilistic()
Is the Lexicon case sensitive or not

abstract List<LexiconEntry> match(@NonNull HString string)
Gets the matched entries for a given HString

abstract List<LexiconEntry> match(@NonNull String term)
Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

protected String normalize(CharSequence sequence)
Normalizes the string based whether the lexicon is case sensitive or not.

abstract int size()
The number of lexical items in the lexicon

boolean test(@NonNull HString hString)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, iterator, spliterator

Methods inherited from interface java.util.function.Predicate
and, negate, or

Methods inherited from interface com.gengoai.hermes.lexicon.PrefixSearchable
isPrefixMatch, isPrefixMatch, prefixes

Methods inherited from interface com.gengoai.hermes.lexicon.WordList
contains, contains

Constructor Detail

Lexicon

public Lexicon()

Method Detail

add

public abstract void add(LexiconEntry lexiconEntry)

Adds an entry to the lexicon

Parameters:

lexiconEntry - the lexicon entry to add

addAll

public void addAll(@NonNull @NonNull Iterable<LexiconEntry> lexiconEntries)

Adds all lexicon entries in the given iterable to the lexicon

Parameters:

lexiconEntries - the lexicon entries to add

entries

public abstract Set<LexiconEntry> entries()

Returns:

the set of lexicon entries in the lexicon

extract

public Extraction extract(@NonNull @NonNull HString source)

Description copied from interface: Extractor

Generate an Extraction from the given HString.

Specified by:

extract in interface Extractor

Parameters:

source - the source text from which we will generate an Extraction

Returns:

the Extraction

get

public abstract Set<LexiconEntry> get(@NonNull @NonNull String word)

Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

Parameters:

word - the word in the lexicon whose entries we want

Returns:

the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

getMaxLemmaLength

public abstract int getMaxLemmaLength()

Returns:

the max lemma length

getMaxTokenLength

public abstract int getMaxTokenLength()

Returns:

the max token length

getName

public abstract String getName()

Returns:

the name of the lexicon

getProbability

public final double getProbability(@NonNull @NonNull HString hString)

Gets the maximum probability for matching the given HString

Parameters:

hString - the HString to match against

Returns:

the maximum probability for the HString

getProbability

public final double getProbability(@NonNull @NonNull String lemma)

Gets the maximum probability for matching the given String

Parameters:

lemma - the String to match against

Returns:

the maximum probability for the String

getProbability

public final double getProbability(@NonNull @NonNull HString hString, @NonNull @NonNull Tag tag)

Gets the maximum probability for matching the given HString with the given Tag

Parameters:

hString - the HString to match against

tag - the tag that must be present for the match

Returns:

the maximum probability for the HString with the given tag

getProbability

public final double getProbability(@NonNull @NonNull String string, @NonNull @NonNull Tag tag)

Gets the maximum probability for matching the given String with the given tag

Parameters:

string - the String to match against

tag - the tag that must be present for the match

Returns:

the maximum probability for the String with the given tag

getTag

public final Optional<String> getTag(@NonNull @NonNull String lemma)

Gets the first matched tag, if one, for the given String

Parameters:

lemma - the String to match against

Returns:

the first matched tag for the String

getTag

public final Optional<String> getTag(@NonNull @NonNull HString hString)

Gets the first matched tag, if one, for the given HString

Parameters:

hString - the HString to match against

Returns:

the first matched tag for the HString

isCaseSensitive

public abstract boolean isCaseSensitive()

Is the Lexicon case sensitive or not

Returns:

True if the lexicon is case sensitive, False if not

isProbabilistic

public abstract boolean isProbabilistic()

Is the Lexicon case sensitive or not

Returns:

True if the lexicon is case sensitive, False if not

match

public abstract List<LexiconEntry> match(@NonNull @NonNull HString string)

Gets the matched entries for a given HString

Parameters:

string - the HString to match against

Returns:

the entries matching the HString

match

public abstract List<LexiconEntry> match(@NonNull @NonNull String term)

Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

Parameters:

term - the word in the lexicon whose entries we want

Returns:

the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.

normalize

protected String normalize(CharSequence sequence)

Normalizes the string based whether the lexicon is case sensitive or not.

Parameters:

sequence - the sequence

Returns:

the string

size

public abstract int size()

The number of lexical items in the lexicon

Specified by:

size in interface WordList

Returns:

the number of lexical items in the lexicon

test

public final boolean test(@NonNull @NonNull HString hString)

Specified by:

test in interface Predicate<HString>

Modifier and Type	Method	Description
`abstract void`	`add(LexiconEntry lexiconEntry)`	Adds an entry to the lexicon
`void`	`addAll(@NonNull Iterable<LexiconEntry> lexiconEntries)`	Adds all lexicon entries in the given iterable to the lexicon
`abstract Set<LexiconEntry>`	`entries()`
`Extraction`	`extract(@NonNull HString source)`	Generate an `Extraction` from the given `HString`.
`abstract Set<LexiconEntry>`	`get(@NonNull String word)`	Returns the `LexiconEntry` associated with a given word in the Lexicon or an empty set if there are none.
`abstract int`	`getMaxLemmaLength()`
`abstract int`	`getMaxTokenLength()`
`abstract String`	`getName()`
`double`	`getProbability(@NonNull HString hString)`	Gets the maximum probability for matching the given `HString`
`double`	`getProbability(@NonNull HString hString, @NonNull Tag tag)`	Gets the maximum probability for matching the given `HString` with the given Tag
`double`	`getProbability(@NonNull String lemma)`	Gets the maximum probability for matching the given String
`double`	`getProbability(@NonNull String string, @NonNull Tag tag)`	Gets the maximum probability for matching the given String with the given tag
`Optional<String>`	`getTag(@NonNull HString hString)`	Gets the first matched tag, if one, for the given `HString`
`Optional<String>`	`getTag(@NonNull String lemma)`	Gets the first matched tag, if one, for the given String
`abstract boolean`	`isCaseSensitive()`	Is the Lexicon case sensitive or not
`abstract boolean`	`isProbabilistic()`	Is the Lexicon case sensitive or not
`abstract List<LexiconEntry>`	`match(@NonNull HString string)`	Gets the matched entries for a given `HString`
`abstract List<LexiconEntry>`	`match(@NonNull String term)`	Returns the `LexiconEntry` associated with a given word in the Lexicon or an empty set if there are none.
`protected String`	`normalize(CharSequence sequence)`	Normalizes the string based whether the lexicon is case sensitive or not.
`abstract int`	`size()`	The number of lexical items in the lexicon
`boolean`	`test(@NonNull HString hString)`

Class Lexicon

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface java.util.function.Predicate

Methods inherited from interface com.gengoai.hermes.lexicon.PrefixSearchable

Methods inherited from interface com.gengoai.hermes.lexicon.WordList

Constructor Detail

Lexicon

Method Detail

add

addAll

entries

extract

get

getMaxLemmaLength

getMaxTokenLength

getName

getProbability

getProbability

getProbability

getProbability

getTag

getTag

isCaseSensitive

isProbabilistic

match

match

normalize

size

test