Class Lexicon

  • All Implemented Interfaces:
    Extractor, PrefixSearchable, WordList, Serializable, Iterable<String>, Predicate<HString>
    Direct Known Subclasses:
    PersistentLexicon, TrieLexicon

    public abstract class Lexicon
    extends Object
    implements Predicate<HString>, WordList, Extractor, PrefixSearchable, Serializable

    A traditional approach to information extraction incorporates the use of lexicons, also called gazetteers, for finding specific lexical items in text. Hermes's Lexicon classes provide the ability to match lexical items using a greedy longest match first or maximum span probability strategy. Both matching strategies allow for case-sensitive or case-insensitive matching and the use of constraints (using the Lyre expression language), such as part-of-speech, on the match.

    Lexicons are managed using the LexiconManager, which acts as a cache associating lexicons with a name and a language. This allows for lexicons to be defined via configuration and then to be loaded and retrieved by their name (this is particularly useful for annotators that use lexicons).

    Lexicons are defined using a LexiconSpecification in the following format:

     
     lexicon:(mem|disk):name(:(csv|json))*::RESOURCE(;ARG=VALUE)*
     **
     

    The schema of the specification is "lexicon" and the currently supported protocols are: mem: An in-memory Trie-based lexicon. disk: A persistent on-disk based lexicon.The name of the lexicon is used during annotation to mark the provider. Additionally, a format (csv or json) can be specified, with json being the default if none is provided, to specify the lexicon format when creating in-memory lexicons. Finally, a number of query parameters (ARG=VALUE) can be given from the following choices:

    • caseSensitive=(true|false): Is the lexicon case-sensitive (true) or case-insensitive (false) (default false).
    • defaultTag=TAG: The default tag value for entry when one is not defined (default null).
    • language=LANGUAGE: The default language of entries in the lexicon (default Hermes.defaultLanguage()).
    • and the following for CSV lexicons:
      • lemma=INDEX: The index in the csv row containing the lemma (default 0).
      • tag=INDEX: The index in the csv row containing the tag (default 1).
      • probability=INDEX: The index in the csv row containing the probability (default 2).
      • constraint=INDEX: The index in the csv row containing the constraint (default 3).

    Author:
    David B. Bracewell
    See Also:
    Serialized Form
    • Constructor Detail

      • Lexicon

        public Lexicon()
    • Method Detail

      • add

        public abstract void add​(LexiconEntry lexiconEntry)
        Adds an entry to the lexicon
        Parameters:
        lexiconEntry - the lexicon entry to add
      • addAll

        public void addAll​(@NonNull
                           @NonNull Iterable<LexiconEntry> lexiconEntries)
        Adds all lexicon entries in the given iterable to the lexicon
        Parameters:
        lexiconEntries - the lexicon entries to add
      • entries

        public abstract Set<LexiconEntry> entries()
        Returns:
        the set of lexicon entries in the lexicon
      • extract

        public Extraction extract​(@NonNull
                                  @NonNull HString source)
        Description copied from interface: Extractor
        Generate an Extraction from the given HString.
        Specified by:
        extract in interface Extractor
        Parameters:
        source - the source text from which we will generate an Extraction
        Returns:
        the Extraction
      • get

        public abstract Set<LexiconEntry> get​(@NonNull
                                              @NonNull String word)
        Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.
        Parameters:
        word - the word in the lexicon whose entries we want
        Returns:
        the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.
      • getMaxLemmaLength

        public abstract int getMaxLemmaLength()
        Returns:
        the max lemma length
      • getMaxTokenLength

        public abstract int getMaxTokenLength()
        Returns:
        the max token length
      • getName

        public abstract String getName()
        Returns:
        the name of the lexicon
      • getProbability

        public final double getProbability​(@NonNull
                                           @NonNull HString hString)
        Gets the maximum probability for matching the given HString
        Parameters:
        hString - the HString to match against
        Returns:
        the maximum probability for the HString
      • getProbability

        public final double getProbability​(@NonNull
                                           @NonNull String lemma)
        Gets the maximum probability for matching the given String
        Parameters:
        lemma - the String to match against
        Returns:
        the maximum probability for the String
      • getProbability

        public final double getProbability​(@NonNull
                                           @NonNull HString hString,
                                           @NonNull
                                           @NonNull Tag tag)
        Gets the maximum probability for matching the given HString with the given Tag
        Parameters:
        hString - the HString to match against
        tag - the tag that must be present for the match
        Returns:
        the maximum probability for the HString with the given tag
      • getProbability

        public final double getProbability​(@NonNull
                                           @NonNull String string,
                                           @NonNull
                                           @NonNull Tag tag)
        Gets the maximum probability for matching the given String with the given tag
        Parameters:
        string - the String to match against
        tag - the tag that must be present for the match
        Returns:
        the maximum probability for the String with the given tag
      • getTag

        public final Optional<String> getTag​(@NonNull
                                             @NonNull String lemma)
        Gets the first matched tag, if one, for the given String
        Parameters:
        lemma - the String to match against
        Returns:
        the first matched tag for the String
      • getTag

        public final Optional<String> getTag​(@NonNull
                                             @NonNull HString hString)
        Gets the first matched tag, if one, for the given HString
        Parameters:
        hString - the HString to match against
        Returns:
        the first matched tag for the HString
      • isCaseSensitive

        public abstract boolean isCaseSensitive()
        Is the Lexicon case sensitive or not
        Returns:
        True if the lexicon is case sensitive, False if not
      • isProbabilistic

        public abstract boolean isProbabilistic()
        Is the Lexicon case sensitive or not
        Returns:
        True if the lexicon is case sensitive, False if not
      • match

        public abstract List<LexiconEntry> match​(@NonNull
                                                 @NonNull HString string)
        Gets the matched entries for a given HString
        Parameters:
        string - the HString to match against
        Returns:
        the entries matching the HString
      • match

        public abstract List<LexiconEntry> match​(@NonNull
                                                 @NonNull String term)
        Returns the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.
        Parameters:
        term - the word in the lexicon whose entries we want
        Returns:
        the LexiconEntry associated with a given word in the Lexicon or an empty set if there are none.
      • normalize

        protected String normalize​(CharSequence sequence)
        Normalizes the string based whether the lexicon is case sensitive or not.
        Parameters:
        sequence - the sequence
        Returns:
        the string
      • size

        public abstract int size()
        The number of lexical items in the lexicon
        Specified by:
        size in interface WordList
        Returns:
        the number of lexical items in the lexicon