Class TokenRegex

  • All Implemented Interfaces:
    Extractor, Serializable

    public final class TokenRegex
    extends Object
    implements Serializable, Extractor

    Hermes provides a token-based regular expression engine that allows for matches on arbitrary annotation types, relation types, and attributes, while providing many of the operators that are possible using standard Java regular expressions. As with Java regular expressions, the token regular expression is specified as a string and is compiled into an instance of of TokenRegex. The TokenRegex class has many of the same methods as Java’s regular expression, but returns a TokenMatcher instead of Matcher. The TokenMatcher class allows for iterating of the matches, extracting the match or named-groups within the match, the starting and ending offset of the match, and conversion into a TokenMatch object which records the current state of the match. Token regular expressions can act as extractors where the extraction generates the HStrings matched for the default group. An example of compiling a regular expression, creating a match, and iterating over the matches is as follows:

     
        TokenRegex regex = TokenRegex.compile(pattern);
        TokenMatcher matcher = regex.matcher(document);
        while (matcher.find()) {
                System.out.println(matcher.group());
        }
     
     

    The syntax for token-based regular expressions borrows from the Lyre Expression Language where possible. Token-based regular expressions differ from Lyre in that they work over sequences of HStrings whereas Lyre is working on single HString units. As such, there are differences in the syntax between Lyre. Details on the syntax can be found in the Hermes User Guide.

    Author:
    David B. Bracewell
    See Also:
    Serialized Form
    • Method Detail

      • compile

        public static TokenRegex compile​(@NonNull
                                         @NonNull String pattern)
                                  throws ParseException
        Compiles the given pattern into a TokenRegex object
        Parameters:
        pattern - The token regex pattern
        Returns:
        A compiled TokenRegex
        Throws:
        ParseException - The given pattern has a syntax error
      • extract

        public Extraction extract​(@NonNull
                                  @NonNull HString hString)
        Description copied from interface: Extractor
        Generate an Extraction from the given HString.
        Specified by:
        extract in interface Extractor
        Parameters:
        hString - the source text from which we will generate an Extraction
        Returns:
        the Extraction
      • matchFirst

        public Optional<HString> matchFirst​(HString text)
        Runs the pattern over the given input text returning the first match if one exists.
        Parameters:
        text - the text to run the pattern over
        Returns:
        an optional of the match
      • matcher

        public TokenMatcher matcher​(HString text,
                                    int start)
        Creates a TokenMatcher to match against the given text.
        Parameters:
        text - The text to run the TokenRegex against
        start - Which token to start the TokenRegex on
        Returns:
        A TokenMatcher
      • matcher

        public TokenMatcher matcher​(HString text)
        Creates a TokenMatcher to match against the given text.
        Parameters:
        text - The text to run the TokenRegex against
        Returns:
        A TokenMatcher
      • matches

        public boolean matches​(HString text)
        Determines if the regex matches the entire region of the given input text.
        Parameters:
        text - the text to match
        Returns:
        True if the pattern matches the entire region of the input text, False otherwise
      • pattern

        public String pattern()
        Returns:
        The token regex pattern as a string