Class |
Description |
Action |
An action defines a processing step to perform on a Corpus with a given Context which results in
either modifying the corpus or the context.
AffixFeaturizer |
The type Affix featurizer.
AnnotatableType |
AnnotatableType.Deserializer |
AnnotatableType.KeyDeserializer |
AnnotatableType.Serializer |
AnnotatableTypeConverter |
Mango Converter to automatically Convert other objects (Json and Strings) into AnnotatableType s
Annotate |
The type Annotate processor.
Annotation |
AnnotationPipeline |
Helper class for determining the correct sequence of annotators to apply on a Document in order for to satisfy the
given AnnotatableType.
AnnotationSet |
An AnnotationSet acts as the storage mechanism for annotations associated with a document.
AnnotationType |
An AnnotationType defines an Annotation , which is a typed (e.g.
Annotator |
AttributeMap |
Specialized HashMap for storing AttributeType s and their values that correctly handles json serialization /
deserialization and allows for checked type gets.
AttributeType<T> |
An AttributeType defines a named Attribute that can be added to an HString.
BackreferenceTransition |
BaseHStringMLModel |
The type Base h string ml model.
BaseWorkflowIO |
BasicCategories |
A basic set of categories to describe words which is useful for inferring higher level concepts.
BasicCategoryFeature |
BreakIteratorTokenizer |
A tokenizer implementation based on Java's BreakIterator class
CaduceusProgram |
Caduceus, pronounced ca·du·ceus, is a rule-based information extraction system.
CategoryProcessor |
CoNLLColumnProcessor |
Interface defining how to process a column from a CoNLL formatted document.
CoNLLEvaluation |
Evaluation used in CoNLL for Named Entity Recognition.
CoNLLFormat |
Format Name: conll
CoNLLFormat.CoNLLParameters |
The type CoNLL parameters.
CoNLLFormat.Provider |
The type Provider.
CoNLLRow |
A Row (token) in a CoNLL formatted file
Context |
Contexts are a specialized map that act as a shared memory for a Workflow.
ContextualizedEmbedding |
Corpus |
A persistent collection of documents each having a unique document ID.
CsvFormat |
Format Name: csv
CsvFormat.CSVParameters |
The type Csv parameters.
CsvFormat.Provider |
The type Provider.
DefaultCategoryAnnotator |
Default annotator for basic categories, which is limited to nouns.
DefaultDependencyAnnotator |
Default dependency annotator that uses MaltParser.
DefaultEntityAnnotator |
Default Entity Annotator that realizes the Entity annotation through sub-annotators defined using the configuration
setting: .
DefaultLemmaAnnotator |
Default Lemmatization annotator that uses the Lemmatizer registered with the
token's language to perform lemmatization.
DefaultMlEntityAnnotator |
Default Machine-Learning based Entity annotator.
DefaultPartOfSpeechAnnotator |
Default Part-of-Speech annotator that uses a POSTagger machine learning model.
DefaultPhraseChunkAnnotator |
Default Phrase Chunk annotator that use an IOBTagger.
DefaultSentenceAnnotator |
Default Sentence Annotator that works reasonably well on tokenized text.
DefaultStemAnnotator |
Default Stem annotator that uses the Stemmer registered with the
token's language to perform stemming.
DefaultTokenAnnotator |
Default token annotator that uses the Tokenizer registered with the
token's language to perform tokenization.
DefaultTokenTypeEntityAnnotator |
DefaultTransliterationAnnotator |
Annotates tokens with their transliteration using ICU4Js Transliterator class.
DependencyLinkProcessor |
Processes dependency governor (parent) information in CoNLL Files
DependencyRelationProcessor |
Processes dependency relation information in CoNLL Files
DiacriticalMarkNormalizer |
Removes diacritics
DiskLexicon |
DistributionalLexiconGenerator<T extends Tag> |
Generates a lexicon based on similarity in an embedding space where positive and negative examples can be given per
tag category.
DocFormat |
A DocFormat defines how to read and write documents in a given format.
DocFormatParameters |
The type Doc format parameters.
DocFormatProvider |
A provider for DocFormat for use within Java's service loader framework.
DocFormatService |
Document |
A document represents text content with an accompanying set of metadata (Attributes), linguistic overlays
(Annotations), and relations between elements in the document.
Document.AnnotationBuilder |
Annotation builder for creating annotations associated with a document
DocumentCollection |
A document collection represents a temporary collection of documents often used for ad-hoc analytics or to import
documents into a corpus
DocumentFactory |
A document factory facilitates the creation of document objects performing any predefined preprocessing, e.g.
DocumentFactory.DocumentFactoryBuilder |
Downloader |
ElmoNERModel |
ElmoSeq2SeqModel |
ElmoTokenEmbedding |
EmbeddingSimilarity |
Implementation of a HStringSimilarity that calculates similarity based on the similarity between the
HStrings in embedding space.
ENEntityAnnotator |
Default Entity annotator for English
ENLemmatizer |
English language lemmatizer based on WordNet's Morphy
ENLexicons |
Lexicons used by the English Tokenizer.
ENPOSTagger |
Default English language Part-of-Speech Annotator that uses a combination of machine learning and post-ml corrective
ENPOSValidator |
English language sequence labeling validator for part-of-speech tags.
ENStemmer |
Default English language stemmer using Porter Stemmer.
ENStopWords |
English StopWords
Entities |
Predefined set of common entities.
EntityTagger |
EntityType |
Tag type associated with Entity annotations.
EntityType.Converter |
The type Converter.
ENTokenizer |
English language tokenizer
Extraction |
An extraction is the output generated by an Extractor .
Extractor |
Fundamental to text mining in Hermes is the concept of a Extractor and the Extraction it
ExtractorBasedSimilarity |
An implementation of an HStringSimilarity that uses an Apollo Similarity measure to determine the
similarity between two HString based on the extraction from a given Extractor .
Features |
The type Features.
FeaturizingExtractor |
Combines an Extractor with an Apollo Featurizer allowing for the output of the extractor to be
directly used as features for machine learning.
Fragments |
Convenience methods for constructing orphaned and empty fragments.
FuzzyLexiconAnnotator |
A lexicon annotator that allows gaps to occur in multi-word expressions.
Hermes |
Convenience methods for getting common configuration options.
HermesJsonFormat |
Format Name: hjson
HermesJsonFormat.Provider |
The type Provider.
HString |
An HString (Hermes String) is a Java String on steroids.
HStringDataSetGenerator |
An extension to a DataSetGenerator that allows for the incoming documents to be broken up into multiple Datum based
on a given AnnotationType .
HStringDataSetGenerator.Builder |
Builder Class for HStringDataSetGenerator
HStringMLModel |
The interface H string ml model.
HStringSimilarity |
Interface defining a methodology for computing the similarity between two HString .
HtmlEntityNormalizer |
Normalizes xml and html entities, such as &
ImportDocuments |
IndexProcessor |
Processes token index information in CoNLL Files
IOBFieldProcessor |
Base processor for IOB (Inside, Outside, Beginning) annotations in CoNLL Files
IOBTagger |
Creates annotations based on the IOB tag output of an underlying model.
IOBValidator |
Sequence validator ensuring correct IOB tag output
KeywordExtraction |
KeywordExtractor |
A keyword extractor determines the important words, phrases, or concepts in HString returning a counter
of keywords and their corresponding scores.
LemmaProcessor |
Processes lemma information in CoNLL Files
Lemmatizer |
Defines the interface for lemmatizing tokens.
Lemmatizers |
Factory class for creating/retrieving lemmatizers for a given language
LexicalFeatures |
Lexicon |
A traditional approach to information extraction incorporates the use of lexicons, also called gazetteers, for
finding specific lexical items in text.
LexiconAnnotator |
Annotator that provides annotations based on a lexicon.
LexiconEntry |
An entry in a lexicon defining the lemma, probability, tag, and any constraints on matching
LexiconGenerator<T extends Tag> |
Defines a methodology for constructing a lexicon for a set of tags.
LexiconIO |
Utility methods reading and writing Lexicon
LexiconIO.CSVParameters |
The type Csv parameters.
LexiconManager |
Manages the creation and access to Lexicons
LexiconMatch |
Value class for matches made by lexicons
LexiconSpecification |
LyreDSL |
Static functions allowing for a functional style DSL for constructing LyreExpressions.
LyreExpression |
A LyreExpression represents a series of steps to perform over an input HString which can be used for
querying (i.e.
LyreExpressionType |
Enumeration of the different types Lyre Expressions
MorphologicalFeatureProcessor |
MultiPhaseExtractor |
A FeaturizingExtractor that breaks the extraction process into the follow parts:
Extracts annotations of the given types.
Trims the extractions, if a trim method is defined.
Filters the extractions, if a trim method is defined.
MultiPhaseExtractor.MultiPhaseExtractorBuilder<T extends MultiPhaseExtractor,V extends MultiPhaseExtractor.MultiPhaseExtractorBuilder<T,V>> |
NamedEntityProcessor |
Processes Named Entities in CoNLL Format.
NeuralNERModel |
Implementation of a non-deterministic finite state automata that works on a Text
NGramExtractor |
NGramExtractor.Builder |
NoOptProcessor |
No Operation Processor
NPClusteringKeywordExtractor |
Implementation of the NP Clustering Keyword Extractor presented in:
OneDocPerFileFormat |
Defines a format in which only one document is written per file.
PartOfSpeech |
Interface defining a part-of-speech.
PartOfSpeechConverter |
PennTreeBank |
Part-of-speech tags defined by Penn Treebank
PennTreebankFormat |
Format Name: ptb
PennTreebankFormat.Provider |
PersistentLexicon |
Base class for lexicon implementations that are persistent, meaning added entries are persisted between runs.
PhraseChunkProcessor |
Processes Shallow Parse information (Phrase Chunks) in CoNLL Format
PhraseChunkTagger |
PorterStemmer |
Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form.
POSCorrection |
Corrects POS tags to conform to HERMES format
POSFieldProcessor |
Processes part-of-speech fields
POSFormat |
Format Name: pos
POSFormat.Provider |
The type Provider.
POSTagger |
PredefinedFeatures |
The type Predefined features.
PredefinedFeatures.PredefinedFeaturizer |
The type Predefined featurizer.
PrefixSearchable |
Interface defining a lexicon or word list that can be searched using prefixes
ProgressLogger |
Defines a logger that keeps track of the number of documents and words processed and reports processing statistics on
a given interval.
Query |
Defines the methodology for matching documents based on simple boolean logic over term and document level
QueryParser |
Simple query to predicate constructor for basic keyword queries over corpora.
RakeKeywordExtractor |
Implementation of the RAKE keyword extraction algorithm as presented in:
RegexAnnotator |
Annotator that constructs annotations based on regular expression matches.
RegexExtractor |
An Extractor implementation that searches for a given regular expression pattern in the document.
Relation |
Relations provide a mechanism to link two Annotations.
RelationDirection |
Directionality of a relation.
RelationEdge |
A specialized annotation graph edge that stores relation type and value.
RelationEdgeFactory |
RelationGraph |
A graph where vertices are annotations and edges represent relations.
RelationType |
Dynamic enumeration of known types of relations that can exist between annotations.
ResourceType |
Defines common resource used by Hermes and methods for finding configuration values and resources for them.
SearchExtractor |
An Extractor implementation that searches for a given search text in the document.
SearchResults |
SentenceLevelAnnotator |
Base for annotators that work at the sentence level.
SequentialWorkflow |
Entry point to sequentially processing a corpus via one ore more Action s.
SimpleWordList |
Simple implementation of a WordList backed by a HashSet
SpellChecker |
The type Spellchecker module.
StandardTokenizer |
This class is a scanner generated by
JFlex 1.5.0-SNAPSHOT
from the specification file /home/ik/prj/gengoai/hermes-pom/core/src/main/jflex/StandardTokenizer.jflex
State |
Defines an action state which can be LOADED where the action has loaded its state or NOT_LOADED meaning the action
has no state to load.
Stemmer |
Defines the interface for stemming tokens.
Stemmers |
Factory class for creating/retrieving stemmers for a given language
StopWords |
Defines a methodology for determining if an HString or String is a stopword for a given language.
StopWords.NoOptStopWords |
StopWords implementation that treats everything as a content word.
SubTypeAnnotator |
An annotator that provides its annotation by annotating for sub-types.
Summarizer |
Interface defining an Extractor that generates summaries for given HString and
specifically documents.
SuperSenseProcessor |
TagDecoder |
TaggedFormat |
Format Name: tagged
TaggedFormat.Provider |
The type Provider.
TaggedFormat.TaggedParameters |
The type Tagged parameters.
TensorFlowSequenceLabeler |
TermCounts |
The type Term extraction processor.
TermExtractor |
Implementation of the MultiPhaseExtractor for extracting terms where a term is a single annotation (TOKEN by
TermExtractor.Builder |
TermKeywordExtractor |
TextNormalization |
Class takes care of normalizing text using a number of TextNormalizer s.
TextNormalizer |
Defines a methodology for normalizing a string.
TextRank |
Implementation of the TextRank algorithm for keyword extraction as defined in:
Mihalcea, R., Tarau, P.: "Textrank: Bringing order into texts".
TextRankSummarizer |
Implementation of the TextRank algorithm for summarization as defined in:
Mihalcea, R., Tarau, P.: "Textrank: Bringing order into texts".
TFIDFKeywordExtractor |
Keyword extractor that scores words based on their TFIDF value.
Tokenizer |
Low level tokenization of strings
Tokenizer.Token |
An internal token
Tokenizers |
TokenMatch |
A match from a TokenRegex pattern on an input HString.
TokenMatcher |
The TokenMatcher class allows for iterating of the matches, extracting the match or named-groups within the match,
the starting and ending offset of the match, and conversion into a TokenMatch object which records the current state
of the match.
TokenRegex |
Hermes provides a token-based regular expression engine that allows for matches on arbitrary annotation types,
relation types, and attributes, while providing many of the operators that are possible using standard Java regular
TokenType |
Defines the type for a given token.
TraditionalToSimplified |
Preprocessor that converts traditional characters into simplified characters.
TrieLexicon |
Implementation of Lexicon usng a Trie data structure.
TrieWordList |
Implementation of a WordList backed by a Trie
TwitterSearchFormat |
Format Name: twitter_search
TwitterSearchFormat.Provider |
The type Provider.
TxtFormat |
Format Name: text
TxtFormat.Provider |
The type Provider.
Types |
Common Annotatable Types.
UnicodeNormalizer |
Converts unicode to canonical form and removes smart quotes.
UniversalFeature |
UniversalFeatureSet |
UniversalFeatureValue |
UniversalSentenceEncoder |
UPOSProcessor |
Processes universal part-of-speech information
ValueCalculator |
The enum Value calculator.
ViterbiAnnotator |
An abstract base annotator that uses the Viterbi algorithm to find text items in a document.
WhitespaceNormalizer |
Handles normalizing whitespace.
WholeFileTextFormat |
Defines a format in which files need to be completely read in order to generate documents.
WordList |
Word lists provide a set like interface to set of vocabulary items.
WordProcessor |
Processes words
Workflow |
A workflow represents a set of _actions_ to perform on an document collection.