Ta Book
Ta Book
Version 2.7.0
Copyright © 2011, 2019 The Apache Software Foundation
License and Disclaimer. The ASF licenses this documentation to you under the Apache
License, Version 2.0 (the "License"); you may not use this documentation except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, this documentation and its contents
are distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License.
Trademarks. All terms mentioned in the text that are known to be trademarks or service marks
have been appropriately capitalized. Use of such terms in this book should not be regarded as
affecting the validity of the the trademark or service mark.
iv Apache UIMA Ruta™ Guide and Reference UIMA Ruta Version 2.7.0
Apache UIMA Ruta™ Guide and Reference
UIMA Ruta Version 2.7.0 Apache UIMA Ruta™ Guide and Reference v
Apache UIMA Ruta™ Guide and Reference
vi Apache UIMA Ruta™ Guide and Reference UIMA Ruta Version 2.7.0
Chapter 1. Apache UIMA Ruta Overview
1.1. What is Apache UIMA Ruta?
Apache UIMA Ruta™ is a rule-based script language supported by Eclipse-based tooling.
The language is designed to enable rapid development of text processing applications within
Apache UIMA™. A special focus lies on the intuitive and flexible domain specific language for
defining patterns of annotations. Writing rules for information extraction or other text processing
applications is a tedious process. The Eclipse-based tooling for UIMA Ruta, called the Apache
UIMA Ruta Workbench, was created to support the user and to facilitate every step when writing
UIMA Ruta rules. Both the Ruta rule language and the UIMA Ruta Workbench integrate smoothly
with Apache UIMA.
then the actions of the rule are performed on the matched annotations. A rule is composed of
a sequence of rule elements and a rule element essentially consist of four parts: A matching
condition, an optional quantifier, a list of conditions and a list of actions. The matching condition
is typically a type of an annotation by which the rule element matches on the covered text of
one of those annotations. The quantifier specifies, whether it is necessary that the rule element
successfully matches and how often the rule element may match. The list of conditions specifies
additional constraints that the matched text or annotations need to fulfill. The list of actions defines
the consequences of the rule and often creates new annotations or modifies existing annotations.
They are only applied if all rule elements of the rule have successfully matched. Examples for
UIMA Ruta rules can be found in Section 1.4, “Learning by Example” [3].
When UIMA Ruta rules are applied on a document, respectively on a CAS, then they are always
grouped in a script file. However, a UIMA Ruta script file does not only contain rules, but also
other statements. First of all, each script file starts with a package declaration followed by a list
of optional imports. Then, common statements like rules, type declarations or blocks build the
body and functionality of a script. Section 4.1, “Apply UIMA Ruta Analysis Engine in plain
Java” [113] gives an example, how UIMA Ruta scripts can be applied in plain Java. UIMA
Ruta script files are naturally organized in UIMA Ruta projects, which is a concept of the UIMA
Ruta Workbench. The structure of a UIMA Ruta project is described in Section 3.3, “UIMA Ruta
Projects” [86]
The inference of UIMA Ruta rules, that is the approach how the rules are applied, can be described
as imperative depth-first matching. In contrast to similar rule-based systems, UIMA Ruta rules
are applied in the order they are defined in the script. The imperative execution of the matching
rules may have disadvantages, but also many advantages like an increased rate of development
or an easier explanation. The second main property of the UIMA Ruta inference is the depth-first
matching. When a rule matches on a pattern of annotations, then an alternative is always tracked
until it has matched or failed before the next alternative is considered. The behavior of a rule may
change, if it has already matched on an early alternative and thus has performed an action, which
influences some constraints of the rule. Examples, how UIMA Ruta rules are applied, are given in
Section 1.4, “Learning by Example” [3].
The UIMA Ruta language provides the possibility to approach an annotation problem in different
ways. Let us distinguish some approaches as an example. It is common in the UIMA Ruta language
to create many annotations of different types. These annotations are probably not the targeted
annotation of the domain, but can be helpful to incrementally approximate the annotation of
interest. This enables the user to work “bottom-up” and “top-down”. In the former approach, the
rules add incrementally more complex annotations using simple ones until the target annotation
can be created. In the latter approach, the rules get more specific while partitioning the document
in smaller segments, which result in the targeted annotation, eventually. By using many “helper”-
annotations, the engineering task becomes easier and more comprehensive. The UIMA Ruta
language provides distinctive language elements for different tasks. There are, for example, actions
that are able to create new annotations, actions that are able to remove annotations and actions that
are able to modify the offsets of annotations. This enables, amongst other things, a transformation-
based approach. The user starts by creating general rules that are able to annotate most of the text
fragments of interest. Then, instead of making these rules more complex by adding more conditions
for situations where they fail, additional rules are defined that correct the mistakes of the general
rules, e.g., by deleting false positive annotations. Section 1.4, “Learning by Example” [3]
provides some examples how UIMA Ruta rules can be engineered.
To write rules manually is a tedious and error-prone process. The UIMA Ruta Workbench was
developed to facilitate writing rules by providing as much tooling support as possible. This
includes, for example, syntax checking and auto completion, which make the development less
error-prone. The user can annotate documents and use these documents as unit tests for test-driven
development or quality maintenance. Sometimes, it is necessary to debug the rules because they do
not match as expected. In this case, the explanation perspective provides views that explain every
detail of the matching process. Finally, the UIMA Ruta language can also be used by the tooling,
for example, by the “Query” view. Here, UIMA Ruta rules can be used as query statements in order
to investigate annotated documents.
UIMA Ruta smoothly integrates with Apache UIMA. First of all, the UIMA Ruta rules are applied
using a generic Analysis Engine and thus UIMA Ruta scripts can easily be added to Apache UIMA
pipelines. UIMA Ruta also provides the functionality to import and use other UIMA components
like Analysis Engines and Type Systems. UIMA Ruta rules can refer to every type defined in an
imported type system, and the UIMA Ruta Workbench generates a type system descriptor file
containing all types that were defined in a script file. Any Analysis Engine can be executed by rules
as long as their implementation is available in the classpath. Therefore, functionality outsourced in
an arbitrary Analysis Engine can be added and used within UIMA Ruta.
The first example consists of a declaration of a type followed by a simple rule. Type declarations
always start with the keyword “DECLARE” followed by the short name of the new type. The
namespace of the type is equal to the package declaration of the script file. If there is no package
declaration, then the types declared in the script file have no namespace. There is also the
possibility to create more complex types with features or specific parent types, but this will be
neglected for now. In the example, a simple annotation type with the short name “Animal” is
defined. After the declaration of the type, a rule with one rule element is given. UIMA Ruta rules
in general can consist of a sequence of rule elements. Simple rule elements themselves consist
of four parts: A matching condition, an optional quantifier, an optional list of conditions and an
optional list of actions. The rule element in the following example has a matching condition “W”,
an annotation type standing for normal words. Statements like declarations and rules always end
with a semicolon.
DECLARE Animal;
W{REGEXP("dog") -> MARK(Animal)};
The rule element also contains one condition and one action, both surrounded by curly parentheses.
In order to distinguish conditions from actions they are separated by “->”. The condition
“REGEXP("dog")” indicates that the matched word must match the regular expression “dog”. If the
matching condition and the additional regular expression are fulfilled, then the action is executed,
which creates a new annotation of the type “Animal” with the same offsets as the matched token.
The default seeder does actually not add annotations of the type “W”, but annotations of the types
“SW” and “CW” for small written words and capitalized words, which both have the parent type
“W”.
There is also the possibility to add implicit actions and conditions, which have no explicit name,
but consist only of an expression. In the part of the conditions, boolean expressions and feature
match expression can be applied, and in the part of the actions, type expressions and feature
assignment expression can be added. The following example contains one implicit condition and
one implicit action. The additional condition is a boolean expression (boolean variable), which is
set to “true”, and therefore is always fulfills the condition. The “MARK” action was replaced by a
type expression, which refer to the type “Animal”. The following rule shows, therefore, the same
behavior as the rule in the last example.
DECLARE Animal;
BOOLEAN active = true;
W{REGEXP("dog"), active -> Animal};
There is also a special kind of rules, which follow a different syntax and semantic, and enables a
simplified creation of annotations based on regular expression. The following rule, for example,
creates an “Animal” annotation for each occurrence of “dog” or “cat”.
DECLARE Animal;
"dog|cat" -> Animal;
The matching condition of the rule element refers to the complete document, or more specific to
the annotation of the type “DocumentAnnotation”, which covers the whole document. The action
“MARKFAST” of this rule element creates an annotation of the type “Animal” for each found
entry of the dictionary “AnimalsList”.
The next example introduces rules with more than one rule element, whereby one of them is a
composed rule element. The following rule tries to annotate occurrences of animals separated by
commas, e.g., “dog, cat, bird”.
DECLARE AnimalEnum;
(Animal COMMA)+{-> MARK(AnimalEnum,1,2)} Animal;
The rule consists of two rule elements, with “(Animal COMMA)+{-> MARK(AnimalEnum,1,2)}”
being the first rule element and “Animal” the second one. Let us take a closer look at the first rule
element. This rule element is actually composed of two normal rule elements, that are “Animal”
and “COMMA”, and contains a greedy quantifier and one action. This rule element, therefore,
matches on one Animal annotation and a following comma. This is repeated until one of the inner
rule elements does not match anymore. Then, there has to be another Animal annotation afterwards,
specified by the second rule element of the rule. In this case, the rule matches and its action is
executed: The MARK action creates a new annotation of the type “AnimalEnum”. However, in
contrast to the previous examples, this action also contains two numbers. These numbers refer to
the rule elements that should be used to calculate the span of the created annotation. The numbers
“1, 2” state that the new annotation should start with the first rule element, the composed one, and
should end with the second rule element.
Let us make the composed rule element more complex. The following rule also matches on lists of
animals, which are separated by semicolon. A disjunctive rule element is therefore added, indicated
by the symbol “|”, which matches on annotations of the type “COMMA” or “SEMICOLON”.
There two more special symbols that can be used to link rule elements. If the symbol “|” is replaced
by the symbol “&” in the last example, then the token after the animal need to be a comma and a
semicolon, which is of course not possible. Another symbol with a special meaning is “%”, which
cannot only be used within a composed rule element (parentheses). This symbol can be interpreted
as a global “and”: It links several rules, which only fire, if all rules have successfully matched. In
the following example, an annotation of the type “FoundIt” is created, if the document contains two
periods in a row and two commas in a row:
There is a “wild card” (“#”) rule element, which can be used to skip some text or annotations until
the next rule element is able to match.
DECLARE Sentence;
PERIOD #{-> MARK(Sentence)} PERIOD;
This rule annotates everything between two “PERIOD” annotations with the type “Sentence”.
Please note that the resulting annotations is automatically trimmed using the current filtering
settings. Conditions at wild card rule elements should by avoided and only be used by advanced
users.
Another special rule element is called “optional” (“_”). Sometimes, an annotation should be created
on a text position if it is not followed by an annotation of a specific property. In contrast to normal
rule elements with optional quantifier, the optional rule element does not need to match at all.
W ANY{-PARTOF(NUM)};
W _{-PARTOF(NUM)};
The two rules in this example specify the same pattern: A word that is not followed by a number.
The difference between the rules shows itself at the border of the matching window, e.g., at the
end of the document. If the document contains only a single word, the first rule will not match
successfully because the second rule element already fails at its matching condition. The second
rule, however, will successfully match due to the optional rule element.
Rule elements can contain more then one condition. The rule in the next example tries to identify
headlines, which are bold, underlined and end with a colon.
DECLARE Headline;
Paragraph{CONTAINS(Bold, 90, 100, true),
CONTAINS(Underlined, 90, 100, true), ENDSWITH(COLON)
-> MARK(Headline)};
The matching condition of this rule element is given with the type “Paragraph”, thus the rule takes
a look at all Paragraph annotations. The rule matches only if the three conditions, separated by
commas, are fulfilled. The first condition “CONTAINS(Bold, 90, 100, true)” states that 90%-100%
of the matched paragraph annotation should also be annotated with annotations of the type “Bold”.
The boolean parameter “true” indicates that amount of Bold annotations should be calculated
relatively to the matched annotation. The two numbers “90,100” are, therefore, interpreted as
percent amounts. The exact calculation of the coverage is dependent on the tokenization of the
document and is neglected for now. The second condition “CONTAINS(Underlined, 90, 100,
true)” consequently states that the paragraph should also contain at least 90% of annotations of
the type “underlined”. The third condition “ENDSWITH(COLON)” finally forces the Paragraph
annotation to end with a colon. It is only fulfilled, if there is an annotation of the type “COLON”,
which has an end offset equal to the end offset of the matched Paragraph annotation.
The readability and maintenance of rules does not increase, if more conditions are added. One
of the strengths of the UIMA Ruta language is that it provides different approaches to solve an
annotation task. The next two examples introduce actions for transformation-based rules.
This rule consists of one condition and one action. The condition “-CONTAINS(W)” is negated
(indicated by the character “-”), and is therefore only fulfilled, if there are no annotations of the
type “W” within the bound of the matched Headline annotation. The action “UNMARK(Headline)”
removes the matched Headline annotation. Put into simple words, headlines that contain no words
at all are not headlines.
The next rule does not remove an annotation, but changes its offsets dependent on the context.
Here, the action “SHIFT(Headline, 1, 2)” expands the matched Headline annotation to the next
colon, if that Headline annotation is followed by a COLON annotation.
UIMA Ruta rules can contain arbitrary conditions and actions, which is illustrated by the next
example.
This rule consists of three rule elements. The first one matches on every token, which has a
covered text that occurs in a word lists named “MonthsList”. The second rule element is optional
and does not need to be fulfilled, which is indicated by the quantifier “?”. The last rule element
matches on numbers that fulfill the regular expression “REGEXP(".{2,4}"” and are therefore at
least two characters to a maximum of four characters long. If this rule successfully matches on
a text passage, then its three actions are executed: An annotation of the type “Month” is created
for the first rule element, an annotation of the type “Year” is created for the last rule element and
an annotation of the type “Date” is created for the span of all three rule elements. If the word
list contains the correct entries, then this rule matches on strings like “Dec. 2004”, “July 85” or
“11.2008” and creates the corresponding annotations.
After introducing the composition of rule elements, the default matching strategy is examined. The
two rules in the next example create an annotation for a sequence of arbitrary tokens with the only
difference of one condition.
The first rule matches on each occurrence of an arbitrary token and continues this until the end of
the document is reached. This is caused by the greedy quantifier “+”. Note that this rule considers
each occurrence of a token and is therefore executed for each token resulting many overlapping
annotations. This behavior is illustrated with an example: When applied on the document “Peter
works for Frank”, the rule creates four annotations with the covered texts “Peter works for Frank”,
“works for Frank”, “for Frank” and “Frank”. The rule first tries to match on the token “Peter” and
continues its matching. Then, it tries to match on the token “works” and continues its matching, and
so on.
In this example, the second rule only returns one annotation, which covers the complete document.
This is caused by the additional condition “-PARTOF(Text2)”. The PARTOF condition is fulfilled,
if the matched annotation is located within an annotation of the given type, or put in simple
words, if the matched annotation is part of an annotation of the type “Text2”. When applied on the
document “Peter works for Frank”, the rule matches on the first token “Peter”, continues its match
and creates an annotation of the type “Text2” for the complete document. Then it tries to match on
the second token “works”, but fails, because this token is already part of an Text2 annotation.
UIMA Ruta rules can not only be used to create or modify annotations, but also to create features
for annotations. The next example defines and assigns a relation of employment, by storing the
given annotations as feature values.
The first statement of this example is a declaration that defines a new type of annotation named
“EmplRelation”. This annotation has two features: One feature with the name “employeeRef” of
the type “Employee” and one feature with the name “employerRef” of the type “Employer”. If the
parent type is Annotation, then it can be omitted resulting in the following declaration:
The second statement of the example, which is a simple rule, creates one annotation of the type
“EmplRelation” for each Sentence annotation that contains at least one annotation of the type
“EmploymentIndicator”. Additionally to creating an annotation, the CREATE action also assigns
an annotation of the “Employee”, which needs to be located within the span of the matched
sentence, to the feature “employeeRef” and an Employer annotation to the feature “employerRef”.
The annotations mentioned in this example need to be present in advance.
In order to refer to annotations and, for example, assigning them to some features, special kinds
of local and global variables can be utilized. Local variables for annotations do not need to be
defined by are specified by a label at a rule element. This label can be utilized for referring to the
matched annotation of this rule element within the current rule match alone. The following example
illustrate some simple use cases using local variables:
Global variables for annotations are declared like other variables and are able to store annotations
across rules as illustrated by the next example:
The first line declares a new type that are utilized afterwards. The second line defines a variable
named firstPerson which can store one annotation. A variable able to hold several annotations
is defined with ANNOTATIONLIST. The next line assigns the first occurrence of Person
annotation to the annotation variable firstPerson. The last line creates an annotation of the type
MentionedAfter and assigns the value of the variable firstPerson to the feature first of the
created annotation.
Expressions for annotations can be extended by a feature match and also conditions. This does also
apply for type expressions that represent annotations. This functionality is illustrated with a simple
example:
Here, an annotation of the type EmplRelation is created for each sentence. The feature
employeeRef is filled with one Employee annotation. This annotation is specified by its type
Employee. The first annotation of this type within the matched sentence, which covers the text
“Peter” and also ends with a Sentence annotation, is selected.
Sometimes, an annotation which was just created by an action should be assigned to a feature. This
can be achieved by referring to the annotation given its type like it was shown in the first example
with “EmplRelation”. However, this can cause problems in situations, e.g. where several annotation
of a type are present at a specific span. Local variables using labels can also be used directly at
actions, which create or modify actions. The action will assign the new annotation the the label
variable, which can then be utilized by following actions as shown in the following example:
In the last examples, the values of features were defined as annotation types. However, also
primitive types can be used, as will be shown in the next example, together with a short
introduction of variables.
First, a new annotation with the name “MoneyAmount” and two features are defined, one string
feature and one integer feature. Then, two UIMA Ruta variables are declared, one integer variable
and one string variable. The rule matches on a number, whose value is stored in the variable
“moneyAmount”, followed by a special token that needs to be equal to the string “€”. Then,
the covered text of the special annotation is stored in the string variable “moneyCurrency” and
annotation of the type “MoneyAmount” spanning over both rule elements is created. Additionally,
the variables are assigned as feature values.
Using feature expression for conditions and action, can reduce the complexity of a rule. The first
rule in the following example set the value of the feature “currency” of the annotation of the type
“MoneyAmount” to “Euro”, if it was “€” before. The second rule creates an annotation of the type
“LessThan” for all annotations of the type “MoneyAmount”, if their amount is less than 100 and
the currency is “Euro”.
DECLARE LessThan;
MoneyAmount.currency=="€"{-> MoneyAmount.currency="Euro"};
MoneyAmount{(MoneyAmount.amount<=100),
UIMA Ruta script files with many rules can quickly confuse the reader. The UIMA Ruta language,
therefore, allows to import other script files in order to increase the modularity of a project or to
create rule libraries. The next example imports the rules together with all known types of another
script file and executes that script file.
SCRIPT uima.ruta.example.SecondaryScript;
Document{-> CALL(SecondaryScript)};
The script file with the name “SecondaryScript.ruta”, which is located in the package “uima/ruta/
example”, is imported and executed by the CALL action on the complete document. The script
needs to be located in the folder specified by the parameter scriptPaths, or in a corresponding
package in the classpath. It is also possible to import script files of other UIMA Ruta projects, e.g.,
by adapting the configuration parameters of the UIMA Ruta Analysis Engine or by setting a project
reference in the project properties of a UIMA Ruta project.
For simple rules that match on the complete document and only specify actions, a simplified syntax
exists that omits the matching parts:
SCRIPT uima.ruta.example.SecondaryScript;
CALL(SecondaryScript);
The types of important annotations of the application are often defined in a separate type system.
The next example shows how to import those types.
TYPESYSTEM my.package.NamedEntityTypeSystem;
Person{PARTOF(Organization) -> UNMARK(Person)};
The type system descriptor file with the name “NamedEntityTypeSystem.xml” located in the
package “my/package” is imported. The descriptor needs to be located in a folder specified by the
parameter descriptorPaths.
This example contains two simple BLOCK statements. The rules defined within the block are only
executed, if the condition in the head of the block is fulfilled. The rules of the first block are only
considered if the feature “language” of the document annotation has the value “en”. Following this,
the rules of the second block are only considered for German documents.
The rule element of the block definition can also refer to other annotation types than “Document”.
While the last example implemented something similar to an if-statement, the next example
provides a show case for something similar to a for-each-statement.
DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}
Here, the rule in the block statement is performed for each occurrence of an annotation of the
type “Sentence”. The rule within the block matches on the complete document, which is the
current sentence in the context of the block statement. As a consequence, this example creates an
annotation of the type “SentenceWithNoLeadingNP” for each sentence that does not start with a
NP annotation.
There are two more language constructs (“->” and “<-”) that allow to apply rules within a certain
context. These rules are added to an arbitrary rule element and are called inlined rules. The first
example interprets the inlined rules as actions. They are executed if the surrounding rule was able
to match, which makes this one very similar to the block statement.
DECLARE SentenceWithNoLeadingNP;
Sentence{}->{
Document{-STARTSWITH(NP) -> SentenceWithNoLeadingNP};
};
The second one (“<-”) interprets the inlined rules as conditions. The surrounding rule can only
match if at least one inlined rule was successfully applied. In the following example, a sentence is
annotated with the type SentenceWithNPNP, if there are two successive NP annotations within this
sentence.
DECLARE SentenceWithNPNP;
Sentence{-> SentenceWithNPNP}<-{
NP NP;
};
A rule element may be extended with several inlined rule block as condition or action. If there a
more than one inlined rule blocks as condition, each needs to contain at least one rule that was
successfully applied. In the following example, the rule will one match if the sentence contains a
number followed by a another number and a period followed by a comma, independently from their
location within the sentence:
Let us take a closer look on what exactly the UIMA Ruta rules match. The following rule matches
on a word followed by another word:
W W;
To be more precise, this rule matches on all documents like “Apache UIMA”, “Apache UIMA”,
“ApacheUIMA”, “Apache <b>UIMA</b>”. There are two main reasons for this: First of all, it
depends on how the available annotations are defined. The default seeder for the initial annotations
creates an annotation for all characters until an upper case character occurs. Thus, the string
“ApacheUIMA” consists of two tokens. However, more important, the UIMA Ruta language
provides a concept of visibility of the annotations. By default, all annotations of the types
“SPACE”, “NBSP”, “BREAK” and “MARKUP” (whitespace and XML elements) are filtered and
not visible. This holds of course for their covered text, too. The rule elements skip all positions of
the document where those annotations occur. The rule in the last example matches on all examples.
Without the default filtering settings, with all annotations set to visible, the rule matches only on
the document “ApacheUIMA” since it is the only one that contains two word annotations without
any whitespace between them.
The filtering setting can also be modified by the UIMA Ruta rules themselves. The next example
provides rules that extend and limit the amount of visible text of the document.
Sentence;
Document{-> RETAINTYPE(SPACE)};
Sentence;
Document{-> FILTERTYPE(CW)};
Sentence;
Document{-> RETAINTYPE, FILTERTYPE};
The first rule matches on sentences, which do not start with any filtered type. Sentences that start
with whitespace or markup, for example, are not considered. The next rule retains all text that is
covered by annotations of the type “SPACE” meaning that the rule elements are now sensible to
whitespaces. The following rule will, therefore, match on sentences that start with whitespaces. The
third rule now filters the type “CW” with the consequence that all capitalized words are invisible.
If the following rule now wants to match on sentences, then this is only possible for Sentence
annotations that do not start with a capitalized word. The last rule finally resets the filtering setting
to the default configuration in the UIMA Ruta Analysis Engine.
The next example gives a showcase for importing external Analysis Engines and for modifying the
documents by creating a new view called “modified”. Additional Analysis Engines can be imported
with the keyword “ENGINE” followed by the name of the descriptor. These imported Analysis
Engines can be executed with the actions “CALL” or “EXEC”. If the executed Analysis Engine
adds, removes or modifies annotations, then their types need to be mentioned when calling the
descriptor, or else these annotations will not be correctly processed by the following UIMA Ruta
rules.
ENGINE utils.Modifier;
Date{-> DEL};
MoneyAmount{-> REPLACE("<MoneyAmount/>")};
Document{-> COLOR(Headline, "green")};
Document{-> EXEC(Modifier)};
In this example, we first import an Analysis Engine defined by the descriptor “Modifier.xml”
located in the folder “utils”. The descriptor needs to be located in the folder specified by the
parameter descriptorPaths. The first rule deletes all text covered by annotations of the type “Date”.
The second rule replaces the text of all annotations of the type “MoneyAmount” with the string
“<MoneyAmount/>”. The third rule remembers to set the background color of text in Headline
annotation to green. The last rule finally performs all of these changes in an additional view called
“modified”, which is specified in the configuration parameters of the analysis engine. Section 1.5.4,
“Modifier” [20] and Section 2.17, “Modification” [73] provide a more detailed description.
In the last example, a descriptor file was loaded in order to import and apply an external analysis
engine. Analysis engines can also be loaded using uimaFIT, whereas the given class name has to
be present in the classpath. In the UIMA Ruta Workbench, you can add a dependency to a java
project, which contains the implementation, to the UIMA Ruta project. The following example
loads an analysis engine without an descriptor and applies it on the document. The additional list of
types states that the annotations of those types created by the analysis engine should be available to
the following Ruta rules.
UIMAFIT my.package.impl.MyAnalysisEngine;
Document{-> EXEC(MyAnalysisEngine, {MyType1, MyType2})};
To change the value of any configuration parameter within a UIMA Ruta script, the
CONFIGURE action (see Section 2.8.8, “CONFIGURE” [51]) can be used. For changing
behavior of dynamicAnchoring the DYNAMICANCHORING action (see Section 2.8.11,
“DYNAMICANCHORING” [52]) is recommended.
mainScript
This parameter specifies the rule file that will be executed by the analysis engine and is, therefore,
one of the most important ones. The exact name of the script is given by the complete namespace
of the file, which corresponds to its location relative to the given parameter scriptPaths. The single
names of packages (or folders) are separated by periods. An exemplary value for this parameter
could be "org.apache.uima.Main", whereas "Main" specifies the file containing the rules and
"org.apache.uima" its package. In this case, the analysis engine loads the script file "Main.ruta",
which is located in the folder structure "org/apache/uima/". This parameter has no default value and
has to be provided, although it is not specified as mandatory.
rules
A String parameter representing the rule that should be applied by the analysis engine. If set, it
replaces the content of file specified by the mainScript parameter.
rulesScriptName
This parameter specifies the name of the non-existing script if the rules parameter is used. The
default value is 'Anonymous'.
scriptEncoding
This parameter specifies the encoding of the rule files. Its default value is "UTF-8".
scriptPaths
The parameter scriptPaths refers to a list of String values, which specify the possible locations
of script files. The given locations are absolute paths. A typical value for this parameter is, for
example, "C:/Ruta/MyProject/script/". If the parameter mainScript is set to org.apache.uima.Main,
then the absolute path of the script file has to be "C:/Ruta/MyProject/script/org/apache/uima/
Main.ruta". This parameter can contain multiple values, as the main script can refer to multiple
projects similar to a class path in Java.
descriptorPaths
This parameter specifies the possible locations for descriptors like analysis engines or type systems,
similar to the parameter scriptPaths for the script files. A typical value for this parameter is for
example "C:/Ruta/MyProject/descriptor/". The relative values of the parameter additionalEngines
are resolved to these absolute locations. This parameter can contain multiple values, as the main
script can refer to multiple projects similar to a class path in Java.
resourcePaths
This parameter specifies the possible locations of additional resources like word lists or CSV
tables. The string values have to contain absolute locations, for example, "C:/Ruta/MyProject/
resources/".
additionalScripts
The optional parameter additionalScripts is defined as a list of string values and contains script
files, which are additionally loaded by the analysis engine. These script files are specified by their
complete namespace, exactly like the value of the parameter mainScript and can be refered to by
language elements, e.g., by executing the containing rules. An exemplary value of this parameter
is "org.apache.uima.SecondaryScript". In this example, the main script could import this script file
by the declaration "SCRIPT org.apache.uima.SecondaryScript;" and then could execute it with the
rule "Document{-> CALL(SecondaryScript)};". This optional list can be used as a replacement of
global imports in the script file.
additionalEngines
This optional parameter contains a list of additional analysis engines, which can be executed by
the UIMA Ruta rules. The single values are given by the name of the analysis engine with their
complete namespace and have to be located relative to one value of the parameter descriptorPaths,
the location where the analysis engine searches for the descriptor file. An example for one value of
the parameter is "utils.HtmlAnnotator", which points to the descriptor "HtmlAnnotator.xml" in the
folder "utils". This optional list can be used as a replacement of global imports in the script file.
additionalUimafitEngines
This optional parameter contains a list of additional analysis engines, which can be executed by
the UIMA Ruta rules. The single values are given by the name of the implementation with the
complete namespace and have to be present int he classpath of the application. An example for
one value of the parameter is "org.apache.uima.ruta.engine.HtmlAnnotator", which points to the
"HtmlAnnotator" class. This optional list can be used as a replacement of global imports in the
script file.
additionalExtensions
This parameter specifies optional extensions of the UIMA Ruta language. The elements of the
string list have to implement the interface "org.apache.uima.ruta.extensions.IRutaExtension". With
these extensions, application-specific conditions and actions can be added to the set of provided
ones.
reloadScript
This boolean parameter indicates whether the script or resource files should be reloaded when
processing a CAS. The default value is set to false. In this case, the script files are loaded when the
analysis engine is initialized. If script files or resource files are extended, e.g., a dictionary is filled
yet when a collection of documents are processed, then the parameter is needed to be set to true in
order to include the changes.
seeders
This list of string values refers to implementations of the interface
"org.apache.uima.ruta.seed.RutaAnnotationSeeder", which can be used to automatically
add annotations to the CAS. The default value of the parameter is a single seeder, namely
"org.apache.uima.ruta.seed.DefaultSeeder" that adds annotations for token classes like CW,
MARKUP or SEMICOLON. Remember that additional annotations can also be added with an
additional engine that is executed by a UIMA Ruta rule.
defaultFilteredTypes
This parameter specifies a list of types, which are filtered by default when executing a script file.
Using the default values of this parameter, whitespaces, line breaks and markup elements are not
visible to Ruta rules. The visibility of annotations and, therefore, the covered text can be changed
using the actions FILTERTYPE and RETAINTYPE.
removeBasics
This parameter specifies whether the inference annotations created by the analysis engine should be
removed after processing the CAS. The default value is set to false.
reindexOnly
This parameter specifies the annotation types which should be reindex for ruta's internal
annotations All annotation types that changed since the last call of a ruta script need to be listed
here. The value of this parameter needs only be adapted for performance optimization in pipelines
that contains several ruta analysis engines. Default value is uima.tcas.Annotation
reindexOnlyMentionedTypes
If this parameter is activated, then only annotations of types are internally reindexed at beginning
that are mentioned with in the rules. This parameter overrides the values of the parameter
'reindexOnly' with the types that are mentioned in the rules. Default value is false.
indexOnlyMentionedTypes
If this parameter is activated, then only annotations of types are internally indexed that are
mentioned with in the rules. This optimization of the internal indexing can improve the speed and
reduce the memory footprint. However, several features of the rule matching require the indexing
of types that are not mentioned in the rules, e.g., literal rule matches, wildcards and actions like
MARKFAST, MARKTABLE, TRIE. Default value is false.
indexAdditionally
This parameter specifies annotation types (resolvable mentions are also supported) that should be
index additionally to types mentioned in the rules. This parameter is only used if the parameter
'indexOnlyMentionedTypes' is activated.
emptyIsInvisible
This parameter determines positions as invisible if the internal indexing of the corresponding
RutaBasic annotation is empty. Default value is true.
modifyDataPath
This parameter specifies whether the datapath of the ResourceManager is extended by the values of
the configuration parameter descriptorPaths. The default value is set to false.
strictImports
This parameter specifies whether short type names should be resolved against the typesystems
declared in the script (true) or at runtime in the CAS typesystem (false). The default value is set to
false.
dynamicAnchoring
If this parameter is set to true, then the Ruta rules are not forced to start to match with the first
rule element. Rather, the rule element referring to the most rare type is chosen. This option can be
utilized to optimize the performance. Please mind that the matching result can vary in some cases
when greedy rule elements are applied. The default value is set to false.
lowMemoryProfile
This parameter specifies whether the memory consumption should be reduced. This parameter
should be set to true for very large CAS documents (e.g., > 500k tokens), but it also reduces the
performance. The default value is set to false.
simpleGreedyForComposed
This parameter specifies whether a different inference strategy for composed rule elements should
be applied. This option is only necessary when the composed rule element is expected to match
very often, e.g., a rule element like (ANY ANY)+. The default value of this parameter is set to
false.
debug
If this parameter is set to true, then additional information about the execution of a rule script is
added to the CAS. The actual information is specified by the following parameters. The default
value of this parameter is set to false.
debugWithMatches
This parameter specifies whether the match information (covered text) of the rules should be stored
in the CAS. The default value of this parameter is set to false.
debugOnlyFor
This parameter specifies a list of rule-ids that enumerate the rule for which debug information
should be created. No specific ids are given by default.
profile
If this parameter is set to true, then additional information about the runtime of applied rules is
added to the CAS. The default value of this parameter is set to false.
statistics
If this parameter is set to true, then additional information about the runtime of UIMA Ruta
language elements like conditions and actions is added to the CAS. The default value of this
parameter is set to false.
createdBy
If this parameter is set to true, then additional information about what annotation was created by
which rule is added to the CAS. The default value of this parameter is set to false.
varNames
This parameter specifies the names of variables and is used in combination with the parameter
varValues, which contains the values of the corresponding variables. The n-th entry of this string
array specifies the variable of the n-th entry of the string array of the parameter varValues. If the
variables is defined in the root of a script, then the name of the variable suffices. If the variable
is defined in a BLOCK or imported script, then the the name must contain the namespaces of the
blocks as a prefix, e.g., InnerBlock.varName or OtherScript.SomeBlock.varName.
varValues
This parameter specifies the values of variables as string values in an string array. It is used
in combination with the parameter varNames, which contains the names of the corresponding
variables. The n-th entry of this string array specifies the value of the n-th entry of the string array
of the parameter varNames. The values for list variables are separated by the character “,”. Thus,
the usage of commas is not allowed if the variable is a list.
dictRemoveWS
If this parameter is set to true, then whitespaces are removed when dictionaries are loaded.
csvSeparator
If this parameter is set to any String value then this String/token is used to split columns in CSV
tables. The default is set to ';'.
inferenceVisitors
This parameter specifies optional class names implementing the interface
org.apache.uima.ruta.visitor.RutaInferenceVisitor, which will be notified during
applying the rules.
maxRuleMatches
Maximum amount of allowed matches of a single rule.
maxRuleElementMatches
Maximum amount of allowed matches of a single rule element.
Output
This string parameter specifies the absolute path of the resulting file named “output.txt”. However,
if an annotation of the type “org.apache.uima.examples.SourceDocumentInformation” is given,
then the value of this parameter is interpreted to be relative to the URI stored in the annotation and
the name of the file will be adapted to the name of the source file. If this functionality is activated
in the preferences, then the UIMA Ruta Workbench adds the SourceDocumentInformation
annotation when the user launches a script file. The default value of this parameter is “/../output/”.
Encoding
This string parameter specifies the encoding of the resulting file. The default value of this
parameter is “UTF-8”.
Type
Only the covered texts of annotations of the type specified with this parameter are stored in the
resulting file. The default value of this parameter is “uima.tcas.DocumentAnnotation”, which will
store the complete document in a new file.
1.5.4. Modifier
The Modifier Analysis Engine can be used to create an additional view, which contains all textual
modifications and HTML highlightings that were specified by the executed rules. This Analysis
Engine can be applied, e.g., for anonymization where all annotations of persons are replaced by the
string “Person”. Furthermore, the content of the new view can optionally be stored in a new HTML
file. A descriptor file for this Analysis Engine is located in the folder “descriptor/utils” of a UIMA
Ruta project.
styleMap
This string parameter specifies the name of the style map file created by the Style Map Creator
Analysis Engine, which stores the colors for additional highlightings in the modified view.
descriptorPaths
This parameter can contain multiple string values and specifies the absolute paths where the style
map file can be found.
outputLocation
This optional string parameter specifies the absolute path of the resulting
file named “output.modified.html”. However, if an annotation of the type
“org.apache.uima.examples.SourceDocumentInformation” is given, then the value of this
parameter is interpreted to be relative to the URI stored in the annotation and the name of the file
will be adapted to the name of the source file. If this functionality is activated in the preferences,
then the UIMA Ruta Workbench adds the SourceDocumentInformation annotation when the user
launches a script file. The default value of this parameter is empty. In this case no additional html
file will be created.
outputView
This string parameter specifies the name of the view, which will contain the modified document. A
view of this name must not yet exist. The default value of this parameter is “modified”.
onlyContent
This parameter specifies whether created annotations should cover only the content of the HTML
elements or also their start and end elements. The default value is “true”.
outputView
This string parameter specifies the name of the new view. The default value is “plaintext”.
inputView
This string parameter can optionally be set to specify the name of the input view.
newlineInducingTags
This string array parameter sets the names of the html tags that create linebreaks in the output view.
The default is “br, p, div, ul, ol, dl, li, h1, ..., h6, blockquote”.
replaceLinebreaks
This boolean parameter determines if linebreaks inside the text nodes are kept or removed. The
default behavior is “true”.
replaceLinebreaks
This string parameter determines the character sequence that replaces a linebreak. The default
behavior is the empty string.
conversionPolicy
This string parameter determines the conversion policy used, either "heuristic", "explicit",
or "none". When the value is "explicit", the parameters “conversionPatterns” and optionally
“conversionReplacements” are considered. The "heuristic" conversion policy uses simple regular
expressions to decode html4 entities such as " ". The default behavior is "heuristic".
conversionPatterns
This string array parameter can be used to apply custom conversions. It defaults to a list of
commonly used codes, e.g., , which are converted using html 4 entity unescaping. However,
explicit conversion strings can also be passed via the parameter “conversionReplacements”.
Remember to enable explicit conversion via “conversionPolicy” first.
conversionReplacements
This string array parameter corresponds to “conversionPatterns” such that “conversionPatterns[i]”
will be replaced by “conversionReplacements[i]”; replacements should be shorter than the source
pattern. Per default, the replacement strings are computed using Html4 decoding. Remember to
enable explicit conversion via “conversionPolicy” first.
skipWhitespaces
This boolean parameter determines if the converter should skip whitespaces. Html documents
often contains whitespaces for indentation and formatting, which should not be reproduced in the
converted plain text document. If the parameter is set to false, then the whitespces are not removed.
This behavior is useful, if not Html documents are converted, but XMl files. The default value is
true.
processAll
If this boolean parameter is set to true, then the tags of the complete document is processed and not
only those within the body tag.
newlineInducingTagRegExp
This string parameter contains a regular expression for HTML/XML elements. If the pattern
matches, then the element will introduce a new line break similar to the element of the parameter
“newlineInducingTags”.
gapInducingTags
This string array parameter sets the names of the html tags that create additional text in the output
view. The acutal string of the gap is defined by the parameter “gapText”.
gapText
This string parameter determines the character sequence that is introduced by the html tags
specified in the “gapInducingTags”.
useSpaceGap
This boolean parameter sets the value of the parameter “gapText” to a single space..
styleMap
This string parameter specifies the name of the style map file created by the Style Map Creator
Analysis Engine, which stores the colors for additional highlightings in the modified view.
descriptorPaths
This parameter can contain multiple string values and specifies the absolute paths where the style
map can be found.
1.5.8. Cutter
This Analysis Engine is able to cut the document of the CAS. Only the text covered by annotations
of the specified type will be retained and all other parts of the documents will be removed. The
offsets of annotations in the index will be updated, but not feature structures nested as feature
values.
keep
This string parameter specifies the complete name of a type. Only the text covered by annotations
of this type will be retained and all other parts of the documents will be removed.
inputView
The name of the view that should be processed.
outputView
The name of the view, which will contain the modified CAS.
output
This string parameter specifies the absolute path of the resulting file named “output.xmi”.
However, if an annotation of the type “org.apache.uima.examples.SourceDocumentInformation”
is given, then the value of this parameter is interpreted to be relative to the URI stored in
the annotation and the name of the file will be adapted to the name of the source file. If
this functionality is activated in the preferences, then the UIMA Ruta Workbench adds the
SourceDocumentInformation annotation when the user launches a script file.
inputView
The name of the view that should be stored in a file.
outputView
The name, which should be used, to store the view in the file.
Output
This string parameter specifies the absolute path of the resulting file named “output.xmi”.
However, if an annotation of the type “org.apache.uima.examples.SourceDocumentInformation”
is given, then the value of this parameter is interpreted to be relative to the URI stored in
the annotation and the name of the file will be adapted to the name of the source file. If
this functionality is activated in the preferences, then the UIMA Ruta Workbench adds the
SourceDocumentInformation annotation when the user launches a script file. The default value is
“/../output/”
2.1. Syntax
UIMA Ruta defines its own language for writing rules and rule scripts. This section gives a formal
overview of its syntax.
Structure: The overall structure of a UIMA Ruta script is defined by the following syntax.
Comments are excluded from the syntax definition. Comments start with "//" and always go to the
end of the line.
PACKAGE uima.ruta.example;
SCRIPT uima.ruta.example.Author;
SCRIPT uima.ruta.example.Title;
SCRIPT uima.ruta.example.Year;
Syntax of declarations:
Since each condition and each action has its own syntax, conditions and actions are described
in their own section. For conditions see Section 2.7, “Conditions” [39] , for actions
see Section 2.8, “Actions” [49]. The syntax of expressions is explained in Section 2.6,
“Expressions” [35].
It is also possible to use specific expression as implicit conditions or action additionally to the set
of available conditions and actions.
Identifier:
The starting rule element can also be manually specified by adding “@” directly in front of the
matching condition. In the following example, the rule first searches for capitalized words (CW)
and then checks whether there is a period in front of the matched word.
PERIOD @CW;
This functionality can also be used for rules that start with an optional rule element by manually
specifying a later rule element to start the matching process.
The choice of the starting rule element can greatly influence the performance speed of the
rule execution. This circumstance is illustrated with the following example that contains two
rules, whereas already an annotation of the type “LastToken” was added to the last token of the
document:
ANY LastToken;
ANY @LastToken;
The first rule matches on each token of the document and checks whether the next annotation is
the last token of the document. This will result in many index operations because all tokens of the
document are considered. The second rule, however, matches on the last token and then checks if
there is any token in front of it. This rule, therefore, considers only one token.
The UIMA Ruta language provides also a concept for automatically selecting the starting rule
element called dynamic anchoring. Here, a simple heuristic concerning the position of the rule
element and the involved types is applied in order to identify the favorable rule element. This
functionality can be activated in the configuration parameters of the analysis engine or directly in
the script file with the DYNAMICANCHORING action.
A list of rule elements normally specifies a sequential pattern. The rule is able to match if the first
rule element successfully matches and then the following rule element at the position after the
match of the first rule element, and so on. There are three language constructs that break up that
sequential matching: “&”, “|” and “%”. A composed rule element where all inner rule elements are
linked by the symbol “&” matches only if all inner rule elements successfully match at the given
position. A composed rule element with inner rule elements linked by the symbol “|” matches if
one of the inner rule element successfully matches. These composed rule elements therefore specify
a conjunction (“and”) and a disjunction (“or”) of its rule element at the given position. The symbol
“%” specifies a different use case. Here, rules themselves are linked and they are only able to fire if
each one of the linked rules successfully matched. In contrast to “&”, this linkage of rule elements
does not introduce constraints for the matched positions. In the following, a few examples of these
three language constructs are given.
This rule is fulfilled, if there is a token whose feature “posTag” has the value “DET” and an
annotation of the type “Lemma” whose feature “value” has the value “the”. Both rule elements
need to be fulfilled at the same position.
This rule matches on a number and then validates if the next word is “Peter” and if next but one
token is capitalized and part of an annotation of the type “Name”. If all rule elements successfully
matched, then a new annotation of the type “Name” will be created covering the largest match of
the linked rule elements. In this example, the new annotation covers also the token after the word
“Peter” even if the actions was specified at the rule element with the smaller match.
In this example, an annotation of the type “Name” will be created for the token “Peter” followed by
a capitalized word or the word “Mr” followed by a period and a capitalized word.
This rule annotates enumerations of animal annotations whereas each animal annotation is
separated by either a comma or the word “and”.
BLOCK(forEach) Sentence{}{
CW NUM % SW NUM{-> MARK(Found, 1, 2)};
}
Here, annotations of the type “Found” are created if a sentence contains a capitalized word
followed by a number and a small written word followed by a number regardless of where these
annotations occur in the sentence.
2.4. Quantifiers
The last match “Big” can be problematic using different types if the rule starts matching with the
first rule element.
2.5. Declarations
There are three different kinds of declarations in the UIMA Ruta system: Declarations of types
with optional feature definitions, declarations of variables and declarations for importing external
resources, further UIMA Ruta scripts and UIMA components such as type systems and analysis
engines.
2.5.1. Types
Type declarations define new kinds of annotation types and optionally their features.
2.5.1.1. Example:
DECLARE SimpleType1, SimpleType2; // <- two new types with the parent
// type "Annotation"
DECLARE ParentType NewType (SomeType feature1, INT feature2);
// a new type "NewType" with parent type "ParentType" and two features
Attention: Types with features need a parent type in their declarations. If no special parent type is
requested, just use type Annotation as default parent type.
2.5.2. Variables
Variable declarations define new variables. There are 12 kinds of variables:
• Float list variable: A variable that represents a list of floating-point numbers in single
precision.
• Double list variable: A variable that represents a list of floating-point numbers in double
precision.
2.5.2.1. Example:
TYPE newTypeVariable;
TYPELIST newTypeList;
INT newIntegerVariable;
INTLIST newIntList;
FLOAT newFloatVariable;
FLOATLIST newFloatList;
DOUBLE newDoubleVariable;
DOUBLELIST newDoubleList;
STRING newStringVariable;
STRINGLIST newStringList;
BOOLEAN newBooleanVariable;
BOOLEANLIST newBooleanList;
ANNOTATION newAnnotationVariable;
ANNOTATIONLIST newAnnotationList;
2.5.3. Resources
There are two kinds of resource declarations that make external resources available in the UIMA
Ruta system:
• List: A list represents a normal text file with an entry per line or a compiled tree of a word
list.
2.5.3.1. Example:
WORDLIST listName = 'someWordList.txt';
WORDTABLE tableName = 'someTable.csv';
2.5.4. Scripts
Additional scripts can be imported and reused with the CALL action. The types of the imported
rules are also available so that it is not necessary to import the Type System of the additional rule
script.
2.5.4.1. Example:
SCRIPT my.package.AnotherScript; // "AnotherScript.ruta" in the
//package "my.package"
Document{->CALL(AnotherScript)}; // <- rule executes "AnotherScript.ruta"
2.5.5. Components
There are three kinds of UIMA components that can be imported in a UIMA Ruta script:
• Type System (IMPORT or TYPESYSTEM): includes the types defined in an external type
system. You can select which types or packages to import from a type system and how to
alias them. If use IMPORT statements, consider enabling strictImports.
• Analysis Engine (ENGINE): loads the given descriptor and creates an external analysis
engine. The descriptor must be located in the descriptor paths. The type system needed for
the analysis engine has to be imported separately. Please mind the filtering setting when
calling an external analysis engine.
• Analysis Engine (UIMAFIT): loads the given class and creates an external analysis engine.
Please mind that the implementation of the analysis engine needs to be available. The
type system needed for the analysis engine has to be imported separately. Please mind the
filtering setting when calling an external analysis engine.
IMPORT my.package.SomeType;
Document{->RETAINTYPE(SPACE,BREAK),CALL(ExternalEngine)};
// calls ExternalEngine, but retains white spaces
Document{-> EXEC(AnotherEngine, {SomeType})};
2.6. Expressions
UIMA Ruta provides five different kinds of expressions. These are type expressions, number
expressions, string expressions, boolean expressions and list expressions.
Definition:
Definition:
Example:
2.6.2.1. Definition:
NumberExpression -> AdditiveExpression
AdditiveExpression -> MultiplicativeExpression ( ( "+" | "-" )
MultiplicativeExpression )*
MultiplicativeExpression -> SimpleNumberExpression ( ( "*" | "/" | "%" )
SimpleNumberExpression )*
| ( "EXP" | "LOGN" | "SIN" | "COS" | "TAN" )
"(" NumberExpression ")"
SimpleNumberExpression -> "-"? ( DecimalLiteral | FloatingPointLiteral
| NumberVariable) | "(" NumberExpression ")"
DecimalLiteral -> ('0' | '1'..'9' Digit*) IntegerTypeSuffix?
IntegerTypeSuffix -> ('l'|'L')
FloatingPointLiteral -> Digit+ '.' Digit* Exponent? FloatTypeSuffix?
| '.' Digit+ Exponent? FloatTypeSuffix?
| Digit+ Exponent FloatTypeSuffix?
| Digit+ Exponent? FloatTypeSuffix
FloatTypeSuffix -> ('f'|'F'|'d'|'D')
Exponent -> ('e'|'E') ('+'|'-')? Digit+
Digit -> ('0'..'9')
For more information on number variables, see Section 2.5.2, “Variables” [33] .
2.6.2.2. Examples:
98 // a integer number literal
104 // a integer number literal
170.02 // a floating-point number literal
1.0845 // a floating-point number literal
INT intVar1;
INT intVar2;
...
Document{->ASSIGN(intVar1, 12 * intVar1 - SIN(intVar2))};
1. String literals: String literals are defined by any sequence of characters within quotation
marks.
2.6.3.1. Definition:
StringExpression -> SimpleStringExpression
SimpleStringExpression -> StringLiteral ("+" StringExpression)*
| StringVariable
2.6.3.2. Example:
STRING strVar; // define string variable
// add prefix "strLiteral" to variable strVar
Document{->ASSIGN(strVar, "strLiteral" + strVar)};
2.6.4.1. Definition:
BooleanExpression ->
ComposedBooleanExpression
| SimpleBooleanExpression
ComposedBooleanExpression -> BooleanCompare | BooleanTypeExpression
| BooleanNumberExpression | BooleanFunction
SimpleBooleanExpression -> BooleanLiteral | BooleanVariable
BooleanCompare -> SimpleBooleanExpression ( "==" | "!=" )
BooleanExpression
BooleanTypeExpression -> TypeExpression ( "==" | "!=" ) TypeExpression
BooleanNumberExpression -> "(" NumberExpression ( "<" | "<=" | ">"
| ">=" | "==" | "!=" ) NumberExpression ")"
BooleanFunction -> XOR "(" BooleanExpression "," BooleanExpression ")"
BooleanLiteral -> "true" | "false"
2.6.4.2. Examples:
Document{->ASSIGN(boolVar, false)};
If the type variable typeVar represents annotation type Author, the boolean type expression
evaluates to true, otherwise it evaluates to false. The result is assigned to boolean variable boolVar.
This rule shows a boolean number expression. If the value in variable intVar is equal to 10, the
boolean number expression evaluates to true, otherwise it evaluates to false. The result is assigned
to boolean variable boolVar. The brackets surrounding the number expression are necessary.
This rule shows a more complex boolean expression. If the value in variable intVar is equal to 10,
the boolean number expression evaluates to true, otherwise it evaluates to false. The result of this
evaluation is compared to booleanVar2. The end result is assigned to boolean variable boolVar1.
Realize that the syntax definition defines exactly this order. It is not possible to have the boolean
number expression on the left side of the complex number expression.
2.6.5.1. Definition:
ListExpression ->
WordListExpression | WordTableExpression |
TypeListExpression | NumberListExpression |
StringListExpression | BooleanListExpression
WordListExpression -> RessourceLiteral | WordListVariable
WordTableExpression -> RessourceLiteral | WordTableVariable
TypeListExpression -> TypeListVariable
| "{" TypeExpression ("," TypeExpression)* "}"
NumberListExpression -> IntListVariable | FloatListVariable
| DoubleListVariable
| "{" NumberExpression
("," NumberExpression)* "}"
StringListExpression -> StringListVariable
| "{" StringExpression
("," StringExpression)* "}"
BooleanListExpression -> BooleanListVariable
| "{" BooleanExpression
("," BooleanExpression)* "}"
AnnotationListExpression -> AnnotationListVariable
| "{" AnnotationExpression
("," AnnotationExpression)* "}"
The covered text of an annotation can be referred to with "coveredText" or "ct". The latter one is
an abbreviation and returns the covered text of an annotation only if the type of the annotation does
not define a feature with the name "ct". The following example creates an annotation of the type
TypeA for each word with the covered text "A".
2.7. Conditions
2.7.1. AFTER
The AFTER condition evaluates true, if the matched annotation starts after the beginning of an
arbitrary annotation of the passed type. If a list of types is passed, this has to be true for at least one
of them.
2.7.1.1. Definition:
AFTER(Type|TypeListExpression)
2.7.1.2. Example:
CW{AFTER(SW)};
Here, the rule matches on a capitalized word, if there is any small written word previously.
2.7.2. AND
The AND condition is a composed condition and evaluates true, if all contained conditions evaluate
true.
2.7.2.1. Definition:
AND(Condition1,...,ConditionN)
2.7.2.2. Example:
Paragraph{AND(PARTOF(Headline),CONTAINS(Keyword))
->MARK(ImportantHeadline)};
2.7.3. BEFORE
The BEFORE condition evaluates true, if the matched annotation starts before the beginning of an
arbitrary annotation of the passed type. If a list of types is passed, this has to be true for at least one
of them.
2.7.3.1. Definition:
BEFORE(Type|TypeListExpression)
2.7.3.2. Example:
CW{BEFORE(SW)};
Here, the rule matches on a capitalized word, if there is any small written word afterwards.
2.7.4. CONTAINS
The CONTAINS condition evaluates true on a matched annotation, if the frequency of the passed
type lies within an optionally passed interval. The limits of the passed interval are per default
interpreted as absolute numeral values. By passing a further boolean parameter set to true the limits
are interpreted as percental values. If no interval parameters are passed at all, then the condition
checks whether the matched annotation contains at least one occurrence of the passed type.
2.7.4.1. Definition:
CONTAINS(Type(,NumberExpression,NumberExpression(,BooleanExpression)?)?)
2.7.4.2. Example:
Paragraph{CONTAINS(Keyword)->MARK(KeywordParagraph)};
Paragraph{CONTAINS(Keyword,2,4)->MARK(KeywordParagraph)};
A Paragraph is annotated with a KeywordParagraph annotation, if it contains between two and four
Keyword annotations.
Paragraph{CONTAINS(Keyword,50,100,true)->MARK(KeywordParagraph)};
2.7.5. CONTEXTCOUNT
The CONTEXTCOUNT condition numbers all occurrences of the matched type within the
context of a passed type's annotation consecutively, thus assigning an index to each occurrence.
Additionally it stores the index of the matched annotation in a numerical variable if one is passed.
The condition evaluates true if the index of the matched annotation is within a passed interval. If no
interval is passed, the condition always evaluates true.
2.7.5.1. Definition:
CONTEXTCOUNT(Type(,NumberExpression,NumberExpression)?(,Variable)?)
2.7.5.2. Example:
Keyword{CONTEXTCOUNT(Paragraph,2,3,var)
->MARK(SecondOrThirdKeywordInParagraph)};
Here, the position of the matched Keyword annotation within a Paragraph annotation is calculated
and stored in the variable 'var'. If the counted value lies within the interval [2,3], then the matched
Keyword is annotated with the SecondOrThirdKeywordInParagraph annotation.
2.7.6. COUNT
The COUNT condition can be used in two different ways. In the first case (see first definition),
it counts the number of annotations of the passed type within the window of the matched
annotation and stores the amount in a numerical variable, if such a variable is passed. The condition
evaluates true if the counted amount is within a specified interval. If no interval is passed, the
condition always evaluates true. In the second case (see second definition), it counts the number
of occurrences of the passed VariableExpression (second parameter) within the passed list (first
parameter) and stores the amount in a numerical variable, if such a variable is passed. Again, the
condition evaluates true if the counted amount is within a specified interval. If no interval is passed,
the condition always evaluates true.
2.7.6.1. Definition:
COUNT(Type(,NumberExpression,NumberExpression)?(,NumberVariable)?)
COUNT(ListExpression,VariableExpression
(,NumberExpression,NumberExpression)?(,NumberVariable)?)
2.7.6.2. Example:
Paragraph{COUNT(Keyword,1,10,var)->MARK(KeywordParagraph)};
Here, the amount of Keyword annotations within a Paragraph is calculated and stored in
the variable 'var'. If one to ten Keywords were counted, the paragraph is marked with a
KeywordParagraph annotation.
Paragraph{COUNT(list,"author",5,7,var)};
Here, the number of occurrences of STRING "author" within the STRINGLIST 'list' is counted
and stored in the variable 'var'. If "author" occurs five to seven times within 'list', the condition
evaluates true.
2.7.7. CURRENTCOUNT
The CURRENTCOUNT condition numbers all occurrences of the matched type within the whole
document consecutively, thus assigning an index to each occurrence. Additionally, it stores the
index of the matched annotation in a numerical variable, if one is passed. The condition evaluates
true if the index of the matched annotation is within a specified interval. If no interval is passed, the
condition always evaluates true.
2.7.7.1. Definition:
CURRENTCOUNT(Type(,NumberExpression,NumberExpression)?(,Variable)?)
2.7.7.2. Example:
Paragraph{CURRENTCOUNT(Keyword,3,3,var)->MARK(ParagraphWithThirdKeyword)};
Here, the Paragraph, which contains the third Keyword of the whole document, is annotated with
the ParagraphWithThirdKeyword annotation. The index is stored in the variable 'var'.
2.7.8. ENDSWITH
The ENDSWITH condition evaluates true, if an annotation of the given type ends exactly at the
same position as the matched annotation. If a list of types is passed, this has to be true for at least
one of them.
2.7.8.1. Definition:
ENDSWITH(Type|TypeListExpression)
2.7.8.2. Example:
Paragraph{ENDSWITH(SW)};
Here, the rule matches on a Paragraph annotation, if it ends with a small written word.
2.7.9. FEATURE
The FEATURE condition compares a feature of the matched annotation with the second argument.
2.7.9.1. Definition:
FEATURE(StringExpression,Expression)
2.7.9.2. Example:
Document{FEATURE("language",targetLanguage)}
This rule matches, if the feature named 'language' of the document annotation equals the value of
the variable 'targetLanguage'.
2.7.10. IF
The IF condition evaluates true, if the contained boolean expression evaluates true.
2.7.10.1. Definition:
IF(BooleanExpression)
2.7.10.2. Example:
Paragraph{IF(keywordAmount > 5)->MARK(KeywordParagraph)};
2.7.11. INLIST
The INLIST condition is fulfilled, if the matched annotation is listed in a given word or string list.
If an optional agrument is given, then the value of the argument is used instead of the covered text
of the matched annotation
2.7.11.1. Definition:
INLIST(WordList(,StringExpression)?)
INLIST(StringList(,StringExpression)?)
2.7.11.2. Example:
Keyword{INLIST(SpecialKeywordList)->MARK(SpecialKeyword)};
A Keyword is annotated with the type SpecialKeyword, if the text of the Keyword annotation is
listed in the word list or string list SpecialKeywordList.
Token{INLIST(MyLemmaList, Token.lemma)->MARK(SpecialLemma)};
This rule creates an annotation of the type SpecialLemma for each token that provides a feature
value of the feature "lemma" that is present in the string list or word list MyLemmaList.
2.7.12. IS
The IS condition evaluates true, if there is an annotation of the given type with the same beginning
and ending offsets as the matched annotation. If a list of types is given, the condition evaluates true,
if at least one of them fulfills the former condition.
2.7.12.1. Definition:
IS(Type|TypeListExpression)
2.7.12.2. Example:
Author{IS(Englishman)->MARK(EnglishAuthor)};
2.7.13. LAST
The LAST condition evaluates true, if the type of the last token within the window of the matched
annotation is of the given type.
2.7.13.1. Definition:
LAST(TypeExpression)
2.7.13.2. Example:
Document{LAST(CW)};
This rule fires, if the last token of the document is a capitalized word.
2.7.14. MOFN
The MOFN condition is a composed condition. It evaluates true if the number of containing
conditions evaluating true is within a given interval.
2.7.14.1. Definition:
MOFN(NumberExpression,NumberExpression,Condition1,...,ConditionN)
2.7.14.2. Example:
Paragraph{MOFN(1,1,PARTOF(Headline),CONTAINS(Keyword))
->MARK(HeadlineXORKeywords)};
2.7.15. NEAR
The NEAR condition is fulfilled, if the distance of the matched annotation to an annotation of
the given type is within a given interval. The direction is defined by a boolean parameter, whose
default value is set to true, therefore searching forward. By default this condition works on an
unfiltered index. An optional fifth boolean parameter can be set to true to get the condition being
evaluated on a filtered index.
2.7.15.1. Definition:
NEAR(TypeExpression,NumberExpression,NumberExpression
(,BooleanExpression(,BooleanExpression)?)?)
2.7.15.2. Example:
Paragraph{NEAR(Headline,0,10,false)->MARK(NoHeadline)};
A Paragraph that starts at most ten tokens after a Headline annotation is annotated with the
NoHeadline annotation.
2.7.16. NOT
The NOT condition negates the result of its contained condition.
2.7.16.1. Definition:
"-"Condition
2.7.16.2. Example:
Paragraph{-PARTOF(Headline)->MARK(Headline)};
A Paragraph that is not part of a Headline annotation so far is annotated with a Headline annotation.
2.7.17. OR
The OR Condition is a composed condition and evaluates true, if at least one contained condition is
evaluated true.
2.7.17.1. Definition:
OR(Condition1,...,ConditionN)
2.7.17.2. Example:
Paragraph{OR(PARTOF(Headline),CONTAINS(Keyword))
->MARK(ImportantParagraph)};
2.7.18. PARSE
The PARSE condition is fulfilled, if the text covered by the matched annotation can be transformed
into a value of the given variable's type. If this is possible, the parsed value is additionally assigned
to the passed variable. For numeric values, this conditions delegates to the NumberFormat of the
locale given by the optional second argument. Therefore, this condition parses the string “2,3” for
the locale “en” to the value 23.
2.7.18.1. Definition:
PARSE(variable(, stringExpression)?)
2.7.18.2. Example:
NUM{PARSE(var,"de")};
If the variable 'var' is of an appropriate numeric type for the locale "de", the value of NUM is
parsed and subsequently stored in 'var'.
2.7.19. PARTOF
The PARTOF condition is fulfilled, if the matched annotation is part of an annotation of the given
type. However, it is not necessary that the matched annotation is smaller than the annotation of the
given type. Use the (much slower) PARTOFNEQ condition instead, if this is needed. If a type list
is given, the condition evaluates true, if the former described condition for a single type is fulfilled
for at least one of the types in the list.
2.7.19.1. Definition:
PARTOF(Type|TypeListExpression)
2.7.19.2. Example:
Paragraph{PARTOF(Headline) -> MARK(ImportantParagraph)};
2.7.20. PARTOFNEQ
The PARTOFNEQ condition is fulfilled if the matched annotation is part of (smaller than and
inside of) an annotation of the given type. If also annotations of the same size should be acceptable,
use the PARTOF condition. If a type list is given, the condition evaluates true if the former
described condition is fulfilled for at least one of the types in the list.
2.7.20.1. Definition:
PARTOFNEQ(Type|TypeListExpression)
2.7.20.2. Example:
W{PARTOFNEQ(Headline) -> MARK(ImportantWord)};
2.7.21. POSITION
The POSITION condition is fulfilled, if the matched type is the k-th occurrence of this type within
the window of an annotation of the passed type, whereby k is defined by the value of the passed
NumberExpression. If the additional boolean paramter is set to false, then k counts the occurrences
of of the minimal annotations.
2.7.21.1. Definition:
POSITION(Type,NumberExpression(,BooleanExpression)?)
2.7.21.2. Example:
Keyword{POSITION(Paragraph,2)->MARK(SecondKeyword)};
Keyword{POSITION(Paragraph,2,false)->MARK(SecondKeyword)};
A Keyword in a Paragraph is annotated with the type SecondKeyword, if it starts at the same offset
as the second (visible) RutaBasic annotation, which normally corresponds to the tokens.
2.7.22. REGEXP
The REGEXP condition is fulfilled, if the given pattern matches on the matched annotation.
However, if a string variable is given as the first argument, then the pattern is evaluated on the
value of the variable. For more details on the syntax of regular expressions, take a look at the Java
API1 . By default the REGEXP condition is case-sensitive. To change this, add an optional boolean
parameter, which is set to true. The regular expression is initialized with the flags DOTALL
and MULTILINE, and if the optional parameter is set to true, then additionally with the flags
CASE_INSENSITIVE and UNICODE_CASE.
2.7.22.1. Definition:
REGEXP((StringVariable,)? StringExpression(,BooleanExpression)?)
2.7.22.2. Example:
Keyword{REGEXP("..")->MARK(SmallKeyword)};
A Keyword that only consists of two chars is annotated with a SmallKeyword annotation.
2.7.23. SCORE
The SCORE condition evaluates the heuristic score of the matched annotation. This score is set or
changed by the MARK action. The condition is fulfilled, if the score of the matched annotation is
in a given interval. Optionally, the score can be stored in a variable.
2.7.23.1. Definition:
SCORE(NumberExpression,NumberExpression(,Variable)?)
2.7.23.2. Example:
MaybeHeadline{SCORE(40,100)->MARK(Headline)};
An annotation of the type MaybeHeadline is annotated with Headline, if its score is between 40 and
100.
2.7.24. SIZE
The SIZE contition counts the number of elements in the given list. By default, this condition
always evaluates true. When an interval is passed, it evaluates true, if the counted number of list
elements is within the interval. The counted number can be stored in an optionally passed numeral
variable.
1
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
2.7.24.1. Definition:
SIZE(ListExpression(,NumberExpression,NumberExpression)?(,Variable)?)
2.7.24.2. Example:
Document{SIZE(list,4,10,var)};
This rule fires, if the given list contains between 4 and 10 elements. Additionally, the exact amount
is stored in the variable “var”.
2.7.25. STARTSWITH
The STARTSWITH condition evaluates true, if an annotation of the given type starts exactly at the
same position as the matched annotation. If a type list is given, the condition evaluates true, if the
former is true for at least one of the given types in the list.
2.7.25.1. Definition:
STARTSWITH(Type|TypeListExpression)
2.7.25.2. Example:
Paragraph{STARTSWITH(SW)};
Here, the rule matches on a Paragraph annotation, if it starts with small written word.
2.7.26. TOTALCOUNT
The TOTALCOUNT condition counts the annotations of the passed type within the whole
document and stores the amount in an optionally passed numerical variable. The condition
evaluates true, if the amount is within the passed interval. If no interval is passed, the condition
always evaluates true.
2.7.26.1. Definition:
TOTALCOUNT(Type(,NumberExpression,NumberExpression(,Variable)?)?)
2.7.26.2. Example:
Paragraph{TOTALCOUNT(Keyword,1,10,var)->MARK(KeywordParagraph)};
Here, the amount of Keyword annotations within the whole document is calculated and stored
in the variable 'var'. If one to ten Keywords were counted, the Paragraph is marked with a
KeywordParagraph annotation.
2.7.27. VOTE
The VOTE condition counts the annotations of the given two types within the window of the
matched annotation and evaluates true, if it finds more annotations of the first type.
2.7.27.1. Definition:
VOTE(TypeExpression,TypeExpression)
2.7.27.2. Example:
Paragraph{VOTE(FirstName,LastName)};
Here, this rule fires, if a paragraph contains more firstnames than lastnames.
2.8. Actions
2.8.1. ADD
The ADD action adds all the elements of the passed RutaExpressions to a given list. For example,
this expressions could be a string, an integer variable or a list. For a complete overview on UIMA
Ruta expressions see Section 2.6, “Expressions” [35].
2.8.1.1. Definition:
ADD(ListVariable,(RutaExpression)+)
2.8.1.2. Example:
Document{->ADD(list, var)};
2.8.2. ADDFILTERTYPE
The ADDFILTERTYPE action adds its arguments to the list of filtered types, which restrict the
visibility of the rules.
2.8.2.1. Definition:
ADDFILTERTYPE(TypeExpression(,TypeExpression)*)
2.8.2.2. Example:
Document{->ADDFILTERTYPE(CW)};
After applying this rule, capitalized words are invisible additionally to the previously filtered types.
2.8.3. ADDRETAINTYPE
The ADDFILTERTYPE action adds its arguments to the list of retained types, which extend the
visibility of the rules.
2.8.3.1. Definition:
ADDRETAINTYPE(TypeExpression(,TypeExpression)*)
2.8.3.2. Example:
Document{->ADDRETAINTYPE(MARKUP)};
After applying this rule, markup is visible additionally to the previously retained types.
2.8.4. ASSIGN
The ASSIGN action assigns the value of the passed expression to a variable of the same type.
2.8.4.1. Definition:
ASSIGN(BooleanVariable,BooleanExpression)
ASSIGN(NumberVariable,NumberExpression)
ASSIGN(StringVariable,StringExpression)
ASSIGN(TypeVariable,TypeExpression)
2.8.4.2. Example:
Document{->ASSIGN(amount, (amount/2))};
2.8.5. CALL
The CALL action initiates the execution of a different script file or script block. Currently, only
complete script files are supported.
2.8.5.1. Definition:
CALL(DifferentFile)
CALL(Block)
2.8.5.2. Example:
Document{->CALL(NamedEntities)};
2.8.6. CLEAR
The CLEAR action removes all elements of the given list. If the list was initialized as it was
declared, then it is reset to its initial value.
2.8.6.1. Definition:
CLEAR(ListVariable)
2.8.6.2. Example:
Document{->CLEAR(SomeList)};
2.8.7. COLOR
The COLOR action sets the color of an annotation type in the modified view, if the rule has fired.
The background color is passed as the second parameter. The font color can be changed by passing
a further color as a third parameter. The supported colors are: black, silver, gray, white, maroon,
red, purple, fuchsia, green, lime, olive, yellow, navy, blue, aqua, lightblue, lightgreen, orange, pink,
salmon, cyan, violet, tan, brown, white and mediumpurple.
2.8.7.1. Definition:
COLOR(TypeExpression,StringExpression(, StringExpression
(, BooleanExpression)?)?)
2.8.7.2. Example:
Document{->COLOR(Headline, "red", "green", true)};
This rule colors all Headline annotations in the modified view. Thereby, the background color is
set to red, font color is set to green and all 'Headline' annotations are selected when opening the
modified view.
2.8.8. CONFIGURE
The CONFIGURE action can be used to configure the analysis engine of the given namespace
(first parameter). The parameters that should be configured with corresponding values are passed as
name-value pairs.
2.8.8.1. Definition:
CONFIGURE(AnalysisEngine(,StringExpression = Expression)+)
2.8.8.2. Example:
ENGINE utils.HtmlAnnotator;
The former rule changes the value of configuration parameter “onlyContent” to false and
reconfigure the analysis engine.
2.8.9. CREATE
The CREATE action is similar to the MARK action. It also annotates the matched text fragments
with a type annotation, but additionally assigns values to a chosen subset of the type's feature
elements.
2.8.9.1. Definition:
CREATE(TypeExpression(,NumberExpression)*
(,StringExpression = Expression)+)
2.8.9.2. Example:
Paragraph{COUNT(ANY,0,10000,cnt)->CREATE(Headline,"size" = cnt)};
This rule counts the number of tokens of type ANY in a Paragraph annotation and assigns the
counted value to the int variable 'cnt'. If the counted number is between 0 and 10000, a Headline
annotation is created for this Paragraph. Moreover, the feature named 'size' of Headline is set to the
value of 'cnt'.
2.8.10. DEL
The DEL action deletes the matched text fragments in the modified view. For removing annotations
see UNMARK.
2.8.10.1. Definition:
DEL
2.8.10.2. Example:
Name{->DEL};
This rule deletes all text fragments that are annotated with a Name annotation.
2.8.11. DYNAMICANCHORING
The DYNAMICANCHORING action turns dynamic anchoring on or off (first parameter) and
assigns the anchoring parameters penalty (second parameter) and factor (third parameter).
2.8.11.1. Definition:
DYNAMICANCHORING(BooleanExpression
(,NumberExpression(,NumberExpression)?)?)
2.8.11.2. Example:
Document{->DYNAMICANCHORING(true)};
2.8.12. EXEC
The EXEC action initiates the execution of a different script file or analysis engine on the
complete input document, independent from the matched text and the current filtering settings.
If the imported component (DifferentFile) refers to another script file, it is applied on a new
representation of the document: the complete text of the original CAS with the default filtering
settings of the UIMA Ruta analysis engine. If it refers to an external analysis engine, then it is
applied on the complete document. The optional, first argument is is a string expression, which
specifies the view the component should be applied on. The optional, third argument is a list of
types, which should be reindexed by Ruta (not UIMA itself).
Note: Annotations created by the external analysis engine are not accessible for UIMA
Ruta rules in the same script. The types of these annotations need to be provided in the
second argument in order to be visible to the Ruta rules.
2.8.12.1. Definition:
EXEC((StringExpression,)? DifferentFile(, TypeListExpression)?)
2.8.12.2. Example:
ENGINE NamedEntities;
Document{->EXEC(NamedEntities, {Person, Location})};
Here, an analysis engine for named entity recognition is executed once on the complete document
and the annotations of the types Person and Location (and all subtypes) are reindexed in UIMA
Ruta. Without this list of types, the annotations are added to the CAS, but cannot be accessed by
Ruta rules.
2.8.13. FILL
The FILL action fills a chosen subset of the given type's feature elements.
2.8.13.1. Definition:
FILL(TypeExpression(,StringExpression = Expression)+)
2.8.13.2. Example:
Headline{COUNT(ANY,0,10000,tokenCount)
->FILL(Headline,"size" = tokenCount)};
Here, the number of tokens within an Headline annotation is counted and stored in variable
'tokenCount'. If the number of tokens is within the interval [0;10000], the FILL action fills the
Headline's feature 'size' with the value of 'tokenCount'.
2.8.14. FILTERTYPE
This action filters the given types of annotations. They are now ignored by rules. Expressions
are not yet supported. This action is related to RETAINTYPE (see Section 2.8.35,
“RETAINTYPE” [62]).
Note: The visibility of types is calculated using three lists: A list “default” for the initially
filtered types, which is specified in the configuration parameters of the analysis engine,
the list “filtered”, which is specified by the FILTERTYPE action, and the list “retained”,
which is specified by the RETAINTYPE action. For determining the actual visibility of
types, list “filtered” is added to list “default” and then all elements of list “retained” are
removed. The annotations of the types in the resulting list are not visible. Please note that
the actions FILTERTYPE and RETAINTYPE replace all elements of the respective lists
and that RETAINTYPE overrides FILTERTYPE.
2.8.14.1. Definition:
FILTERTYPE((TypeExpression(,TypeExpression)*))?
2.8.14.2. Example:
Document{->FILTERTYPE(SW)};
This rule filters all small written words in the input document. They are further ignored by every
rule.
Document{->FILTERTYPE};
Here, the the action (without parentheses) specifies that no additional types should be filtered.
2.8.15. GATHER
This action creates a complex structure: an annotation with features. The optionally passed indexes
(NumberExpressions after the TypeExpression) can be used to create an annotation that spans the
matched information of several rule elements. The features are collected using the indexes of the
rule elements of the complete rule.
2.8.15.1. Definition:
GATHER(TypeExpression(,NumberExpression)*
(,StringExpression = NumberExpression)+)
2.8.15.2. Example:
DECLARE Annotation A;
DECLARE Annotation B;
DECLARE Annotation C(Annotation a, Annotation b);
W{REGEXP("A")->MARK(A)};
W{REGEXP("B")->MARK(B)};
A B{-> GATHER(C, 1, 2, "a" = 1, "b" = 2)};
Two annotations A and B are declared and annotated. The last rule creates an annotation C
spanning the elements A (index 1 since it is the first rule element) and B (index 2) with its features
'a' set to annotation A (again index 1) and 'b' set to annotation B (again index 2).
2.8.16. GET
The GET action retrieves an element of the given list dependent on a given strategy.
Strategy Functionality
dominant finds the most occurring element
2.8.16.1. Definition:
GET(ListExpression, Variable, StringExpression)
2.8.16.2. Example:
Document{->GET(list, var, "dominant")};
In this example, the element of the list 'list' that occurs most is stored in the variable 'var'.
2.8.17. GETFEATURE
The GETFEATURE action stores the value of the matched annotation's feature (first paramter) in
the given variable (second parameter).
2.8.17.1. Definition:
GETFEATURE(StringExpression, Variable)
2.8.17.2. Example:
Document{->GETFEATURE("language", stringVar)};
In this example, variable 'stringVar' will contain the value of the feature 'language'.
2.8.18. GETLIST
This action retrieves a list of types dependent on a given strategy.
Strategy Functionality
Types get all types within the matched annotation
Types:End get all types that end at the same offset as the
matched annotation
Types:Begin get all types that start at the same offset as the
matched annotation
2.8.18.1. Definition:
GETLIST(ListVariable, StringExpression)
2.8.18.2. Example:
Document{->GETLIST(list, "Types")};
Here, a list of all types within the document is created and assigned to list variable 'list'.
2.8.19. GREEDYANCHORING
The GREEDYANCHORING action turns greedy anchoring on or off. If the first parameter is set to
true, then start positions already matched by the same rule element will be ignored. This situation
occurs mostly for rules that start with a quantifier. The second optional parameter activates greedy
acnhoring for the complete rule. Later rule matches are only possible after previous matches.
2.8.19.1. Definition:
GREEDYANCHORING(BooleanExpression(,BooleanExpression)?)
2.8.19.2. Example:
Document{->GREEDYANCHORING(true, true)};
ANY+;
CW CW;
The above mentioned example activates dynamic anchoring and the second rule will then only
match once since the next positions, e.g., the second token, are already covered by the first attempt.
The third rule will not match on capitalized word that have benn already considered by previous
matches of the rule.
2.8.20. LOG
The LOG action writes a log message.
2.8.20.1. Definition:
LOG(StringExpression)
2.8.20.2. Example:
Document{->LOG("processed")};
2.8.21. MARK
The MARK action is the most important action in the UIMA Ruta system. It creates a new
annotation of the given type. The optionally passed indexes (NumberExpressions after the
TypeExpression) can be used to create an annotation that spanns the matched information of
several rule elements.
2.8.21.1. Definition:
MARK(TypeExpression(,NumberExpression)*)
2.8.21.2. Example:
Freeline Paragraph{->MARK(ParagraphAfterFreeline,1,2)};
This rule matches on a free line followed by a Paragraph annotation and annotates both in a
single ParagraphAfterFreeline annotation. The two numerical expressions at the end of the mark
action state that the matched text of the first and the second rule elements are joined to create the
boundaries of the new annotation.
2.8.22. MARKFAST
The MARKFAST action creates annotations of the given type (first parameter), if an element of
the passed list (second parameter) occurs within the window of the matched annotation. Thereby,
the created annotation does not cover the whole matched annotation. Instead, it only covers the
text of the found occurrence. The third parameter is optional. It defines, whether the MARKFAST
action should ignore the case, whereby its default value is false. The optional fourth parameter
specifies a character threshold for the ignorence of the case. It is only relevant, if the ignore-case
value is set to true. The last parameter is set to true by default and specifies whether whitespaces
in the entries of the dictionary should be ignored. For more information on lists see Section 2.5.3,
“Resources” [33]. Additionally to external word lists, string lists variables can be used.
2.8.22.1. Definition:
MARKFAST(TypeExpression,ListExpression(,BooleanExpression
(,NumberExpression,(BooleanExpression)?)?)?)
MARKFAST(TypeExpression,StringListExpression(,BooleanExpression
(,NumberExpression,(BooleanExpression)?)?)?)
2.8.22.2. Example:
WORDLIST FirstNameList = 'FirstNames.txt';
DECLARE FirstName;
This rule annotates all first names listed in the list 'FirstNameList' within the document and ignores
the case, if the length of the word is greater than 2.
2.8.23. MARKFIRST
The MARKFIRST action annotates the first token (basic annotation) of the matched annotation
with the given type.
2.8.23.1. Definition:
MARKFIRST(TypeExpression)
2.8.23.2. Example:
Document{->MARKFIRST(First)};
This rule annotates the first token of the document with the annotation First.
2.8.24. MARKLAST
The MARKLAST action annotates the last token of the matched annotation with the given type.
2.8.24.1. Definition:
MARKLAST(TypeExpression)
2.8.24.2. Example:
Document{->MARKLAST(Last)};
This rule annotates the last token of the document with the annotation Last.
2.8.25. MARKONCE
The MARKONCE action has the same functionality as the MARK action, but creates a new
annotation only, if each part of the matched annotation is not yet part of the given type.
2.8.25.1. Definition:
MARKONCE(NumberExpression,TypeExpression(,NumberExpression)*)
2.8.25.2. Example:
Freeline Paragraph{->MARKONCE(ParagraphAfterFreeline,1,2)};
This rule matches on a free line followed by a Paragraph and annotates both in a single
ParagraphAfterFreeline annotation, if no part is not already annotated with ParagraphAfterFreeline
annotation. The two numerical expressions at the end of the MARKONCE action state that the
matched text of the first and the second rule elements are joined to create the boundaries of the new
annotation.
2.8.26. MARKSCORE
The MARKSCORE action is similar to the MARK action. It also creates a new annotation of
the given type, but only if it is not yet existing. The optionally passed indexes (parameters after
the TypeExpression) can be used to create an annotation that spanns the matched information of
several rule elements. Additionally, a score value (first parameter) is added to the heuristic score
value of the annotation. For more information on heuristic scores see Section 2.16, “Heuristic
extraction using scoring rules” [73] .
2.8.26.1. Definition:
MARKSCORE(NumberExpression,TypeExpression(,NumberExpression)*)
2.8.26.2. Example:
Freeline Paragraph{->MARKSCORE(10,ParagraphAfterFreeline,1,2)};
This rule matches on a free line followed by a paragraph and annotates both in a single
ParagraphAfterFreeline annotation. The two number expressions at the end of the mark action
indicate that the matched text of the first and the second rule elements are joined to create the
boundaries of the new annotation. Additionally, the score '10' is added to the heuristic threshold of
this annotation.
2.8.27. MARKTABLE
The MARKTABLE action creates annotations of the given type (first parameter), if an element of
the given column (second parameter) of a passed table (third parameter) occures within the window
of the matched annotation. Thereby, the created annotation does not cover the whole matched
annotation. Instead, it only covers the text of the found occurrence. Optionally the MARKTABLE
action is able to assign entries of the given table to features of the created annotation. For more
information on tables see Section 2.5.3, “Resources” [33]. Additionally, several configuration
parameters are possible. (See example.)
2.8.27.1. Definition:
MARKTABLE(TypeExpression, NumberExpression, TableExpression
(,BooleanExpression, NumberExpression,
StringExpression, NumberExpression)?
(,StringExpression = NumberExpression)+)
2.8.27.2. Example:
WORDTABLE TestTable = 'TestTable.csv';
DECLARE Annotation Struct(STRING first);
Document{-> MARKTABLE(Struct, 1, TestTable,
true, 4, ".,-", 2, "first" = 2)};
In this example, the whole document is searched for all occurrences of the entries of the first
column of the given table 'TestTable'. For each occurrence, an annotation of the type Struct is
created and its feature 'first' is filled with the entry of the second column. Moreover, the case of the
word is ignored if the length of the word exceeds 4. Additionally, the chars '.', ',' and '-' are ignored,
but maximally two of them.
2.8.28. MATCHEDTEXT
The MATCHEDTEXT action saves the text of the matched annotation in a passed String variable.
The optionally passed indexes can be used to match the text of several rule elements.
2.8.28.1. Definition:
MATCHEDTEXT(StringVariable(,NumberExpression)*)
2.8.28.2. Example:
Headline Paragraph{->MATCHEDTEXT(stringVariable,1,2)};
The text covered by the Headline (rule element 1) and the Paragraph (rule element 2) annotation is
saved in variable 'stringVariable'.
2.8.29. MERGE
The MERGE action merges a number of given lists. The first parameter defines, if the merge is
done as intersection (false) or as union (true). The second parameter is the list variable that will
contain the result.
2.8.29.1. Definition:
MERGE(BooleanExpression, ListVariable, ListExpression, (ListExpression)+)
2.8.29.2. Example:
Document{->MERGE(false, listVar, list1, list2, list3)};
The elements that occur in all three lists will be placed in the list 'listVar'.
2.8.30. REMOVE
The REMOVE action removes lists or single values from a given list.
2.8.30.1. Definition:
REMOVE(ListVariable,(Argument)+)
2.8.30.2. Example:
Document{->REMOVE(list, var)};
In this example, the variable 'var' is removed from the list 'list'.
2.8.31. REMOVEDUPLICATE
This action removes all duplicates within a given list.
2.8.31.1. Definition:
REMOVEDUPLICATE(ListVariable)
2.8.31.2. Example:
Document{->REMOVEDUPLICATE(list)};
2.8.32. REMOVEFILTERTYPE
The REMOVEFILTERTYPE action removes its arguments from the list of filtered types, which
restrict the visibility of the rules.
2.8.32.1. Definition:
REMOVEFILTERTYPE(TypeExpression(,TypeExpression)*)
2.8.32.2. Example:
Document{->REMOVEFILTERTYPE(W)};
After applying this rule, words are possibly visible again depending on the current filtering settings.
2.8.33. REMOVERETAINTYPE
The REMOVEFILTERTYPE action removes its arguments from the list of retained types, which
extend the visibility of the rules.
2.8.33.1. Definition:
REMOVERETAINTYPE(TypeExpression(,TypeExpression)*)
2.8.33.2. Example:
Document{->REMOVERETAINTYPE(W)};
After applying this rule, words are possibly not visible anymore depending on the current filtering
settings.
2.8.34. REPLACE
The REPLACE action replaces the text of all matched annotations with the given StringExpression.
It remembers the modification for the matched annotations and shows them in the modified view
(see Section 2.17, “Modification” [73]).
2.8.34.1. Definition:
REPLACE(StringExpression)
2.8.34.2. Example:
FirstName{->REPLACE("first name")};
This rule replaces all first names with the string 'first name'.
2.8.35. RETAINTYPE
The RETAINTYPE action retains the given types. This means that they are now not ignored by
rules. This action is related to FILTERTYPE (see Section 2.8.14, “FILTERTYPE” [54]).
Note: The visibility of types is calculated using three lists: A list “default” for the initially
filtered types, which is specified in the configuration parameters of the analysis engine,
the list “filtered”, which is specified by the FILTERTYPE action, and the list “retained”,
which is specified by the RETAINTYPE action. For determining the actual visibility of
types, list “filtered” is added to list “default” and then all elements of list “retained” are
removed. The annotations of the types in the resulting list are not visible. Please note that
the actions FILTERTYPE and RETAINTYPE replace all elements of the respective lists
and that RETAINTYPE overrides FILTERTYPE.
2.8.35.1. Definition:
RETAINTYPE((TypeExpression(,TypeExpression)*))?
2.8.35.2. Example:
Document{->RETAINTYPE(SPACE)};
Document{->RETAINTYPE};
Here, the the action (without parentheses) specifies that no types should be retained.
2.8.36. SETFEATURE
The SETFEATURE action sets the value of a feature of the matched complex structure.
2.8.36.1. Definition:
SETFEATURE(StringExpression,Expression)
2.8.36.2. Example:
Document{->SETFEATURE("language","en")};
2.8.37. SHIFT
The SHIFT action can be used to change the offsets of an annotation. The two number expressions,
which point the rule elements of the rule, specify the new offsets of the annotation. The annotations
that will be modified have to start or end at the match of the rule element of the action if the
boolean option is set to true. By default, only the matched annotation of the given type will be
modified. In either way, this means that the action has to be placed at a matching condition, which
will be used to specify the annotations to be changed.
2.8.37.1. Definition:
SHIFT(TypeExpression,NumberExpression,NumberExpression,BooleanExpression?)
2.8.37.2. Example:
Author{-> SHIFT(Author,1,2)} PM;
In this example, an annotation of the type “Author” is expanded in order to cover the following
punctuation mark.
In this example, an annotation of the type “FS” that consists mostly of words is shrinked by
removing the last MARKUP annotation.
2.8.38. SPLIT
The SPLIT action is able to split the matched annotation for each occurrence of annotation of the
given type. There are three additional parameters: The first one specifies if complete annotations
of the given type should be used to split the matched annotations. If set to false, then even the
boundary of an annotation will cause splitting. The third (addToBegin) and fourth (addToEnd)
argument specify if the complete annotation (for splitting) will be added to the begin or end of the
split annotation. The latter two are only utilized if the first one is set to true.. If omitted, the first
argument is true and the other two arguments are false by default.
2.8.38.1. Definition:
SPLIT(TypeExpression(,BooleanExpression,
(BooleanExpression, BooleanExpression)? )?
2.8.38.2. Example:
Sentence{-> SPLIT(PERIOD, true, false, true)};
In this example, an annotation of the type “Sentence” is split for each occurrence of a period, which
is added to the end of the new sentence.
2.8.39. TRANSFER
The TRANSFER action creates a new feature structure and adds all compatible features of the
matched annotation.
2.8.39.1. Definition:
TRANSFER(TypeExpression)
2.8.39.2. Example:
Document{->TRANSFER(LanguageStorage)};
Here, a new feature structure “LanguageStorage” is created and the compatible features of the
Document annotation are copied. E.g., if LanguageStorage defined a feature named 'language', then
the feature value of the Document annotation is copied.
2.8.40. TRIE
The TRIE action uses an external multi tree word list to annotate the matched annotation and
provides several configuration parameters.
2.8.40.1. Definition:
TRIE((String = (TypeExpression|{TypeExpression,StringExpression,
Expression}))+,ListExpression,BooleanExpression,NumberExpression,
BooleanExpression,NumberExpression,StringExpression)
2.8.40.2. Example:
Document{->TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company,
'Dictionary.mtwl', true, 4, false, 0, ".,-/")};
Here, the dictionary 'Dictionary.mtwl' that contains word lists for first names and companies is used
to annotate the document. The words previously contained in the file 'FirstNames.txt' are annotated
with the type FirstName and the words in the file 'Companies.txt' with the type Company. The
case of the word is ignored, if the length of the word exceeds 4. The edit distance is deactivated.
The cost of an edit operation can currently not be configured by an argument. The last argument
additionally defines several chars that will be ignored.
Here, the dictionary 'list1' is applied on the document. Matches originated in dictionary
'FirstNames.txt' result in annotations of type A wheras their features 'a' are set to 'first'. The other
two dictionaries create annotations of type 'B' and 'C' for the corresponding dictionaries with a
boolean feature value and a integer feature value.
2.8.41. TRIM
The TRIM action changes the offsets on the matched annotations by removing annotations, whose
types are specified by the given parameters.
2.8.41.1. Definition:
TRIM(TypeExpression ( , TypeExpression)*)
TRIM(TypeListExpression)
2.8.41.2. Example:
Keyword{-> TRIM(SPACE)};
This rule removes all spaces at the beginning and at the end of Keyword annotations and thus
changes the offsets of the matched annotations.
2.8.42. UNMARK
The UNMARK action removes the annotation of the given type overlapping the matched
annotation. There are two additional configurations: If additional indexes are given, then the span
of the specified rule elements are applied, similar the the MARK action. If instead a boolean is
given as an additional argument, then all annotations of the given type are removed that start at the
matched position.
2.8.42.1. Definition:
UNMARK(AnnotationExpression)
UNMARK(TypeExpression)
UNMARK(TypeExpression (,NumberExpression)*)
UNMARK(TypeExpression, BooleanExpression)
2.8.42.2. Example:
Headline{->UNMARK(Headline)};
CW ANY+? QUESTION{->UNMARK(Headline,1,3)};
Here, all Headline annotations are removed that start with a capitalized word and end with a
question mark.
CW{->UNMARK(Headline,true)};
Here, all Headline annotations are removed that start with a capitalized word.
Complex{->UNMARK(Complex.inner)};
2.8.43. UNMARKALL
The UNMARKALL action removes all the annotations of the given type and all of its descendants
overlapping the matched annotation, except the annotation is of at least one type in the passed list.
2.8.43.1. Definition:
UNMARKALL(TypeExpression, TypeListExpression)
2.8.43.2. Example:
Annotation{->UNMARKALL(Annotation, {Headline})};
Note: The visibility of types is calculated using three lists: A list “default” for the initially
filtered types, which is specified in the configuration parameters of the analysis engine,
the list “filtered” , which is specified by the FILTERTYPE action, and the list “retained”
, which is specified by the RETAINTYPE action. For determining the actual visibility of
types, list “filtered” is added to list “default” and then all elements of list “retained” are
removed. The annotations of the types in the resulting list are not visible. Please note that
the actions FILTERTYPE and RETAINTYPE replace all elements of the respective lists
and that RETAINTYPE overrides FILTERTYPE.
If no rule action changed the configuration of the filtering settings, then the default filtering
configuration ignores whitespaces and markup. Look at the following rule:
Using the default setting, this rule matches on all four lines of this input document:
To change the default setting, use the “FILTERTYPE” or “RETAINTYPE” action. For example
if markups should no longer be ignored, try the following example on the above mentioned input
document:
Document{->RETAINTYPE(MARKUP)};
"Dr" PERIOD CW CW;
You will see that the third line of the previous input example will no longer be matched.
Document{->FILTERTYPE(PERIOD)};
"Dr" CW CW;
Since periods are ignored here, the rule will match on all four lines of the example.
Notice that using a filtered annotation type within a rule prevents this rule from being executed. Try
the following:
Document{->FILTERTYPE(PERIOD)};
"Dr" PERIOD CW CW;
You will see that this matches on no line of the input document since the second rule uses the
filtered type PERIOD and is therefore not executed.
2.10. Wildcard #
The wildcard # is a special matching condition of a rule element, which does not match itself but
uses the next rule element to determine its match. It's behavior is similar to a generic rule element
with a reluctant, not restricted quantifier like ANY+? but it much more efficient since no additional
annotations have to be matched. The functionality of the wildcard is illustrated with following
examples:
In this example, everything in between two periods is annotated with an annotation of the type
Sentence. This rule is much more efficient than a rule like PERIOD ANY+{-PARTOF(PERIOD)}
PERIOD; since it only navigated in the index of PERIOD annotations and does not match on
all tokens. The wildcard is a normal matching condition and can be used as any other matching
condition. If the sentence should include the period, the rule would look like:
This rule creates only annotations after a period. If the wildcard is used as an anchor of the rule,
e.g., is the first rule element and no manual anchor is specified, then it starts to match at the
beginning of the document or current window.
(# PERIOD){-> Sentence};
This rule creates a Sentence annotation starting at the begin of the document ending with the first
period. If the rule elements are switched, the result is quite different because of the starting anchor
of the rule:
Here, one annotation of the type Sentence is create for each PERIOD annotation starting with the
period and ending at the end of the document. Currently, optional rule elements after wildcards are
not optional.
In this example, an annotation of the type SentenceEnd is created for each PERIOD annotation,
if it is followed by something that is not part of a CW. This is also fulfilled for the last PERIOD
annotation in a document that ends with a period.
sw1:SW sw2:SW{sw1.end=sw2.begin};
This rule matches on two consecutive small-written words, but matches only if there is no space in
between them. Label expression can also be used across Section 2.14, “Inlined rules” [71].
2.13. Blocks
There are different types of blocks in UIMA Ruta. Blocks aggregate rules or even other blocks and
may serve as more complex control structures. They are even able to change the rule behavior of
the contained rules.
2.13.1. BLOCK
BLOCK provides a simple control structure in the UIMA Ruta language:
1. Conditioned statements
3. Procedures
Declaration of a block:
A block declaration always starts with the keyword “BLOCK” , followed by the identifier of the
block within parentheses. The “RuleElementType” -element is a UIMA Ruta rule that consists of
exactly one rule element. The rule element has to be a declared annotation type.
Note: The rule element in the definition of a block has to define a condition/action part,
even if that part is empty ( “{}” ).
Through the rule element a new local document is defined, whose scope is the related block. So if
you use Document within a block, this always refers to the locally limited document.
BLOCK(ForEach) Paragraph{} {
Document{COUNT(CW)}; // Here "Document" is limited to a Paragraph;
// therefore the rule only counts the CW annotations
// within the Paragraph
}
A block is always executed when the UIMA Ruta interpreter reaches its declaration. But a block
may also be called from another position of the script. See Section 2.13.1.3, “Procedures” [70]
Examples:
DECLARE Month;
Examples:
DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}
This construction is especially useful, if you have a set of rules, which has to be executed
continuously on the same part of an input document. Let us assume that you have already annotated
your document with Paragraph annotations. Now you want to count the number of words within
each paragraph and, if the number of words exceeds 500, annotate it as BigParagraph. Therefore,
you wrote the following rules:
DECLARE BigParagraph;
INT numberOfWords;
Paragraph{COUNT(W,numberOfWords)};
Paragraph{IF(numberOfWords > 500) -> MARK(BigParagraph)};
This will not work. The reason for this is that the rule, which counts the number of words within
a Paragraph is executed on all Paragraphs before the last rule which marks the Paragraph as
BigParagraph is even executed once. When reaching the last rule in this example, the variable
numberOfWords holds the number of words of the last Paragraph in the input document, thus,
annotating all Paragraphs either as BigParagraph or not.
To solve this problem, use a block to tie the execution of this rules together for each Paragraph:
DECLARE BigParagraph;
INT numberOfWords;
BLOCK(IsBig) Paragraph{} {
Document{COUNT(W,numberOfWords)};
Document{IF(numberOfWords > 500) -> MARK(BigParagraph)};
}
Since the scope of the Document is limited to a Paragraph within the block, the rule, which counts
the words is only executed once before the second rule decides, if the Paragraph is a BigParagraph.
Of course, this is done for every Paragraph in the whole document.
2.13.1.3. Procedures
Blocks can be used to introduce procedures to the UIMA Ruta scripts. To do this, declare a block as
before. Let us assume, you want to simulate a procedure
which counts the number of the passed type within the document and returns the counted number.
This can be done in the following way:
BLOCK(countNumberOfTypesInDocument) Document{IF(executeProcedure)} {
Document{COUNT(type, amount)};
}
Document{->ASSIGN(executeProcedure, true)};
Document{->ASSIGN(type, Paragraph)};
Document{->CALL(MyScript.countNumberOfTypesInDocument)};
The boolean variable executeProcedure is used to prohibit the execution of the block when
the interpreter first reaches the block since this is no procedure call. The block can be called by
referring to it with its name, preceded by the name of the script the block is defined in. In this
example, the script is called MyScript.ruta.
2.13.2. FOREACH
The syntax of the FOREACH block is very similar to the common BLOCK construct, but the
execution of the contained rules can lead to other results. the execution of the rules is, however,
different. Here, all contained rules are applied on each matched annotation consecutively. In
a BLOCK construct, each rule is applied within the window of each matched annotation. The
differences can be summarized with:
1. The FOREACH does not restrict the window for the contained rules. The rules are able
to match on the complete document, or at least within the window defined by previous
BLOCK definitions.
2. The Identifier of the FORACH block (the part within the parentheses) declares a new local
annotation variable. The match annotations of the head rule are assign to this variable for
each loop.
3. It is expected that the local variable is part of each rule within the FOREACH block. The
start anchor of each rule is set to the rule element that contains the annotation as a matching
condition. If not another start anchor is defined before the variable.
4. An additional optional boolean parameter specifies the direction of the matching process.
With the default value true, the loop will start with the first annotation continuing with the
following annotations. If set to false, the loop will start with the last annotation continuing
with the previous annotations.
The following example illustrates the syntax and semantic of the FOREACH block:
The first line specifies that the FOREACH block iterates over all annotations of the type NUM
and assigns each matched annotation to a new local variable named num. The block contains two
rules. Both rules start their matching process with the rule element with the matching condition
num, meaning that they match directly on the annotation match by the head rule. While the first
rule validates if there is a capitalized word following the number, the second rule validates that the
is a small written word before the number. Thus, this construct annotates number efficiently with
annotations of the type SpecialNum dependent on their surrounding.
cannot be called by other rules. If the curly brackets start with the symbol “<-” , then the inlined
rules are interpreted as some sort of condition. The surrounding rules will only match, if one of
the inlined rules was successfully applied. A rule element may be extended with several inlined
rule blocks of the same type. The functionality introduced by inlined rules is illustrated with a few
examples:
The first rule in this example matches on each “Sentence” annotation and applies the inlined
rule within each matched sentence. The inlined rule matches on numbers followed by a word
and annotates the number with an annotation of the type “NumBeforeWord” . The second rule
matches on each sentence and applies the inlined rule within each sentence. Note that the inlined
rule contains no actions. The rule matches only successfully on a sentence if one of the inlined rules
was successfully applied. In this case, the sentence is only annotated with an annotation of the type
“SentenceWithNumBeforeWord” , if the sentence contains a number followed by a word.
This examples combines both types of inlined rules. First, the rule matches on document
annotations with the language feature set to “en” . Only for those documents, the first inner rule
is applied. The inner rule matches on everything between two period, but only if the text span
between the period fulfills two conditions: There must be two successive colons and two successive
commas within the window of the matched part of the wildcard. Only if these constraints are
fulfilled, then the last period is annotated with the type “SpecialPeriod” .
The first line in this example declares a new macro condition with the name “CWorPERIODor”
with one annotation type argument named “t” . The condition is fulfilled if the matched text is
either a CW annotation, a PERIOD annotation or an annotation of the given type t. The second
line declares a new macro action with the name “INC” and two integer arguments “i” and “inc”
. The keyword “VAR” indicated that the first argument should be treated as a variable meaning
that the actions of the macro can assign new values to the given argument. Else only the value of
the argument would be accessible to the actions. The action itself just contains an ASSIGN action,
which add the second argument to the variable given in the first argument. The rule in line 4 finally
matches on each annotation of the type ANY and validates if the matched position is either a CW,
a PERIOD or an annotation of the type Bold. If this is the case, then value of the variable counter
defined in line 3 is incremented by 1.
Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
Headline{SCORE(10)->MARK(Realhl)};
Headline{SCORE(5,10)->LOG("Maybe a headline")};
In the first part of this rule set, annotations of the type paragraph receive scoring points for
a headline annotation, if they fulfill certain CONTAINS conditions. The first condition, for
example, evaluates to true, if the paragraph contains one word up to five words, whereas the fourth
conditions is fulfilled, if the paragraph contains thirty up to eighty percent of emph annotations.
The last two rules finally execute their actions, if the score of a headline annotation exceeds ten
points, or lies in the interval of five to ten points, respectively.
2.17. Modification
There are different actions that can modify the input document, like DEL, COLOR and REPLACE.
However, the input document itself can not be modified directly. A separate engine, the
Modifier.xml, has to be called in order to create another CAS view with the (default) name
"modified". In that document, all modifications are executed.
The following example shows how to import and call the Modifier.xml engine. The example is
explained in detail in Section 1.4, “Learning by Example” [3] .
ENGINE utils.Modifier;
Date{-> DEL};
MoneyAmount{-> REPLACE("<MoneyAmount/>")};
Document{-> COLOR(Headline, "green")};
Document{-> EXEC(Modifier)};
recognized first names you have to change the rule itself every time. Moreover, writing rules with
possibly hundreds of first names is not really practically realizable and definitely not efficient,
if you have the list of first names already as a simple text file. Using this text file directly would
reduce the effort.
UIMA Ruta provides, therefore, two kinds of external resources to solve such tasks more easily:
WORDLISTs and WORDTABLEs.
2.18.1. WORDLISTs
A WORDLIST is a list of text items. There are three different possibilities of how to provide a
WORDLIST to the UIMA Ruta system.
The first possibility is the use of simple text files, which contain exactly one list item per line. For
example, a list "FirstNames.txt" of first names could look like this:
Frank
Peter
Jochen
Martin
First names within a document containing any number of these listed names, could be annotated by
using Document{->MARKFAST(FirstName, 'FirstNames.txt')}; , assuming an already
declared type FirstName. To make this rule recognizing more first names, add them to the external
list. You could also use a WORLIST variable to do the same thing as follows, which is preferable:
Another possibility compared to the plain text files to provide WORDLISTs is the use of compiled
“tree word list” s. The file ending for this is “.twl” A tree word list is similar to a trie. It is a
XML-file that contains a tree-like structure with a node for each character. The nodes themselves
refer to child nodes that represent all characters that succeed the character of the parent node.
For single word entries the resulting complexity is O(m*log(n)) instead of O(m*n) for simple
text files. Here m is the amount of basic annotations in the document and n is the amount of
entries in the dictionary. To generate a tree word list, see Section 3.11, “Creation of Tree Word
Lists” [110] . A tree word list is used in the same way as simple word lists, for example
Document{->MARKFAST(FirstName, 'FirstNames.twl')}; .
A third kind of usable WORDLISTs are “multi tree word list” s. The file ending for this is “.mtwl”
. It is generated from several ordinary WORDLISTs given as simple text files. It contains special
nodes that provide additional information about the original file. These kind of WORDLIST is
useful, if several different WORDLISTs are used within a UIMA Ruta script. Using five different
lists results in five rules using the MARKFAST action. The documents to annotate are thus
searched five times resulting in a complexity of 5*O(m*log(n)) With a multi tree word list this can
be reduced to about O(m*log(5*n)). To generate a multi tree word list, see Section 3.11, “Creation
of Tree Word Lists” [110] To use a multi tree word list UIMA Ruta provides the action TRIE.
If for example two word lists “FirstNames.txt” and “LastNames.txt” have been merged in the multi
tree word list “Names.mtwl” , then the following rule annotates all first names and last names in the
whole document:
Only if the wordlist is explicitly declared with WORDLIST, then also a StringExpression including
variables can be applied to specify the file:
2.18.2. WORDTABLEs
WORDLISTs have been used to annotate all occurrences of any list item in a document with
a certain type. Imagine now that each annotation has features that should be filled with values
dependent on the list item that matched. This can be achieved with WORDTABLEs. Let us,
for example, assume we want to annotate all US presidents within a document. Moreover, each
annotation should contain the party of the president as well as the year of his inauguration.
Therefore we use an annotation type DECLARE Annotation PresidentOfUSA(STRING
party, INT yearOfInauguration) . To achieve this, it is recommended to use
WORDTABLEs.
A WORDTABLE is simply a comma-separated file (.csv), which actually uses semicolons for
separation of the entries. For our example, such a file named “presidentsOfUSA.csv” could look
like this:
Bill Clinton;democrats;1993
George W. Bush;republicans;2001
Barack Obama;democrats;2009
Only if the wordtable is explicitly declared with WORDTABLE, then also a StringExpression
including variables can be applied to specify the file:
rules can be restricted to match only within certain annotations using the BLOCK construct, and
ignore all filtering settings.
The following example contains a simple rule, which is able to create annotations of two different
types. It creates an annotation of the type “T1” for each match of the complete regular expression
and an annotation of the type “T2” for each match of the first capturing group.
2.20.1.1. DOCUMENTBLOCK
This additional block construct applies the contained statements/rules on the complete document
independent of previous windows and restrictions. It resets the matching context, but otherwise
behaves like a normal BLOCK.
BLOCK(ex) NUM{}{
DOCUMENTBLOCK W{}{
// do something with the words
}
}
The example contains two blocks. The first block iterates over all numbers (NUM). The second
block resets the match context and matches on all words (W), for every previously matched
number.
2.20.1.2. ONLYFIRST
This additional block construct applies the contained statements/rules only until the first one was
successfully applied. The following example provides an overview of the syntax:
ONLYFIRST Document{}{
Document{CONTAINS(Keyword1) -> Doc1};
Document{CONTAINS(Keyword2) -> Doc2};
Document{CONTAINS(Keyword3) -> Doc3};
}
The block contains three rules each evaluating if the document contains a specific annotation of the
type Keyword1/2/3. If the first rule is able to match, then the other two rules will not try to apply.
Straightforwardly, if the first rule failed to match and the second rules is able to match, then the
third rule will not try to be applied.
2.20.1.3. ONLYONCE
Rules within this block construct will stop after the first successful match. The following example
provides an overview of the syntax:
ONLYONCE Document{}{
CW{-> FirstCW};
NUM+{-> FirstNumList};
}
The block contains two rules. The first rule will annotate the first capitalized word of the document
with the type FirstCW. All further possible matches will be skipped. The second rule will annotate
the first sequence of numbers with the type FirstNumList. The greedy behavior of the quantifiers is
not changed by the ONLYONCE block.
2.20.1.4. Stringfunctions
In order to manipulate Strings in variables a bunch of Stringfunctions have been added. They will
all be presented with a short example demonstrating their use.
firstCharToUpperCase(IStringExpression expr)
STRING s;
STRINGLIST sl;
SW{-> MATCHEDTEXT(s), ADD(sl, firstCharToUpperCase(s))};
CW{INLIST(sl) -> Test};
This example declares a STRING and a STRINGLIST. Afterwards for every small-written word,
the according word with a capital first Character is added to the STRINGLIST. This might be
helpful in German Named-Entity-Recognition where you will encounter "der blonde Junge..." and
"der Blonde", both map to the same entity. Applied to the word "blonde" you can then also track
the second appearance of that Person. In the last line a rule marks all words in the STRINGLIST as
a Test Annotation.
This example declares a STRING and a STRINGLIST. Next every capital Word CW is added to
the STRINGLIST, however the first "e" is going to be replaced by "o". Afterwards all instances of
the STRINGLIST are matched with all present CWs and annotated as a Test Annotation if a match
occurs.
This example declares a STRING and a STRINGLIST. Next every capital Word CW is added to
the STRINGLIST, however similar to the above example at first there is going to be a replacement.
This time all "e"`s are going to be replaced by "o"`s. Afterwards all instances of the STRINGLIST
are matched with all present CWs and annotated as a Test Annotation if a match occurs.
This example declares a STRING and a STRINGLIST. Imagine you found the word
"Alexanderplatz" but you only want to continue with the word "Alexander". This snippet shows
how this can be done by using the Stringfunctions in RUTA. If a word has less character than
specified in the arguments, nothing will be executed.
toLowerCase(IStringExpression expr)
STRING s;
STRINGLIST sl;
CW{-> MATCHEDTEXT(s), ADD(sl, toLowerCase(s))};
SW{INLIST(sl) -> Test};
This example declares a STRING and a STRINGLIST. A problem you might encounter is that
you want to know whether the first word of a sentence is really a noun.(Again more or less german
related) By using this function you could add all words that start a sentence(which usually means a
capitalized word) to a list as in this example. Then test if it also appears within the text but this time
as lowercase. As a result you could change its POS-Tag.
toUpperCase(IStringExpression expr)
STRING s;
STRINGLIST sl;
CW{-> MATCHEDTEXT(s), ADD(sl, toUpperCase(s))};
SW{INLIST(sl) -> T1};
This example declares a STRING and a STRINGLIST. A typical scenario for its use might be
Named-Entity-Recognition. This time you want to find all organizations given an input document.
At first you might track-down all fully capitalized words. As a second step you can use this
function and iterate over all CW insances and compare the found instance with all the uppercase
organizations that were found before.
If you want to find all words that contain a given charactersequence. Assume again you are in a
NER-Task you found the token "Alexanderplatz" using this function you can track down the names
that are part of a given token. This example uses a BLOCK to iterate over each word and then
assigns whether the text of that word contains the given char-sequence. If so it is annotated as a
Test annotation.
Assume you found the suffix "str" as a strong indicator whether a given token represents location (a
street) by using this function you can now easily identify all of those words, given a valid suffix.
Given a stem of a word you want to mark every instance that was possibly derived from that stem.
If you decide to use that function you can detect all those words in 1 line and in a next step mark all
of them as an Annotationtype of choice.
These functions check whether both arguments are equal in terms of the text of the token that they
contain.
An equivalent function to the Java Stringlibrary. It checks whether or not a given variable contains
an empty Stringliteral "" or not.
2.20.1.5. typeFromString
This function takes a string expression and tries to find the corresponding type. Short names are
supported but need to be unambiguous.
CW{-> typeFromString("Person")}
In this example, each CW annotation is annotated with an annotation of the type Person.
Three classes need to be implemented for adding a new condition that also is resolved in the UIMA
Ruta Workbench:
The exemplary project provides implementation of all possible language elements. This project
contains the implementations for the analysis engine and also the implementation for the UIMA
Ruta Workbench, and is therefore an Eclipse plugin (mind the pom file).
Concerning the ExampleCondition condition extension, there are four important spots/classes:
3. ExampleConditionIDEExtension provides the syntax check for the editor and the keyword
for syntax coloring.
<extension point="org.apache.uima.ruta.ide.conditionExtension">
<condition
class="org.apache.uima.ruta.example.extensions.
ExampleConditionIDEExtension"
engine="org.apache.uima.ruta.example.extensions.
ExampleConditionExtension">
</condition>
</extension>
If the UIMA Ruta Workbench is not used or the rules are only applied in UIMA
pipelines, only the ExampleCondition and ExampleConditionExtension are needed, and
Adding new conditions using Java projects in the same workspace has not been tested yet, but at
least the Workbench support will be missing due to the inclusion of extensions using the extension
point mechanism of Eclipse.
3.1. Installation
Do the installation of the UIMA Ruta Workbench as follows:
1. Download, install and start Eclipse. The Eclipse version currently supported by UIMA Ruta
is given on the webpage of the project1. This is normally the latest version of Eclipse, which
can be obtained from the eclipse.org2 download site.
4. Select “Apache UIMA Ruta” and (if not yet installed) “Apache UIMA Eclipse tooling and
runtime support” by clicking into the related checkbox.
5. Also select “Contact all update sites during install to find required software ” and click on
“Next”.
6. On the next page, click “Next” again. Now, the license agreement site is displayed. To
install UIMA Ruta read the license and choose “I accept the ...” if you agree to it. Then,
click on “Finish”
1
https://uima.apache.org/ruta.html
2
https://eclipse.org/
3
http://www.apache.org/dist/uima/eclipse-update-site/
4
http://www.apache.org/dist/uima/eclipse-update-site/
Now, UIMA Ruta is going to be installed. After the successful installation, switch to the
UIMA Ruta perspective. To get an overview, see Section 3.2, “UIMA Ruta Workbench
Overview” [84].
Several times within this chapter we use a UIMA Ruta example project to illustrate the use of the
UIMA Ruta Workbench. The “ExampleProject” project is part of the source release of UIMA Ruta
(example-projects folder).
To import this project into the workbench do “File → Import... ” . Select “Existing Projects into
Workspace” under “General” . Select the “ExampleProject” directory in your file system as root
directory and click on “Finish” . The example project is now available in your workspace.
1. The “UIMA Ruta perspective”, which provides the main functionality for working on
UIMA Ruta projects. See Section 3.4, “UIMA Ruta Perspective” [89].
2. The “Explain perspective”, which provides functionality primarily used to explain how
a set of rules are executed on input documents. See Section 3.5, “UIMA Ruta Explain
Perspective” [91].
The following Table 3.2, “UIMA Ruta wizards” [86] lists all UIMA Ruta wizards:
Folder Description
script Source folder for UIMA Ruta scripts and packages.
descriptor Build folder for UIMA components. Analysis engines and type systems are
created automatically from the related script files.
input Folder that contains the files that will be processed when launching a UIMA
Ruta script. Such input files could be plain text, HTML or xmiCAS files.
output Folder that contains the resulting xmiCAS files. One xmiCAS file is
generated for each associated document in the input folder.
resources Default folder for word lists, dictionaries and tables.
test Folder for test-driven development.
Figure 3.3, “A newly created UIMA Ruta project” [87] shows a project, newly created with the
wizard.
Figure 3.4, “Wizard start page” [88] shows the start page of the wizard.
To create a simple UIMA Ruta project, enter a project name for your project and click “Finish”.
This will create everything you need to start.
Other possible settings on this page are the desired location of the project, the interpreter to use and
the working set you wish to work on, all of them are self-explaining.
Figure 3.5, “Wizard second page” [89] shows the second page of the wizard.
To make it possible to reproduce all of the examples used below, switch to the UIMA Ruta Explain
perspective within your Eclipse workbench. Import the UIMA Ruta example project and open the
main UIMA Ruta script file 'Main.ruta'. Now press the 'Run' button (green arrow) and wait for the
end of execution. Open the resulting xmiCAS file 'Test1.txt.xmi', which you can find in the output
folder.
The result of the execution of the UIMA Ruta example project is shown in Figure 3.6, “ Annotation
Browser view ” [90]. You can see that there are 5 annotations of 5 different types in the
document. Highlighting of certain types can be controlled by the checkboxes in the tree view. The
names of the types are abbreviated by their package constituents. Full type names are provided by
tooltips and can be copied into the clipboard by hitting 'ctrl + c'. This can be especially useful to
paste a type name into the Query View.
Moreover, this view has two possible filters. Using the “Only types with...”-filter leads to a list
containing only those types that contain the entered text. The “Only annotations with...”-filter leads
to an analogous list. Both list filters can be quickly activated by hitting the return key.
Type highlighting can be reset by the action represented by a switched-off light bulb at the top
of the view. The light bulb that is turned on sets all types visible in the tree view of the page
highlighted in the CAS editor.
The offsets of selected annotations can be modified with ctrl + u (reduce begin), ctrl + i (increase
begin), ctrl + o (reduce end), and ctrl + p (increase end).
The user can choose whether parent types are displayed using the preference page “UIMA Cas
Editor -> Cas Editor Views”.
3.4.2. Selection
The Selection view is very similar to the Annotation Browser view, but only shows annotations
that affect a specific text passage. To get such a list, click on any position in the opened xmiCAS
document or select a certain text passage.
If you select the text passage 2008, the Selection view will be generated as shown in Figure 3.6, “
Annotation Browser view ” [90].
The Selection view has the same filtering and modification options as described in Annotation
Browser view.
To make it possible to reproduce all of the examples used below, switch to the UIMA Ruta Explain
perspective within your Eclipse workbench. Import the UIMA Ruta example project and open the
main Ruta script file 'Main.ruta'. Now press the 'Debug' button and wait for the end of execution.
Open the resulting xmiCAS file 'Test1.txt.xmi', which you can find in the output folder.
The structure is as follows: if BLOCK constructs were used in the executed Ruta file, the rules
contained in that block will be represented as child node in the tree of the view. Each Ruta file is
a BLOCK construct itself and named after the file. The root node of the view is, therefore, always
a BLOCK containing the rules of the executed UIMA Ruta script. Additionally, if a rule calls a
different Ruta file, then the root block of that file is the child of the calling rule.
If you double-click on one of the rules, the related script file is opened within the editor and the rule
itself is selected.
Section 3.5, “UIMA Ruta Explain Perspective” [91] shows the whole rule hierarchy resulting
from the UIMA Ruta example project. The root of the whole hierarchy is the BLOCK associated
to the 'Main.ruta' script. On the next level, the rules called by the 'Main.ruta' script are listed. Since
there is a call to each of the script files 'Year.ruta', 'Author.ruta' and 'Title.ruta', these are included
into the hierarchy, each forming their own block.
The following image shows the UIMA Ruta Applied Rules view.
The following image shows the UIMA Ruta Applied Rules view.
The selection (single-click) of one of the text passages in either Matched Rules view or Failed
Rules view will directly change the information visualized in the Rule Elements view.
Within the Rule Elements view, each rule element generates its own explanation hierarchy. On
the root level, the rule element itself is given. An apostrophe at the beginning of the rule element
indicates that this rule was the anchor for the rule execution. On the next level, the text passage on
which the rule element tried to match on is given. The last level explains, why the rule element did
or did not match. The first entry on this level tells, if the text passage is of the requested annotation
type. If it is, a green hook is shown in front of the requested type. Otherwise, a red cross is shown.
In the following the rule conditions and their evaluation on the given text passage are shown.
In the previous example, select the listed instance Bethard, S.. The Rule Elements view shows
the related explanation displayed in Figure 3.10, “ The views Matched Rules and Failed Rules
” [94].
The following image shows the UIMA Ruta Rule Elements view.
As you can see, the first rule element Name{-PARTOF(NameListPart)} matched on the text
passage Bethard, S. since it is firstly annotated with a “Name” annotation and secondly it is
not part of an annotation “NameListPart”. However, as this first text passage is not followed by a
“NameLinker” annotation the whole rule fails.
3.5.6. Created By
The Created By view tells you which rule created a specific annotation. To get this information,
select an annotation in the Annotation Browser. After doing this, the Created By view shows the
related information.
To see how this works, use the example project and go to the Annotation view. Select the
“d.u.e.Year” annotation “(2008)”. The Created By view displays the information, shown in
Figure 3.11, “ The Created By view ” [95]. You can double-click on the shown rule to jump to
the related document “Year.ruta”.
3.5.7. Statistics
The Statistics view displays profiling information for the used conditions and actions of the UIMA
Ruta language. Three numbers are given for each element: The total time of execution, the amount
of executions and the average time per execution.
The following image shows the UIMA Ruta Statistics view generated form the UIMA Ruta
example project.
The table at the bottom of the view contains all documents that are inspected. The intervals
specifying the color of the icon of each documents dependent of the CDE result can be set in the
preference page. If the evaluation has finished, then the CDE value is displayed and additionally
the F1 score, if available. The table is sortable. A double-click on a documents opens it in the CAS
Editor.
toolbar of the view export and import the constraint using a simple xml format. The buttons in the
right part of the view can modify the list of the constraints. Currently, three types of constraints
are supported: Simple UIMA Ruta rules, a list of simple UIMA Ruta rules and word distribution
constraints. All constraints return a value between 0 and 1. The constraints based on UIMA Ruta
rules return the ratio how often the rule was applied to how often the rules tried to apply. The rule
“Author{STARTSWITH(Reference)};”, for example, returns 1, if all author annotations start with
a reference annotation. The word distribution constraints refer to a txt file in which each line has
the format “"Proceedings":Booktitle 0.95, Journal 0.05” specifying the distribution of the word
proceedings concerning the interesting annotations. If no quotes are given for the first term, then
the term is interpreted as an annotation type. The result value is currently calculated by using the
cosine similarity of the expected and observed frequencies.
Figure 3.14. The Query View. (1) Start Button; (2) Export Button
1. The field “Query Data” specifies the folder containing the documents on which the query
should be executed. You can either click on the button next to the field to specify the folder
by browsing through the file system or you can drag and drop a folder directly into the field.
If the checkbox is activated, all subfolders are included.
2. The field “Type System” has to contain a type system or a UIMA Ruta script that specifies
all types that are used in the query. You can either click on the button next to the field to
specify the type system by browsing through the file system or you can drag and drop a type
system directly into the field.
3. The query in form of one or more UIMA Ruta rules is specified in the text field in the
middle of the view.
4. After pressing the start button, the query is started. The results are subsequently displayed in
the bottom text field.
The resulting list consists of all text passages the query applied to. Above the text field, information
about the entire number of matches and the number of different documents the query applied to
is given. Each item in the list shows both the matched text passage and in brackets the document
related to the text passage. By double-clicking on one of the listed items, the related document is
opened in the editor and the matched text passage is selected. If the related document is already
open you can jump to another matched text passage within the the same document with one click
on the listed item. Of course, this text passage is selected. By clicking on the export button, a list
of all matched text passaged is showed in a separate window. For further usage, e.g. as a list of
authors in another UIMA Ruta project, copy the content of this window to another text file.
The screenshot shows an example where a rule is used to find occurrences of years within brackets
in the input file of the UIMA Ruta example. After pressing the run button the result list contains all
occurrences. Recognize that the rule does not create any annotation. The list lists all rule matches,
not the created annotations.
3.8. Testing
The UIMA Ruta Workbench comes bundled with its own testing environment that allows you to
test and evaluate UIMA Ruta scripts. It provides full back-end testing capabilities and allows you to
examine test results in detail.
To test the quality of a written UIMA Ruta script, the testing procedure compares a previously
annotated gold standard file with the resulting xmiCAS file created by the selected UIMA Ruta
script. As a product of the testing operation a new xmiCAS file will be created, containing detailed
information about the test results. The evaluators compare the offsets of annotations and, depending
on the selected evaluator, add true positive, false positive or false negative annotations for each
tested annotation to the resulting xmiCAS file. Afterwards precision, recall and f1-score are
calculated for each test file and each type in the test file. The f1-score is also calculated for the
whole test set. The testing environment consists of four views: Annotation Test, True Positive,
False Positive and False Negative. The Annotation Test view is by default associated with the
UIMA Ruta perspective.
Note: There are two options for choosing the types that should be evaluated, which is
specified by the preference “Use all types”. If this preference is activated (by default),
then the user has to selected the types using the toolbar in the view. There are button for
selecting the included and excluded types. If this preference is deactivated, then only the
types present in the current test document are evaluated. This can result in missing false
positive, if the an annotation of a specific type was created by the rules and no annotation
of this type is present in the test document.
Figure 3.15, “Test folder structure. ” [99] shows the script explorer. Every UIMA Ruta project
contains a folder called “test”. This folder is the default location for the test-files. In the folder each
script file has its own subfolder with a relative path equal to the scripts package path in the “script”
folder. This folder contains the test files. In every scripts test folder, you will also find a result
folder where the results of the tests are saved. If you like to use test files from another location in
the file system, the results will be saved in the “temp” subfolder of the project's test folder. All files
in the temp folder will be deleted once Eclipse is closed.
3.8.1. Usage
This section describes the general proceeding when using the testing environment.
Currently, the testing environment has no own perspective associated to it. It is recommended to
start within the UIMA Ruta perspective. There, the Annotation Test view is open by default. The
True Positive, False Positive and False Negative views have to be opened manually: “Window ->
Show View -> True Positive/False Positive/False Negative ”.
To explain the usage of the UIMA Ruta testing environment, the UIMA Ruta example project is
used again. Open this project. Firstly, one has to select a script for testing: UIMA Ruta will always
test the script, that is currently open and active in the script editor. So, open the “Main.ruta” script
file of the UIMA Ruta example project. The next figure. shows the Annotation Test view after
doing this.
Figure 3.16. The Annotation Test view. Button from left to right: Start Test; Select excluded
type; Select included type; Select evaluator/preferences; Export to CSV; Extend Classpath
All control elements that are needed for the interaction with the testing environment are located
here. At the top right, there is the buttons bar. At the top left of the view the name of the script
that is going to be tested is shown. It is always equal to the script active in the editor. Below this,
the test list is located. This list contains the different files for testing. Right next to the name of
the script file you can select the desired view. Right to this you get statistics over all ran tests: the
number of all true positives (TP), false positives (FP) and false negatives (FN). In the field below,
you will find a table with statistic information for a single selected test file. To change this view,
select a file in the test list field. The table shows a total TP, FP and FN information, as well as
precision, recall and f1-score for every type as well as for the whole file.
There is also an experimental feature to extend the classpath during testing, which allows to
evaluate scripts that call analysis engines in the same workspace. Therefore, you have to toggle the
button in the toolbar of the view.
Next, you have to add test files to your project. A test file is a previously annotated xmiCAS file
that can be used as a golden standard for the test. You can use any xmiCAS file. The UIMA Ruta
example project already contains such test files. These files are listed in the Annotation Test view.
Try do delete these files by selecting them and clicking on Del. Add these files again by simply
dragging them from the Script Explorer into the test file list. A different way to add test-files is
to use the “Load all test files from selected folder” button (green plus). It can be used to add all
xmiCAS files from a selected folder.
The testing environment supports different evaluators that allow a sophisticated analysis of
the behavior of a UIMA Ruta script. The evaluator can be chosen in the testing environment's
preference page. The preference page can be opened either through the menu or by clicking on
the “Select evaluator” button (blue gear wheels) in the testing view's toolbar. Clicking the button
will open a filtered version of the UIMA Ruta preference page. The default evaluator is the "Exact
CAS Evaluator", which compares the offsets of the annotations between the test file and the file
annotated by the tested script. To get an overview of all available evaluators, see Section 3.8.2,
“Evaluators” [104]
This preference page (see Figure 3.17, “The testing preference page view ” [101]) offers a few
options that will modify the plug-ins general behavior. For example, the preloading of previously
collected result data can be turned off. An important option in the preference page is the evaluator
you can select. On default the "exact evaluator" is selected, which compares the offsets of the
annotations, that are contained in the file produced by the selected script with the annotations in the
test file. Other evaluators will compare annotations in a different way.
During a test-run it might be convenient to disable testing for specific types like punctuation or
tags. The “Select excluded types” button (white exclamation in a red disk) will open a dialog (see
Figure 3.18, “Excluded types window ” [102]) where all types can be selected that should not be
considered in the test.
A test-run can be started by clicking on the start button. Do this for the UIMA Ruta example
project. Figure 3.19, “The Annotation Test view. ” [102] shows the results.
The testing main view displays some information on how well the script did after every test run. It
will display an overall number of true positive, false positive and false negatives annotations of all
result files as well as an overall f1-score. Furthermore, a table will be displayed that contains the
overall statistics of the selected test file as well as statistics for every single type in the test file. The
information displayed are true positives, false positives, false negatives, precision, recall and f1-
measure.
The testing environment also supports the export of the overall data in form of a comma-separated
table. Clicking the “export data” button will open a dialog window that contains this table. The text
in this table can be copied and easily imported into other applications.
When running a test, the evaluator will create a new result xmiCAS file and will add new true
positive, false positive and false negative annotations. By clicking on a file in the test-file list, you
can open the corresponding result xmiCAS file in the CAS Editor. While displaying the result
xmiCAS file in the CAS Editor, the True Positive, False Positive and False Negative views allow
easy navigation through the new tp, fp and fn annotations. The corresponding annotations are
displayed in a hierarchic tree structure. This allows an easy tracing of the results within the testing
document. Clicking on one of the annotations in those views will highlight the annotation in the
CAS Editor. Opening “test1.result.xmi” in the UIMA Ruta example project changes the True
Positive view as shown in Figure 3.20, “The True Positive view. ” [104]. Notice that the type
system, which will be used by the CAS Editor to open the evaluated file, can only be resolved for
the tested script, if the test files are located in the associated folder structure that is the folder with
the name of the script. If the files are located in the temp folder, for example by adding the files
to the list of test cases by drag and drop, other strategies to find the correct type system will be
applied. For UIMA Ruta projects, for example, this will be the type system of the last launched
script in this project.
3.8.2. Evaluators
When testing a CAS file, the system compared the offsets of the annotations of a previously
annotated gold standard file with the offsets of the annotations of the result file the script produced.
Responsible for comparing annotations in the two CAS files are evaluators. These evaluators have
different methods and strategies implemented for comparing the annotations. Also, an extension
point is provided that allows easy implementation of new evaluators.
Exact Match Evaluator: The Exact Match Evaluator compares the offsets of the annotations in the
result and the golden standard file. Any difference will be marked with either a false positive or
false negative annotations.
Partial Match Evaluator: The Partial Match Evaluator compares the offsets of the annotations
in the result and golden standard file. It will allow differences in the beginning or the end of an
annotation. For example, "corresponding" and "corresponding " will not be annotated as an error.
Core Match Evaluator: The Core Match Evaluator accepts annotations that share a core expression.
In this context, a core expression is at least four digits long and starts with a capitalized letter.
For example, the two annotations "L404-123-421" and "L404-321-412" would be considered
a true positive match, because "L404" is considered a core expression that is contained in both
annotations.
Word Accuracy Evaluator: Compares the labels of all words/numbers in an annotation, whereas the
label equals the type of the annotation. This has the consequence, for example, that each word or
number that is not part of the annotation is counted as a single false negative. For example in the
sentence: "Christmas is on the 24.12 every year." The script labels "Christmas is on the 12" as a
single sentence, while the test file labels the sentence correctly with a single sentence annotation.
While, for example, the Exact CAS Evaluator is only assigning a single False Negative annotation,
Word Accuracy Evaluator will mark every word or number as a single false negative.
Template Only Evaluator: This Evaluator compares the offsets of the annotations and the features,
that have been created by the script. For example, the text "Alan Mathison Turing" is marked with
the author annotation and "author" contains 2 features: "FirstName" and "LastName". If the script
now creates an author annotation with only one feature, the annotation will be marked as a false
positive.
Template on Word Level Evaluator: The Template On Word Evaluator compares the offsets of
the annotations. In addition, it also compares the features and feature structures and the values
stored in the features. For example, the annotation "author" might have features like "FirstName"
and "LastName". The authors name is "Alan Mathison Turing" and the script correctly assigns
the author annotation. The feature assigned by the script are "Firstname : Alan", "LastName :
Mathison", while the correct feature values are "FirstName Alan" and "LastName Turing". In this
case, the Template Only Evaluator will mark an annotation as a false positive, since the feature
values differ.
3.9. TextRuler
Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the
UIMA Ruta Workbench. It provides several configurable algorithms, which are able to learn new
rules based on given labeled data. The framework was created in order to support the user by
suggesting new rules for the given task. The user selects a suitable learning algorithm and adapts
its configuration parameters. Furthermore, the user engineers a set of annotation-based features,
which enable the algorithms to form efficient, effective and comprehensive rules. The rule learning
algorithms present their suggested rules in a new view, in which the user can either copy the
complete script or single rules to a new script file, where the rules can be further refined.
This section gives a short introduction about the included features and learners, and how to use the
framework to learn UIMA Ruta rules. First, the available rule learning algorithms are introduced in
Section 3.9.1, “Included rule learning algorithms” [105]. Then, the user interface and the usage
is explained in Section 3.9.2, “The TextRuler view” [107] and Section 4.5, “Induce rules with
the TextRuler framework” [120] illustrates the usage with an exemplary UIMA Ruta project.
3.9.1.1. LP2
Note: This rule learner is an experimental implementation of the ideas and algorithms
published in: F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using
Linguistic Constraints. Technical Report CS-03-07, Department of Computer Science,
University of Sheffield, Sheffield, 2003.
This algorithms learns separate rules for the beginning and the end of a single slot, which are later
combined in order to identify the targeted annotation. The learning strategy is a bottom-up covering
algorithm. It starts by creating a specific seed instance with a window of w tokens to the left and
right of the target boundary and searches for the best generalization. Additional context rules are
induced in order to identify missing boundaries. The current implementation does not support
correction rules. The TextRuler framework provides two versions of this algorithm: LP2 (naive) is
a straightforward implementation with limited expressiveness concerning the resulting Ruta rules.
LP2 (optimized) is an improved version with a dynamic programming approach and is providing
better results in general. The following parameters are available. For a more detailed description of
the parameters, please refer to the implementation and the publication.
• Context Window Size (to the left and right)
• Best Rules List Size
• Minimum Covered Positives per Rule
• Maximum Error Threshold
• Contextual Rules List Size
3.9.1.2. WHISK
Note: This rule learner is an experimental implementation of the ideas and algorithms
published in: Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning
Information Extraction Rules for Semi-Structured and Free Text. In Machine Learning,
volume 34, pages 233-272, 1999.
WHISK is a multi-slot method that operates on all three kinds of documents and learns single-
or multi-slot rules looking similar to regular expressions. However, the current implementation
only support single slot rules. The top-down covering algorithm begins with the most general
rule and specializes it by adding single rule terms until the rule does not make errors anymore
on the training set. The TextRuler framework provides two versions of this algorithm: WHISK
(token) is a naive token-based implementation. WHISK (generic) is an optimized and improved
implementation, which is able to refer to arbitrary annotations and also supports primitive features.
The following parameters are available. For a more detailed description of the parameters, please
refer to the implementation and the publication.
• Parameters Window Size
• Maximum Error Threshold
• PosTag Root Type
• Considered Features (comma-separated) - only WHISK (generic)
3.9.1.3. TraBaL
Note: This rule learner is an implementation of the ideas and algorithms published in:
Benjamin Eckstein, Peter Kluegl, and Frank Puppe. Towards Learning Error-Driven
The TraBal rule learner induces rules that try to correct annotations error and relies on two set of
documents. A set of documents with gold standard annotation and an additional set of annotated
documents with the same text that possibly contain erroneous annotations, for which correction
rules should be learnt. First, the algorithm compares the two sets of documents and identifies the
present errors. Then, rules for each error are induced and extended. This process can be iterated
in order to incrementally remove all errors. The following parameters are available. For a more
detailed description of the parameters, please refer to the implementation and the publication.
3.9.1.4. KEP
The name of the rule learner KEP (knowledge engineering patterns) is derived from the idea that
humans use different engineering patterns to write annotation rules. This algorithms implements
simple rule induction methods for some patterns, such as boundary detection or annotation-
based restriction of the window. The results are then combined in order to take advantage of
the combination of the different kinds of induced rules. Since the single rules are constructed
according to how humans engineer the annotations rules, the resulting rule set should resemble
more a handcrafted rule set. Furthermore, by exploiting the synergy of the patterns, solutions for
some annotation are much simpler. The following parameters are available. For a more detailed
description of the parameters, please refer to the implementation.
be applied on the documents before the algorithms start in order to add additional annotations as
learning features. The preprocessing can be skipped. All text fields support drag and drop: the user
can drag a file in the script explorer and drop it in the respective text field. In the center of the view,
the target types, for which rule should be induced, can be specified in the “Information Types” list.
The list “Featured Feature Types” specify the filtering settings, but it is discourage to change these
settings. The user is able to drop a simple text file, which contains a type with complete namespace
in each line, to the “Information Types” list in order to add all those types. The lower part of the
view contains the list of available algorithms. All checked algorithms will be started, if the start
button in the toolbar of the view is pressed. When the algorithms are started, they display their
current action after their name, and a result view with the currently induced rules is displayed in the
right part of the perspective.
the selected types can easily be accepted or rejected. Figure 3.23, “Check Annotations view (right
part) ” [109] provides an screenshot of the view. Its parts are described in the following.
The view provides three text fields: the absolute location of the folder with the source documents,
which contain the annotations to be validated, the absolute location of the gold folder, where the
accepted annotations will be stored, and the absolute location of the type system that contains all
necessary types. The toolbar of the view provides seven buttons: the first one updates the set of
documents and their annotations in the main part of the view. This is necessary, e.g., if the selected
types change or if the annotations in the documents change. The second button opens a dialog
for choosing the types that need to be checked. Only annotations of those types will be displayed
and can be accepted or rejected. If features need to be checked together with the annotations of
a type, these features have to be selected in this dialog too. Features are shown as sub-nodes of
the annotation nodes. By default, only annotations that have been checked are transferred from
the original document to the according gold document. To also transfer annotations of some
type unchecked, these types have to be chosen in another dialog, which is opened with the third
button. The fourth and fifth button accept/reject the currently selected annotation. Only accepted
annotations will be stored in the gold folder. An annotation can also be accepted with the key
binding “ctrl+4” and rejected with the key binding “ctrl+5”. If an annotation is processed, then
the next annotation is automatically selected and a new CAS Editor is opened if necessary. The
sixth button adds the currently accepted annotations, as well as the annotations of a type selected
in the unchecked dialog, to the corresponding file in the gold folder and additionally extends an
file “data.xml”, which remembers what types have already been checked in each documents.
Annotations of these types will not show up again in the main part of the view. With the last button,
the user can select the annotation mode of the CAS editor. The choice is restricted to the currently
selected types. If an annotation is missing in the source documents, then the user can manually
add this annotation in the CAS Editor. The new annotation will be added as accepted to the list of
annotations in the main part of the view. By right-clicking on an annotation node in the view's tree
viewer, a dialog opens to change the type of an annotation. Right-clicking on a feature node opens
another dialog to change the feature's value.
Frank
Peter
Jochen
Martin
To compile a simple tree word list from a text file, right-click on the text file in UIMA Ruta script
explorer. The resulting menu is shown in Figure 3.24, “Create a simple tree word list ” [110].
You can also generate several tree word lists at once. To do so, just select multiple files and then
right-click and do the same like for a single list. You will get one tree word list for every selected
file.
To generate a multi tree work list, select all files, which should be generated into the multi tree
word list. Again right-click and select “Convert to Multi TWL” under item UIMA Ruta. A multi
tree word list named “generated.mtwl” will be created.
the menu entry UIMA Ruta. There are three options to apply a UIMA Ruta script to the files of the
selected folder, cf. Figure 3.25, “Remove Ruta basics ” [111].
1. Quick Ruta applies the UIMA Ruta script that is currently opened and focused in the
UIMA Ruta editor to all suitable files in the selected folder. Files of the type “xmi” will be
adapted and a new xmi-file will be created for other files like txt-files.
2. Quick Ruta (remove basics) is very similar to the previous menu entry, but removes the
annotations of the type “RutaBasic” after processing a CAS.
3. Quick Ruta (no xmi) applies the UIMA Ruta script, but does not change nor create an xmi-
file. This menu entry can, for example, be used in combination with an imported XMIWriter
Analysis Engine, which stores the result of the script in a different folder depending on the
execution of the rules.
Note: The UIMA Ruta Analysis Engine utilizes type priorities. If the CAS object is not
created using the UIMA Ruta Analysis Engine descriptor by other means, then please
provide the necessary type priorities for a valid execution of the UIMA Ruta rules.
If the UIMA Ruta script was written, for example, with a common text editor and no configured
descriptors are yet available, then the following java code can be used, which, however, is only
applicable for executing single script files that do not import additional components or scripts. In
this case the other parameters, e.g., “additionalScripts” , need to be configured correctly.
Collection<TypeSystemDescription> tsds =
new ArrayList<TypeSystemDescription>();
tsds.add(basicTypeSystem);
// add some other type system descriptors
// that are needed by your script file
TypeSystemDescription mergeTypeSystems = CasCreationUtils.
mergeTypeSystems(tsds);
aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems);
aed.resolveImports(resMgr);
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed,
resMgr, null);
File scriptFile = new File("path/to/file/MyScript.ruta");
ae.setConfigParameterValue(RutaEngine.PARAM_SCRIPT_PATHS,
new String[] { scriptFile.getParentFile().getAbsolutePath() });
String name = scriptFile.getName().substring(0,
scriptFile.getName().length() - 5);
ae.setConfigParameterValue(RutaEngine.PARAM_MAIN_SCRIPT, name);
ae.reconfigure();
CAS cas = ae.newCAS();
cas.setDocumentText("This is my document.");
ae.process(cas);
There is also a convenience implementation for applying simple scripts, which do not introduce
new types. The following java code applies a simple rule “T1 SW{-> MARK(T2)};” on the given
CAS. Note that the types need to be already defined in the type system of the CAS.
...it assigns the posTag JJ (adjective) to the token "brown" , the posTag NN (common noun) to the
token "fox" and the tag VBZ (verb, 3rd person singular present) to the token "receives" in the first
sentence.
We have noticed that the tagger sometimens fails to disambiguate NNS (common noun plural) and
VBZ tags, as in the second sentence. The word "up" also seems to confuse the tagger, which always
assigns it an RB (adverb) tag, even when it is a particle (RP) following a verb, as in the third and
fourth sentences:
Let's imagine that after applying every possible approach available in the POS tagging literature,
our tagger still generates these and some other errors. We decide to write a few Ruta rules to post-
process the output of the tagger.
<dependency>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-core</artifactId>
<version>[2.0.2,)</version>
</dependency>
We also take care that the Ruta basic typesystem is loaded when our annotator is initialized. The
Ruta typesystem descriptors are available from ruta-core/src/main/resources/org/apache/uima/ruta/
engine/
That is, we change a Token's NNS tag to VBZ, if it is surrounded by a Token tagged as NN and a
Token tagged as JJ. We also change an RB tag for an "up" token to RP, if "up" is preceded by any
verbal tag (VB, VBZ, etc.) matched with the help of the REGEXP condition.
We test our rules in the Ruta Workbench and see that they indeed fix most of our problems. We
save those and some more rules in a text file src/main/resources/ruta.txt.
We declare the file with our rules as an external resource and we load it during initialization. Here's
a way to do it using uimaFIT:
/**
* Ruta rules for post-processing the tagger's output
*/
public static final String RUTA_RULES_PARA = "RutaRules";
ExternalResource(key = RUTA_RULES_PARA, mandatory=false)
...
File rutaRulesF = new File((String)
aContext.getConfigParameterValue(RUTA_RULES_PARA));
After our CAS has been populated with posTag annotations from the main algorithm, we post-
process the CAS using Ruta.apply():
We are now happy to see that the final output of our annotator now looks much better:
<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>2.3.0</version>
<configuration>
<!--
The following parameter is optional and should only be specified
if the structure (e.g., classpath/resources) of the project requires it.
A FileSet specifying the UIMA Ruta script files that should be built.
If this parameter is not specified, then all UIMA Ruta script files
in the output directory (e.g., target/classes) of the project will
be built.
<!-- The directory where the generated type system descriptors will
be written stored. -->
<!-- default value: ${project.build.directory}/generated-sources/
ruta/descriptor -->
<typeSystemOutputDirectory>${project.build.directory}/generated-sources/
ruta/descriptor</typeSystemOutputDirectory>
<!-- The directory where the generated analysis engine descriptors will
be stored. -->
<!-- default value: ${project.build.directory}/generated-sources/ruta/
descriptor -->
<analysisEngineOutputDirectory>${project.build.directory}/
generated-sources/ruta/descriptor</analysisEngineOutputDirectory>
<scriptPaths>
<scriptPath>${basedir}/src/main/ruta/</scriptPath>
</scriptPaths>
<!-- Suffix used for the generated type system descriptors. -->
<!-- default value: Engine -->
<analysisEngineSuffix>Engine</analysisEngineSuffix>
<!-- Suffix used for the generated analysis engine descriptors. -->
<!-- default value: TypeSystem -->
<typeSystemSuffix>TypeSystem</typeSystemSuffix>
<!-- Buildpath of the UIMA Ruta Workbench (IDE) for this project -->
<!-- default value: none -->
<buildPaths>
<buildPath>script:src/main/ruta/</buildPath>
<buildPath>descriptor:target/generated-sources/ruta/descriptor/
</buildPath>
<buildPath>resources:src/main/resources/</buildPath>
</buildPaths>
</configuration>
<executions>
<execution>
<id>default</id>
<phase>process-classes</phase>
<goals>
<goal>generate</goal>
</goals>
</execution>
</executions>
</plugin>
The configuration parameters for this goal either define the build behavior, e.g., where the
generated descriptor should be placed or which suffix the files should get, or the configuration of
the generated analysis engine descriptor, e.g., the values of the configuration parameter scriptPaths.
However, there are also other parameters: addRutaNature and buildPaths. Both can be utilzed to
configure the current Eclipse project (due to the missing m2e connector). This is required if the
functionality of the UIMA Ruta Workbench, e.g., syntax checking or auto-completeion, should be
available in the maven project. If the parameter addRutaNature is set to true, then the UIMA Ruta
Workbench will recognize the project as a script project. Only then, the buildpath of the UIMA
Ruta project can be configured using the buildPaths parameter, which specifies the three important
source folders of the UIMA Ruta project. In normal UIMA Ruta Workbnech projects, these are
script, descriptor and resources.
<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>2.3.0</version>
<configuration></configuration>
<executions>
<execution>
<id>default</id>
<phase>process-classes</phase>
<goals>
<goal>twl</goal>
</goals>
<configuration>
<!-- This is a exemplary configuration, which explicitly specifies
the default configuration values if not mentioned otherwise. -->
<!-- The source files for the tree word list. -->
<!-- default value: none -->
<inputFiles>
<directory>${basedir}/src/main/resources</directory>
<includes>
<include>*.txt</include>
</includes>
</inputFiles>
<!-- The directory where the generated tree word lists will be
written to.-->
<!-- default value: ${project.build.directory}/generated-sources/
ruta/resources/ -->
<outputDirectory>${project.build.directory}/generated-sources/ruta/
resources/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>2.3.0</version>
<configuration></configuration>
<executions>
<execution>
<id>default</id>
<phase>process-classes</phase>
<goals>
<goal>mtwl</goal>
</goals>
<configuration>
<!-- This is a exemplary configuration, which explicitly specifies
the default configuration values if not mentioned otherwise. -->
<!-- The source files for the multi tree word list. -->
<!-- default value: none -->
<inputFiles>
<directory>${basedir}/src/main/resources</directory>
<includes>
<include>*.txt</include>
</includes>
</inputFiles>
<!-- The directory where the generated tree word list will be
written to. -->
<!-- default value: ${project.build.directory}/generated-sources/ruta/
resources/generated.mtwl -->
<outputFile>${project.build.directory}/generated-sources/ruta/resources/
generated.mtwl</outputFile>
</configuration>
</execution>
</executions>
</plugin>
A UIMA Ruta project are created with following command using the the archetype (in one line):
mvn archetype:generate
-DarchetypeGroupId=org.apache.uima
-DarchetypeArtifactId=ruta-maven-archetype
-DarchetypeVersion=<ruta-version>
-DgroupId=<package>
-DartifactId=<project-name>
The placeholders need to be replaced with the corresponding values. This could look like:
Using the archetype in Eclipse to create a project may result in some missing replacements of
variables and thus to broken projects. Using the archetype on command line is recommended.
In the creation process, several properties need to be defined. Their default values can be accepted
by simply pressing the return key. After the project was created successfully, switch to the new
folder and enter 'mvn install'. Now, the UIMA Ruta project is built: the descriptors for the UIMA
Ruta script are created, the wordlist is compiled to a MTWL file, and the unit test verifies the
overall functionality.
In this example, we are using the “KEP” algorithm for learning annotation rules for identifying
Bibtex entries in the reference section of scientific publications:
1. Select the folder “single” and drag and drop it to the “Training Data” text field. This folder
contains one file with correct annotations and serves as gold standard data in our example.
2. Select the file “Feature.ruta” and drag and drop it to the “Preprocess Script” text field. This
UIMA Ruta script knows all necessary types, especially the types of the annotations we try
the learn rules for, and additionally it contains rules that create useful annotations, which
can be used by the algorithm in order to learn better rules.
3. Select the file “InfoTypes.txt” and drag and drop it to the “Information Types” list. This
specifies the goal of the learning process, which types of annotations should be annotated by
the induced rules, respectively.
4. Check the checkbox of the “KEP” algorithm and press the start button in the toolbar fo the
view.
5. The algorithm now tries to induce rules for the targeted types. The current result is
displayed in the view “KEP Results” in the right part of the perspective.
6. After the algorithms finished the learning process, create a new UIMA Ruta file in the
“uima.ruta.example” package and copy the content of the result view to the new file. Now,
the induced rules can be applied as a normal UIMA Ruta script file.
PACKAGE uima.ruta.example;
ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE HtmlViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.SourceDocumentInformation;
Document{-> RETAINTYPE(SPACE,BREAK)};
Document{-> EXEC(HtmlAnnotator)};
ENGINE utils.XMIWriter;
TYPESYSTEM utils.SourceDocumentInformation;
DECLARE Pattern;
Document{CONTAINS(Pattern)->CONFIGURE(XMIWriter,
"Output" = "../with/"), EXEC(XMIWriter)};
Document{-CONTAINS(Pattern)->CONFIGURE(XMIWriter,
"Output" = "../without/"), EXEC(XMIWriter)};
ENGINE utils.HtmlAnnotator;
TYPESYSTEM utils.HtmlTypeSystem;
ENGINE utils.HtmlConverter;
ENGINE TEIViewWriter;
TYPESYSTEM utils.SourceDocumentInformation;
Document{->EXEC(HtmlAnnotator, {TAG})};
Document{-> RETAINTYPE(MARKUP,SPACE)};
TAG.name=="PERSNAME"{-> PersName};
TAG.name=="SURNAME"{-> LastName};
TAG.name=="FORENAME"{-> FirstName};
TAG.name=="ADDNAME"{-> AddName};
Document{-> RETAINTYPE};