This vignette is considered depricated! It’s content has been moved to the the EMU-SDMS manual (+ expanded and updated). Specifially see the the query system as well as the EQL: further examples chapters.
This document introduces and defines version 2 of the Emu Query Language (EQL) and tries to show what it is capable of by giving numerous examples. The EQL is a query language that is aimed at speech and language researchers that is supposed to be easy to understand and learn yet expressive and powerful. It enables researchers to easily query annotation structures of databases stored in the emuDB format. The emuR
package provides a query()
function to query emuDBs that are loaded into the current R session (for more information see the emuR_intro
as well as the emuDB
vignettes). The main argument of the query()
function is the query
argument (query(..., query = "XXX", ...)
where XXX
is the query string). In this document we will be focusing soley on these query strings and how to compile them.
To revise what was already mentioned in the emuR_intro
as well as the emuDB
vignette: The annotation structure of an emuDB
can be thought of as a graph. Each annotation consist of annotational units (called ITEMs) that are grouped together in an ordered array. Each ITEM can be linked to other ITEMs of other levels if an according linkDefinition is present in the emuDB
. An exemplary excerpt of such an annotation can be seen below.
Example of simple hierarchy
As it is not the focus of this vignette: one thing to note about the query()
function are the parameters bundlePattern
and sessionPattern
. These can be used to restrict which session and bundle the query will be run against. They both expect a regular expression string to restrict the sessions or bundles one wishes to query.
We will now jump right in by giving you a bunch of examples of query strings that where adapted from (Harrington and Cassidy 2002, @cassidy_harrington:2001). To have some data we can play with let us create a demo database and then load it into our current R session:
# load the package
library(emuR)
# create demo data in folder provided by the tempdir() function
create_emuRdemoData(dir = tempdir())
# get the path to emuDB called 'ae' that is part of the demo data
path2folder = file.path(tempdir(), "emuR_demoData", "ae_emuDB")
# load emuDB into current R session
ae = load_emuDB(path2folder)
The syntax of a simple equality / inequality / matching / non-matching query is "[L OPERATOR A]"
where “L” specifies a level (or alternatively the name of a parallel attributeDefinitions), “OPERATOR” is one of the following operators: “==” (equality); “!=” (inequality); “=~” (matching) or “!~” (non-matching) and “A” is an expression specifying the labels of the ITEMs of “L”.
Example Q & A’s:
INFO: The above examples use three operators that are new to the EQL as of version 2. One being the “==” equal operator that has the same meaning as the “=” operator of the EQL1 (which is also still available) while providing a cleaner more precise syntax. The other two being the “=~” and “!~” which are the new matching and non-matching regular expression operators.
If you wish to use parenthesis, blanks or characters that represent operands used by the EQL (see EBNF) as part of a label matching string (the string on the right hand side of one of the operands mentioned above), you must place this string in additional single quotation marks to escape these characters. For example searching for the items containing the labels O_"
on the Phonetic level could not be written as "[Phonetic == O_"]"
but would have to be written as "[Phonetic == 'O_"']"
. Note that reversing the single vs. double quotation mark order is currently not supported i.e. ‘[Phonetic == “O_"”]’ won’t lead to the desired behavior. Only use double quotation marks for the outer wrapping of the query string to avoid this issue.
The syntax of a query string using the “->” sequence operator is "[L == A -> L == B]"
where ITEM “A” on level “L” precedes ITEM “B” on level “L”. For a sequential query to work both arguments must be on the same level (alternatively parallel attributeDefinitions of the same level may also be chosen).
Example Q & A’s:
# NOTE: all row entries in the resulting segment list have the start time of "@", the end time of "n" and their labels will be "@->n"
query(ae, "[Phonetic == @ -> Phonetic == n]")
# NOTE: all row entries in the resulting segment list have the start time of "@", the end time of "@" and their labels will also be "@"
query(ae, "[#Phonetic == @ -> Phonetic == n]")
# NOTE: all row entries in the resulting segment list have the start time of "n", the end time of "n" and their labels will also be "n"
query(ae, "[Phonetic == @ -> #Phonetic == n]")
The general strategy to constructing a query string that retrieves subsequent sequences of labels is to nest multiple sequences while paying close attention to the correct placement of the parentheses. An abstracted version of such a query string for the subsequent sequence of arguments A1, A2, A3, A4 would be: "[[[[A1 -> A2] -> A3] -> A4] -> A5]"
where each argument (e.g. “A1”) represents an equality / inequality / matching / non-matching expression on the same level (alternatively parallel attributeDefinitions of the same level may also be chosen).
Example Q & A’s:
INFO: As the EQL1 didn’t have a regular expression operator, users often resorted to using queries such as “[Phonetic != XXX]” (where XXX is a label that was not part of the label set of the “Phonetic” level) to match every label on the “Phonetic” level. Although this is still possible in the EQL2, we strongly recommend using regular expressions as they provide a much clearer and preciser syntax and are less error prone.
The syntax of a query string using the conjunction operator can schematically be written as: "[L == A & L_a2 == B & L_a3 == C & L_a4 == D & ... & L_an == N]"
where ITEMs on level “L” have the label “A” (technically belonging to the first attribute of that level i.e. L_a1, which per default has the same name as it’s level) also have the attributes “B”, “C”, “D”, …, “N”. Same as with the sequence operator all expressions must be on the same level (i.e. parallel attributesDefinitions of the same level indicated by the a2 - an
may to be chosen).
The conjunction operator is used to combine query conditions on the same level. This makes sense in two cases:
"[phonetic == l & sonorant == T]"
when ‘sonorant’ is an additional attribute of level ‘phonetic’."[phonetic == l & Start(word, phonetic) == 1]"
Example Q & A’s:
A schematic representation of a simple domination query string that retrieves all ITEMs “A” of level “L1” that are dominated by i.e. are directly or indirectly linked to ITEMs “B” in level “L2” would be "[L1 == A ^ L2 == B]"
. The dominates operator is not directional, meaning that either ITEMs in “L1” dominate ITEMs in “L2” or ITEMs in “L2” dominate ITEMs in “L1”. Note that linkDefinitions that specify the validity of the domination have to be present in the emuDB for this to work (see emuDB
vignette for details).
Example Q & A’s:
query(ae, "[Syllable =~ .* ^ Phoneme != p | t | k]")
# or
query(ae, "[Phoneme != p | t | k ^ #Syllable =~ .*]")
INFO: Even though the domination operator is not directional, what you place to the left and to the right of the operator does have an impact on the result. If no result modifier (the hash tag “#”) is used the query engine will automatically assume that the expression to the left of the operator specifies what is to be returned. This means that the schematic query string: "[L1 == A ^ L2 == B]"
is semantically equal to the query string: "[#L1 == A ^ L2 == B]"
. As it is more explicit to mark the desired result we recommend you always use the result modifier where possible.
The general strategy to constructing a query string that specifies multiple domination relations of ITEMs is to nest multiple domination expressions while paying close attention to the correct placement of the parentheses. A dominance relationship sequence or the arguments “A1”, “A2”, “A3”, “A4”, can therefore be noted as: "[[[[A1 ^ A2] ^ A3] ^ A4] ^ A5]"
where “A1” is dominated by “A2” and “A3” and so on.
Example Q & A’s:
"
"[[Pitch_Accent == H* ^ Phoneme == p] ^ #Text == price | space]"
The EQL has three function terms to specify where in a dominance relationship a child level ITEM is allowed to occur. The three function terms are “Start()”, “End()” and “Medial()”.
A schematic representation of a query string representing a simple usage of the Start()
, End()
and Medial()
function would be: "POSFCT(L1, L2) == 1"
or "POSFCT(L1, L2) == TRUE"
. In this representation “POSFCT” is a placeholder for one of the three function where the level “L1” has to dominate level “L2”. The “== 1” / “== TRUE” part of the query string indicates that if a match is found (match is TRUE
or “== 1”) then the according ITEM of the level “L2” is returned. If this expression is set to “== 0” / “== FALSE” (FALSE
), all the ITEMs that do not match the condition of “L2” will be returned. For a visualization of what is returned by the various options of the three functions see the illustration below.
Illustration of what is returned by the Start()
, Medial()
and End()
functions
INFO: As using 1 and 0 for TRUE
and FALSE
is not that intuitive to most R users, the EQL version 2 optionally allows for the values TRUE / T and FALSE / F to be used instead of 1 and 0. This syntax should be more familiar to most R users.
Example Q & A’s:
The syntax for combining a position function with the boolean operator is "[L == E & Start(L, L2) == 1]"
where ITEM “E” on level “L” occurs at the beginning of the ITEM “L”. Once again “L” has to dominate “L2” ( optionally parallel attributeDefinitions of the same level may also be chosen).
Example Q & A’s:
The syntax for combining a position function with the boolean hierarchical operator is "[L == E ^ Start(L1, L2) == 1]"
where level “L” and level “L2” refer to different levels where either “L” dominates “L2”, or “L2” dominates “L”.
Example Q & A’s:
A schematic representation of a query string utilizing the count mechanism would be: "[Num(L1, L2) == N]"
where “L1” contains “N” number of ITEMs in “L2”. For this type of query to work “L1” has to dominate “L2”. As the query matches a number (“N”) it is also possible to use the operators > (more than), < (less than) and != (not equal). The resulting segment list contains ITEMs of “L1”.
Example Q & A’s:
A schematic representation of a query string combining the count and the boolean operators would be: "[L == E & Num(L1, L2) == N]"
where ITEMs “E” on level “L” are dominated by “L1” and “L1” contains “N” number of “L2” Items. Further “L1” dominates “L2” under the condition that “L” and “L1” (not “L2”) refer to the same level (parallel attributeDefinitions of the same level may also be chosen).
Example Q & A’s:
query(ae, "[Text =~ .* & Num(Text, Phoneme) > 5 ]")
# or
query(ae, "[Text =~ .* & Num(Word, Phoneme) > 5]")
A schematic representation of a query string combining the count and the boolean operators would be: "[L == E ^ Num(L1, L2) == N]"
where ITEMs “E” on level “L” are dominated by “L1” and “L1” contains “N” number of “L2” ITEMs. Further “L1” dominates “L2” under the condition that “L” and “L1” do not refer to the same level.
Example Q & A’s:
A schematic representation of a query string combining the domination and the sequence operators would be: "[[A1 ^ A2] -> A3]"
where “A1” and “A3” refer to the same level (parallel attributeDefinitions of the same level may also be chosen).
Example Q & A’s:
"[[Phoneme == s ^ Syllable == S] -> Syllable == S]"
this will cause an error as Phoneme == s
and Syllable == S
are not on the same level. Therefore the correct answer is:Example Q & A’s:
"[Text =~ .* & Num(Text, Syllable) == 3]"
- 2.) A schwa occurs in the first syllable: "[Phoneme == @ ^ Start(Word, Syllable) == 1]"
- 3.) The text is “the”: "[Text == the]"
- Let’s now combined all three by saying "[1. ^ 2.]"
and these are followed by three ("[1. ^ 2.] -> 3.]"
):In this section we will try to give a quick overview of the major changes concerning the query mechanics of emuR
compared to the legacy R package emu
in the version 4.2. This section is mainly meant for people transitioning to emuR
from the legacy system.
In emuR
it is required that a emuDB is loaded into your current R session before being able to use the query()
function. This is achieved using the load_emuDB()
function (see emuR_intro
vignette for details). This was not necessary using the legacy emu.query()
function.
Example calls to the query()
function (prerequisite: a loaded emuDB called “andosl”):
# query all "p" ITEMs on the "Phoneme" level that are dominated by "S" (strong) syllables
query(emuDBhandle = andosl,
query = "[Phoneme == p ^ Syllable == S]")
# same query as before but this time using
# the sessionPattern and bundlePattern arguments
# to only select specific sessions / bundles
# using regular expressions (RegEx)
query(emuDBhandle = andosl,
query = "[Phoneme == p ^ Syllable == S]",
sessionPattern = "000.", # RegEx that matches session 0000; 000a; 0001; ...
bundlePattern = "msajc0[1-2].") # RegEx that matches bundles msajc01a; msajc02a; msajc021; ...
The new default result type of a query is an object of the S3 class “emuRsegs”. This class inherits from the legacy EMU class “emusegs” and the well known “data.frame” class. This means it is fully compatible to the legacy “emusegs” class, while containing some additional data, for example the ID’s of the start and end ITEMs of each segment list row. Each row of this “data.frame” is a sequence of one or more annotational units (i.e. ITEMs) on a single level. For more information about this object see help(emuRsegs)
.
The query
function of emuR
returns an empty segment list (row count is zero) if the query does not match any ITEM. If the legacy EMU function emu.query()
didn’t find any matches it would throw an error with the message: "Can't find the query results in emu.query: there may have been a problem with the query command."
.
The emuDB format used by the emuR
package introduces the concept of bundles that are grouped together in sessions (see emuDB vignette for further details). As legacy EMU databases did not have the concept of a session, all the utterances of a legacy database are place in a single default session called “0000”. Therefore the “utts” column of a segment list is prefixed by the session name for example “0000:msajc003” instead of just being “msajc003” as in the legacy system.
Compared to the legacy EMU system which allowed multiple occurrences of the hash tag “#” to be present in a query string, the query()
function only allows a single result modifier. This assures that only consistent result sets are returned. If you however desire to have multiple result sets in one segment list, we recommend you simply concatenate the result sets of separate queries using the rbind()
function.
moving data from Tcl to R
Read 1 records
segment list from database: andosl
query was: [Text=spring & #Accent=S]
labels start end utts
1 spring 2288.959 2704.466 msajc094
moving data from Tcl to R
Read 1 records
segment list from database: andosl
query was: [#Text=spring & #Accent=S]
labels start end utts
1 spring 2288.959 2704.466 msajc094
"
The hash tag “#” had no effect.
segment list from database: andosl
query was: [Text=spring & #Accent=S]
labels start end utts
1 S 2288.975 2704.475 0000:msajc094
Returns the same ITEM, but with the label of the hashed attributeDefinition name. The second legacy example is not a valid emuR
query (two hash tags).
Error in query.database.eql.KONJA(dbConfig, qTrim) :
Only one hashtag allowed in linear query term: #Text=spring & #Accent=S
The query()
function throws an error as it would be necessary to return each item twice to get both the “Text” and “Accent” labels.
Example:
moving data from Tcl to R
Read 4 records
segment list from database: ae
query was: [Text!=beautiful|futile ^ Phoneme=u:]
labels start end utts
1 new 475.802 666.743 msajc057
2 futile 571.999 1091.000 msajc010
3 to 1091.000 1222.389 msajc010
4 beautiful 2033.739 2604.489 msajc003
We assume that the OR operator “|” was simply ignored when used in conjunction with the inequality operator “!=”.
query(dbName = "ae",
query = "[Text != beautiful | futile ^ Phoneme == u:]",
resultType = "emusegs")
segment list from database: ae
query was: [Text!=beautiful|futile ^ Phoneme=u:]
labels start end utts
1 to 1091.025 1222.375 0000:msajc010
2 new 475.825 666.725 0000:msajc057
Certain queries in the legacy EMU system required blanks around certain operators to be present or absent as well as certain parenthesis to be present or absent. If this was not the case the legacy query engine sometimes threw cryptic errors and sometimes even crashed and in the worst cases took the entire R session with it. The query engine of the emuR
package is much more robust against missing or superfluous blanks / parenthesis.
For the legacy EMU query it was never explicitly defined, at least to our knowledge, if and how the resulting segment list was ordered. If the result type of the query()
function is set to "emuRsegs"
the resulting list is ordered by UUID, session, bundle and sample start position. If it is set to "emusegs"
the resulting list is ordered by the fields utts and start.
emuR
accepts the double equal character string “==” (recommended) as well as the single “=” equal character string as an equal operator.query("andosl", "Text =~ .*tz.*")
EBNF adapted from (John 2012). As the original EBNF was formulated in German a few of the abbreviation terms (e.g. “DOMA” is the abbreviation for the German term “Dominanzabfrage”) where translated into English abbreviations (e.g. “DOMQ” is the abbreviation for the English term “dominance query”).
The terminal symbols described below are ordered descending by their binding priority.
Symbol | Meaning |
---|---|
# |
Result modifier (projection) |
, |
Parameter list separator |
== |
Equality (new in version 2 of the EQL; added for cleaner syntax) |
= |
Equality (optional; for backwards compatibility) |
!= |
inequality |
=~ |
Regular expression matching |
!~ |
Regular expression non-matching |
> |
Greater than |
>= |
Equal or greater than |
< |
Less than |
>= |
Equal or less than |
| |
Alternatives separator |
& |
Conjunction of equal rank |
^ |
Dominance conjunction |
-> |
Sequence operator |
Symbol | Meaning |
---|---|
' |
Quotes literal string |
( |
Function parameter list begin |
) |
Function parameter list end |
[ |
Sequence or dominance enclosing begin bracket |
] |
Sequence or dominance enclosing end bracket |
Symbol | Meaning |
---|---|
Start |
Start |
Medial |
Medial |
End |
Final |
Num |
Count |
EBNF term | Abriviation | Conditions |
---|---|---|
EQL = CONJQ | SEQQ | DOMQ; |
EMU Query Language | |
DOMQ = "[", ( CONJQ | DOMQ | SEQQ ), "^", ( CONJQ | DOMQ | SEQQ ), "]"; |
dominance query | levels must be hierarchically associated |
SEQQ = "[", ( CONJQ | SEQQ | DOMQ ), "->", ( CONJQ | SEQQ | DOMQ ), "]"; |
sequential query | levels must be linearly associated |
CONJQ = { "[" }, SQ, { "&", SQ }, { "]" }; |
conjunction query | levels must be linearly associated |
SQ = LABELQ | FUNCQ; |
simple query | |
LABELQ = [ "#" ], LEVEL, ( "=" | "==" | "!=" | "=~" | "!~" ), LABELALTERNATIVES; |
label query | |
FUNCQ = POSQ | NUMQ; |
function query | |
POSQ = POSFCT, "(", LEVEL, ",", LEVEL, ")", "=", "0" | "1" | "TRUE" | "FALSE"; |
position query | levels must be hierarchically associated; second level determines semantic |
NUMQ = "Num", "(", LEVEL, ",", LEVEL, ")", COP, INTPN; |
number query | levels must be hierarchically associated; first level determines semantic |
LABELALTERNATIVES = LABEL , { "|", LABEL }; |
label alternatives | |
LABEL = LABELING | ( "'", LABELING, "'" ); |
label | levels must be part of the database structure; LABELING is an arbitrary character string or a label group class configured in the emuDB; result modifier ‘#’ may only occur once |
POSFCT = "Start" | "Medial" | "End"; |
position function | |
COP = "=" | "==" | "!=" | ">" | "<" | "<=" | ">="; |
comparison operator | |
INTPN = "0" | INTP; |
integer positive with n**ull | |
INTP = DIGIT-"0", { DIGIT }; |
integer positive | |
DIGIT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; |
digit |
INFO: The LABELING term used in the LABEL EBNF term can represent any character string that is present in the annotation. As this can be any combination of Unicode characters we chose not to explicitly list them as part of the EBNF.
A query may only contain a single result modifier “#” (hash tag)
Cassidy, Steve, and Jonathan Harrington. 2011. “Multi-Level Annotation in the Emu Speech Database Management System.” Speech Communication, no. 33: 61–78.
Harrington, Jonathan, and Steve Cassidy. 2002. “The Emu-Query Language (Anhang).” IPDS Kiel.
John, Tina. 2012. “Emu Speech Database System.” PhD thesis, LMU-Munich.