Skip to main content

German Decomposition

1. Composition and Decomposition#

1.1. Why experiment on German documents?#

Since German has some specific language features which are not present in English and that, when considered, can make a difference in search results, we decided to work on German texts for this experiment. There are not many German datasets freely available which is why we used the DeepL API to translate the same corpora we have already used in previous experiments. For our first comparison on the German corpora, we looked into splitting compound words using decomposition. To understand why decomposition may increase the search results it is necessary to first know what composition means.

1.2. What is Composition?#

In German, as in some other languages (e.g. Finnish, Dutch), it is possible to create a new word from two or more independent words through composition. Thereby, two or more words are combined into one word by so-called “Fugenelemente” or linking elements (e.g.: -(e)s-, -e-, -(e)n-, -er-) to form a coherent word. English does not have this kind of composition. In most cases, such word combinations are separated with a space or hyphen.

Examples of composition in German include:

  • "Kapitän" + "Mütze" => "Kapitänsmütze" (captain's cap)
  • "Apfel" + "Baum" => "Apfelbaum" (apple tree)
  • "Blumen" + "Topf" => "Blumentopf" (flower pot)

With this technique, an infinite number of words can be strung together. A classic example would be the "Donaudampfschifffahrtskapitänsmütze" (Danube steamboat captain's cap). The last word is always the carrier of meaning and determines both the grammatical gender and the primary meaning of the newly compounded word.

Composition must be distinguished from derivation. In a derivation, a root word changes its part of speech by a grammatical affix (e.g. "Licht-ung"). In a composition, the word retains its original part of speech.

1.3. What is the benefit of Decomposition?#

Composition in languages with rich morphology, like German, makes it possible to expand the vocabulary to an infinite level. This can cause problems in information retrieval.

For example, if you search for "Atomkraftwerk" (atomic power plant), you are also likely interested in articles that deal with the more general topic of "Atomkraft" (atomic power). If you do not break down this composite in a search, it can quickly happen that relevant results are not found. On the other hand, too much decomposition can also result in too much irrelevant information being returned. In this example, articles that are only about "Kraft" (power) would be too generic in most cases and would not lead to the desired search result.

To improve the IR performance it can be helpful to use linguistically motivated decomposition of compounded words. Unlike in German, compounds in English are typically written as separate words and can therefore be easily found.

2. Approaches and Hypotheses#

We tried to compare two different decomposition approaches in Elasticsearch:

  • Dictionary Decompounder Token Filter:
    This token filter uses a brute force approach to look up all possible subwords in a compound dictionary. If subwords are found those are included in the scoring of the documents. For example: If our dictionary includes "Donau'' and "Dampfschiff" the search for "Donaudampfschiff" would search for "Donaudampfschiff", "Donau" and "dampfschiff". This should in theory perform better than a search using only the Standard Analyzer.

  • Hyphenation Decompounder Token Filter:
    This approach uses XML-based hyphenation patterns to search for subwords in a compound dictionary. This token filter is faster than the dictionary decompounder and the recommended filter for decompounding by Elastic.

To get a baseline to compare the results against, we also searched with two more basic approaches:

  • Standard Analyzer:
    This approach only provides grammar-based tokenization while searching. No stemmer is used and stop words are included in the search.

  • German Analyzer:
    This approach performs basic stemming while searching while also excluding language-specific stop words from the search in addition to grammar-based tokenization.

Hypothesis: Decomposition while using a dictionary to look up possible compound words should improve the search for retrieved documents, including relevant documents, but it could be that more irrelevant documents are also returned. The Hyphenation Decompounder should perform better than the Dictionary Decompounder because it should search with fewer irrelevant compounds

3. Experiment#

Before we could start experimenting, we needed all the documents and queries translated into German so we could index each corpus with the matching settings for each approach. Afterward, we ran the Evaluation API of Elasticsearch as explained in our first experiment and visualized the results.

3.1. Translation via DeepL API#

As previously mentioned, we used the DeepL API to translate all our previously used data into German. Generally, this worked well except for a few adjustments we needed to make to the corpora. One important issue we came across was that the translator gets a bit confused if the text contains additional and leading white spaces which can happen if the text had previously been tokenized by preprocessing. It's not a major issue, however, if a sentence looks like this example from the Cranfield corpus:

  the integrated remaining liftincrement, after subtracting this destalling lift, was found to agreewell with a potential flow theory .  an empirical evaluation of the destalling effects was made forthe specific configuration of the experiment .

DeepL has some problems with getting the upper- and lower-case distinction right:

  der integrierte RestauftriebDer integrierte Restauftrieb nach Abzug des Destillationsauftriebs stimmtgut mit einer Potentialströmungstheorie übereinstimmt.  Eine empirische Auswertung der Destalling-Effekte wurde vorgenommen fürdie spezifische Konfiguration des Experiments .

This does not represent a major problem, however, we parsed our data sets once again, removed unnecessary spaces and tabs, saved them as TSVs, and translated them with those newly created files. As we can see in the above example, the translation is good enough to work with.

3.2. Baseline Setup#

The first thing we set up was the baseline for this experiment. For the Standard Analyzer we only used the standard setting by Elasticsearch to index our documents. For the German Analyzer we implemented it according to the Language Analyzer documentation:

german_analyzer = {    "settings": {        "number_of_shards": 1,        "number_of_replicas": 0,        "analysis": {          "analyzer" : {              "default" : {                  "type": "german"              }          }      }    }}

Here you can see a comparison of both approaches:

Standard AnalyzerADICACMCISICranfieldLISAMedlineNPLTime
Recall0.3610.1540.0890.3230.2640.2710.1170.689
Precision0.0880.0660.1290.1120.1130.3180.1180.137
F1-Score0.1420.0920.1050.1670.1580.2930.1180.229
German AnalyzerADICACMCISICranfieldLISAMedlineNPLTime
Recall0.4880.2140.0970.3890.3840.2890.160.744
Precision0.1730.110.1360.1360.1570.3380.1470.151
F1-Score0.2550.1450.1140.2010.2230.3120.1530.25

3.3. Dictionary Decompounder#

We configured the Dictionary Decompounder for the settings of each corpus index following the Elasticsearch documentation:

german_decompounding = {    "settings": {        "number_of_shards": 1,        "number_of_replicas": 0,        "analysis": {            "filter": {                 "german_stop": {                    "type":       "stop",                    "stopwords":  "_german_"                   },                  "german_keywords": {                    "type":       "keyword_marker",                    "keywords":   ["Beispiel"]                   },                  "german_stemmer": {                    "type":       "stemmer",                    "language":   "light_german"                  },                "german_decompounder" : {                    "only_longest_match" : "true",                    "word_list_path" : "analysis/dictionary-de.txt",                    "max_subword_size" : "22",                    "type" : "dictionary_decompounder",                    "min_subword_size" : "4"                    }                },             "analyzer" : {                 "default" : {                     "tokenizer": "standard",                     "filter": [                        "lowercase",                        "german_stop",                        "german_decompounder",                        "german_normalization",                        "german_stemmer",                        "remove_duplicates"                ]              }          }    }}}

These were the results of the *Dictionary Decompounder*:
Dictionary DecompounderADICACMCISICranfieldLISAMedlineNPLTime
Recall0.5950.2390.1030.3520.3470.3010.1830.746
Precision0.1260.1120.1470.1270.1490.3420.1780.144
F1-Score0.2080.1520.1210.1870.2080.320.1810.241

3.4. Hyphenation Decompounder#

We configured the Hyphenation Decompounder for the settings of each corpus index following the Elasticsearch documentation:

german_hyphenation_decompounding = {    "settings": {        "number_of_shards": 1,        "number_of_replicas": 0,        "analysis": {              "filter": {                  "german_stop": {                    "type":       "stop",                    "stopwords":  "_german_"                   },                  "german_keywords": {                    "type":       "keyword_marker",                    "keywords":   ["Beispiel"]                   },                  "german_stemmer": {                    "type":       "stemmer",                    "language":   "light_german"                  },                "german_hyphenation_decompounder": {                  "type": "hyphenation_decompounder",                  "word_list_path": "analysis/dictionary-de.txt",                  "hyphenation_patterns_path": "analysis/de_DR.xml",                  "only_longest_match": "true",                  "min_subword_size": "4"                }            },            "analyzer" : {                "default" : {                  "tokenizer": "standard",                  "filter": [                    "lowercase",                    "german_stop",                    "german_hyphenation_decompounder",                    "german_normalization",                    "german_stemmer",                    "remove_duplicates"                  ]                }            }    }  }}


These were the results of the *Hyphenation Decompounder*:
Hyphenation DecompounderADICACMCISICranfieldLISAMedlineNPLTime
Recall0.6260.240.1060.3790.3630.2980.190.729
Precision0.1310.120.1450.1330.1490.3380.1940.146
F1-Score0.2170.160.1220.1970.2110.3170.1920.243

4. Results#

To compare our results properly we plotted Recall, Precision, and F1-Score for each method on every corpus.

Recall

What is "Recall"?
Recall measures the probability that relevant documents are retrieved. Therefore, the number of all retrieved relevant documents is divided by the number of all documents that are labeled as relevant. For example, if we were to return 10 documents, 8 of which are relevant to our search and 4 of these are retrieved, then the Recall score would be 4/8 = 0.5.

To measure Recall it is necessary to have the relevant documents properly labeled. Recall only looks at relevant documents that were retrieved and does not take into account any irrelevant documents which may have been retrieved.

Recall

Precision

What is "Precision"?
Precision measures the probability that retrieved documents are relevant to the search query. Therefore, the number of all retrieved relevant documents is divided by the number of all retrieved documents. For example, if we retrieve 10 search results and only 5 are relevant for our search, the Precision score would be: 5/10 = 0.5.

To measure the Precision it is necessary to have the relevant documents properly labeled. Precision only looks at the documents that are retrieved and does not account for relevant documents which were not retrieved.

Precision

F1-Score

What is an "F1-Score"?
The F1-Score measures a harmonic mean between Precision and Recall. Therefore we multiply Precision and Recall by two and divide it by the sum of Precision and Recall:

F1-Score=(2*Precision*Recall)/(Precision+Recall)
This is the simplest way to balance both Precision and Recall, there are also other common options to weight them differently.

F1-Score

For a better overview on how much each approach increased, we also measured the gains using the Standard Analyzer values as a baseline:

Gains

Conclusion

We can see an increase in the scores in every corpus comparing the baseline Standard Analyzer to the Hyphenation Decompounder.
In all corpora except the Medline corpus, the Hyphenation Decompounder seems to perform better than the Dictionary Decompounder. There are a few examples where regular stemming and stop word removal might be better than decomposition with the German Analyzer. Overall it looks like using any approach other than the baseline Standard Analyzer shows an improvement in the results. For a more detailed explanation and some examples please look through the following discussion section.

5. Discussion#

In this section, we want to discuss some of our results in more detail. First, we want to show some positive examples of how decompounding can improve search results to underline our hypothesis. After that, we will try to analyze some unexpected variations in the results such as why the German Analyzer is vastly superior to the Standard Analyzer and sometimes even the Dictionary Decompounder, or why the Dictionary Decompounder is slightly better than the Hyphenation Decompounder on the Medline corpus.

5.1. German Analyzer vs. Dictionary & Hyphenation Decompounder#

As previously suggested in the hypothesis section, decomposition did increase the search scores. The best examples for this can be seen while looking at the NPL corpus. Based on the scores of the Standard Analyzer, the outcome became better with the German Analyzer. The Dictionary Decompounder was able to increase the values once again. On top of that, the increase from the Hyphenation Decompounder showed the highest score.

To illustrate what exactly caused this increase, we selected a search query that seemed the most informative to us.


German Analyzer vs. Hyphenation Decompounder#

To show how the German Analyzer is surpassed by the Hyphenation Decompounder, Query 23 seemed like the best choice. The F-Score difference between both is 0.256.

Example Query 23: "BEOBACHTUNGEN DER SONNE WÄHREND FINSTERNISSEN, DIE DIE VERTEILUNG DER QUELLEN AUF DER SCHEIBE IM MIKROWELLENBEREICH ERGEBEN"

There are 20 relevant documents for this query and with 20 results returned, not all true positives are found. We can see, that the Hyphenation Decompounder does find 5 more relevant documents than the German Analyzer:

As an example document to compare the scores and searched terms in detail, we chose document 81 because the Hyphenation Decompounder scored it rightfully as the top document, while the German Analyzer does not rank it in the top 20.

German AnalyzerHyphenation Decompounder
position-1
score3.354313428.156767

To explain why the German Analyzer works better than the Hyphenation Decompounder we have to look at the search results in detail.

Explain German Analyzer

{        "_id" : "81",        "_score" : 3.3543134,        "_source" : {          "text_german" : "finsternisbeobachtungen von mikrowellenradioquellen auf der sonnenscheibe im april ergebnisse von beobachtungen der flussdichtepolarisation und helligkeitsverteilung, die in japan gemacht wurden, sind vier frequenzen im bereich"        },        "highlight" : {          "text_german" : [            "finsternisbeobachtungen von mikrowellenradioquellen auf der sonnenscheibe im april ergebnisse von <em>beobachtungen</em>"          ]        }

Explain Hyphenation Decompounder

{  "doc": {    "id": 81,    "text_german": "finsternisbeobachtungen von mikrowellenradioquellen auf der sonnenscheibe im april ergebnisse von beobachtungen der flussdichtepolarisation und helligkeitsverteilung, die in japan gemacht wurden, sind vier frequenzen im bereich"  },  "highlight": {    "text_german": [      "<em>finsternisbeobachtungen</em> von <em>mikrowellenradioquellen</em> auf der <em>sonnenscheibe</em> im april ergebnisse von <em>beobachtungen</em>",      "flussdichtepolarisation und helligkeitsverteilung, die in japan gemacht wurden, sind vier frequenzen im <em>bereich</em>"    ]  }}

As we can see in this table the matched terms and, therefore, the scores diverge widely:

We can clearly see that, for example, the term FINSTERNISSEN does not match within the document while using the German Analyzer. It gets stemmed by Elasticsearch to the term finsternis but, since the searched text isn't decompounded with this approach, it won't return finsternisbeobachtungen as relevant to the query. The same goes for the term MIKROWELLENBEREICH, which is only found by the Hyphenation Decompounder within the composition mikrowellenradioquellen.

The German Analyzer seems to find far fewer results than the Hyphenation Decompounder. Even though some of the terms that the Hyphenation approach extracts are quite unuseful for this query, like well and reich, it still finds enough relevant terms that were not found by only using the standard approach.

This shows that the Decompounder approach can clearly improve search results.


Dictionary Decompounder vs. Hyphenation Decompounder#

Second, we compared the Dictionary Decompounder with the Hyphenation Decompounder to see how, in some cases, the hyphenation approach improves over brute force decompounding. We used Query 42 for this comparison since the F-Score difference is 0.25.

Example Query 42: "LÖSUNG VON DIFFERENTIALGLEICHUNGEN PER COMPUTER"

The Hyphenation Decompounder returns far more true positives than the Dictionary Decompounder, out of 36 relevant documents only 1 is found by the brute force approach of the Dictionary Decompounder compared to 8 by the Hyphenation Decompounder.

Once again it is helpful to look at one relevant document from the query more closely. The document with the ID 5444, which we chose as an example, does not get ranked into the top 20 retrieved documents for the Dictionary Decompounder, but is in the second position for the Hyphenation Decompounder.

Dictionary DecompounderHyphenation Decompounder
position-2
score7.70341614.950291

When we look closely at the explain function returned by Elasticsearch we can see which terms were matched:

Explain Dictionary Decompounder

{"doc":{"_id" : 5444,        "text_german" : "die Lösung partieller Differentialgleichungen durch Differenzverfahren mit dem elektronischen Differentialanalysator"        },        "highlight" : {          "text_german" : [            "die <em>Lösung</em> partieller <em>Differentialgleichungen</em> durch Differenzverfahren mit dem elektronischen <em>Differentialanalysator</em>"          ]        }}

Explain Hyphenation Decompounder

{"doc": {"id": 5444,    "text_german": "die Lösung partieller Differentialgleichungen durch Differenzverfahren mit dem elektronischen Differentialanalysator"},   "highlight": {"text_german": ["die <em>Lösung</em> partieller <em>Differentialgleichungen</em> durch Differenzverfahren mit dem elektronischen <em>Differentialanalysator</em>"]}   }

We can see that the score of the Dictionary approach is half of the Hyphenation score, although both seem to match the same words in the text of the document. This can be explained by looking at the term scores more closely:

As we can see, the Dictionary Decompounder seems to search on more sub-terms for differentialgleichung than the Hyphenation Decompounder. This leads to a higher document frequency which reduces the term score. So the Dictionary approach searches more terms that aren't relevant for the query and, therefore, is more likely to score relevant documents lower than irrelevant ones.

Conclusion
On the NPL corpus, the Hyphenation Decompounder seems to work best. It seems to differentiate between necessary decompounding and words that only raise the document frequency. These examples underline our hypothesis that taking decomposition into account can positively affect the search scores.

5.2. Why is the German Analyzer significantly better than the Standard Analyzer?#

While experimenting, the scores for the German Analyzer always exceeded the scores of the Standard Analyzer and on some corpora, like CACM or LISA, there was an increase of nearly 50%, up to 80% with the ADI corpus (see graphic with gains in section 3. Results).

At first, this seemed strange, so we looked into it a bit more. While analyzing the ADI and the CACM datasets, we discovered that the Standard Analyzer finds more documents than the German Analyzer which increases the document frequency and therefore returns significantly more irrelevant documents. The reason for that is the stop word list that is used. The Standard Analyzer doesn't consider any German stop words, while the German Analyzer works with a very rich stop word list. Below, you can find two more detailed examples, one from the ADI and one from the CACM corpus, which illustrate this phenomenon.


ADI#

A good example of the decrease in irrelevant retrieval in the ADI corpus is Query 21, the F1-Score difference between the Standard Analyzer and the German Analyzer is 0,416.

Example Query 21: " Die Notwendigkeit, Personal für den Informationsbereich bereitzustellen."

It's a relatively short query which has only 5 relevant documents, which are found by both approaches. The main difference can be seen while looking at the false positives:

As an example document to compare the scores in detail, we chose the document 21 because the Standard Analyzer ranked it lower than the German Analyzer even though the scores were nearly the same.

Standard AnalyzerGerman Analyzer
position67
score2.91567182.9190629

It seems that other documents like 38 were scored higher because there were matching stop words like die and für in it:

Explain Standard Analyzer

[  {    "doc": {      "id": 38,      "text_german": " Mikrodruck hat sich in einem Experiment der Wildlife Disease Association als akzeptables Publikationsmedium erwiesen . mit Autorenkomposition kostet etwas mehr als sieben Cent pro 3 x 5-Zoll-Karte, bis zu 47 Seiten . die Notwendigkeit für die Entwicklung von Standards und die Verbesserung von Zubehör-Abrufgeräten wird erkannt.",      "title_german": " Mikrodruck hat sich in einem Experiment der Wildlife Disease Association als akzeptables Publikationsmedium erwiesen . mit Autorenkomposition kostet etwas mehr als sieben Cent pro 3 x 5-Zoll-Karte, bis zu 47 Seiten . die Notwendigkeit für die Entwicklung von Standards und die Verbesserung von Zubehör-Abrufgeräten wird erkannt."    },    "highlight": {      "text_german": [        ". mit Autorenkomposition kostet etwas mehr als sieben Cent pro 3 x 5-Zoll-Karte, bis zu 47 Seiten . <em>die</em>",        "<em>Notwendigkeit</em> <em>für</em> <em>die</em> Entwicklung von Standards und <em>die</em> Verbesserung von Zubehör-Abrufgeräten wird erkannt"      ],      "title_german": [        ". mit Autorenkomposition kostet etwas mehr als sieben Cent pro 3 x 5-Zoll-Karte, bis zu 47 Seiten . <em>die</em>",        "<em>Notwendigkeit</em> <em>für</em> <em>die</em> Entwicklung von Standards und <em>die</em> Verbesserung von Zubehör-Abrufgeräten wird erkannt"      ]    },    "position": 5,    "score": 3.8836367  }

The average scores returned from the Standard Analyzer are higher than the scores returned by the German Analyzer but since the Standard Analyzer always returns a lot of noise with it, it provides worse results when you look at the average scores across all queries.

But the Standard Analyzer does not always return high scores. Since it only searches on the given terms without stemming, it sometimes scores significantly worse. A good example for that can be found in Query 20:

Example Query 20: "Testen von automatisierten Informationssystemen."

There are 3 relevant documents, only 1 is found by the Standard Analyzer and 2 are found by the German Analyzer. And since the false positive rate is also higher with the first approach, this is weighted even higher.

If you look at document 65, which is only matched by the German Analyzer, you can see that it is scored 10 times higher by it than by the Standard Analyzer:

The position and the score of this document in both approaches vary greatly:

Standard AnalyzerGerman Analyzer
position-3
score0.38694833.3748484

That is because it finds words while stemming which can not be found otherwise.

Explain Standard Analyzer

{    "score": 0.3869483,    "text_german": {        "total_value": 0.3869483,        "details": [            {                "function": {                    "value": 0.3869483,                    "description": "weight(text_german:von in 21) [PerFieldSimilarity], result of:",                    "n, number of documents containing term": 57,                    "freq, occurrences of term within document": 1.0                }            }        ]    },    "title_german": {        "total_value": 0.3869483,        "details": [            {                "function": {                    "value": 0.3869483,                    "description": "weight(title_german:von in 21) [PerFieldSimilarity], result of:",                    "n, number of documents containing term": 57,                    "freq, occurrences of term within document": 1.0                }            }        ]    }}

Explain German Analyzer

{    "score": 3.3748484,    "text_german": {        "total_value": 3.3748484,        "details": [            {                "function": {                    "value": 3.3748484,                    "description": "weight(text_german:test in 21) [PerFieldSimilarity], result of:",                    "n, number of documents containing term": 3,                    "freq, occurrences of term within document": 1.0                }            }        ]    },    "title_german": {        "total_value": 3.3748484,        "details": [            {                "function": {                    "value": 3.3748484,                    "description": "weight(title_german:test in 21) [PerFieldSimilarity], result of:",                    "n, number of documents containing term": 3,                    "freq, occurrences of term within document": 1.0                }            }        ]    }}

In this example, we can see that the Standard Analyzer only matches von which is filtered by the stop word list in the second approach and doesn't match test because the word Testen isn't stemmed. It seems like the effect of stemming and stop words are very relevant in German information retrieval.


CACM#

To support this conclusion let us look at one more example from the CACM dataset.

Example Query 10: "Parallele Sprachen; Sprachen für paralleles Rechnen .N 10. Alec Grimison, Comp Serv, Uris Hall (parallel lang)"

Overall, we can see that the German Analyzer does find significantly more true positives than the Standard Analyzer:

The first approach only returns 1 out of 35 relevant documents while the second approach returns 15. When we look at more detailed search terms of the one document 2851 that was found in both approaches, then we can see that once again, stemming does the trick:

Although the scoring of the German Analyzer is higher than the scoring of the Standard Analyzer, it has a lower position in the ranking. This is because there were other relevant documents in the second approach that had an even better term scoring. For example, document 1262 which was ranked at position 3 with the second approach, but out of the top 20 for the first:

When we look at the highlights of both approaches and which terms are matches, we can see that the main difference between them is that the second approach matches sprach instead of für. Since für is a very common stop word in German, the document frequency of this term decreases the score significantly.

Standard Analyzer

{"_id": "1262",  "_index": "pragmalingu-cacm-german-corpus",  "_score": 0.3728282,  "_source": {"text_german": "Es werden zwei Anweisungen vorgeschlagen, die es einem Programmierer erlauben der in einer prozedurorientierten Sprache schreibt Programmabschnitte anzugeben, die parallel ausgeführt werden sollen. parallel ausgeführt werden sollen.  Die Anweisungen sind DO TOGETHER und HOLD.  Sie dienen einerseits als Klammern zur Festlegung einen Bereich für den Parallelbetrieb festzulegen und teilweise jeden parallelen Pfad innerhalb dieses Bereichs zu definieren.  DO TOGETHERs können verschachtelt werden.  Die Anweisungen sollten besonders besonders effektiv für die Verwendung mit Rechengeräten sein, die in der Lage sind, ein gewisses Maß an Überlappung von Rechenoperationen zu erreichen. ",   "title_german": "Prozedur-orientierte Sprachanweisungen zur Erleichterung der Parallelverarbeitung"},  "_type": "_doc",  "highlight": {"text_german": ["Sie dienen einerseits als Klammern zur Festlegung einen Bereich <em>für</em> den Parallelbetrieb festzulegen und",    "Die Anweisungen sollten besonders besonders effektiv <em>für</em> die Verwendung mit Rechengeräten sein, die in"]}}

German Analyzer

{"doc": {"id": 1262,   "text_german": "Es werden zwei Anweisungen vorgeschlagen, die es einem Programmierer erlauben der in einer prozedurorientierten Sprache schreibt Programmabschnitte anzugeben, die parallel ausgeführt werden sollen. parallel ausgeführt werden sollen.  Die Anweisungen sind DO TOGETHER und HOLD.  Sie dienen einerseits als Klammern zur Festlegung einen Bereich für den Parallelbetrieb festzulegen und teilweise jeden parallelen Pfad innerhalb dieses Bereichs zu definieren.  DO TOGETHERs können verschachtelt werden.  Die Anweisungen sollten besonders besonders effektiv für die Verwendung mit Rechengeräten sein, die in der Lage sind, ein gewisses Maß an Überlappung von Rechenoperationen zu erreichen. ",   "title_german": "Prozedur-orientierte Sprachanweisungen zur Erleichterung der Parallelverarbeitung"},  "highlight": {"text_german": ["zwei Anweisungen vorgeschlagen, die es einem Programmierer erlauben der in einer prozedurorientierten <em>Sprache</em>",    "schreibt Programmabschnitte anzugeben, die <em>parallel</em> ausgeführt werden sollen. <em>parallel</em> ausgeführt werden",    "einerseits als Klammern zur Festlegung einen Bereich für den Parallelbetrieb festzulegen und teilweise jeden <em>parallelen</em>"]},  "position": 3,  "score": 16.73008}

Conclusion

Our results suggest that since German contains significantly more semantically empty words that do not contribute any content value to the search, it might be important to include stop words into the search parameters in German. This way, it may be possible to avoid too many irrelevant documents getting returned which can suppress relevant search results.
This would also explain why the standard analyzer performs worse on German texts than on English texts. You can look at the results of the standard analyzer on English texts in our stemming experiment

5.3. Why does the German Analyzer work better than the Dictionary Decompounder?#

The Decompounders seemed to work better on most of the corpora we compared, but on the ADI, Cranfield, LISA, and TIME corpora the German Analyzer scores exceeded both decompounding approaches. We looked more into that for the Cranfield and the LISA corpora. We decided to leave the ADI out of this part of the discussion since it is a rather small corpus and also the Time, since the other two show better examples.


Cranfield#

To analyze why the results of the German Analyzer seem to be better than the decompounding approaches, we chose the Dictionary Decompounder for this comparison.

A good example to see exactly what works better with the German Analyzer would be Query 3 with an F-Score difference of 0.428.

Example Query 3: " welche Probleme der Wärmeleitung in Verbundplatten bisher gelöst worden sind ."

There are 9 relevant documents for this query where only 1 is returned by the Dictionary Decompounder and 6 by the German Analyzer:

We chose the one retrieved document that was returned by both approaches, 181. We can see by the scores and positions that, apparently, the Dictionary Decompounder matches far more documents before matching document 181, although the score isn't that different from the German Analyzer:

German AnalyzerDictionary Decompounder
position312
score11.53941610.477009

While looking at the term scores, we can see that the German Analyzer does get better scores for every matched term group:

Although the decompounding matches far more terms inside the documents text_german field, both approaches seem to take the title_german field score for the document because it was higher:

Explain German Analyzer

{"doc": {"id": 181,    "text_german": " Einige Probleme zur Wärmeleitung in schichtförmigen Körpern.  Probleme zur Wärmeleitung in mehrschichtigen Körpern führen meist zu komplizierten Berechnungen. Die vorliegende Arbeit gibt einen Einblick in die besonderen Schwierigkeiten, die bei unendlichen Verbundkörpern auftreten, wobei allgemeine Ableitungen auf eine spezielle Klasse von Fragen angewendet werden.",    "title_german": " Einige Probleme zur Wärmeleitung in schichtförmigen Körpern."},   "highlight": {"text_german": ["Einige <em>Probleme</em> zur <em>Wärmeleitung</em> in schichtförmigen Körpern.",     "<em>Probleme</em> zur <em>Wärmeleitung</em> in mehrschichtigen Körpern führen meist zu komplizierten Berechnungen."],    "title_german": ["Einige <em>Probleme</em> zur <em>Wärmeleitung</em> in schichtförmigen Körpern."]},}

Explain Dictionary Decompounder

{"doc": {"id": 181,    "text_german":     " Einige Probleme zur Wärmeleitung in schichtförmigen Körpern.  \n    Probleme zur Wärmeleitung in mehrschichtigen Körpern führen meist zu komplizierten Berechnungen. \n    Die vorliegende Arbeit gibt einen Einblick in die besonderen Schwierigkeiten, die bei unendlichen Verbundkörpern auftreten, \n    wobei allgemeine Ableitungen auf eine spezielle Klasse von Fragen angewendet werden.",    "title_german": " Einige Probleme zur Wärmeleitung in schichtförmigen Körpern."},   "highlight": {"text_german": ["Einige <em>Probleme</em> zur <em>Wärmeleitung</em> in schichtförmigen Körpern.",     "<em>Probleme</em> zur <em>Wärmeleitung</em> in mehrschichtigen Körpern führen meist zu komplizierten Berechnungen.",     "Die vorliegende Arbeit gibt einen Einblick in die besonderen Schwierigkeiten, die bei unendlichen <em>Verbundkörpern</em>",     "auftreten, wobei allgemeine <em>Ableitungen</em> auf eine spezielle Klasse von Fragen angewendet werden."],    "title_german": ["Einige <em>Probleme</em> zur <em>Wärmeleitung</em> in schichtförmigen Körpern."]}}

Due to decompounding, more documents are found which decreases the idf and matches more irrelevant documents. The scores for terms like "warmeleitung" will be combined with the scores of the sub-concept "warm". That generalizes the term "warmeleitung" too much. To test if that was the case with the other corpora, let us check an example from the LISA corpus as well.

What is "idf" and why is it relevant?

In information retrieval, the tf-idf measure is often consulted when analyzing search results.

  • tf stands for "term frequency" and describes how often a term occurs in the considered document, therefore this value only depends on the current document.
  • idf stands for "inverse document frequency" and describes how frequently a term occurs in all available documents, therefore this value depends on the entire corpus.

In the tf-idf measure, these two values are weighted against each other. If a term is very rare in the entire corpus, but quite frequent in the document under consideration, it is classified as more relevant.


LISA#

Similar to the Cranfield corpus, the different values from the German Analyzer and Dictionary Decompounder are especially visible in the recall scores:

As an example for this corpus, we chose Query 10 with the highest F-Score difference of 0.30.

Example Query 10: "ICH INTERESSIERE MICH FÜR INFORMATIONEN ÜBER DIE BEREITSTELLUNG AKTUELLER AWARENESS BULLETINS, INSBESONDERE SDI-DIENSTE IN BELIEBIGEN INSTITUTIONEN, Z.B. IN WISSENSCHAFTLICHEN BIBLIOTHEKEN, IN DER INDUSTRIE UND IN BELIEBIGEN THEMENBEREICHEN. SDI, SELEKTIVE INFORMATIONSVERBREITUNG, CURRENT AWARENESS BULLETINS, INFORMATION BULLETINS."

This query has 14 relevant documents, the German Analyzer returns 10 while the Dictionary Decompounder only returns 5:

To see why the German Analyzer finds twice as many true positives, we looked at one of the overlapping documents, document 5801.

German AnalyzerDictionary Decompounder
position210
score39.54491440.788223

The score is not that different, but the position is. The Dictionary Decompounder seems to find far more irrelevant documents before retrieving relevant ones. When we look at the terms, we can see why:

The Dictionary Decompounder finds far more terms, also including some irrelevant and nonsense words like formation (which has nothing to do with the searched term information).

Explain German Analyzer

{"doc": {"id": 5801,   "text_german": "PRÄSENTIERT EINEN HISTORISCHEN RÜCKBLICK AUF DIE SELEKTIVE VERBREITUNG VON INFORMATIONEN ALS EIN AKTUELLES BEWUSSTSEINS-SYSTEM, DIE NOTWENDIGKEIT DAFÜR UND EINIGE PRAKTISCHE VORSCHLÄGE FÜR SEINE EINFÜHRUNG. BEISPIELE WERDEN SOWOHL AUS SPEZIAL- ALS AUCH AUS WISSENSCHAFTLICHEN BIBLIOTHEKEN HERANGEZOGEN.VERSUCHE, ENTWÜRFE FÜR EINEN CURRENT-AWARENESS-SERVICE IN DER KASHIMIBRAHIM-BIBLIOTHEK, AHMADU-BELLO-UNIVERSITÄT, ZU ERSTELLEN.",   "title_german": "SELEKTIVE VERBREITUNG VON INFORMATIONEN IN WISSENSCHAFTLICHEN BIBLIOTHEKEN."},  "highlight": {"text_german": ["PRÄSENTIERT EINEN HISTORISCHEN RÜCKBLICK AUF DIE <em>SELEKTIVE</em> VERBREITUNG VON <em>INFORMATIONEN</em> ALS EIN <em>AKTUELLES</em>",    "BEISPIELE WERDEN SOWOHL AUS SPEZIAL- ALS AUCH AUS <em>WISSENSCHAFTLICHEN</em> <em>BIBLIOTHEKEN</em> HERANGEZOGEN.VERSUCHE",    ", ENTWÜRFE FÜR EINEN <em>CURRENT</em>-<em>AWARENESS</em>-SERVICE IN DER KASHIMIBRAHIM-<em>BIBLIOTHEK</em>, AHMADU-BELLO-UNIVERSITÄT"],   "title_german": ["<em>SELEKTIVE</em> VERBREITUNG VON <em>INFORMATIONEN</em> IN <em>WISSENSCHAFTLICHEN</em> <em>BIBLIOTHEKEN</em>."]},  "position": 2,  "score": 39.544914}

Explain Dictionary Decompounder

{"doc": {"id": 5801,   "text_german": "PRÄSENTIERT EINEN HISTORISCHEN RÜCKBLICK AUF DIE SELEKTIVE VERBREITUNG VON INFORMATIONEN ALS EIN AKTUELLES BEWUSSTSEINS-SYSTEM, DIE NOTWENDIGKEIT DAFÜR UND EINIGE PRAKTISCHE VORSCHLÄGE FÜR SEINE EINFÜHRUNG. BEISPIELE WERDEN SOWOHL AUS SPEZIAL- ALS AUCH AUS WISSENSCHAFTLICHEN BIBLIOTHEKEN HERANGEZOGEN.VERSUCHE, ENTWÜRFE FÜR EINEN CURRENT-AWARENESS-SERVICE IN DER KASHIMIBRAHIM-BIBLIOTHEK, AHMADU-BELLO-UNIVERSITÄT, ZU ERSTELLEN.",   "title_german": "SELEKTIVE VERBREITUNG VON INFORMATIONEN IN WISSENSCHAFTLICHEN BIBLIOTHEKEN."},  "highlight": {"text_german": ["PRÄSENTIERT EINEN HISTORISCHEN RÜCKBLICK AUF DIE <em>SELEKTIVE</em> <em>VERBREITUNG</em> VON <em>INFORMATIONEN</em> ALS EIN <em>AKTUELLES</em>",    "BEISPIELE WERDEN SOWOHL AUS SPEZIAL- ALS AUCH AUS <em>WISSENSCHAFTLICHEN</em> <em>BIBLIOTHEKEN</em> HERANGEZOGEN.VERSUCHE",    ", ENTWÜRFE FÜR EINEN <em>CURRENT</em>-<em>AWARENESS</em>-SERVICE IN DER KASHIMIBRAHIM-<em>BIBLIOTHEK</em>, AHMADU-BELLO-UNIVERSITÄT",    ", ZU <em>ERSTELLEN</em>."],   "title_german": ["<em>SELEKTIVE</em> <em>VERBREITUNG</em> VON <em>INFORMATIONEN</em> IN <em>WISSENSCHAFTLICHEN</em> <em>BIBLIOTHEKEN</em>."]},  "position": 10,  "score": 40.788223}

This concludes that with all those irrelevant terms the Dictionary Decompounder seems to have the same problem on the LISA corpus as on Cranfield, the decomposition increases the returned document rate, which decreases the idf and therefore scores irrelevant documents higher than relevant ones.

What is "idf" and why is it relevant?

In information retrieval, the tf-idf measure is often consulted when analyzing search results.

  • tf stands for "term frequency" and describes how often a term occurs in the considered document, therefore this value only depends on the current document.
  • idf stands for "inverse document frequency" and describes how frequently a term occurs in all available documents, therefore this value depends on the entire corpus.

In the tf-idf measure, these two values are weighted against each other. If a term is very rare in the entire corpus, but quite frequent in the document under consideration, it is classified as more relevant.


Conclusion These examples show that sometimes decomposition can lead to a decrease in the values due to over-decompounding the search terms thereby returning more noise.

5.4. Why does the Dictionary Decompounder work better than the Hyphenation Decompounder?#

We mostly compared decompounding with the standard analyzer. For most of the datasets the Hyphenation Decompounder performed better than the Dictionary Decompounder, but for the Medline and the CACM corpora, there were minor deviations in the scores where the Dictionary Decompounder performed better. .

To investigate what exactly caused this increase, we picked out a search query that seemed the most informative to us. We looked deeper into Query 27 with an F-Score difference of 0.432.

Example Query 27: "Interesse an den parasitären Krankheiten. Filarienparasiten bei Primaten, die Insektenvektoren der Filarien, die verwandten Dipteren, d. h. Culicoides, Mücken usw., die als Vektoren dieser Infektionskrankheit dienen können; auch die Lebenszyklen und die Übertragung der Filarien. Parasiten und Ökologie des Taiwan-Affen, Macaca cyclopis, mit Schwerpunkt auf dem Filarienparasiten, Macacanema formosana."

There are 18 relevant documents for this query, the Dictionary Decompounder returns 12 true positives and the Hyphenation Decompounder three times fewer, only 4:

As a good example document to compare the scores and searched terms in detail we chose the document 732 because the Hyphenation Decompounder doesn't score it into the top 20 documents, while the Dictionary Decompounder does.

Dictionary DecompounderHyphenation Decompounder
position5-
score42.1840743.2657375

When we look at the term scores, it is clear that the Dictionary approach finds far more terms:

In contrast to the previous examples, where more terms increased the idf and therefore the overall term, this seems to be different in the Medline corpus. Since it is filled with medical documents, we think there are fewer irrelevant terms in general.

What is "idf" and why is it relevant?

In information retrieval, the tf-idf measure is often consulted when analyzing search results.

  • tf stands for "term frequency" and describes how often a term occurs in the considered document, therefore this value only depends on the current document.
  • idf stands for "inverse document frequency" and describes how frequently a term occurs in all available documents, therefore this value depends on the entire corpus.

In the tf-idf measure, these two values are weighted against each other. If a term is very rare in the entire corpus, but quite frequent in the document under consideration, it is classified as more relevant.


Here we can see what exactly was matched inside the documents:

Explain Dictionary Decompounder

{"doc": {"id": 732,    "text_german": "ein pilotprojekt zur kontrollierung der filariose in thailand in einem dorf im bezirk kanjanadit in der provinz surat-thani, südthailand, wo eine feldstation für filariasestudien von der bangkok school of tropical medicine eingerichtet worden war, wurden blutfilme von 977 personen (95,5 prozent der gesamtbevölkerung von 1.023 personen) untersucht. von jeder person wurden zwei dicke filme (je 20 c.mm.) präpariert und mit giemsa angefärbt. es wurde festgestellt, dass 21,1 prozent. der Personen beherbergten Mikrofilarien (alle Brugia malayi). elephantiasis wurde bei 5,3 Prozent der Bevölkerung gefunden. die Mikrofilarien-Periodizität wurde bei 25 Personen untersucht; in jedem Fall wurde festgestellt, dass sie ausgeprägt nächtlich ist. das blut von 98 katzen, 52 hunden und zwei affen wurde ebenfalls untersucht. es wurden keine b. malayi larven gefunden. stechmücken wurden gefangen und identifiziert. in einer ersten Untersuchung wurden 4.557 stechmücken untersucht, von denen 568 mansonia spp. waren. in 4.136 sektionen wurden b. malayi-Larven im stadium ii wurden in einem m. uniformis und im stadium iii in einem anderen gefunden; die Infektionsrate für m. uniformis lag bei 0,6 Prozent. in der letzten Phase der untersuchung wurden alle Häuser mit ddt besprüht. dies führte zu einem leichten rückgang der anzahl und des prozentsatzes der gefangenen mansonia-Mücken. diethylcarbamazin wurde so vielen Dorfbewohnern wie möglich in einer Dosis von 5 mgm. des Citratsalzes pro kgm. körpergewicht einmal wöchentlich über sechs Wochen verabreicht.  Die Blutuntersuchungen wurden einen Monat und ein Jahr nach Absetzen des Medikaments wiederholt. Es zeigte sich, dass der Anteil der Mikrofilaria-Träger von 21,1 Prozent auf 2,2 bzw. 2,2 Prozent, die Filariose-Infektionsrate von 26,1 Prozent auf 8,6 bzw. 8,5 Prozent zurückgegangen war, und die mittlere Mikrofilariendichte aller Filme von 4,8 pro 20 c.mm. Blut auf 0,48 und 0,12. Larven von b. malayi wurden in Mücken, die einen Monat und ein Jahr nach der Massentherapie seziert wurden, nicht gefunden."},   "highlight": {"text_german": ["Fall wurde festgestellt, dass sie ausgeprägt nächtlich ist. das blut von 98 katzen, 52 hunden und zwei <em>affen</em>",     "spp. waren. in 4.136 sektionen wurden b. malayi-Larven im stadium ii wurden in einem m. <em>uniformis</em> und",     "im stadium iii in einem anderen gefunden; die <em>Infektionsrate</em> für m. <em>uniformis</em> lag bei 0,6 Prozent. in",     "Es zeigte sich, dass der Anteil der Mikrofilaria-<em>Träger</em> von 21,1 Prozent auf 2,2 bzw. 2,2 Prozent, die",     "<em>Mikrofilariendichte</em> aller Filme von 4,8 pro 20 c.mm."]},   "position": 5,   "score": 42.184074}

Explain Hyphenation Decompounder

{"doc":{"_id" : "732",          "text_german" : "ein pilotprojekt zur kontrollierung der filariose in thailand in einem dorf im bezirk kanjanadit in der provinz surat-thani, südthailand, wo eine feldstation für filariasestudien von der bangkok school of tropical medicine eingerichtet worden war, wurden blutfilme von 977 personen (95,5 prozent der gesamtbevölkerung von 1.023 personen) untersucht. von jeder person wurden zwei dicke filme (je 20 c.mm.) präpariert und mit giemsa angefärbt. es wurde festgestellt, dass 21,1 prozent. der Personen beherbergten Mikrofilarien (alle Brugia malayi). elephantiasis wurde bei 5,3 Prozent der Bevölkerung gefunden. die Mikrofilarien-Periodizität wurde bei 25 Personen untersucht; in jedem Fall wurde festgestellt, dass sie ausgeprägt nächtlich ist. das blut von 98 katzen, 52 hunden und zwei affen wurde ebenfalls untersucht. es wurden keine b. malayi larven gefunden. stechmücken wurden gefangen und identifiziert. in einer ersten Untersuchung wurden 4.557 stechmücken untersucht, von denen 568 mansonia spp. waren. in 4.136 sektionen wurden b. malayi-Larven im stadium ii wurden in einem m. uniformis und im stadium iii in einem anderen gefunden; die Infektionsrate für m. uniformis lag bei 0,6 Prozent. in der letzten Phase der untersuchung wurden alle Häuser mit ddt besprüht. dies führte zu einem leichten rückgang der anzahl und des prozentsatzes der gefangenen mansonia-Mücken. diethylcarbamazin wurde so vielen Dorfbewohnern wie möglich in einer Dosis von 5 mgm. des Citratsalzes pro kgm. körpergewicht einmal wöchentlich über sechs Wochen verabreicht.  Die Blutuntersuchungen wurden einen Monat und ein Jahr nach Absetzen des Medikaments wiederholt. Es zeigte sich, dass der Anteil der Mikrofilaria-Träger von 21,1 Prozent auf 2,2 bzw. 2,2 Prozent, die Filariose-Infektionsrate von 26,1 Prozent auf 8,6 bzw. 8,5 Prozent zurückgegangen war, und die mittlere Mikrofilariendichte aller Filme von 4,8 pro 20 c.mm. Blut auf 0,48 und 0,12. Larven von b. malayi wurden in Mücken, die einen Monat und ein Jahr nach der Massentherapie seziert wurden, nicht gefunden."        },        "highlight" : {          "text_german" : [            "Fall wurde festgestellt, dass sie ausgeprägt nächtlich ist. das blut von 98 katzen, 52 hunden und zwei <em>affen</em>",            "wurde ebenfalls untersucht. es wurden keine b. malayi larven gefunden. <em>stechmücken</em> wurden gefangen und",            "identifiziert. in einer ersten Untersuchung wurden 4.557 <em>stechmücken</em> untersucht, von denen 568 mansonia",            "malayi-Larven im stadium ii wurden in einem m. uniformis und im stadium iii in einem anderen gefunden; die <em>Infektionsrate</em>",            "sich, dass der Anteil der Mikrofilaria-Träger von 21,1 Prozent auf 2,2 bzw. 2,2 Prozent, die Filariose-<em>Infektionsrate</em>"          ]},       "_score" : 15.94014}

Conclusion On the one hand, the queries of the Medline corpus are in general very long and not representative of a standard search query. On the other hand, the overall score difference between these two approaches is not that big. Therefore, we concluded the observation that the Dictionary Decompounder scores better than the Hyphenation Decompounder won't happen that often and not in a nameable scope. Therefore we think it is okay to ignore this occurrence for our overall conclusion.

Acknowledgements:
Thanks to Kenny Hall for proofreading this article.

Written by Miriam Rupprecht, Mai 2021