In the last months, I’ve been giving some classes on Natural Language Processing (NLP). I decided to share some of my experiments here. I will not give you rocket science. I will cover the basics. And I will share my vision. You are warned!

One of the typical tasks in NLP is to process a text and understand which sequence of words comprises a multi-word expression. There are different definitions for what a multi-word expression (MWE) is. I will only consider them as a sequence of words that are used together to refer to something that can, or can not, have compositional meaning. Sequences like “European Commission” are not just an MWE but also what is called in NLP as a Mentioned Entity (something we can refer to). But there are other sequences, like “short circuit” which means something that is not the sum of the sense of each word; or “cardiovascular disease” whose meaning is exactly the sum of each word sense.

We want to find these kinds of expressions in a text. And, for now, we will only focus on bigrams (pairs of words), and we will experiment with some different associativity measures.

DISCLAIMER: The code presented below is prepared in a way I expect to be readable and easy to understand, but it does not scale. First, as discussed in the next section, the library for tokenization is not efficient. Second, my code loads all the text to memory. That is a very bad approach for processing large corpora.

Computing Tokens

To start with, we need to obtain text tokens. That means we need to split the text into words, but also into punctuation and other things we usually find in the text. I will not deal with that here and just use a Python library. I will use Stanza. Stanza is a machine-learning-based module that can perform much more than just tokenization. And to be honest, if you only need tokenization, probably a library that is not based on machine learning might be much more interesting and efficient.

import stanza

def tokenize(filename):
    with open(filename, "r") as fh:
        content = fh.read()
    pipeline = stanza.Pipeline(lang='en', processors='tokenize', use_gpu=True)
    return pipeline(content)
Python

This code defines a function that reads a file and passes it through the Stanza pipeline, only with the tokenize module. The result is a list of lists. One list per sentence, that contains a list with each token. Each token is, itself, a dictionary, including the text but also other properties. For this post, only the text field will be used.

You can change the language and/or configure it to use, or not, GPU acceleration.

Preparing the Text

We will be required to count the number of times each word occurs. Also, we will need to count the number of times each pair of words occurs. This will be useful for the metrics. To reduce the amount of memory used, we will account only for the frequency of words (in lowercase), and bigrams where at least one of the two tokens is, also, a word. That is, we will check if the token is a sequence of letters. Other tokens will be replaced by a generic <OTHER> token.

Note: for languages where morphology has a high generation of forms (like the Portuguese language), using the lemma for each word would be more useful. Even for English that might present more interesting results. I decided not to include that in this post for simplicity.

It will also be relevant to detect words occurring at the beginning of a sentence or the end. For that, each sentence will start with a <S> tag and end with an </S> tag.

import re

def check_word(w):
    return w if re.fullmatch(r'\w+', w) and not re.match(r'\d', w) else "<OTHER>"
Python

The function above will check if a string is a sequence of letters and/or digits (the \w metacharacter). As I want to ignore any sequence with a number, I added the second regular expression to guarantee there is none. If you want to catch another type of thing, like H2O, probably you will not want to add the second match.

def preprocess(text):
    return [["<S>"] + [check_word(w.text.lower()) for w in s.words] + ["</S>"] for s in text.sentences]
Python

Finally, the preprocess function will take the text and do the required transformations. For all the Pythonists out there, if this is not the better approach, please advise!

Computing Frequencies

We will require frequencies for words and bigrams. To make the process faster, we will iterate the corpus only once, and compute both frequency tables at the same time.

def count_frequences(text):
    total_tokens = 0
    total_bigrams = 0
    tokens = {}
    bigrams = {}
    for sentence in text:
        total_tokens += len(sentence) - 2
        total_bigrams += len(sentence) - 1
        for pos in range(0, len(sentence) - 1):
            w1, w2 = sentence[pos], sentence[pos+1]
            if pos > 1:
                tokens[w1] = tokens.get(w1, 0) + 1
            bigrams[(w1, w2)] = bigrams.get((w1, w2), 0) + 1
    return tokens, total_tokens, bigrams, total_bigrams
Python

Some notes on this code:

  • we are computing the frequency tables and the total counts at the same time
  • we are not accounting for the start/end of sentence tokens (thus the len(sentence)-2 and the if pos>1)

Top Occurring Bigrams

A first option for choosing probable MWE could be looking at the occurrence count of each bigram. To do that, let’s define an auxiliary function to sort a frequency dictionary by occurrence count, in descending order, and returning the top n occurring bigrams:

def top_occurring(freqs, top_n=20):
    sorted_list = sorted(freqs.items(), key=lambda p: p[1], reverse=True)
    return sorted_list[:top_n]
Python

In this function we just receive the dictionary, get each pair (key, value) from the dictionary, and sort by the second element of the pair, in reverse order. Then, we return the first top_n elements.

The results will be something like this:

('<OTHER>', '<OTHER>') => 179611
('<OTHER>', '</S>') => 66316
('of', 'the') => 42825
('<OTHER>', 'the') => 19887
('<S>', 'the') => 17319
('in', 'the') => 16538
('to', 'the') => 13735
('article', '<OTHER>') => 11082
('<OTHER>', 'and') => 10951
('the', 'ecb') => 10638
('the', 'european') => 9586
('on', 'the') => 8975
('<OTHER>', 'of') => 8331
('by', 'the') => 8255
('for', 'the') => 7799
('and', 'the') => 7772
('ecb', '<OTHER>') => 7589
('of', '<OTHER>') => 6535
('<OTHER>', 'in') => 6509
('and', '<OTHER>') => 6298

The table above was computed from a corpus about the European Central Bank. That explains why “the ecb” occurs so often. Also, that is why “article <other>” is present in the table: results from the legislation that is present in this corpus. Nevertheless, and globally, these results are not useful.

For this specific analysis, we will create a function that clones the bigrams table but removes any bigram that includes <OTHER>, <S>, </S> or any stop word. A stop word is a word that is highly used. The list of stop words depends a lot from author to author. We will use a hand-made stop word list, obtained from one of the many websites that claim to have a list of the most common English words. To make the code simpler I also added the three different tags.

def stop_words():
    return {"the", "be", "to", "of", "and", "a", "in", "that", "have", "I", "it", "for", 
            "not", "on", "with", "he", "as", "you", "do", "at", "this", "but", "his", "by", 
            "from", "they", "we", "say", "her", "she", "or", "an", "will", "my", "one", 
            "all", "would", "there", "their", "what", "so", "up", "out", "if", "about",
            "who", "get", "which", "go", "me", "when", "make", "can", "like", "time",
            "no", "just", "him", "know", "take", "person", "into", "year", "your", "good",
            "some", "could", "them", "see", "other", "than", "then", "now", "look", "only",
            "come", "its", "over", "think", "also", "back", "after", "use", "two", "how",
            "our", "work", "first", "well", "way", "even", "new", "want", "because", "any",
            "these", "give", "day", "most", "us", "<S>", "</S>", "<OTHER>"}
Python

Now for the function that creates the cleaned version of the bigrams count:

('euro', 'area') => 5669
('central', 'bank') => 5337
('member', 'states') => 5257
('european', 'central') => 3456
('central', 'banks') => 3189
('monetary', 'policy') => 2756
('governing', 'council') => 2731
('national', 'central') => 2102
('european', 'union') => 2071
('member', 'state') => 2054
('credit', 'institutions') => 1648
('european', 'parliament') => 1549
('interest', 'rates') => 1483
('balance', 'sheet') => 1480
('participating', 'member') => 1347
('general', 'government') => 1335
('interest', 'rate') => 1331
('navigation', 'path') => 1300
('legal', 'framework') => 1243
('has', 'been') => 1231

The Jaccard Index

The Jaccard Index is used to measure the similarity between finite sample sets. It can be used to understand if the distribution of two words is similar. It is defined mathematically as:

$$J(A,B) = \frac{\left| A\cap B\right|}{\left| A\cup B\right|}=\frac{\left|A\cap B\right|}{\left|A\right| + \left|B\right| – \left|A\cap B\right|}$$

In this formula, we have:

  • \(\left|A \cap B\right| = C(w_1, w_2)\) is the number of times two words \(w_1\) and \(w_2\) appeared in a specific bigram (by this specific order)
  • \(\left|A\right| = C(w_1, ?)\) is the number of times the word \(w_1\) appeared as the first word for any bigram
  • \(\left|B\right| = C(?, w_2)\) is the number of times the word \(w_2\) appeared as the second word for any bigram

NOTE: it might look strange that I compute \(\left|A\right|\) as \(C(w_1, ?)\) instead of just \(C(w_1)\). The reason is simple: \(A\) is the observation of \(w_1\) being the first word of a bigram; \(B\) is the observation of \(w_2\) being the second word of a bigram; and \(A\cap B\) the observation of finding the bigram \((w_1,w_2)\).

To make it possible to compute this metric for every bigram, we need to account for the number of times each word is used in a brigram in the first or the second position. To that, we will use the following function:

def marginal_counts(freqs):
    left_set = {}
    right_set = {}
    for w1, w2 in freqs.keys():
        left_set[w1] = left_set.get(w1, 0) + freqs[(w1, w2)]
        right_set[w2] = right_set.get(w2, 0) + freqs[(w1, w2)]
    return left_set, right_set

def jaccard(freqs):
    left_set, right_set = marginal_counts(freqs)
    return {
        (w1, w2): freqs[(w1, w2)] / (left_set[w1] + right_set[w2] - freqs[(w1, w2)]) for w1, w2 in freqs.keys()
    }
Python

The first function is used to compute \(C(w_1,?)\) and \(C(?,w_2)\). The second function uses the first one and then computes for each bigram its Jaccard coefficient. If we run this code and print the top occurring pairs, we will get a disappointing result:

('tue', 'wed') => 1.0
('wed', 'thu') => 1.0
('thu', 'fri') => 1.0
('fri', 'sat') => 1.0
('sat', 'sun') => 1.0
('sulle', 'assicurazioni') => 1.0
('interesse', 'collettivo') => 1.0
('fondi', 'pensione') => 1.0
('ufficio', 'italiano') => 1.0
('comitato', 'interministeriale') => 1.0
('suomen', 'pankki') => 1.0
('gospodarstwa', 'krajowego') => 1.0
('magyar', 'nemzeti') => 1.0
('naţional', 'ă') => 1.0
('lietuvos', 'bankas') => 1.0
('българска', 'народна') => 1.0
('народна', 'банка') => 1.0
('regolamento', 'lordo') => 1.0
('grandes', 'transacções') => 1.0
('règlement', 'intérieur') => 1.0

This list shows weird pairs, especially when the corpus is in English (or it should be). These are low-occurring bigrams, where each word occurs only in this specific expression. Thus, \(C(w_1,w_2) = C(w_1,?) = C(?,w_2)\), and therefore, \(J(w_1,w_2)=1\).

Better results can be obtained by mixing this information with the raw frequency obtained in the previous section. Also, it is possible to join this method with the methods described below. Nevertheless, a simple and interesting tweak to the Jaccard function can discard situations where the number of occurrences of \(w_1\) and \(w_2\) are very close:

def jaccard(u_freqs, bi_freqs, thereshold=5):
    left_set, right_set = marginal_counts(bi_freqs)
    return {
        (w1, w2): bi_freqs[(w1, w2)] / (left_set[w1] + right_set[w2] - bi_freqs[(w1, w2)]) for w1, w2 in bi_freqs.keys()
           if w1 in u_freqs and w2 in u_freqs and abs(u_freqs[w1] - u_freqs[w2]) > thereshold
    }
Python

Note that now we receive the frequency of single words, so we can compare them. The results for the code above and my corpus are:

('navigation', 'path') => 0.9565857247976454
('inter', 'alia') => 0.8411552346570397
('sveriges', 'riksbank') => 0.8132530120481928
('wise', 'men') => 0.75
('otmar', 'issing') => 0.7391304347826086
('van', 'belgië') => 0.6842105263157895
('official', 'journal') => 0.6669341894060995
('frankfurt', 'am') => 0.6591760299625468
('big', 'bang') => 0.6521739130434783
('finnish', 'markka') => 0.6511627906976745
('centrale', 'du') => 0.6335877862595419
('member', 'states') => 0.6205878880887734
('laid', 'down') => 0.5927995034140285
('decimal', 'places') => 0.58
('cashtester', 'ct') => 0.5714285714285714
('pro', 'rata') => 0.5686274509803921
('oesterreichische', 'nationalbank') => 0.5664739884393064
('ad', 'hoc') => 0.5641025641025641
('united', 'kingdom') => 0.5612244897959183
('ser', 'vices') => 0.5555555555555556

In these examples, some MWE appears, like navigation path, wise men (ok, arguable), member states, decimal places, united kingdom, and the Latin expressions pro rata and ad hoc.

Sørensen–Dice Coefficient

The Sørensen-Dice coefficient is another relation between the set sizes used above, for the Jaccard index. The relation here presented is also very similar to the F1 score you might know.

$$SDC(A,B) = \frac{2 \left| A \cap B \right|}{\left|A\right| + \left|B\right|}$$

Compared with the Jaccard Index, instead of subtracting the number of co-occurrences from the individual occurrences, the co-occurrences are just doubled. Thus, these two formulas are related, but the results are a little different.

Again, if the free sets have the same cardinality, we get \(SDC = 1\). Therefore, I will implement the function with the same tweak we used above, to consider bigrams where the occurrence count of each component is a little different.

def sdc(u_freqs, bi_freqs, thereshold=5):
    left_set, right_set = marginal_counts(bi_freqs)
    return {
        (w1, w2): 2 * bi_freqs[(w1, w2)] / (left_set[w1] + right_set[w2]) for w1, w2 in bi_freqs.keys()
           if w1 in u_freqs and w2 in u_freqs and abs(u_freqs[w1] - u_freqs[w2]) > thereshold
    }
Python

For my corpus, the results are the same, although with different indexes. There is an algebraic expression that allows conversion between both metrics. SDC does not satisfy the triangle inequality, which is satisfied by the Jaccard index.

(‘navigation’, ‘path’) => 0.9778112072207596
(‘inter’, ‘alia’) => 0.9137254901960784
(‘sveriges’, ‘riksbank’) => 0.8970099667774086
(‘wise’, ‘men’) => 0.8571428571428571
(‘otmar’, ‘issing’) => 0.85
(‘van’, ‘belgië’) => 0.8125
(‘official’, ‘journal’) => 0.8001925854597978
(‘frankfurt’, ‘am’) => 0.7945823927765236
(‘big’, ‘bang’) => 0.7894736842105263
(‘finnish’, ‘markka’) => 0.7887323943661971
(‘centrale’, ‘du’) => 0.7757009345794392
(‘member’, ‘states’) => 0.7658799533799534
(‘laid’, ‘down’) => 0.7443491816056118
(‘decimal’, ‘places’) => 0.7341772151898734
(‘cashtester’, ‘ct’) => 0.7272727272727273
(‘pro’, ‘rata’) => 0.725
(‘oesterreichische’, ‘nationalbank’) => 0.7232472324723247
(‘ad’, ‘hoc’) => 0.7213114754098361
(‘united’, ‘kingdom’) => 0.7189542483660131
(‘ser’, ‘vices’) => 0.7142857142857143

There are other similar metrics based on events cardinalities, but the results are relatable.

Pointwise Mutual Information

The final metric I would like to talk about is Pointwise Mutual Information (PMI), a metric used in Information Theory, that computes an association metric between two events (in contrast with the general Mutual Information property, that builds a distribution on it). PMI is defined mathematically as:

$$PMI(x; y) = \log\frac{p(x,y)}{p(x)p(y)} = \log \frac{p(x|y)}{p(x)}$$

While we will use the first equation, the second is interesting to be aware of, as it is useful on NLP given the conditional probability. Computing this metric with the code we developed so far is quite easy:

def pmi(u_freqs, u_total, bi_freqs, bi_total, thereshold=5):
    bi_prob = lambda x: bi_freqs[x] / bi_total
    u_prob = lambda x: u_freqs[x] / u_total
    return {
        (w1, w2): math.log2(bi_prob((w1, w2)) / (u_prob(w1) * u_prob(w2))) for w1, w2 in bi_freqs.keys()
           if w1 in u_freqs and w2 in u_freqs and abs(u_freqs[w1] - u_freqs[w2]) > thereshold
    }
Python

To make this code more legible I decided to create two functions, one to create the probability of a bigram, and the other the probability of a single word. In this way, the function is a simple logarithm of the division of the desired probabilities.

The results for this method are very different from the two above:

('founding', 'fathers') => 18.557323666461535
('por', 'tfolio') => 18.557323666461535
('michael', 'wilford') => 18.557323666461535
('nitra', 'castle') => 18.557323666461535
('magnifying', 'glasses') => 18.557323666461535
('slanted', 'roof') => 18.557323666461535
('albert', 'schweitzer') => 18.557323666461535
('keith', 'strasse') => 18.557323666461535
('c19', 'musashi') => 18.557323666461535
('jetscan', 'sentinel') => 18.557323666461535
('zum', 'fruchthof') => 18.557323666461535
('cashcontrol', 'industriestr') => 18.557323666461535
('eu55', 'cashcontrol') => 18.557323666461535
('bd', 'arago') => 18.557323666461535
('ministrstvo', 'za') => 18.557323666461535
('za', 'pravosodje') => 18.557323666461535
('sektor', 'za') => 18.557323666461535
('za', 'izvrsevanje') => 18.557323666461535
('pontifical', 'guard') => 18.557323666461535
('fortified', 'castle') => 18.557323666461535

Conclusion

While the results presented above do not seem the more interesting, both the Jaccard Index and Pointwise Mutual Information are widely used in Natural Language Processing, and not just to compute associativity between words. Nevertheless, they can be used for this, and the results are not to be discarded.

The examples above were computed with a small and noisy corpus. A simple example of such bad quality is the example in this last block for ‘por tflolio‘ as two distinct words, where it should be just one.

In future posts, I might get back on this, and enrich the methods we discussed with more engineering.

The code for this (and other) posts will be available here: https://github.com/ambs/blog-code

If you think this is useful, let me know. It might make me write more of this. If you have a suggestion to make it better (either the code or the post), please share your positive ideas. If you want to criticize, just write your own blog! Thanks!

Leave a Reply