Semantically Distinct Key Phrase Extraction

1 minute read

Working with Hilbert Space is fascinating. It is a hash function with prefix matching properties and two hashes can be compared just like a ZIP Code or a PIN Code. This implies that a vector in positive space can be hashed. Combining this with a typical vector transformation in NLP (say word2vec) can generate one dimensional embeddings.

To give an example, consider the words in months (april, july, january) and mathematics (mathematics, deep learning, machine learning). If we transform those words using a word vector embedding and do a hilbert hash, it would look like the following figure

Hilbert Hash for various words from word2vec )

What it implies are

  • If the topics are different, the hash prefix would differ
  • We can reduce all the words in an article to set of hashes and find most common prefixes they belong
  • The look up table can be saved as a dictionary for easy re-use. The original word vectors are no longer needed
  • A trie can do fast look up under each subtree and can summarize/rank keywords

Using this approach, the first thing we could do is to generate distinct key phrases from a given article text. The problem is caleed Key phrase extraction. The uniqueness of this approach is that generated key phrases will be semantically distnict.

Hashing Process )

Output from the package

Distinct Keywords Sample Output )

Generalization and benchmarks

The approach can be generalized to any vector embedding technique and can do semantic sentence comparison or document comparison in an unsupervised setting. The current implementation used Trie and SortedDict for making it one of the fastest implementation. The approach does not require any training and shown a 31% recall score while doing benchmark with KPTimes Test Data Set (20000 articles) with manual keywords Same preprocessing and comparison was done with KeyBert with top_n as 16 and compared.

KP Times Test Data Recall Score )

Supported languages

  1. English (default) using custom word2vec trained on simplewiki.
  2. German (on test. Need support from native speakers).
  3. French (on test. Need support from native speakers).
  4. Italian (on test. Need support from native speakers).
  5. Portuguese (on test. Need support from native speakers).
  6. Spanish (on test. Need support from native speakers).




Leave a comment

Your email address will not be published. Required fields are marked *