Better Vector Clustering With Head Noun Extraction

by

in

Let’s do a mental exercise.

Glance over the following list and group them in your mind:

  1. blue thermal socks
  2. cheap diesel bulldozer
  3. cheap gaming laptops
  4. blue rental bulldozer
  5. cheap ankle socks
  6. used cushioned socks
  7. blue lightweight laptops
  8. cheap striped socks
  9. used touchscreen laptops
  10. blue compact bulldozer
  11. cheap business laptops
  12. blue ultraportable laptops
  13. used electric bulldozer
  14. cheap mini bulldozer
  15. blue compression socks

Most people arrive at the following clustering schema:

SocksLaptopsBulldozers
blue thermal sockscheap gaming laptopscheap diesel bulldozer
cheap ankle socksblue lightweight laptopsblue rental bulldozer
used cushioned socksused touchscreen laptopsblue compact bulldozer
cheap striped sockscheap business laptopsused electric bulldozer
blue compression socksblue ultraportable laptopscheap mini bulldozer

What would a machine do?

Let’s find out.

We’ll vectorise these search queries using Embedding Gemma

0,1,...,255

0.01809046,0.014781968,...,-0.09089249
0.036337394,0.06969773,...,0.0038870324
...etc

Note: In the above example we’re using MRL 256 to reduce dimensionality.

After that we’ll cluster them by similarity of their embeddings. In this specific example we’ll use FAISS index which builds implicit clusters represented as Voronoi cells each one with a “topical centroid”.

And you end up with grouping like this:

???
cheap ankle socksblue thermal socksused cushioned socks
cheap striped socksblue compression socksused touchscreen laptops
cheap gaming laptopsblue lightweight laptopsused electric bulldozer
cheap business laptopsblue ultraportable laptops
cheap diesel bulldozerblue rental bulldozer
cheap mini bulldozerblue compact bulldozer

What happened?

We ended up with head nouns grouped by adjectives.

Standard embeddings create a “semantic soup.” The vector for “cheap laptop” is a mathematical average of “cheap” and “laptop.” Because “cheap” is a very strong concept, it pulls the vector towards other “cheap” things, ignoring the physical object.

Obviously it’s not all as simple as the above example, our large-scale NLP analysis of search queries reveals a wide variety of patterns:

patternfreq
ADJ NOUN NOUN45154
NOUN NOUN NOUN28902
NOUN NOUN25469
ADJ NOUN NOUN NOUN25036
ADJ NOUN14539
NOUN NOUN NOUN NOUN11848
NOUN6732
ADJ NOUN NOUN NOUN NOUN5403
ADJ ADJ NOUN NOUN4033
NOUN ADJ NOUN NOUN3684
NOUN VERB NOUN3492
NOUN ADJ NOUN3367
ADJ ADJ NOUN3304
ADJ NOUN VERB NOUN2968
ADJ NOUN ADJ NOUN2726
NOUN NOUN VERB2137
ADV NOUN2063
ADJ NOUN VERB2037
NOUN NOUN VERB NOUN2001
NOUN VERB1898

So what do we do?

To be continued…


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *