Let’s do a mental exercise.
Glance over the following list and group them in your mind:
- blue thermal socks
- cheap diesel bulldozer
- cheap gaming laptops
- blue rental bulldozer
- cheap ankle socks
- used cushioned socks
- blue lightweight laptops
- cheap striped socks
- used touchscreen laptops
- blue compact bulldozer
- cheap business laptops
- blue ultraportable laptops
- used electric bulldozer
- cheap mini bulldozer
- blue compression socks
Most people arrive at the following clustering schema:
| Socks | Laptops | Bulldozers |
|---|---|---|
| blue thermal socks | cheap gaming laptops | cheap diesel bulldozer |
| cheap ankle socks | blue lightweight laptops | blue rental bulldozer |
| used cushioned socks | used touchscreen laptops | blue compact bulldozer |
| cheap striped socks | cheap business laptops | used electric bulldozer |
| blue compression socks | blue ultraportable laptops | cheap mini bulldozer |
What would a machine do?
Let’s find out.
We’ll vectorise these search queries using Embedding Gemma
0,1,...,255
0.01809046,0.014781968,...,-0.09089249
0.036337394,0.06969773,...,0.0038870324
...etc
Note: In the above example we’re using MRL 256 to reduce dimensionality.
After that we’ll cluster them by similarity of their embeddings. In this specific example we’ll use FAISS index which builds implicit clusters represented as Voronoi cells each one with a “topical centroid”.

And you end up with grouping like this:
| ? | ? | ? |
|---|---|---|
| cheap ankle socks | blue thermal socks | used cushioned socks |
| cheap striped socks | blue compression socks | used touchscreen laptops |
| cheap gaming laptops | blue lightweight laptops | used electric bulldozer |
| cheap business laptops | blue ultraportable laptops | |
| cheap diesel bulldozer | blue rental bulldozer | |
| cheap mini bulldozer | blue compact bulldozer |
What happened?
We ended up with head nouns grouped by adjectives.
Standard embeddings create a “semantic soup.” The vector for “cheap laptop” is a mathematical average of “cheap” and “laptop.” Because “cheap” is a very strong concept, it pulls the vector towards other “cheap” things, ignoring the physical object.
Obviously it’s not all as simple as the above example, our large-scale NLP analysis of search queries reveals a wide variety of patterns:
| pattern | freq |
|---|---|
| ADJ NOUN NOUN | 45154 |
| NOUN NOUN NOUN | 28902 |
| NOUN NOUN | 25469 |
| ADJ NOUN NOUN NOUN | 25036 |
| ADJ NOUN | 14539 |
| NOUN NOUN NOUN NOUN | 11848 |
| NOUN | 6732 |
| ADJ NOUN NOUN NOUN NOUN | 5403 |
| ADJ ADJ NOUN NOUN | 4033 |
| NOUN ADJ NOUN NOUN | 3684 |
| NOUN VERB NOUN | 3492 |
| NOUN ADJ NOUN | 3367 |
| ADJ ADJ NOUN | 3304 |
| ADJ NOUN VERB NOUN | 2968 |
| ADJ NOUN ADJ NOUN | 2726 |
| NOUN NOUN VERB | 2137 |
| ADV NOUN | 2063 |
| ADJ NOUN VERB | 2037 |
| NOUN NOUN VERB NOUN | 2001 |
| NOUN VERB | 1898 |
So what do we do?
To be continued…

Leave a Reply