VecZip is a novel compression method by DEJAN AI that reduces embedding dimensionality by retaining unique dimensions to improve AI performance and storage.
Machine learning models rely on embeddings to understand complex data like language and images. But these embeddings can be massive, creating huge bottlenecks for storage, processing, and speed. Traditional compression often strips away vital context. That is why DEJAN AI developed VecZip, a new approach designed to shrink embeddings without losing their meaning.
While standard techniques like Principal Component Analysis, or PCA, focus on dimensions with the highest variance, VecZip takes the opposite approach. It analyzes the data to find and keep the dimensions with the least commonality, preserving the most unique features. In practice, it can compress embeddings down to just sixteen dimensions.
This aggressive reduction shrinks file sizes by about fifty to one, drastically cutting storage and compute costs. But the real surprise is the performance. Tests show that VecZip actually improves accuracy on downstream tasks, like measuring sentence similarity. It also enhances real-world applications, from classifying search intent and clustering data to optimizing link recommendations.
By optimizing the essential features of embeddings, VecZip makes AI systems faster, cheaper, and more scalable.
Embeddings are vital for representing complex data in machine learning, enabling models to perform tasks such as natural language understanding and image recognition. However, these embeddings can be massive in size, creating challenges for storage, processing, and transmission. At DEJAN AI, we’ve developed VecZip, a novel approach to address this issue, and reduce the file size without compromising data quality, with the goal of improving the quality of AI processes.
The Challenge of Large Embeddings
While traditional compression techniques can help reduce file size, they are not always optimized for the unique structure of embeddings. They may also not be optimized to preserve essential semantic or contextual information. This is where VecZip excels.

VecZip Approach
VecZip is a compression method designed to reduce the dimensionality of embeddings while focusing on retaining the most salient information. It works by identifying and removing dimensions that are less informative and keeping those that are the most unique, focusing on the areas with the least commonality.

This has the impact of reducing embedding sizes, but also improving the performance of the AI when used in downstream tasks.
VecZip vs. PCA
In the context of dimensionality reduction, PCA (Principal Component Analysis) is a commonly used technique. However, unlike PCA, which preserves the dimensions with the most variance across the entire dataset, VecZip uses an approach that emphasizes the least common dimensions.

Mode | LastWriteTime | Length Name
---- ------------- ------ ----
-a---- 9/12/2024 12:52 AM 246830957 embeddings.csv (235MB)
-a---- 12/12/2024 9:15 PM 4584099 zipped-embeddings.csv (4.37MB)
Test Results and Key Findings
To evaluate the effectiveness of VecZip, we conducted tests using the sentence-transformers/stsb dataset. We compared the results of using both original embeddings and compressed embeddings across a variety of tasks, here are the most prominent results:

Top two rows are the VecZip pruned embeddings for two sentences compared to the original below. Helpful for intuitive understanding of the impact this method has on file size.
Broader Applications
At DEJAN AI, we apply dimensionality reduction techniques to improve many aspects of our client’s work.
VecZip is an important step in developing efficient AI tools. By optimizing the feature space of embeddings, while improving downstream task performance, it paves the way for more scalable and performant AI systems.
We encourage the research and development community to explore the potential of VecZip, and we hope this approach enables further innovation in the field of machine learning.
pip install dejan
dejan veczip embeddings.csv zipped-embeddings.csv
I messed up the repo and took it down until I fix it up. Wheel based install should be enough to take it for a spin. If you need any details feel free to ping me.
Possible to do pip install from Git repository?
E.g : pip install git+https://github.com/….
At the moment the two installation options are:
pip install dejan
https://pypi.org/project/dejan/
or download the wheels:
https://pypi.org/project/dejan/#dejan-1.2-py3-none-any.whl
https://files.pythonhosted.org/packages/61/9f/bab08d11b175065fa24dbc0053b477280da9891fceb2f7751c921b4d79a1/dejan-1.2-py3-none-any.whl
What is the GitHub repository for ‘dejan’ because I can’t find it on PyPi.