Embeddings are vital for representing complex data in machine learning, enabling models to perform tasks such as natural language understanding and image recognition. However, these embeddings can be massive in size, creating challenges for storage, processing, and transmission. At DEJAN AI, we’ve developed VecZip, a novel approach to address this issue, and reduce the file size without compromising data quality, with the goal of improving the quality of AI processes.
The Challenge of Large Embeddings
While traditional compression techniques can help reduce file size, they are not always optimized for the unique structure of embeddings. They may also not be optimized to preserve essential semantic or contextual information. This is where VecZip excels.
VecZip Approach
VecZip is a compression method designed to reduce the dimensionality of embeddings while focusing on retaining the most salient information. It works by identifying and removing dimensions that are less informative and keeping those that are the most unique, focusing on the areas with the least commonality.
This has the impact of reducing embedding sizes, but also improving the performance of the AI when used in downstream tasks.
- Dimensionality Analysis: VecZip analyzes the distribution of values across all samples. Dimensions with high commonality are considered less important.
- Feature Selection: VecZip retains the dimensions with the least commonality, effectively keeping the most unique aspects of the embeddings. In our current implementation, we target a reduction to just 16 dimensions.
- Compressed Representation: The result is a compact representation of the original data, with minimal loss of critical information and an overall reduced file size.
VecZip vs. PCA
In the context of dimensionality reduction, PCA (Principal Component Analysis) is a commonly used technique. However, unlike PCA, which preserves the dimensions with the most variance across the entire dataset, VecZip uses an approach that emphasizes the least common dimensions.
- PCA (Left): Performs better at light to moderate dimensionality reduction.
- VecZip (Right): Performs better at aggressive reduction.
Mode | LastWriteTime | Length Name
---- ------------- ------ ----
-a---- 9/12/2024 12:52 AM 246830957 embeddings.csv (235MB)
-a---- 12/12/2024 9:15 PM 4584099 zipped-embeddings.csv (4.37MB)
Test Results and Key Findings
To evaluate the effectiveness of VecZip, we conducted tests using the sentence-transformers/stsb dataset. We compared the results of using both original embeddings and compressed embeddings across a variety of tasks, here are the most prominent results:
- Enhanced Similarity Scores: On a sentence similarity task, VecZip led to embeddings with a lower mean absolute difference from the “true” scores when compared to the original, higher dimension embeddings.
- Significant Compression: The data was also compressed by approximately 50:1, which greatly reduces the required storage space and can improve the speed of processing embeddings.
Top two rows are the VecZip pruned embeddings for two sentences compared to the original below. Helpful for intuitive understanding of the impact this method has on file size.
Broader Applications
At DEJAN AI, we apply dimensionality reduction techniques to improve many aspects of our client’s work.
- Link Recommendations: Reduced embeddings aid in improving the quality of internal link recommendations.
- Anchor Text Selection: We see enhanced performance when aiding anchor text selection tasks using VecZip .
- Query Intent Classification: These techniques also improve our ability to classify user query intent.
- Clustering: The improved clustering behavior of the compressed embeddings gives us a better overview of the data as a whole.
- CTR Optimization: We apply compressed embeddings to help optimize click-through rates.
- General NLP Tasks: VecZip can improve performance of many other NLP tasks.
- Reduced Costs: Additionally, by greatly reducing the number of dimensions, we see improvements in storage needs as well as a reduced compute overhead.
VecZip is an important step in developing efficient AI tools. By optimizing the feature space of embeddings, while improving downstream task performance, it paves the way for more scalable and performant AI systems.
We encourage the research and development community to explore the potential of VecZip, and we hope this approach enables further innovation in the field of machine learning.
pip install dejan
dejan veczip embeddings.csv zipped-embeddings.csv
Leave a Reply