Only a few bits of storage are needed per hash value compared to the approximately 64 bits needed in other methodologies, leading to an 8x reduction in storage space.
About
Challenge Weighted Minwise Sampling or hashing is used extensively in commercial big data systems for search engines and learning. Hashing is the process of computing shorter fixed-length values that represent a longer original string of data. Hashing is commonly used for indexing, comparing, and retrieving items in databases, as it is more efficient than working with the original data. For large datasets there exists a bottleneck in terms of the time required to compute hashes of all the data. To remedy this, more efficient hashing techniques are needed. Solution The best known Weighted Minwise hashing (WMH) technique requires O(dk) computation time, per k independent and unbiased hashes, where d is the number of non-zero values in the data vector, which is especially complex for large data sets where k can range into the thousands. A new technique discovered at Rice University uses a novel red-green map approach to reduce this to a constant amount of computation per hash, O(k), making it scalable for massive datasets. In addition, only a few bits of storage are needed per hash value compared to the approximately 64 bits needed in other methodologies, leading to an 8x reduction in storage space. Benefits and Features Faster and more efficient method for weighted minwise hashing Demonstrated 60,000-fold speedup in computing 500 hashes as compared to current leading method Eight-fold reduction in data storage space needs Market Potential / Applications This improved Weighted Minwise Hashing technique with reductions in computation time and storage space will be useful in search engines and learning approaches for complex big datasets.