ReversingLabs Hashing Algorithm

predictive malware detection

Traditional hashing algorithms (e.g., MD5, SHA-1) provide an essential tool for security applications. Although commonly used for allowlisting and blocklisting, traditional hashes have significant drawbacks for detecting malware. First, a malicious file must be seen before a hash can be created so polymorphic attacks are not detectable. Second, hashes are fragile, enabling malware authors to make inconsequential changes to files to avoid detection.

ReversingLabs Hashing Algorithm (“RHA”) addresses these issues by intelligently hashing a file’s features rather than its bits. Files have the same RHA hash when they are functionally similar. This makes RHA orders of magnitude better than traditional hashes for malware detection. One RHA hash can potentially identify thousands of functionally similar malware files, even though each has a unique SHA-1 hash. Further, RHA will detect a new and unknown malware variant because it is functionally similar to known malware.

RHA is superior to the traditional similarity algorithms, such as imphash, ssdeep, tlsh, and others, providing superior identification automatically with pre-defined threat matching. With the traditional similarity algorithms, users have to create filesets that they want to match against; with RHA, users can get threat matching directly as that work has been done by ReversingLabs for them. Keep reading to see the results of our proprietary algorithm with a real-world example.

How RHA Works

RHA enables correlation of files based on functional features. These attributes include format-specific header information, file layout, and functional file information (e.g., code and data relationships). RHA calculates functional similarity at four “precision levels,” 25%, 50%, 75%, and 100%, each based on an increasing number of attributes. Precision level represents the degree a file is functionally similar to another file. A higher precision level will match fewer files, but the files will have more functional similarity.

RHA can be applied to any executable file format. First, format-specific features are abstracted into categories such as structure, layout, content, symbols, functionality, and relationships. Then, algorithms are implemented to evaluate the attributes of each category for similarity at each precision level. Algorithms will vary for each format but usually entail data sorting and simplification. The algorithms calculate a hash for each precision level so that functionally related files fall into the same hash group.

Each precision level’s hash is deterministic and tied to functional configuration. This makes precision levels distinct with no overlaps in hash lookup. This hash determinism ensures the fastest possible hash lookup times.

Validation

The effectiveness of RHA was tested using 7.75M unique malware samples that were detected as part of the Zeus malware family by at least one antivirus vendor. The samples were processed with the algorithm at the lowest precision level, resulting in 475K unique RHA1 hashes. This effectively reduced the working malware set size by 93%.

We expected a reduction in sample uniqueness for members of the same malware family but didn’t expect the magnitude of reduction. We analyzed the sample data to understand better why the effectiveness was so high. We started with the hashes that yielded the most matches. The following plot shows the number of unique binaries that map to a single RHA1 hash at the lowest precision level.

Number of files that are assigned to a single RHA hash

The top matching RHA file sample showed that our best match wasn’t on a particular malware family, but on a packing wrapper used to mask the true attack. This was not a common off-the-shelf packer, such as UPX, but a custom packing solution developed exclusively to hide malware presence.

Since packing can obscure detections and their malware family groupings, we turned to antivirus solutions to see how they classified the top match. The following graph shows the normalized threat names for the 100k files of the most prevalent RHA hash. There wasn’t no consensus on the threat name; only one antivirus vendor classified these samples as Zeus. Since it’s clear that the packing layer interferes with proper detections, we updated the Spectra Core backend to support this custom packing solution we call cpFlush.

Threat name breakdown for the best RHA1 hash

Unpacking the files showed that the top match was also using multiple packing layers. The number of corrupted and incorrectly packed files was low, so we could successfully unpack 95% of the samples. Comparing the RHA of files at each layer of packing showed they remained within the same functional hash buckets. This indicates that the differences between these files were indeed minor.

RHA, even at the lowest precision level, showed no collisions with allowlisted files and, therefore was safely applied to our automatic classification. The custom packer was blocklisted using its format signature, and RHA enabled us to detect multiple malware families that use it.

Conclusion

ReversingLabs proprietary hashing algorithm, a.k.a. RHA, is a critical component of RL’s Spectra Core capabilities, giving security teams a powerful new way for detecting present and future malware.