How RHA Works
RHA enables correlation of files based on functional features. These attributes include format specific header information, file layout and functional file information (e.g. code and data relationships.) RHA calculates functional similarity at four “Precision Levels,” 25%, 50%, 75% and 100%, each based on an increasing number of attributes. Precision Level represents the degree that a file is functionally similar to another file. A higher Precision Level will match fewer files but the files will have more functional similarity.
RHA can be applied to any executable file format. First, format specific features are abstracted into categories such as: structure, layout, content, symbols, functionality and relationships. Then, algorithms are implemented to evaluate the attributes of each category for similarity at each precision level. Algorithms will vary for each format but usually entail data sorting and simplification. The algorithms calculate a hash for each Precision Level so that functionally related files fall into the same hash group.
Each Precision Level’s hash is deterministic and tied to functional configuration. This makes Precision Levels distinct with no overlaps in hash lookup. This hash determinism ensures fastest possible hash lookup times.
The effectiveness of RHA was tested using 7.75M unique malware samples that were detected as part of the Zeus malware family by at least one antivirus vendor. The samples were processed with the algorithm at the lowest precision level resulting in 475k unique RHA1 hashes. This effectively reduced the working malware set size by 93%.
We expected a reduction in sample uniqueness for members of the same malware family but didn’t expect the magnitude of reduction. We analyzed the sample data to better understand why the effectiveness was so high. We started with the hashes that yielded the most matches. The following plot shows the number of unique binaries that map to a single RHA1 hash at the lowest Precision Level.
Number of files that are assigned to a single RHA hash
The top matching RHA file sample showed that our best match wasn’t on a particular malware family but on a packing wrapper used to mask the true attack. This was not a common off-the-shelf packer, such as UPX, but a custom packing solution developed exclusively to hide malware presence.
Since packing can obscure detections and their malware family groupings, we turned to antivirus solutions to see how they classified the top match. The following graph shows the normalized threat names for the 100k files of the most prevalent RHA hash. There wasn’t a consensus on the threat name and only one antivirus vendor classified these samples as Zeus. Since it’s clear that the packing layer interferes with proper detections, we’ve upgraded our TitaniumCore solution to support this custom packing solution we call cpFlush.
Threat name breakdown for the best RHA1 hash
Unpacking the files showed that the top match was also using multiple packing layers. The number of corrupted and incorrectly packed files was low, so we could successfully unpack 95% of the samples. Comparing the RHA of files at each layer of packing showed they remained within the same functional hash buckets. This indicates that the differences between these files were indeed minor.
RHA, even at the lowest precision level, showed no collisions with whitelisted files and therefore was safely applied to our automatic RHA cloud classification. The custom packer was blacklisted using its format signature. RHA enables us to detect multiple malware families that use it.
RHA provides a new security tool for effectively detecting present and future malware. The power of this tool is multiplied when used with an extensive file reputation database like TitaniumCloud. This combination enables large-scale detection of new malware variants through functional similarity to known malware.