100 Days of YARA – Day 2: Identifying PE files and Measuring Speed of Rules

Windows-based software is often represented as PE files:

  • exe files
  • dlls
  • drivers

PE files can be identified by the first two bytes of the file being “MZ”, the initials of Mark Zbykowski:

% xxd ~/Downloads/putty.exe | head
00000000: 4d5a 7800 0100 0000 0400 0000 0000 0000  MZx.............
00000010: 0000 0000 0000 0000 4000 0000 0000 0000  ........@.......
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 7800 0000  ............x...
00000040: 0e1f ba0e 00b4 09cd 21b8 014c cd21 5468  ........!..L.!Th
00000050: 6973 2070 726f 6772 616d 2063 616e 6e6f  is program canno
00000060: 7420 6265 2072 756e 2069 6e20 444f 5320  t be run in DOS 
00000070: 6d6f 6465 2e24 0000 5045 0000 4c01 0700  mode.$..PE..L...
00000080: f69c ef5e 0000 0000 0000 0000 e000 0201  ...^............
00000090: 0b01 0e00 00ce 0800 00aa 0700 0000 0000  ................

Simply checking for files beginning with “MZ” may result in false positives, however, in practice this method yields very few false positives.

Malware can evade these types of rules by replacing “MZ” with something else. One such example is observed in the REvil ransomware:

Malware removes some unused magic constants from the header to evade it. Magic constants such as 0x4D5A (MZ) 0x5045 (PE). This method requires loading and executing a payload just like a shellcode.


Just because there is a potential for false positives or false negatives does not make these rules “bad”, but it is something to be aware of and highlights the importance of developing multiple rules to scan for the same thing.

Here are three ways to determine if a file is a PE file. There are more ways to do this which may be faster or yield better results within your datasets, but these examples are easy to understand and generally good enough.

Method 1

rule pe_file_method1
		description = "PE file 'MZ' header as string"
		author = "Daniel Roberson"

		$pe = "MZ"

		$pe at 0

Method 2

rule pe_file_method2
		description = "PE file 'MZ' header as uint16"
		author = "Daniel Roberson"

		uint16(0) == 0x5a4d

Method 3

import "pe"

rule pe_file_method3
		description = "PE file using 'pe' module"
		author = "Daniel Roberson"



As I intend to use several YARA rules against large, multiple terabyte datasets, I want the rules to be as fast as reasonably possible. This level of performance is generally not such a big deal, but for certain use cases, performance can be a concern.

To measure performance, I calculated the average of the time it took each rule to scan the same 2.5Gb dataset 10 times using a one-liner:

for i in $(seq 10); do (time yara -r pe1.yar /tmp/malware_data_science 2>&1 >/dev/null) 2>&1 | awk {'print $7'}; done

I also included a dummy rule that always evaluates as true in the data to show the best-case scenario scanning the dataset using the same hardware:

rule null
Method 1 (string)Method 2 (uint16)Method 3 (pe module)Dummy rule
all measurements in seconds

The PE module is clearly the slowest of the three. The second method of searching for the integer representation of “MZ” was clearly the fastest of the three, roughly 4% faster than searching for “MZ” represented as a string.

YARA Rules Index

One thought on “100 Days of YARA – Day 2: Identifying PE files and Measuring Speed of Rules

  1. Pingback: YARA Rules Index – DMFR SECURITY

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s