Archive for October, 2020

Shock Result<>?: Rust faster than Python in one test of file parsing

I don’t think I’d surprise many people if I told them a Rust program was often faster than a Python program but today I wrote a small “script” that scanned a large log file for a regular expression and was surprised to find the Rust version ran >400x faster than the Python. Given the amount of log data I’m going to need to analyse over the next few days – this “script” is going to be a Rust program (though it would be possible for me to use an optimised Python version instead).

In my day job, I work on a complex C application. When we are hunting certain classes of bugs, a common technique to try to tune the logging/tracing of the application, wait for a while until we think the problem has occurred then comb through the trace to figure out what happened.

Even with tuning the tracing settings, when looking for a rare timing condition, this can result in having to search through many gigabytes of logs files. I can often use grep – but sometimes I want to say something like “find this line or this line between (a line that looks like a line of type A and one of type B)” and writing a short program is the easiest thing to do. I’ve used C and Go for parsing the logs in the past but my default language for this sort of task is Python – it’s easy to write and often programmer time is more important than CPU time.

This morning I started work on a new script, scanning 200MiB of logs (850,000 lines). The first version of my Python script just tried to match a regular expression against each line. It took over 4 minutes on my sample data, so I made a few tweaks (e.g. I noticed I was compiling the regular expression for each line). It still took over 4 minutes when I fixed that. The Python code in question:

import re

#Takes about 4m23 on my ThinkPad P50

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1

            connectline = connectregex.search(line)

            if connectline:
                print("date = %s" % connectline.group(1))
                print("time = %s" % connectline.group(2))
                print("connect = %s " % connectline.group(3))
                print("client= %s" % connectline.group(4))

            line = fp.readline()



if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

This code takes ~4m23s on my ThinkPad P50. This worried me as the eventual script will have a lot more regular expressions and run on a lot more data – so to cut a long story short – I thought that given I was learning Rust (for other reasons), I’d see how long a Rust version took. A simple version without clever optimisations runs in about 0.6 seconds. Here’s the Rust version:

use std::fs::File;
use std::io::{prelude::*, BufReader};
use regex::Regex;

fn parse_trace_file(filepath: String) -> std::io::Result<()> {
    let connect_regex: Regex = Regex::new(r"([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s").unwrap();
    
    let file = File::open(filepath)?;
    let reader = BufReader::new(file);
    
    let mut count = 1;

    for lineresult in reader.lines() {
        if let Ok(line) = lineresult {
            //println!("{}", line);
            
            //Proof of life - so we can tell we haven't hung
            if (count % 20000) == 0 {
                println!("Processing line {}", count); 
            }
        
            count += 1;
        
            if let Some(caps) = connect_regex.captures(&line) {
                println!("Found a match.");
                println!("Date    = {}", caps.get(1).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Time    = {}", caps.get(2).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Connect = {}", caps.get(3).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Client  = {}", caps.get(4).map_or("PARSE ERROR", |m| m.as_str()));
            }
        }
    }

    Ok(())
}

fn main() {
    let result = parse_trace_file(String::from("/var/tmp/sampletrace.log"));
    
    if result.is_ok() {
        println!("Yay");
    }
}

Is the Rust version harder to write? Yes, at least for a beginner like me – it is but given the saving in CPU time the trade off is worth it – especially given I’ll get faster at writing Rust the more I do it.

The difference in execution time surprised me – I naively assumed that because the regular expression library in Python is going to be mature, compiled C code I’d see an insignificant difference.

This is, at the moment about a single regular expression – if I alter the Python version to search for a string literal – and only run the regular expression on lines that are likely candidates:

import re

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1
            findpos = line.find("User is authenticated and authorized", 40)

            if findpos > -1:
                print("Found match")
                connectline = connectregex.search(line)

                if connectline:
                    print("date = %s" % connectline.group(1))
                    print("time = %s" % connectline.group(2))
                    print("connect = %s " % connectline.group(3))
                    print("client= %s" % connectline.group(4))

            line = fp.readline()


if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

Then this Python version runs at a similar speed to the Rust version.

What conclusions can we draw from such a single case? Not many – I’m going to experiment with doing this particular analysis with Rust and, if it is not too painful I’ll probably compare some other cases as well. If I wasn’t learning Rust for other reasons though – I could still get my Python script to run at a similar speed at the cost of thinking more carefully about optimising the analysis code. That more careful thought does start to eat away at the advantage Python has for me though – that it’s very fast for writing a quick and dirty scripts.