Shock Result<>?: Rust faster than Python in one test of file parsing
I don’t think I’d surprise many people if I told them a Rust program was often faster than a Python program but today I wrote a small “script” that scanned a large log file for a regular expression and was surprised to find the Rust version ran >400x faster than the Python. Given the amount of log data I’m going to need to analyse over the next few days – this “script” is going to be a Rust program (though it would be possible for me to use an optimised Python version instead).
In my day job, I work on a complex C application. When we are hunting certain classes of bugs, a common technique to try to tune the logging/tracing of the application, wait for a while until we think the problem has occurred then comb through the trace to figure out what happened.
Even with tuning the tracing settings, when looking for a rare timing condition, this can result in having to search through many gigabytes of logs files. I can often use grep – but sometimes I want to say something like “find this line or this line between (a line that looks like a line of type A and one of type B)” and writing a short program is the easiest thing to do. I’ve used C and Go for parsing the logs in the past but my default language for this sort of task is Python – it’s easy to write and often programmer time is more important than CPU time.
This morning I started work on a new script, scanning 200MiB of logs (850,000 lines). The first version of my Python script just tried to match a regular expression against each line. It took over 4 minutes on my sample data, so I made a few tweaks (e.g. I noticed I was compiling the regular expression for each line). It still took over 4 minutes when I fixed that. The Python code in question:
import re
#Takes about 4m23 on my ThinkPad P50
def parseTraceFile(filepath):
connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")
with open(filepath) as fp:
line = fp.readline()
count = 1
while line:
#print("Line {}: {}".format(count, line.strip()))
#Occasional proof of life so we don't think it's hung
if count % 20000 == 0:
print("Processing line %d" % count)
count += 1
connectline = connectregex.search(line)
if connectline:
print("date = %s" % connectline.group(1))
print("time = %s" % connectline.group(2))
print("connect = %s " % connectline.group(3))
print("client= %s" % connectline.group(4))
line = fp.readline()
if __name__ == "__main__":
success = parseTraceFile('/var/tmp/sampletrace.log')
if not success:
exit(10)
exit(0)
This code takes ~4m23s on my ThinkPad P50. This worried me as the eventual script will have a lot more regular expressions and run on a lot more data – so to cut a long story short – I thought that given I was learning Rust (for other reasons), I’d see how long a Rust version took. A simple version without clever optimisations runs in about 0.6 seconds. Here’s the Rust version:
use std::fs::File;
use std::io::{prelude::*, BufReader};
use regex::Regex;
fn parse_trace_file(filepath: String) -> std::io::Result<()> {
let connect_regex: Regex = Regex::new(r"([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s").unwrap();
let file = File::open(filepath)?;
let reader = BufReader::new(file);
let mut count = 1;
for lineresult in reader.lines() {
if let Ok(line) = lineresult {
//println!("{}", line);
//Proof of life - so we can tell we haven't hung
if (count % 20000) == 0 {
println!("Processing line {}", count);
}
count += 1;
if let Some(caps) = connect_regex.captures(&line) {
println!("Found a match.");
println!("Date = {}", caps.get(1).map_or("PARSE ERROR", |m| m.as_str()));
println!("Time = {}", caps.get(2).map_or("PARSE ERROR", |m| m.as_str()));
println!("Connect = {}", caps.get(3).map_or("PARSE ERROR", |m| m.as_str()));
println!("Client = {}", caps.get(4).map_or("PARSE ERROR", |m| m.as_str()));
}
}
}
Ok(())
}
fn main() {
let result = parse_trace_file(String::from("/var/tmp/sampletrace.log"));
if result.is_ok() {
println!("Yay");
}
}
Is the Rust version harder to write? Yes, at least for a beginner like me – it is but given the saving in CPU time the trade off is worth it – especially given I’ll get faster at writing Rust the more I do it.
The difference in execution time surprised me – I naively assumed that because the regular expression library in Python is going to be mature, compiled C code I’d see an insignificant difference.
This is, at the moment about a single regular expression – if I alter the Python version to search for a string literal – and only run the regular expression on lines that are likely candidates:
import re
def parseTraceFile(filepath):
connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")
with open(filepath) as fp:
line = fp.readline()
count = 1
while line:
#print("Line {}: {}".format(count, line.strip()))
#Occasional proof of life so we don't think it's hung
if count % 20000 == 0:
print("Processing line %d" % count)
count += 1
findpos = line.find("User is authenticated and authorized", 40)
if findpos > -1:
print("Found match")
connectline = connectregex.search(line)
if connectline:
print("date = %s" % connectline.group(1))
print("time = %s" % connectline.group(2))
print("connect = %s " % connectline.group(3))
print("client= %s" % connectline.group(4))
line = fp.readline()
if __name__ == "__main__":
success = parseTraceFile('/var/tmp/sampletrace.log')
if not success:
exit(10)
exit(0)
Then this Python version runs at a similar speed to the Rust version.
What conclusions can we draw from such a single case? Not many – I’m going to experiment with doing this particular analysis with Rust and, if it is not too painful I’ll probably compare some other cases as well. If I wasn’t learning Rust for other reasons though – I could still get my Python script to run at a similar speed at the cost of thinking more carefully about optimising the analysis code. That more careful thought does start to eat away at the advantage Python has for me though – that it’s very fast for writing a quick and dirty scripts.
You don’t describe how you ran the code, and that’s a big deal for Rust. What command did you use?
I suspect you ran a build without optimizations. There are other possibilities too, but that’s the most obvious question.
Hi Steve – I ran a release build – I’m not complaining about the speed of the Rust – it ran 400x quicker than the Python! (It’s an honour to have you comment on my blog)
This sounds like an ideal use case for Perl too, I wonder what sort of performance you’d get from that type of solution by contrast too.
Graham – It’s been more than a decade since I wrote much Perl and I don’t plan on going back ever – it’s my least favourite program language I’ve ever used and I’ve used asm and Fortran :)
The issue is that regular expressions can be divided into ‘limited’ regular expressions, which are fast, but have limited expressivity (this is the original, academic meaning I believe) and ‘extended’ regular expressions, which are much slower, but have more expressivity, such as back references, etc. The rust crate for regex only implements limited regular expressions, ie, from the documentation of this:
“Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text.”
Python regular expressions appear to be ‘expanded’ expression, which have support for such things as back references, etc, but I think in this case, since you don’t appear to need them, you are perhaps still paying the price. That would explain why the simple text match in python is similar to rust — it is the type of regular expression library that is making the difference.
May I interest you in https://github.com/DanielKeep/cargo-script#scripts ?
Essentially, you can put it into the #! at the the start of your file and start running ./foo.rs like you would have run your ./foo.py before.
(It’ll save you from having to have a cargo project structure for every little thing.)
Hi Julius – I hadn’t seen that – looks really handy, thanks!
You might also get better performance using an alternative regex library (like re2) in Python. I think the standard library is still quite bad: https://swtch.com/~rsc/regexp/regexp1.html.
In your specific use case I suppose it won’t really mater as I/O appears to be your bottleneck.
Hi Remi, I/O is not the bottleneck, see the second python program in the post – if I use a string compare rather than a regex, it goes ~400x faster.
[…] I don’t think I’d surprise many people if I told them a Rust program was often faster than a Python program but today I wrote a small “script†that scanned a large log file for a regular expression and was surprised to find the Rust version ran >400x f… Read more […]