Shock Result<>?: Rust faster than Python in one test of file parsing

I don’t think I’d surprise many people if I told them a Rust program was often faster than a Python program but today I wrote a small “script” that scanned a large log file for a regular expression and was surprised to find the Rust version ran >400x faster than the Python. Given the amount of log data I’m going to need to analyse over the next few days – this “script” is going to be a Rust program (though it would be possible for me to use an optimised Python version instead).

In my day job, I work on a complex C application. When we are hunting certain classes of bugs, a common technique to try to tune the logging/tracing of the application, wait for a while until we think the problem has occurred then comb through the trace to figure out what happened.

Even with tuning the tracing settings, when looking for a rare timing condition, this can result in having to search through many gigabytes of logs files. I can often use grep – but sometimes I want to say something like “find this line or this line between (a line that looks like a line of type A and one of type B)” and writing a short program is the easiest thing to do. I’ve used C and Go for parsing the logs in the past but my default language for this sort of task is Python – it’s easy to write and often programmer time is more important than CPU time.

This morning I started work on a new script, scanning 200MiB of logs (850,000 lines). The first version of my Python script just tried to match a regular expression against each line. It took over 4 minutes on my sample data, so I made a few tweaks (e.g. I noticed I was compiling the regular expression for each line). It still took over 4 minutes when I fixed that. The Python code in question:

import re

#Takes about 4m23 on my ThinkPad P50

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1

            connectline = connectregex.search(line)

            if connectline:
                print("date = %s" % connectline.group(1))
                print("time = %s" % connectline.group(2))
                print("connect = %s " % connectline.group(3))
                print("client= %s" % connectline.group(4))

            line = fp.readline()



if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

This code takes ~4m23s on my ThinkPad P50. This worried me as the eventual script will have a lot more regular expressions and run on a lot more data – so to cut a long story short – I thought that given I was learning Rust (for other reasons), I’d see how long a Rust version took. A simple version without clever optimisations runs in about 0.6 seconds. Here’s the Rust version:

use std::fs::File;
use std::io::{prelude::*, BufReader};
use regex::Regex;

fn parse_trace_file(filepath: String) -> std::io::Result<()> {
    let connect_regex: Regex = Regex::new(r"([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s").unwrap();
    
    let file = File::open(filepath)?;
    let reader = BufReader::new(file);
    
    let mut count = 1;

    for lineresult in reader.lines() {
        if let Ok(line) = lineresult {
            //println!("{}", line);
            
            //Proof of life - so we can tell we haven't hung
            if (count % 20000) == 0 {
                println!("Processing line {}", count); 
            }
        
            count += 1;
        
            if let Some(caps) = connect_regex.captures(&line) {
                println!("Found a match.");
                println!("Date    = {}", caps.get(1).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Time    = {}", caps.get(2).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Connect = {}", caps.get(3).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Client  = {}", caps.get(4).map_or("PARSE ERROR", |m| m.as_str()));
            }
        }
    }

    Ok(())
}

fn main() {
    let result = parse_trace_file(String::from("/var/tmp/sampletrace.log"));
    
    if result.is_ok() {
        println!("Yay");
    }
}

Is the Rust version harder to write? Yes, at least for a beginner like me – it is but given the saving in CPU time the trade off is worth it – especially given I’ll get faster at writing Rust the more I do it.

The difference in execution time surprised me – I naively assumed that because the regular expression library in Python is going to be mature, compiled C code I’d see an insignificant difference.

This is, at the moment about a single regular expression – if I alter the Python version to search for a string literal – and only run the regular expression on lines that are likely candidates:

import re

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1
            findpos = line.find("User is authenticated and authorized", 40)

            if findpos > -1:
                print("Found match")
                connectline = connectregex.search(line)

                if connectline:
                    print("date = %s" % connectline.group(1))
                    print("time = %s" % connectline.group(2))
                    print("connect = %s " % connectline.group(3))
                    print("client= %s" % connectline.group(4))

            line = fp.readline()


if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

Then this Python version runs at a similar speed to the Rust version.

What conclusions can we draw from such a single case? Not many – I’m going to experiment with doing this particular analysis with Rust and, if it is not too painful I’ll probably compare some other cases as well. If I wasn’t learning Rust for other reasons though – I could still get my Python script to run at a similar speed at the cost of thinking more carefully about optimising the analysis code. That more careful thought does start to eat away at the advantage Python has for me though – that it’s very fast for writing a quick and dirty scripts.

CentOS 7 on an HP MicroServer N36L with an AMD Mobility Radeon HD 5430

I have an aging N36L box I use at home as a file server on our internal network.

Today a yum update in CentOS7 (it has not been updated for a long time so there were a lot of packages) caused the graphics to be corrupt/crash (machine was fine in text only mode).

I tried a few different things:

  • the latest mainline kernel from elrepo.org)
  • Fedora 29 (which rebooted immediately after logging into GNOME and had lots of graphical corruption in XFCE
  • Eventually downloaded CentOS7.3 from the archive and reinstalled using that. It’s a terrible work-around (and I didn’t keep logs or error messages). I used software RAID 1 with LVM on top.

Really this post is just to remind me what my dirty hack was if I need to do it again. I ought to collect logs and diagnostics of the problem and report it upstream.

TEDx Southampton 2016

TEDx Southampton 2016 happened today – the subjects were diverse and the speakers varied in style. I enjoyed most of them but there were two, in particular, that I thought might change my behaviour in future…

Tony Curran was speaking about the carbon-footprint of our food. He linked to the XKCD representation of the relative weights of the the population of land mammals. If you’ve not seen it before, the small fraction of wild animals might surprise you. Look at the fraction that is cows! Reducing the amount of beef we eat (even if you swap it for chicken) could have a big impact on the planet. He also talked about the importance of eating things in season and of choosing foods that come by land/sea not flown by air or grown in a hot house. He has a handy PDF guide to whch foods are in season.

Megan Streb talked about the importance of livable spaces. She extolled the virtues of knowing your neighbours, kids playing outside and communal outdoor spaces with seating. She talked about the possibility of closing streets for parties (or more regularly for kids to play). It was definitely food for thought and it gave me some ideas to think about which is what you hope for when you go to a TED event.

I’ll definitely be keen to go to a 2017 event if one happens.

Today I learned

Day 2 of Winchester Science Festival 2016. It’s an awesome event. If it happens next year: go!

A few random facts that I learned today:

  • The average time between a rabbit being bought in a pet shop and it being taken to a rescue centre is 6 weeks but they live 8-10 years. 
  • Rabbits make two types of poo – one (soft) designed to be eaten – they digest their food twice.
  • I saw an amazing video of a  rabbit attacking a snake.
  • Quite how quickly atmospheric CO2 is rising ‘The pump handle graph
  • In 1800 the UK had a life expectancy of about 40. It was even lower in almost all other countries.
  • At this point we’ll need to do carbon capture (BECS) to keep climate change to 1.5C
  • Deadly Nightshade is called Belladonna (‘beautiful woman’) because the berry juice was applied with a feather to dilate ladies’ pupil and make them look more attracted to a partner. The active ingredient atropine causes hallucinations and larger doses ~1 berry causes death.
  • Botulinum toxin (botox) is one of the most toxic things known to mankind. A pint could kill all living humans. Lethal dose is 1 nano-gram.
  • The placenta has viral DNA, what makes us mammals are genes captured from viruses: Mammals were made by viruses
  • ‘Unqualified’ volunteers have analysed images on the GalaxyZoo website and discovered type Ia supernova which are a ‘standard candle’..which help is measure the distance to other galaxies. You can volunteer time to help a number of science projects at the Zooniverse.

The Zooniverse talk was really inspiring, I’ll definitely be having a play!

A QuickStart guide to PubSub with IBM Internet of Things Foundation (sample C client)

At work, I’ve been working with IBM’s recently launched Internet of Things Foundation (IoTF), both on MessageSight (the underlying MQTT server) and other odds and ends. IoTF allows you to quickly start doing simple pub/sub via the MQTT protocol without even having to register yourself or the apps/devices doing the publishing or the subscribing, you configure apps to connect and messages can immediately start to flow for free.

As MQTT is used as the protocol you can use pretty much any programming language (including but definitely not limited to: Javascript, python, Go, C or Java). As an old-skool C hacker I often default to C and I forget the format for the id the publishers and subscribers need to use. To get started with a sample C app you:

  1. Download the Paho MQTT C client
  2. Extract the libraries (or “make” if you used the source, Luke)
  3. Build the samples (Unless you built the libraries and got it to build the samples at the same time (see README.md))
    1. gcc stdinpub.c -o stdinpub -I../include -L ../lib -lpaho-mqtt3c -Wl,-rpath,../lib
    2. gcc stdoutsub.c -o stdoutsub -I../include -L ../lib -lpaho-mqtt3c -Wl,-rpath,../lib
  4. Setup a subscriber:

    ./stdoutsub 'iot-2/type/jonexampletype/id/device1/evt/+/fmt/+' --host quickstart.messaging.internetofthings.ibmcloud.com --clientid 'a:quickstart:jonexamplesubber' --verbose
  5. In another terminal start a publishing app:

    ./stdinpub iot-2/evt/status/fmt/json --host quickstart.messaging.internetofthings.ibmcloud.com --clientid 'd:quickstart:jonexampletype:device1' --verbose
  6. Send some messages by typing them into your publisher (as we are publishing on a topic that claims to be JSON a simple example message might be:
    {"d":{"somenumber":42}}

In this example, the samples are linked to the classic, simpler, synchronous C library, in the same Paho bundle you get other (more powerful) variants, see the README.me for more details

A key to the colours used above:

  • Device Type: The type of device publishing, you can make it up and use it to group your devices
  • Device Id: The id of the device publishing, you can make it up. You need the combination of devicetype:deviceid to be unique across all devices connecting to quickstart… when a “duplicate” device connects, any existing devices with the same type:id will be disconnected
  • Event Type: The category of messages being published, again you can make it up. In this case, the subscriber used ‘+’ which means any event type.
  • Message Format: The format of the messaged published. You can make it up but JSON is special, in quickstart you can visualise data from devices published in JSON format and for registered devices, JSON messages can be recorded by the IoTF for future use
  • Application Identifier: Identifies the application, needs to be unique as duplicates are disconnected in the same way as mentioned in the Device Id: above.
  • Organisation:See below

In this example we were using the quickstart organisation which allows you to get going quickly. You know you that won’t be charged (or contacted by a sales person) as you haven’t given any contact details. After the initial thrill of getting going quickly, there are downsides – the device/application client identifiers have to unique across everyone using quickstart and there is no security for your data. Once you’ve taken your first steps you probably want to register – depending on your needs it can be as cheap as free, but it gives you e.g. access controls, security and the ability to log published messages in the cloud.

I plan to eventually do equivalents of this posts for other languages (and possibly with a registered account), let me know in the comments if there’s something you want covered.