Archive for the 'Technology' Category

Shock Result<>?: Rust faster than Python in one test of file parsing

I don’t think I’d surprise many people if I told them a Rust program was often faster than a Python program but today I wrote a small “script” that scanned a large log file for a regular expression and was surprised to find the Rust version ran >400x faster than the Python. Given the amount of log data I’m going to need to analyse over the next few days – this “script” is going to be a Rust program (though it would be possible for me to use an optimised Python version instead).

In my day job, I work on a complex C application. When we are hunting certain classes of bugs, a common technique to try to tune the logging/tracing of the application, wait for a while until we think the problem has occurred then comb through the trace to figure out what happened.

Even with tuning the tracing settings, when looking for a rare timing condition, this can result in having to search through many gigabytes of logs files. I can often use grep – but sometimes I want to say something like “find this line or this line between (a line that looks like a line of type A and one of type B)” and writing a short program is the easiest thing to do. I’ve used C and Go for parsing the logs in the past but my default language for this sort of task is Python – it’s easy to write and often programmer time is more important than CPU time.

This morning I started work on a new script, scanning 200MiB of logs (850,000 lines). The first version of my Python script just tried to match a regular expression against each line. It took over 4 minutes on my sample data, so I made a few tweaks (e.g. I noticed I was compiling the regular expression for each line). It still took over 4 minutes when I fixed that. The Python code in question:

import re

#Takes about 4m23 on my ThinkPad P50

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1

            connectline = connectregex.search(line)

            if connectline:
                print("date = %s" % connectline.group(1))
                print("time = %s" % connectline.group(2))
                print("connect = %s " % connectline.group(3))
                print("client= %s" % connectline.group(4))

            line = fp.readline()



if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

This code takes ~4m23s on my ThinkPad P50. This worried me as the eventual script will have a lot more regular expressions and run on a lot more data – so to cut a long story short – I thought that given I was learning Rust (for other reasons), I’d see how long a Rust version took. A simple version without clever optimisations runs in about 0.6 seconds. Here’s the Rust version:

use std::fs::File;
use std::io::{prelude::*, BufReader};
use regex::Regex;

fn parse_trace_file(filepath: String) -> std::io::Result<()> {
    let connect_regex: Regex = Regex::new(r"([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s").unwrap();
    
    let file = File::open(filepath)?;
    let reader = BufReader::new(file);
    
    let mut count = 1;

    for lineresult in reader.lines() {
        if let Ok(line) = lineresult {
            //println!("{}", line);
            
            //Proof of life - so we can tell we haven't hung
            if (count % 20000) == 0 {
                println!("Processing line {}", count); 
            }
        
            count += 1;
        
            if let Some(caps) = connect_regex.captures(&line) {
                println!("Found a match.");
                println!("Date    = {}", caps.get(1).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Time    = {}", caps.get(2).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Connect = {}", caps.get(3).map_or("PARSE ERROR", |m| m.as_str()));
                println!("Client  = {}", caps.get(4).map_or("PARSE ERROR", |m| m.as_str()));
            }
        }
    }

    Ok(())
}

fn main() {
    let result = parse_trace_file(String::from("/var/tmp/sampletrace.log"));
    
    if result.is_ok() {
        println!("Yay");
    }
}

Is the Rust version harder to write? Yes, at least for a beginner like me – it is but given the saving in CPU time the trade off is worth it – especially given I’ll get faster at writing Rust the more I do it.

The difference in execution time surprised me – I naively assumed that because the regular expression library in Python is going to be mature, compiled C code I’d see an insignificant difference.

This is, at the moment about a single regular expression – if I alter the Python version to search for a string literal – and only run the regular expression on lines that are likely candidates:

import re

def parseTraceFile(filepath):
    connectregex = re.compile("([^T]*)T([^Z]*)Z.*User is authenticated and authorized.*connect=([^\s]*)\s*client=([^\s]*)\s")

    with open(filepath) as fp:
        line = fp.readline()
        count = 1
        
        while line:
            #print("Line {}: {}".format(count, line.strip()))

            #Occasional proof of life so we don't think it's hung
            if count % 20000 == 0:
                print("Processing line %d" % count)

            count += 1
            findpos = line.find("User is authenticated and authorized", 40)

            if findpos > -1:
                print("Found match")
                connectline = connectregex.search(line)

                if connectline:
                    print("date = %s" % connectline.group(1))
                    print("time = %s" % connectline.group(2))
                    print("connect = %s " % connectline.group(3))
                    print("client= %s" % connectline.group(4))

            line = fp.readline()


if __name__ == "__main__":

    success = parseTraceFile('/var/tmp/sampletrace.log')
    
    if not success:
        exit(10)
    
    exit(0)

Then this Python version runs at a similar speed to the Rust version.

What conclusions can we draw from such a single case? Not many – I’m going to experiment with doing this particular analysis with Rust and, if it is not too painful I’ll probably compare some other cases as well. If I wasn’t learning Rust for other reasons though – I could still get my Python script to run at a similar speed at the cost of thinking more carefully about optimising the analysis code. That more careful thought does start to eat away at the advantage Python has for me though – that it’s very fast for writing a quick and dirty scripts.

A QuickStart guide to PubSub with IBM Internet of Things Foundation (sample C client)

At work, I’ve been working with IBM’s recently launched Internet of Things Foundation (IoTF), both on MessageSight (the underlying MQTT server) and other odds and ends. IoTF allows you to quickly start doing simple pub/sub via the MQTT protocol without even having to register yourself or the apps/devices doing the publishing or the subscribing, you configure apps to connect and messages can immediately start to flow for free.

As MQTT is used as the protocol you can use pretty much any programming language (including but definitely not limited to: Javascript, python, Go, C or Java). As an old-skool C hacker I often default to C and I forget the format for the id the publishers and subscribers need to use. To get started with a sample C app you:

  1. Download the Paho MQTT C client
  2. Extract the libraries (or “make” if you used the source, Luke)
  3. Build the samples (Unless you built the libraries and got it to build the samples at the same time (see README.md))
    1. gcc stdinpub.c -o stdinpub -I../include -L ../lib -lpaho-mqtt3c -Wl,-rpath,../lib
    2. gcc stdoutsub.c -o stdoutsub -I../include -L ../lib -lpaho-mqtt3c -Wl,-rpath,../lib
  4. Setup a subscriber:

    ./stdoutsub 'iot-2/type/jonexampletype/id/device1/evt/+/fmt/+' --host quickstart.messaging.internetofthings.ibmcloud.com --clientid 'a:quickstart:jonexamplesubber' --verbose
  5. In another terminal start a publishing app:

    ./stdinpub iot-2/evt/status/fmt/json --host quickstart.messaging.internetofthings.ibmcloud.com --clientid 'd:quickstart:jonexampletype:device1' --verbose
  6. Send some messages by typing them into your publisher (as we are publishing on a topic that claims to be JSON a simple example message might be:
    {"d":{"somenumber":42}}

In this example, the samples are linked to the classic, simpler, synchronous C library, in the same Paho bundle you get other (more powerful) variants, see the README.me for more details

A key to the colours used above:

  • Device Type: The type of device publishing, you can make it up and use it to group your devices
  • Device Id: The id of the device publishing, you can make it up. You need the combination of devicetype:deviceid to be unique across all devices connecting to quickstart… when a “duplicate” device connects, any existing devices with the same type:id will be disconnected
  • Event Type: The category of messages being published, again you can make it up. In this case, the subscriber used ‘+’ which means any event type.
  • Message Format: The format of the messaged published. You can make it up but JSON is special, in quickstart you can visualise data from devices published in JSON format and for registered devices, JSON messages can be recorded by the IoTF for future use
  • Application Identifier: Identifies the application, needs to be unique as duplicates are disconnected in the same way as mentioned in the Device Id: above.
  • Organisation:See below

In this example we were using the quickstart organisation which allows you to get going quickly. You know you that won’t be charged (or contacted by a sales person) as you haven’t given any contact details. After the initial thrill of getting going quickly, there are downsides – the device/application client identifiers have to unique across everyone using quickstart and there is no security for your data. Once you’ve taken your first steps you probably want to register – depending on your needs it can be as cheap as free, but it gives you e.g. access controls, security and the ability to log published messages in the cloud.

I plan to eventually do equivalents of this posts for other languages (and possibly with a registered account), let me know in the comments if there’s something you want covered.

Logitech C300 webcam on Linux

The Logitech C300 webcam worked fine with Fedora 15 as it shipped but a kernel update caused the audio to come out really squeaky/high-pitched/”chipmunk”ed.

I’m not the only one having issues with similar webcams:

  • https://bugs.launchpad.net/ubuntu/+source/linux/+bug/858412
  • https://bugzilla.redhat.com/show_bug.cgi?id=729269
  • https://bbs.archlinux.org/viewtopic.php?id=121607&p=3

Based on the patch in the third link in that list (for a few different webcams), I’ve found that the following trivial patch fixed the problem:

diff -uNrp kernel-2.6.40.fc15.orig/drivers/usb/core/quirks.c kernel-2.6.40.fc15.new/drivers/usb/core/quirks.c
--- kernel-2.6.40.fc15.orig/drivers/usb/core/quirks.c 2011-09-27 21:23:58.801051233 +0100
+++ kernel-2.6.40.fc15.new/drivers/usb/core/quirks.c 2011-09-27 21:30:35.184686232 +0100
@@ -44,6 +44,9 @@ static const struct usb_device_id usb_qu
/* Logitech Webcam C250 */
{ USB_DEVICE(0x046d, 0x0804), .driver_info = USB_QUIRK_RESET_RESUME },

+ /* Logitech Webcam C300 */
+ { USB_DEVICE(0x046d, 0x0805), .driver_info = USB_QUIRK_RESET_RESUME },
+
/* Logitech Webcam C310 */
{ USB_DEVICE(0x046d, 0x081b), .driver_info = USB_QUIRK_RESET_RESUME },

If you want to try that before Fedora make a fix available, it’s easy enough to build your own kernel

Fedora 13 and old NFS servers

On Fedora 13, the NFS client tries to use NFSv4 by default. When talking to the old (AIX-based) NFS servers at work the mount command failed with:

mount.nfs: Remote I/O error

In order to be able to mount the NFS shares I had to add -o vers=3 to my mount command.

Frustrated by the Digital Economy Bill

Days after the Digital Economy Bill (more details) was made law, I’m still incredibly angry and frustrated about the stitch-up that saw both the Tories and Labour rush an extremely bad bill into law, forcing their MPs to vote for it after a derisory debate with a three-line whip.

I understand that with the state of the economy, the deficit, Afghanistan the Digital Economy Bill might not be the top priority in this election. Apparently however, it’s important enough to railroad into the statue book with controversial amendments being added days before becoming law. Watching the passage of the bill was an eye-opening experience….

Everyone should watch a parliamentary debate on something they know about, finance, shipbuilding whatever. See how gvt “works”. #debill

@tonywhitmore on Twitter

This law was not made the way I expected law to be passed. What I expected was neatly summed up by Cameron Neylon:

Representative democracy bases its existence on the assumption that the full community can not be effectively involved in an informed and considered criticism of proposed bills and that it is therefore of value to both place some buffer between raw, and probably ill informed public opinion, and actual decision making. This presumes that MPs, particularly party spokespersons take the time to become expert on the matter of bills they represent.

Instead what I witnessed in debates in the House of Commons was a scarily ill-informed, rushed mess which was typified when it emerged that the “Minister for Digital Britain”, who is in charge of this travesty thought that the IP in “IP address” stands for Intellectual Property

There were MPs who “got it” from all parties, exemplified by the Labour back-bencher Tom Watson but only the Lib Dems voted against the bill as a party and even then, many did not turn up to vote!

I am not angry because I file-share illegal files. I do not. Copyright violation is already an offence. As that well-known bastion of communist hippies, the Telegraph, says:

In the past the lawyers had to go after the infringers, with actual proof. Remember being innocent until proven guilty? That’s out now. Now, the holder of the internet account (Mum, Dad, Granny and the small business that can’t afford the legal fees) will be held to account for what happens over their connection.

Aside from the eye-watering insight into how law is made, I’m cross for a number of reasons:

  • It won’t do what it tries to do, The Telegraph again:

    ” In April last year, Sweden’s internet traffic took a dramatic 30 per cent dip as the country’s new anti-file sharing law came into effect. Before this, Sweden, the home of the Pirate Bay, had been a hotbed of illegal trade in movies and music.

    But several months later traffic levels started to surpass the old levels. Consultancy firm Mediavision found that the accessing of illegally shared movies, TV shows and music simply recovered. But there was one crucial difference. Much of the internet traffic was now encrypted.”

  • It will make running a wifi hotspot a massive headache Small cafes have to decide whether to try to be ISPs or be responsible for copyright infringement on their hotspots and use technological measures to prevent abuse. Good luck finding workable “technological measures” that don’t massively restrict legal use of the internet!
  • It will have a chilling effect on free speech The Bill has a clause that will allow “the secretary of state for business to order the blocking of “a location on the internet which the court is satisfied has been, is being or is likely to be used for or in connection with an activity that infringes copyright” (via The Guardian). It has some safe-guards but there is concern that sites like WikiLeaks (which recently broadcast leaked video footage of American Pilots firing on innocent people including Reuters journalists) could be banned. ISPs will be worried about court costs which means even the threat of a ban will often get pages removed.

So if you’ve made it through this far, I’m hoping you agree with some of what I’m saying. So what can we do? A number of things:

  1. Join the Open Rights Group who campaigned against this fiasco.
  2. Find out about your local candidates. For example, which way did they vote on the Digital Economy Bill or did they not even bother to show up for work?
  3. Share info about the candidates with other people who care about internet freedom.
  4. What else? I’m not sure – I’m half tempted to wander down my local high street (Winchester) handing out leaflets summarising this blog entry and pointing out what one of the local candidates (Martin Tod) is a founder member of ORG.

If you have better ideas, let me know in the comments….