Some new options for my tool sortcanon.py to handle more inputs.
A bit of context: when one sorts a list of IPv4 addresses as text, one gets a result as follows. Take this list:
Just sorting this gives this result:
The IPv4 address starting with 185 comes first, because by default, sorting is string based and digit 1 comes before digit 3.
With sortcanon, one can provide a Python function that will be used to interpret the input and achieve the desired sorting. There are a couple of builtin functions, like ipv4. This is the result:
This time, the IPv4 address starting with 185 comes last, because it has the highest most significant byte.
Recently, I had to sort some files where with extra data, like IPv4 addresses with port numbers. Something like this list:
But this did not work:
Because the function that parses IPv4 addresses, does not expect a port number.
I could create a custom function to handle this, but I pursued another solution. I added an option to select the part of the line, that will be used for sorting, with a regular expression. This is done with option -s (select). Like this:
Regular expression “^([^ ]+) ” selects all characters from the beginning of the line (^) until the first space character (excluded). This selection is stored in a capture group (), and the ipv4 sorting function takes this capture group as input, in stead of the complete line.
The list I selected as example, has some duplicate IPv4 addresses:
If I use option -u (unique), duplicate lines are removed:
But of course the lines with identical IPv4 address 53… remain, because the lines themselves are different (different port number).
This is the desired result, most of the time. But I had an exceptional case, where I had to drop duplicate IPv4 addresses, but still keep one port number. This can be done with option –selectoptions u:
sortcanon_V0_0_3.zip (http)Click to Open Code Editor