Sort blocks of text in files

Ever had to sort a file alphabetically, only to realize that you’d have to do it manually because every item that needs to be sorted is spread over more than one line? This just happened when I exported my Gmail contacts to vCard, which it turned out were sorted by formatted name (FN) instead of name (N). The result was the following script, which takes two pattern and some input, and returns the sorted output. The example returned by ./sort_blocks.py --help is exactly the code to re-sort Gmail contacts. I’d love to know if you find any bugs or possible improvements to this script. Enjoy:

#! /usr/bin/env python
# -*- coding: utf-8 -*-
## Copyright (C) 2009 CERN.
##
## Sort any multi-line block text
##
## This file is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

"""sort_blocks.py - Multiline sort of standard input

Default syntax:

./sort_blocks.py -b 'pattern' -s 'pattern' < input_file > result_file

Options:
-v,--verbose    Verbose mode
-h,--help       Print this message
-b,--bp         Block pattern (dotall multiline); used to extract blocks
-s,--sp         Sort pattern (dotall multiline); extracted to sort blocks

Example:

./sort_blocks.py -b 'BEGIN:VCARD.*?END:VCARD\\r\\n' -s '^N:(.*)$' \
< contacts.vcf > contacts2.vcf

Orders vCards in contacts.vcf by name, and puts the results in contacts2.vcf."""

import getopt
import re
import sys

class Usage(Exception):
    """Raise in case of invalid parameters"""
    def __init__(self, msg):
        self.msg = msg

def _compare_pattern(sort_pattern, text1, text2):
    """Function to sort by regex"""
    matches = [
        re.search(sort_pattern, text, re.DOTALL | re.MULTILINE)
        for text in [text1, text2]]
    text_matches = []
    for match in matches:
        if match is None:
            text_matches.append('')
        else:
            text_matches.append(match.group(1))

    return cmp(text_matches[0], text_matches[1])

def split_and_sort(text, block_pattern, sort_pattern):
    """Split into blocks, sort them, and join them up again
    @param text: String of blocks to sort
    @param block_pattern: Regular expression corresponding to the border between
    the blocks
    @param sort_pattern: Gets a subset of each block to sort by"""

    text_blocks = re.findall(block_pattern, text, re.DOTALL | re.MULTILINE)
    #print text_blocks

    text_blocks.sort(lambda x, y: _compare_pattern(sort_pattern, x, y))

    return ''.join(text_blocks)

def main(argv = None):
    """Argument handling"""

    if argv is None:
        argv = sys.argv

    # Defaults
    block_pattern = ''
    sort_pattern = ''

    try:
        try:
            opts, args = getopt.getopt(
                argv[1:],
                'hb:s:',
                ['help', 'bp=', 'sp='])
        except getopt.GetoptError, err:
            raise Usage(err.msg)

        for option, value in opts:
            if option in ('-h', '--help'):
                print(__doc__)
                return 0
            elif option in ('-b', '--bp'):
                block_pattern = value
            elif option in ('-s', '--sp'):
                sort_pattern = value
            else:
                raise Usage('Unhandled option ' % option)

        if block_pattern == '' or sort_pattern == '' or args:
            raise Usage(__doc__)

        text = sys.stdin.read()

        print split_and_sort(text, block_pattern, sort_pattern)

    except Usage, err:
        sys.stderr.write(err.msg + '\n')
        return 2

if __name__ == '__main__':
    sys.exit(main())
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s