vCard 3.0 validator and parser

Did you know that even the vCards listed in the official RFC are not valid? It clearly says The vCard object MUST contain the FN, N and VERSION types. Still, the example vCards are both clearly missing the N type. As somebody else remarked, releasing a format spec without some reference validator is bound to result in all sorts of invalid implementations.

After searching for a vCard validator without success, I’ve therefore started my own vCard module in Python. It tries to create an object with all the information from a vCard string, and returns what I hope are useful error and warning messages if there’s anything wrong.

Update: Added file validation – Now you can validate files with several vCards from the command line.

Install / upgrade:
sudo pip install --upgrade vcard

Validate vCard files:
vcard *.vcf

Sort blocks of text in files

Ever had to sort a file alphabetically, only to realize that you’d have to do it manually because every item that needs to be sorted is spread over more than one line? This just happened when I exported my Gmail contacts to vCard, which it turned out were sorted by formatted name (FN) instead of name (N). The result was the following script, which takes two pattern and some input, and returns the sorted output. The example returned by ./sort_blocks.py --help is exactly the code to re-sort Gmail contacts. I’d love to know if you find any bugs or possible improvements to this script. Enjoy:

#! /usr/bin/env python
# -*- coding: utf-8 -*-
## Copyright (C) 2009 CERN.
##
## Sort any multi-line block text
##
## This file is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

"""sort_blocks.py - Multiline sort of standard input

Default syntax:

./sort_blocks.py -b 'pattern' -s 'pattern' < input_file > result_file

Options:
-v,--verbose    Verbose mode
-h,--help       Print this message
-b,--bp         Block pattern (dotall multiline); used to extract blocks
-s,--sp         Sort pattern (dotall multiline); extracted to sort blocks

Example:

./sort_blocks.py -b 'BEGIN:VCARD.*?END:VCARD\\r\\n' -s '^N:(.*)$' \
< contacts.vcf > contacts2.vcf

Orders vCards in contacts.vcf by name, and puts the results in contacts2.vcf."""

import getopt
import re
import sys

class Usage(Exception):
    """Raise in case of invalid parameters"""
    def __init__(self, msg):
        self.msg = msg

def _compare_pattern(sort_pattern, text1, text2):
    """Function to sort by regex"""
    matches = [
        re.search(sort_pattern, text, re.DOTALL | re.MULTILINE)
        for text in [text1, text2]]
    text_matches = []
    for match in matches:
        if match is None:
            text_matches.append('')
        else:
            text_matches.append(match.group(1))

    return cmp(text_matches[0], text_matches[1])

def split_and_sort(text, block_pattern, sort_pattern):
    """Split into blocks, sort them, and join them up again
    @param text: String of blocks to sort
    @param block_pattern: Regular expression corresponding to the border between
    the blocks
    @param sort_pattern: Gets a subset of each block to sort by"""

    text_blocks = re.findall(block_pattern, text, re.DOTALL | re.MULTILINE)
    #print text_blocks

    text_blocks.sort(lambda x, y: _compare_pattern(sort_pattern, x, y))

    return ''.join(text_blocks)

def main(argv = None):
    """Argument handling"""

    if argv is None:
        argv = sys.argv

    # Defaults
    block_pattern = ''
    sort_pattern = ''

    try:
        try:
            opts, args = getopt.getopt(
                argv[1:],
                'hb:s:',
                ['help', 'bp=', 'sp='])
        except getopt.GetoptError, err:
            raise Usage(err.msg)

        for option, value in opts:
            if option in ('-h', '--help'):
                print(__doc__)
                return 0
            elif option in ('-b', '--bp'):
                block_pattern = value
            elif option in ('-s', '--sp'):
                sort_pattern = value
            else:
                raise Usage('Unhandled option ' % option)

        if block_pattern == '' or sort_pattern == '' or args:
            raise Usage(__doc__)

        text = sys.stdin.read()

        print split_and_sort(text, block_pattern, sort_pattern)

    except Usage, err:
        sys.stderr.write(err.msg + '\n')
        return 2

if __name__ == '__main__':
    sys.exit(main())

Range arithmetic in Python

The XML 1.0 and 1.1 standards define some ranges of Unicode code points which are valid, and some “compatibility characters” which should not be used. CDS Invenio (a FOSS CMS) already has some code to clean up text to remove invalid characters, but it doesn’t remove the compatibility characters. Using the existing code for HTML 4.01 made the W3C Markup Validation Service complain, so I wanted to exclude the compatibility character ranges from the valid ranges, and get the most concise hexadecimal ranges corresponding to the resulting set to plug into a Python regular expression. Here’s the resultingsloppy and ugly code (I’ll post updated code and/or a link to the source repository if this is included at some point):

# -*- coding: utf-8 -*-
## Copyright (C) 2009 CERN.
##
## This file is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

"""Creates the minimal set of Unicode character ranges for valid XML 1.0 and 1.1
characters minus the compatibility changes"""

INCLUDE_XML10 = "#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] \
| [#x10000-#x10FFFF]"
EXCLUDE_XML10 = "[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

INCLUDE_XML11 = "[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]"
EXCLUDE_XML11 = "[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], \
[#xFDD0-#xFDDF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

def cleanup(value):
    """Prepare string for conversion to hex ranges
    @param value: String with ranges
    @return: String with ranges"""
    return value.replace('#', '0').translate(None, '[]')

def list_to_range(value):
    """Convert a list of strings (ranges and not)
    @param value: List of strings corresponding to hexadecimal numbers and
    ranges
    @return: List of numbers"""
    result = []
    for item in value:
        if item.find('-') == -1:
            result.append(int(item, 16))
        else:
            numbers = [int(hex_str, 16) for hex_str in item.split('-')]
            result.extend(range(numbers[0], numbers[1] + 1))
    return result

def range_minus(include_range, exclude_range):
    """Subtract one range from another
    @param include_range: String from http://www.w3.org/TR/xml/#charsets or
    http://www.w3.org/TR/xml11/#charsets
    @param exclude_range: Ditto
    @return: String with hex numbers and ranges"""
    include_range = cleanup(include_range)
    includes = include_range.split(' | ')

    exclude_range = cleanup(exclude_range)
    excludes = exclude_range.split(', ')

    include_numbers = list_to_range(includes)
    exclude_numbers = list_to_range(excludes)

    numbers = set([
        number for number
        in include_numbers
        if number not in exclude_numbers])
    lows = [
        number for number
        in numbers
        if number - 1 not in numbers]
    highs = [
        number for number
        in numbers
        if number + 1 not in numbers]

    result = zip(lows, highs)

    result_hex = [
        '\\U%0*X-\\U%0*X' % (8, pair[0], 8, pair[1])
        for pair in result]
    result_hex = [
        text.replace('-' + text[:10], '')
        for text in result_hex] # Single ranges

    result_hex = [
        text.replace('\\U0000', '\\u')
        for text in result_hex] # Shorten where possible

    return '\n'.join(result_hex)

print 'XML 1.0:\n' + range_minus(INCLUDE_XML10, EXCLUDE_XML10) + '\n'

print 'XML 1.1:\n' + range_minus(INCLUDE_XML11, EXCLUDE_XML11)

In case you just want the results, here you go:

XML 1.0:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDF0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

XML 1.1:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDE0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

Unit testing Python + MySQLdb warnings

There seems to be several methods out there, based on elevating warnings to errors using warnings.simplefilter. Here’s another method, based on recording warnings in a variable, and checking that the last one is a MySQLdb.Warning. Hopefully to be integrated in INSPIRE.

import MySQLdb
import unittest
import warnings
[...]
class TestTagInsert(unittest.TestCase):
    def test_too_long_tags(self):
        with warnings.catch_warnings(record=True) as warn:
            [Run SQL statement]
            self.assert_(len(warn) == 1) # Ensures that the next statement won't break the testing
            self.assertEqual(
                MySQLdb.Warning,
                warn[-1].category
                )
            #If you also want to check the text of the warning:
            self.assert_(
                'truncated' in str(warn[-1].message))