Range arithmetic in Python

The XML 1.0 and 1.1 standards define some ranges of Unicode code points which are valid, and some “compatibility characters” which should not be used. CDS Invenio (a FOSS CMS) already has some code to clean up text to remove invalid characters, but it doesn’t remove the compatibility characters. Using the existing code for HTML 4.01 made the W3C Markup Validation Service complain, so I wanted to exclude the compatibility character ranges from the valid ranges, and get the most concise hexadecimal ranges corresponding to the resulting set to plug into a Python regular expression. Here’s the resultingsloppy and ugly code (I’ll post updated code and/or a link to the source repository if this is included at some point):

# -*- coding: utf-8 -*-
## Copyright (C) 2009 CERN.
##
## This file is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

"""Creates the minimal set of Unicode character ranges for valid XML 1.0 and 1.1
characters minus the compatibility changes"""

INCLUDE_XML10 = "#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] \
| [#x10000-#x10FFFF]"
EXCLUDE_XML10 = "[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

INCLUDE_XML11 = "[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]"
EXCLUDE_XML11 = "[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], \
[#xFDD0-#xFDDF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

def cleanup(value):
    """Prepare string for conversion to hex ranges
    @param value: String with ranges
    @return: String with ranges"""
    return value.replace('#', '0').translate(None, '[]')

def list_to_range(value):
    """Convert a list of strings (ranges and not)
    @param value: List of strings corresponding to hexadecimal numbers and
    ranges
    @return: List of numbers"""
    result = []
    for item in value:
        if item.find('-') == -1:
            result.append(int(item, 16))
        else:
            numbers = [int(hex_str, 16) for hex_str in item.split('-')]
            result.extend(range(numbers[0], numbers[1] + 1))
    return result

def range_minus(include_range, exclude_range):
    """Subtract one range from another
    @param include_range: String from http://www.w3.org/TR/xml/#charsets or
    http://www.w3.org/TR/xml11/#charsets
    @param exclude_range: Ditto
    @return: String with hex numbers and ranges"""
    include_range = cleanup(include_range)
    includes = include_range.split(' | ')

    exclude_range = cleanup(exclude_range)
    excludes = exclude_range.split(', ')

    include_numbers = list_to_range(includes)
    exclude_numbers = list_to_range(excludes)

    numbers = set([
        number for number
        in include_numbers
        if number not in exclude_numbers])
    lows = [
        number for number
        in numbers
        if number - 1 not in numbers]
    highs = [
        number for number
        in numbers
        if number + 1 not in numbers]

    result = zip(lows, highs)

    result_hex = [
        '\\U%0*X-\\U%0*X' % (8, pair[0], 8, pair[1])
        for pair in result]
    result_hex = [
        text.replace('-' + text[:10], '')
        for text in result_hex] # Single ranges

    result_hex = [
        text.replace('\\U0000', '\\u')
        for text in result_hex] # Shorten where possible

    return '\n'.join(result_hex)

print 'XML 1.0:\n' + range_minus(INCLUDE_XML10, EXCLUDE_XML10) + '\n'

print 'XML 1.1:\n' + range_minus(INCLUDE_XML11, EXCLUDE_XML11)

In case you just want the results, here you go:

XML 1.0:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDF0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

XML 1.1:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDE0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

Convert XHTML to HTML with XSLT

After fiddling a bit with the “copy-no-ns” XSLT template, I’ve ended up with a style sheet which converts XHTML to HTML 4.01, so you can use it as a post-processing step when serving to Internet Explorer. Note that this has not been tested with alternative namespaces such as SVG or MathML.

Edit: After moving to PHP 5 and libxslt, it was necessary to trim the xmlns declarations down a bit. The new version is online now.

Edit 2: I got a bit of a surprise when reading the W3C recommendations for declaring encoding and MIME type. The new version is online, but you must provide the content type at run time (using XSLTProcessor::setParameter in PHP 5). Of course, you can just ignore that and specify your own if it’s static.

TED.com bloat

If you’re a TED.com user, I’m pretty sure you’ve noticed the slow page loads compared to … Well, just about any other site out there. I’ve sent some feedback (below), and I’m hoping you’ll help out as well by suggesting general and specific improvements.

Hello,

While your web site is some of the best content collections I’ve ever come across, the style sheets / scripts are so huge as to require the full attention of a Pentium IV 3 GHz CPU for several seconds for every page displayed. 122 KB of CSS and 259 KB of JavaScript is massive, even today.

As a first fix, I’d suggest to use some of the online tools to compress CSS and JavaScript. Also, with 8 years of web development behind me (3 professionally), I’m confident that you can reduce the amount an order of magnitude without losing the overall look and feel of the site.

Thank you for your time and magnificent content!

PS: I’ve asked for feedback, and I’ll post it here if I receive any.

Confessions of an ex(?) newbie

Today Months ago it hit me that I should properly ask forgiveness for my crimes committed against the IT community. I have, in no particular order:

  • Asked for help before searching.
  • Filed bugs with too little information.
  • Been dead sure of the source of the bug and completely wrong.
  • Used
    noob
    text
    “techniques”
    in
    chats
    At least I never used FUCKING COLORED CAPS.
  • Participated in newsgroup flame wars.
  • Used frames on my website. *Shiver*
  • Vented frustration in bug reports.
  • Sent emails without reviewing content and formatting.

Job trends in web development

The job search service Indeed has an interesting “trends” search engine: It visualizes the amount of job postings matching your keywords the last year. Let’s see if there is some interesting information for modern web technologies there…

XHTML vs. HTML

The relation between XHTML and HTML Relative popularity of XHTML and HTML in job offers could be attributed to a number of factors:

  • XHTML is just not popular yet (1 Google result for every 19 on HTML).
  • The transition from HTML to XHTML is so simple as to be ignored.
  • The terms are confused, and HTML is the most familiar one.
  • XHTML is thought to be the same as HTML, or a subset of it.

The XHTML graph alone Popularity of XHTML in job offers could give us a hint as to where we stand: At about 1/100 of the “popularity” of HTML, it’s increasing linearly. At the same time, HTML has had an insignificant increase, with a spike in the summer months (it is interesting to note that this spike did not occur for XHTML). XHTML could be posed for exponential growth, taking over for HTML, but only time will tell.

AJAX

This is an interesting graph Popularity of AJAX in job offers: It grows exponentially, which is likely to be a result of all the buzz created by Google getting on the Web 2.0 bandwagon. Curiously, the growth rate doesn’t match that of the term “web 2.0” Relative popularity of AJAX and "Web 2.0" in job offers. Attempting to match it with other Web 2.0 terms such as “RSSRelative popularity of AJAX and RSS in job offers, “JavaScript” Relative popularity of AJAX and JavaScript in job offers, and “DOMRelative popularity of AJAX and DOM in job offers also failed. The fact that AJAX popularity seems to be irrelevant to Web 2.0 and even JavaScript popularity is interesting, but I’ll leave the creation of predictions from this as an exercise for the readers. :)

CSS

While insignificant when compared to HTML Relative popularity of HTML and CSS in job offers, the popularity of CSS closely follows that of XHTML Relative popularity of XHTML and CSS in job offers. Based on that and the oodles of best practices out there cheering CSS and XHTML on, I predict the following: When CSS is recognized for its power to reduce bandwidth use and web design costs, it’ll drag XHTML up with it as a means to create semantic markup which can be used with other XML technologies, such as XSLT and RSS / Atom.

Discussion of conclusions

The job search seems to be only in the U.S., so the international numbers may be very different. I doubt that, however, based on how irrelevant borders are on the Web.

The occurence of these terms will be slowed by such factors as how long it takes for the people in charge to notice them, understand their value / potential, and finally find areas of the business which needs those skills.

Naturally, results will be skewed by buzz, large scale market swings, implicit knowledge (if you know XHTML, you also know HTML), and probably another 101 factors I haven’t though of. So please take the conclusions with a grain of salt.

My conclusions are often based on a bell-shaped curve of lifetime popularity, according to an article / book I read years ago. I can’t find the source, but it goes something like this:

  1. Approximately linear growth as early adopters are checking it out.
  2. Exponential growth as less tech savvy people catch on; buzz from tech news sources.
  3. Stabilization because of market saturation and / or buzz wearing off.
  4. Exponential decline when made obsolete by other technology.
  5. Approximately linear decline as the technology falls into obscurity.

PS: For some proof that any web service such as Indeed should be taken with a grain of salt, try checking out the result for George Carlin’s seven dirty words Relative popularity of George Carlin's seven dirty words in job offers ;)