Range arithmetic in Python

The XML 1.0 and 1.1 standards define some ranges of Unicode code points which are valid, and some “compatibility characters” which should not be used. CDS Invenio (a FOSS CMS) already has some code to clean up text to remove invalid characters, but it doesn’t remove the compatibility characters. Using the existing code for HTML 4.01 made the W3C Markup Validation Service complain, so I wanted to exclude the compatibility character ranges from the valid ranges, and get the most concise hexadecimal ranges corresponding to the resulting set to plug into a Python regular expression. Here’s the resultingsloppy and ugly code (I’ll post updated code and/or a link to the source repository if this is included at some point):

# -*- coding: utf-8 -*-
## Copyright (C) 2009 CERN.
##
## This file is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

"""Creates the minimal set of Unicode character ranges for valid XML 1.0 and 1.1
characters minus the compatibility changes"""

INCLUDE_XML10 = "#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] \
| [#x10000-#x10FFFF]"
EXCLUDE_XML10 = "[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

INCLUDE_XML11 = "[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]"
EXCLUDE_XML11 = "[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], \
[#xFDD0-#xFDDF], \
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
[#x10FFFE-#x10FFFF]"

def cleanup(value):
    """Prepare string for conversion to hex ranges
    @param value: String with ranges
    @return: String with ranges"""
    return value.replace('#', '0').translate(None, '[]')

def list_to_range(value):
    """Convert a list of strings (ranges and not)
    @param value: List of strings corresponding to hexadecimal numbers and
    ranges
    @return: List of numbers"""
    result = []
    for item in value:
        if item.find('-') == -1:
            result.append(int(item, 16))
        else:
            numbers = [int(hex_str, 16) for hex_str in item.split('-')]
            result.extend(range(numbers[0], numbers[1] + 1))
    return result

def range_minus(include_range, exclude_range):
    """Subtract one range from another
    @param include_range: String from http://www.w3.org/TR/xml/#charsets or
    http://www.w3.org/TR/xml11/#charsets
    @param exclude_range: Ditto
    @return: String with hex numbers and ranges"""
    include_range = cleanup(include_range)
    includes = include_range.split(' | ')

    exclude_range = cleanup(exclude_range)
    excludes = exclude_range.split(', ')

    include_numbers = list_to_range(includes)
    exclude_numbers = list_to_range(excludes)

    numbers = set([
        number for number
        in include_numbers
        if number not in exclude_numbers])
    lows = [
        number for number
        in numbers
        if number - 1 not in numbers]
    highs = [
        number for number
        in numbers
        if number + 1 not in numbers]

    result = zip(lows, highs)

    result_hex = [
        '\\U%0*X-\\U%0*X' % (8, pair[0], 8, pair[1])
        for pair in result]
    result_hex = [
        text.replace('-' + text[:10], '')
        for text in result_hex] # Single ranges

    result_hex = [
        text.replace('\\U0000', '\\u')
        for text in result_hex] # Shorten where possible

    return '\n'.join(result_hex)

print 'XML 1.0:\n' + range_minus(INCLUDE_XML10, EXCLUDE_XML10) + '\n'

print 'XML 1.1:\n' + range_minus(INCLUDE_XML11, EXCLUDE_XML11)

In case you just want the results, here you go:

XML 1.0:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDF0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

XML 1.1:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDE0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD
Advertisements

Convert XHTML to HTML with XSLT

After fiddling a bit with the “copy-no-ns” XSLT template, I’ve ended up with a style sheet which converts XHTML to HTML 4.01, so you can use it as a post-processing step when serving to Internet Explorer. Note that this has not been tested with alternative namespaces such as SVG or MathML.

Edit: After moving to PHP 5 and libxslt, it was necessary to trim the xmlns declarations down a bit. The new version is online now.

Edit 2: I got a bit of a surprise when reading the W3C recommendations for declaring encoding and MIME type. The new version is online, but you must provide the content type at run time (using XSLTProcessor::setParameter in PHP 5). Of course, you can just ignore that and specify your own if it’s static.

TED.com bloat

If you’re a TED.com user, I’m pretty sure you’ve noticed the slow page loads compared to … Well, just about any other site out there. I’ve sent some feedback (below), and I’m hoping you’ll help out as well by suggesting general and specific improvements.

Hello,

While your web site is some of the best content collections I’ve ever come across, the style sheets / scripts are so huge as to require the full attention of a Pentium IV 3 GHz CPU for several seconds for every page displayed. 122 KB of CSS and 259 KB of JavaScript is massive, even today.

As a first fix, I’d suggest to use some of the online tools to compress CSS and JavaScript. Also, with 8 years of web development behind me (3 professionally), I’m confident that you can reduce the amount an order of magnitude without losing the overall look and feel of the site.

Thank you for your time and magnificent content!

PS: I’ve asked for feedback, and I’ll post it here if I receive any.

Confessions of an ex(?) newbie

Today Months ago it hit me that I should properly ask forgiveness for my crimes committed against the IT community. I have, in no particular order:

  • Asked for help before searching.
  • Filed bugs with too little information.
  • Been dead sure of the source of the bug and completely wrong.
  • Used
    noob
    text
    “techniques”
    in
    chats
    At least I never used FUCKING COLORED CAPS.
  • Participated in newsgroup flame wars.
  • Used frames on my website. *Shiver*
  • Vented frustration in bug reports.
  • Sent emails without reviewing content and formatting.

Job trends in web development

The job search service Indeed has an interesting “trends” search engine: It visualizes the amount of job postings matching your keywords the last year. Let’s see if there is some interesting information for modern web technologies there…

XHTML vs. HTML

The relation between XHTML and HTML Relative popularity of XHTML and HTML in job offers could be attributed to a number of factors:

  • XHTML is just not popular yet (1 Google result for every 19 on HTML).
  • The transition from HTML to XHTML is so simple as to be ignored.
  • The terms are confused, and HTML is the most familiar one.
  • XHTML is thought to be the same as HTML, or a subset of it.

The XHTML graph alone Popularity of XHTML in job offers could give us a hint as to where we stand: At about 1/100 of the “popularity” of HTML, it’s increasing linearly. At the same time, HTML has had an insignificant increase, with a spike in the summer months (it is interesting to note that this spike did not occur for XHTML). XHTML could be posed for exponential growth, taking over for HTML, but only time will tell.

AJAX

This is an interesting graph Popularity of AJAX in job offers: It grows exponentially, which is likely to be a result of all the buzz created by Google getting on the Web 2.0 bandwagon. Curiously, the growth rate doesn’t match that of the term “web 2.0” Relative popularity of AJAX and "Web 2.0" in job offers. Attempting to match it with other Web 2.0 terms such as “RSSRelative popularity of AJAX and RSS in job offers, “JavaScript” Relative popularity of AJAX and JavaScript in job offers, and “DOMRelative popularity of AJAX and DOM in job offers also failed. The fact that AJAX popularity seems to be irrelevant to Web 2.0 and even JavaScript popularity is interesting, but I’ll leave the creation of predictions from this as an exercise for the readers. :)

CSS

While insignificant when compared to HTML Relative popularity of HTML and CSS in job offers, the popularity of CSS closely follows that of XHTML Relative popularity of XHTML and CSS in job offers. Based on that and the oodles of best practices out there cheering CSS and XHTML on, I predict the following: When CSS is recognized for its power to reduce bandwidth use and web design costs, it’ll drag XHTML up with it as a means to create semantic markup which can be used with other XML technologies, such as XSLT and RSS / Atom.

Discussion of conclusions

The job search seems to be only in the U.S., so the international numbers may be very different. I doubt that, however, based on how irrelevant borders are on the Web.

The occurence of these terms will be slowed by such factors as how long it takes for the people in charge to notice them, understand their value / potential, and finally find areas of the business which needs those skills.

Naturally, results will be skewed by buzz, large scale market swings, implicit knowledge (if you know XHTML, you also know HTML), and probably another 101 factors I haven’t though of. So please take the conclusions with a grain of salt.

My conclusions are often based on a bell-shaped curve of lifetime popularity, according to an article / book I read years ago. I can’t find the source, but it goes something like this:

  1. Approximately linear growth as early adopters are checking it out.
  2. Exponential growth as less tech savvy people catch on; buzz from tech news sources.
  3. Stabilization because of market saturation and / or buzz wearing off.
  4. Exponential decline when made obsolete by other technology.
  5. Approximately linear decline as the technology falls into obscurity.

PS: For some proof that any web service such as Indeed should be taken with a grain of salt, try checking out the result for George Carlin’s seven dirty words Relative popularity of George Carlin's seven dirty words in job offers ;)

Re: KittenAuth Test

The KittenAuth Test, a very cute and brilliant Turing test which could possibly replace CAPTCHA, deserves attention for two reasons: It is much more user friendly than CAPTCHAs, and it can easily be extended with a textual variant, for anyone using a non-graphical browser.

Consider the textual equivalent to the KittenAuth Test: Click on three of the sentences / words which exhibit some property that machines cannot understand without massive manual learning. The text could simply be put as the images’ alt text, and hey presto! An accessible Turing test!

Which language properties could you use? “Hard” words, for example. Which ones of these would you consider “hard”?

  • Kitten
  • Hammer
  • Stool
  • Wall
  • Synonym
  • Paper
  • Ice
  • Fantastic
  • Sugar

“Hammer”, “Wall”, and “Ice”, right? So all you’ll need to do as a programmer is make two lists for each property you’d like to use, and sprinkle the pictures with alt texts from the lists. Naturally, the best place to put the “good” words would be at the same place as the kittens. Then just add a new (CSS hidden) heading to the KittenAuth “gallery” with something like this: “If you cannot see the kittens, please click on three “$property” words instead”. Accessible in a flash…

Update 2006-04-19: Of course, for this to work every user will need a separate, large collection of words tagged with their meanings or categories. In the case of brute-forcing bots, this may be prohibitive. A follow-up post will take up a couple other suggestions.

Re: The Future of Tagging

Vik Singh has an interesting blog post on the future of tagging. IMO not so much because of the idea, since it looks quite familiar to the directory structure we’re all familiar with, adding the possibility for objects to be inside several directories at the same time. But it got me thinking on what tagging lacks, which is an easy to use relation to more structured data about what the tag means.

Anyone who’s used the del.icio.us tagging interface provided by their bookmarklets or Firefox extension knows how easy they are to use. Just click any tag, and it’s added or removed, based on whether it’s already used. Click the “save” button when done. Dead easy.

Creating RDF, when compared with del.icio.us, is quantum theory. But it’s already being used, and will probably be one of the biggest players in the semantic web, where things have meanings which can be interpreted by computers. Using RDF, you can distinguish between the tag “read” as an imperative (read it!) and an assertation (has been read). You can also make the computer understand that “examples” is the plural of “example”, and that curling is a sport (though some may disagree :)).

How could we combine the two? Here’s an idea: When clicking any of the tags in the del.icio.us tagging interface, you’ll be asked what you mean, by getting the possibility to select any number of from a list of one-sentence meanings. E.g., when selecting “work”, you could get these choices:

  • Item on my todo list
  • Something you’ve worked on
  • Something somebody else has worked on
  • A job/position
  • None of the above
  • Show all meanings
  • Define new meaning…

The list would normally only contain the most popular definitions, to harness the power of the “best” meanings, as defined by the number of users. The “Show all meanings” link could be used to show the whole range of meanings people have defined.

“Define new meaning…” could give you a nice interface to define the meaning of the word in the context of the link you’re tagging at the moment. This is where the designers really have to get their minds cooking to get something usable by at least mid-range computer literates.