TagSoup - Just Keep On Truckin' Introduction This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: [1]poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML. This is also the README file packaged with TagSoup. TagSoup is free and Open Source software. As of version 1.2, it is licensed under the [2]Apache License, Version 2.0, which allows proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only project, feel free to ask.) Warning: TagSoup will not build on stock Java 5.x or 6.x! Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x, TagSoup will not build out of the box. You need to retrieve [3]Saxon 6.5.5, which does not have the bug. Unpack the zipfile in an empty directory and copy the saxon.jar and saxon-xml-apis.jar files to $ANT_HOME/lib. The Ant build process for TagSoup will then notice that Saxon is available and use it instead. TagSoup 1.2 released There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the CHANGES file in the source distribution. Very special thanks to Jojo Dijamco, whose intensive efforts at debugging made this release a usable upgrade rather than a useless mass of undetected bugs. * As noted above, I have changed the license to Apache 2.0. * The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with parser.setFeature(Parser.emptyBogonsFeature, true). * The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like foo?cdown=32&cup=42 are no longer seen as containing an instance of the )U character (whose name happens to be cup). * Several new switches have been added: + --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers. + --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.) + --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML). * The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the [4]2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent. * The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes. * All bugs around newline processing on Windows should now be gone. * A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification. * Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed. * The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp. * In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked". * Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags. [5]Download the TagSoup 1.2 jar file here. It's about 87K long. [6]Download the full TagSoup 1.2 source here. If you don't have zip, you can use jar to unpack it. [7]Download the current CHANGES file here. TagSoup 1.1 released TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use TagSoup within the JAXP framework (which is not something I necessarily recommend, but it is part of the Java XML platform), you can create a SAXParser by calling org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also set the system property javax.xml.parsers.SAXParserFactory to org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing this will cause all JAXP-based XML parsing to go through TagSoup, which is a Bad Thing if your application also reads XML documents. What TagSoup does TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as [8]HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on. The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like: This is bold, bold italic, italic, normal text gets correctly rewritten as: This is bold, bold italic, italic, normal text. By intention, TagSoup is small and fast. It does not depend on the existence of any framework other than SAX, and should be able to work with any framework that can accept SAX parsers. In particular, [10]XOM is known to work. You can replace the low-level HTML scanner with one based on Sean McGrath's [11]PYX format (very close to James Clark's ESIS format). You can also supply an AutoDetector that peeks at the incoming byte stream and guesses a character encoding for it. Otherwise, the platform default is used. If you need an autodetector of character sets, consider trying to adapt the [12]Mozilla one; if you succeed, let me know. Note: TagSoup in Java 1.1 If you go through the TagSoup source and replace all references to HashMap with Hashtable and recompile, TagSoup will work fine in Java 1.1 VMs. Thanks to Thorbjørn Vinne for this discovery. The TSaxon XSLT-for-HTML processor [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5 of Michael Kay's Saxon XSLT version 1.0 implementation that includes TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to process either HTML or XML documents with XSLT stylesheets. TagSoup as a stand-alone program It is possible to run TagSoup as a program by saying java -jar tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command line will be parsed individually. If no files are specified, the standard input is read. The following options are understood: --files Output into individual files, with html extensions changed to xhtml. Otherwise, all output is sent to the standard output. --html Output is in clean HTML: the XML declaration is suppressed, as are end-tags for the known empty elements. --omit-xml-declaration The XML declaration is suppressed. --method=html End-tags for the known empty HTML elements are suppressed. --doctype-system=systemid Forces the output of a DOCTYPE declaration with the specified systemid. --doctype-public=publicid Forces the output of a DOCTYPE declaration with the specified publicid. --version=version Sets the version string in the XML declaration. --standalone=[yes|no] Sets the standalone declaration to yes or no. --pyx Output is in PYX format. --pyxin Input is in PYXoid format (need not be well-formed). --nons Namespaces are suppressed. Normally, all elements are in the XHTML 1.x namespace, and all attributes are in no namespace. --nobogons Bogons (unknown elements) are suppressed. --nodefaults suppress default attribute values --nocolons change explicit colons in element and attribute names to underscores --norestart don't restart any normally restartable elements --ignorable output whitespace in elements with element-only content --emptybogons Bogons are given a content model of EMPTY rather than ANY. --any Bogons are given a content model of ANY rather than EMPTY (default). --norootbogons Don't allow bogons to be root elements; make them subordinate to the root. --lexical Pass through HTML comments and DOCTYPE declarations. Has no effect when output is in PYX format. --reuse Reuse a single instance of TagSoup parser throughout. Normally, a new one is instantiated for each input file. --nocdata Change the content models of the script and style elements to treat them as ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the special CDATA content model. --encoding=encoding Specify the input encoding. The default is the Java platform default. --output-encoding=encoding Specify the output encoding. The default is the Java platform default. --help Print help. --version Print the version number. SAX features and properties TagSoup supports the following SAX features in addition to the standard ones: http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons A value of "true" indicates that the parser will ignore unknown elements. http://www.ccil.org/~cowan/tagsoup/features/bogons-empty A value of "true" indicates that the parser will give unknown elements a content model of EMPTY; a value of "false", a content model of ANY. http://www.ccil.org/~cowan/tagsoup/features/root-bogons A value of "true" indicates that the parser will allow unknown elements to be the root of the output document. http://www.ccil.org/~cowan/tagsoup/features/default-attributes A value of "true" indicates that the parser will return default attribute values for missing attributes that have default values. http://www.ccil.org/~cowan/tagsoup/features/translate-colons A value of "true" indicates that the parser will translate colons into underscores in names. http://www.ccil.org/~cowan/tagsoup/features/restart-elements A value of "true" indicates that the parser will attempt to restart the restartable elements. http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace A value of "true" indicates that the parser will transmit whitespace in element-only content via the SAX ignorableWhitespace callback. Normally this is not done, because HTML is an SGML application and SGML suppresses such whitespace. http://www.ccil.org/~cowan/tagsoup/features/cdata-elements A value of "true" indicates that the parser will process the script and style elements (or any elements with type='cdata' in the TSSL schema) as SGML CDATA elements (that is, no markup is recognized except the matching end-tag). TagSoup supports the following SAX properties in addition to the standard ones: http://www.ccil.org/~cowan/tagsoup/properties/scanner Specifies the Scanner object this parser uses. http://www.ccil.org/~cowan/tagsoup/properties/schema Specifies the Schema object this parser uses. http://www.ccil.org/~cowan/tagsoup/properties/auto-detector Specifies the AutoDetector (for encoding detection) this parser uses. More information I gave a presentation (a nocturne, so it's not on the schedule) at [15]Extreme Markup Languages 2004 about TagSoup, updated from the one presented in 2002 at the New York City XML SIG and at XML 2002. This is the main high-level documentation about how TagSoup works. Formats: [16]OpenDocument [17]Powerpoint [18]PDF. I also had people add [19]"evil" HTML to a large poster so that I could [20]clean it up; View Source is probably more useful than ordinary browsing. The original instructions were: SOUPE DE BALISES (BE EVIL)! Ecritez une balise ouvrante (sans attributs) ou fermante HTML ici, s.v.p. There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups. You can [23]join via the Web, or by sending a blank email to [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are open to all. Online TagSoup processing for publicly accessible HTML documents is now [26]available courtesy of Leigh Dodds. References 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html 2. http://opensource.org/licenses/apache2.0.php 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES 8. http://tidy.sf.net/ 9. http://www.crumbmuseum.com/truckin.html 10. http://www.cafeconleche.org/XOM 11. http://gnosis.cx/publish/programming/xml_matters_17.html 12. http://jchardet.sourceforge.net/ 13. http://www.ccil.org/~cowan 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon 15. http://www.extrememarkup.com/extreme/2004 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml 21. http://groups.yahoo.com/group/tagsoup-friends 22. http://groups.yahoo.com/ 23. http://groups.yahoo.com/group/tagsoup-friends/join 24. mailto:tagsoup-friends-subscribe@yahoogroups.com 25. http://groups.yahoo.com/group/tagsoup-friends/messages 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/