diff options
author | Upstream <upstream-import@none> | 1970-01-12 13:46:40 +0000 |
---|---|---|
committer | Upstream <upstream-import@none> | 1970-01-12 13:46:40 +0000 |
commit | 70e83658cac1d0d766e93853e3698921af269a37 (patch) | |
tree | f2dbc24614858517bc61f8811143d878002f800d /README | |
download | tagsoup-70e83658cac1d0d766e93853e3698921af269a37.tar.gz |
external/tagsoup 1.2upstream/1.2nougat-mr1-arc
Diffstat (limited to 'README')
-rw-r--r-- | README | 357 |
1 files changed, 357 insertions, 0 deletions
@@ -0,0 +1,357 @@ + TagSoup - Just Keep On Truckin' + + Introduction + + This is the home page of TagSoup, a SAX-compliant parser written in + Java that, instead of parsing well-formed or valid XML, parses HTML as + it is found in the wild: [1]poor, nasty and brutish, though quite often + far from short. TagSoup is designed for people who have to process this + stuff using some semblance of a rational application design. By + providing a SAX interface, it allows standard XML tools to be applied + to even the worst HTML. TagSoup also includes a command-line processor + that reads HTML files and can generate either clean HTML or well-formed + XML that is a close approximation to XHTML. + + This is also the README file packaged with TagSoup. + + TagSoup is free and Open Source software. As of version 1.2, it is + licensed under the [2]Apache License, Version 2.0, which allows + proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later + projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only + project, feel free to ask.) + + Warning: TagSoup will not build on stock Java 5.x or 6.x! + + Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x, + TagSoup will not build out of the box. You need to retrieve [3]Saxon + 6.5.5, which does not have the bug. Unpack the zipfile in an empty + directory and copy the saxon.jar and saxon-xml-apis.jar files to + $ANT_HOME/lib. The Ant build process for TagSoup will then notice that + Saxon is available and use it instead. + + TagSoup 1.2 released + + There are a great many changes, most of them fixes for long-standing + bugs, in this release. Only the most important are listed here; for the + rest, see the CHANGES file in the source distribution. Very special + thanks to Jojo Dijamco, whose intensive efforts at debugging made this + release a usable upgrade rather than a useless mass of undetected bugs. + * As noted above, I have changed the license to Apache 2.0. + * The default content model for bogons (unknown elements) is now ANY + rather than EMPTY. This is a breaking change, which I have done + only because there was so much demand for it. It can be undone on + the command line with the --emptybogons switch, or programmatically + with parser.setFeature(Parser.emptyBogonsFeature, true). + * The processing of entity references in attribute values has finally + been fixed to do what browsers do. That is, a reference is only + recognized if it is properly terminated by a semicolon; otherwise + it is treated as plain text. This means that URIs like + foo?cdown=32&cup=42 are no longer seen as containing an instance of + the )U character (whose name happens to be cup). + * Several new switches have been added: + + --doctype-system and --doctype-public force a DOCTYPE + declaration to be output and allow setting the system and + public identifiers. + + --standalone and --version allow control of the XML + declaration that is output. (Note that TagSoup's XML output is + always version 1.0, even if you use --version=1.1.) + + --norootbogons causes unknown elements not to be allowed as + the document root element. Instead, they are made children of + the default root element (the html element for HTML). + * The TagSoup core now supports character entities with values above + U+FFFF. As a consequence, the HTML schema now supports all 2,210 + standard character entities from the [4]2007-12-14 draft of XML + Entity Definitions for Characters, except the 94 which require more + than one Unicode character to represent. + * The SAX events startPrefixMapping and endPrefixMapping are now + being reported for all cases of foreign elements and attributes. + * All bugs around newline processing on Windows should now be gone. + * A number of content models have been loosened to allow elements to + appear in new and non-standard (but commonly found) places. In + particular, tables are now allowed inside paragraphs, against the + letter of the W3C specification. + * Since the span element is intended for fine control of appearance + using CSS, it should never have been a restartable element. This + very long-standing bug has now been fixed. + * The following non-standard elements are now at least partly + supported: bgsound, blink, canvas, comment, listing, marquee, nobr, + rbc, rb, rp, rtc, rt, ruby, wbr, xmp. + * In HTML output mode, boolean attributes like checked are now output + as such, rather than in XML style as checked="checked". + * Runs of < characters such as << and <<< are now handled correctly + in text rather than being transformed into extremely bogus + start-tags. + + [5]Download the TagSoup 1.2 jar file here. It's about 87K long. + [6]Download the full TagSoup 1.2 source here. If you don't have zip, + you can use jar to unpack it. + [7]Download the current CHANGES file here. + + TagSoup 1.1 released + + TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use + TagSoup within the JAXP framework (which is not something I necessarily + recommend, but it is part of the Java XML platform), you can create a + SAXParser by calling + org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also + set the system property javax.xml.parsers.SAXParserFactory to + org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing + this will cause all JAXP-based XML parsing to go through TagSoup, which + is a Bad Thing if your application also reads XML documents. + + What TagSoup does + + TagSoup is designed as a parser, not a whole application; it isn't + intended to permanently clean up bad HTML, as [8]HTML Tidy does, only + to parse it on the fly. Therefore, it does not convert presentation + HTML to CSS or anything similar. It does guarantee well-structured + results: tags will wind up properly nested, default attributes will + appear appropriately, and so on. + + The semantics of TagSoup are as far as practical those of actual HTML + browsers. In particular, never, never will it throw any sort of syntax + error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's + much, much more. For example, if the first tag is LI, it will supply + the application with enclosing HTML, BODY, and UL tags. Why UL? Because + that's what browsers assume in this situation. For the same reason, + overlapping tags are correctly restarted whenever possible: text like: +This is <B>bold, <I>bold italic, </b>italic, </i>normal text + + gets correctly rewritten as: +This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. + + By intention, TagSoup is small and fast. It does not depend on the + existence of any framework other than SAX, and should be able to work + with any framework that can accept SAX parsers. In particular, [10]XOM + is known to work. + + You can replace the low-level HTML scanner with one based on Sean + McGrath's [11]PYX format (very close to James Clark's ESIS format). You + can also supply an AutoDetector that peeks at the incoming byte stream + and guesses a character encoding for it. Otherwise, the platform + default is used. If you need an autodetector of character sets, + consider trying to adapt the [12]Mozilla one; if you succeed, let me + know. + + Note: TagSoup in Java 1.1 + + If you go through the TagSoup source and replace all references to + HashMap with Hashtable and recompile, TagSoup will work fine in Java + 1.1 VMs. Thanks to Thorbjørn Vinne for this discovery. + + The TSaxon XSLT-for-HTML processor + + [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5 + of Michael Kay's Saxon XSLT version 1.0 implementation that includes + TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to + process either HTML or XML documents with XSLT stylesheets. + + TagSoup as a stand-alone program + + It is possible to run TagSoup as a program by saying java -jar + tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command + line will be parsed individually. If no files are specified, the + standard input is read. + + The following options are understood: + + --files + Output into individual files, with html extensions changed to + xhtml. Otherwise, all output is sent to the standard output. + + --html + Output is in clean HTML: the XML declaration is suppressed, as + are end-tags for the known empty elements. + + --omit-xml-declaration + The XML declaration is suppressed. + + --method=html + End-tags for the known empty HTML elements are suppressed. + + --doctype-system=systemid + Forces the output of a DOCTYPE declaration with the specified + systemid. + + --doctype-public=publicid + Forces the output of a DOCTYPE declaration with the specified + publicid. + + --version=version + Sets the version string in the XML declaration. + + --standalone=[yes|no] + Sets the standalone declaration to yes or no. + + --pyx + Output is in PYX format. + + --pyxin + Input is in PYXoid format (need not be well-formed). + + --nons + Namespaces are suppressed. Normally, all elements are in the + XHTML 1.x namespace, and all attributes are in no namespace. + + --nobogons + Bogons (unknown elements) are suppressed. + + --nodefaults + suppress default attribute values + + --nocolons + change explicit colons in element and attribute names to + underscores + + --norestart + don't restart any normally restartable elements + + --ignorable + output whitespace in elements with element-only content + + --emptybogons + Bogons are given a content model of EMPTY rather than ANY. + + --any + Bogons are given a content model of ANY rather than EMPTY + (default). + + --norootbogons + Don't allow bogons to be root elements; make them subordinate to + the root. + + --lexical + Pass through HTML comments and DOCTYPE declarations. Has no + effect when output is in PYX format. + + --reuse + Reuse a single instance of TagSoup parser throughout. Normally, + a new one is instantiated for each input file. + + --nocdata + Change the content models of the script and style elements to + treat them as ordinary #PCDATA (text-only) elements, as in + XHTML, rather than with the special CDATA content model. + + --encoding=encoding + Specify the input encoding. The default is the Java platform + default. + + --output-encoding=encoding + Specify the output encoding. The default is the Java platform + default. + + --help + Print help. + + --version + Print the version number. + + SAX features and properties + + TagSoup supports the following SAX features in addition to the standard + ones: + + http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons + A value of "true" indicates that the parser will ignore unknown + elements. + + http://www.ccil.org/~cowan/tagsoup/features/bogons-empty + A value of "true" indicates that the parser will give unknown + elements a content model of EMPTY; a value of "false", a content + model of ANY. + + http://www.ccil.org/~cowan/tagsoup/features/root-bogons + A value of "true" indicates that the parser will allow unknown + elements to be the root of the output document. + + http://www.ccil.org/~cowan/tagsoup/features/default-attributes + A value of "true" indicates that the parser will return default + attribute values for missing attributes that have default + values. + + http://www.ccil.org/~cowan/tagsoup/features/translate-colons + A value of "true" indicates that the parser will translate + colons into underscores in names. + + http://www.ccil.org/~cowan/tagsoup/features/restart-elements + A value of "true" indicates that the parser will attempt to + restart the restartable elements. + + http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace + A value of "true" indicates that the parser will transmit + whitespace in element-only content via the SAX + ignorableWhitespace callback. Normally this is not done, because + HTML is an SGML application and SGML suppresses such whitespace. + + http://www.ccil.org/~cowan/tagsoup/features/cdata-elements + A value of "true" indicates that the parser will process the + script and style elements (or any elements with type='cdata' in + the TSSL schema) as SGML CDATA elements (that is, no markup is + recognized except the matching end-tag). + + TagSoup supports the following SAX properties in addition to the + standard ones: + + http://www.ccil.org/~cowan/tagsoup/properties/scanner + Specifies the Scanner object this parser uses. + + http://www.ccil.org/~cowan/tagsoup/properties/schema + Specifies the Schema object this parser uses. + + http://www.ccil.org/~cowan/tagsoup/properties/auto-detector + Specifies the AutoDetector (for encoding detection) this parser + uses. + + More information + + I gave a presentation (a nocturne, so it's not on the schedule) at + [15]Extreme Markup Languages 2004 about TagSoup, updated from the one + presented in 2002 at the New York City XML SIG and at XML 2002. This is + the main high-level documentation about how TagSoup works. Formats: + [16]OpenDocument [17]Powerpoint [18]PDF. + + I also had people add [19]"evil" HTML to a large poster so that I could + [20]clean it up; View Source is probably more useful than ordinary + browsing. The original instructions were: + + SOUPE DE BALISES (BE EVIL)! + Ecritez une balise ouvrante (sans attributs) + ou fermante HTML ici, s.v.p. + + There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups. + You can [23]join via the Web, or by sending a blank email to + [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are + open to all. + + Online TagSoup processing for publicly accessible HTML documents is now + [26]available courtesy of Leigh Dodds. + +References + + 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html + 2. http://opensource.org/licenses/apache2.0.php + 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip + 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214 + 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar + 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip + 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES + 8. http://tidy.sf.net/ + 9. http://www.crumbmuseum.com/truckin.html + 10. http://www.cafeconleche.org/XOM + 11. http://gnosis.cx/publish/programming/xml_matters_17.html + 12. http://jchardet.sourceforge.net/ + 13. http://www.ccil.org/~cowan + 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon + 15. http://www.extrememarkup.com/extreme/2004 + 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp + 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt + 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf + 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html + 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml + 21. http://groups.yahoo.com/group/tagsoup-friends + 22. http://groups.yahoo.com/ + 23. http://groups.yahoo.com/group/tagsoup-friends/join + 24. mailto:tagsoup-friends-subscribe@yahoogroups.com + 25. http://groups.yahoo.com/group/tagsoup-friends/messages + 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/ |