CHANGES


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309

Changes from 1.2 to 1.2.1
=========================
Match DOCTYPE case-blind
Extend PushbackReader's size for oddball cases like & followed by CR
Leo Sutic's 2x-4x speedup by precompiling HTMLScanner table

Changes from 1.1.3 to 1.2
=========================
Changed license to Apache 2.0
Bogon default model is now ANY, not EMPTY
Support new DOCTYPE output switches --doctype-system and --doctype-public
Support new XML declaration output switches --standalone and --version
New --norootbogons switch makes bogons children of the root
Don't resolve entity references in attribute values unless semicolon-terminated
Support character entities above U+FFFF
Add character entities from the 2007-12-14 draft of xml-entity-names
Call SAX events startPrefixMapping and endPrefixMapping to report prefixes
Clean up newline processing, shrinking html.stml considerably
Allow link elements in the body as well as the head, to avoid excess bodies
Allow tables inside paragraphs
Allow cells and forms in thead and tfoot elements without intervening tr element
The span element is no longer restartable
Support non-standard elements bgsound, blink, canvas, comment, listing,
	marquee, nobr, ruby, rbc, rtc, rb, rt, rp, wbr, xmp
In HTML mode, boolean attributes like checked are output in minimized form
Correctly handle runs of less-than characters
Suppress all but the first DOCTYPE declaration
Modify PI targets containing colons to have underscores instead
The case of element tags is now canonicalized to the schema
PI targets are no longer forced to lower case

Changes from 1.1.2 to 1.1.3
===========================
Allow Parser.set* methods to accept null
Allow setting the LexicalHandler feature to be null
	in both cases means "use default behavior"

Changes from 1.1.1 to 1.1.2
===========================
Setting CDATAElementsFeature didn't really set CDATAElements instance variable

Changes from 1.1 to 1.1.1
=========================
Removed lexical handler calls to startCDATA/endCDATA from CDATA element handling
Added lexical handler calls to startCDATA/endCDATA from CDATA section handling
Added CDATAElementsFeature, the programmatic equivalent of the --nocdata switch

Changes from 1.0.5 to 1.1
=========================
Add Tatu Saloranta's JAXP support package

Changes from 1.0.4 to 1.0.5
===========================
Major repairs to comment scanning
Skip leading BOM
Comment out debugging code in PYXWriter
Allow &#X as well as &#x
Add net.sf.saxon to list of supported XSLT engines

Changes from 1.0.4 to 1.0.3
===========================
Certain options were mutually exclusive that should not have been
Blocked XML declaration from specifying an encoding of ""
--method=html was not doing the right thing

Changes from 1.0.3 to 1.0.2
===========================
Fixed build file to use Java target version 1.4
Fixed --version switch to print the right thing

Changes from 1.0.1 to 1.0.2
===========================
Version attribute default value removed from html element
Leading and trailing hyphens now trimmed properly from comments
Added --output-encoding switch to control encoding
If output encoding is Unicode, don't generate character references
Whitespace compressed and junk stripped from public identifiers

Changes from 1.0 to 1.0.1
=========================
Added ignorableWhitespaceFeature and --ignorable to report ignorable whitespace
	Patch due to David Pashley
Insert spaces to break up -- in comments
Change bogus chars in publicids to spaces
--lexical switch now outputs DOCTYPE if there is one
Remove unnecessary blank line after XML declaration

Changes from 1.0rc9 to 1.0
==========================
Added feature to control restartability
	Patch due to Nikita Zhuk
Added corresponding --norestart switch in CommandLine
Made translate-colons feature actually work

Changes from 1.0rc8 to 1.0rc9
=============================
If there is a publicid but no systemid, set systemid to ""

Changes from 1.0rc7 to 1.0rc8
=============================
Fixed paper-bag bug (source didn't match binary in release)

Changes from 1.0rc6 to 1.0rc7
=============================
LexicalHandler now gets DOCTYPE information (publicid and systemid)
	Patch due to Mike Bremford
HTMLScanner now reports more useful debug output when not commented out
	Patch due to Mike Bremford
Change "<memberOfAny>" to exclude "<root>" pseudo-element
	This prevents "script" from being output as a root
The shared HTMLParser object has been eliminated

Changes from 1.0rc5 to 1.0rc6
=============================
If namespaceFeature is false, uri and localname are passed as empty strings
The namespacePrefixesFeature is now always false
Command line switch --nons no longer affects namespacePrefixesFeature
Command line switch --html now implies --nons
XMLWriter is now told directly to use the schema's URI as default namespace
XMLWriter now takes the element name from the qname if localname is empty

Changes from 1.0rc4 to 1.0rc5
=============================
The --nodefault switch now removes only default attributes, not all of them
Added --nocolons switch and translate-colons feature to convert ":"
	in names to "_" (thus suppressing namespaces other than the basic one)
The root element can be unknown without problem
Empty <script/> and <style/> tags now work
Added all standard SAX2 features to feature hashtable
Reimplemented namespacePrefixes feature (broken since 1.0rc3)

Changes from 1.0rc3 to 1.0rc4
=============================
Remove trailing ? from processing instructions (in case the input is XHTML)
Added Javadocs for all SAX standard and TagSoup-specific features and properties
Fixed termination conditions for entity/character references
Fixed EOF-pushback bug that was generating bogus &#x65535; references
Added Parser feature and --nodefaults switch to ignore default attribute values
Added support for SAX Locator
Updated AFL license to version 3.0
Scanner buffer size increases as needed, allowing large attribute values
Look for various XSLT implementations as available (still fails in raw 5.0)
Clean up handling of XML empty tags and SGML minimized end-tags
Support proper options and help message internally
Use Hashtable in CommandLine class instead of HashMap
Do proper buffering of InputStream and Reader
Clean up content model of noframes element
Removed htmlMode in XMLWriter
Added support for XSLT output options METHOD=html and OMIT_XML_DECLARATION=yes
Command line option --html sets both of these
Wrote simple validator for TSSL schemas (tssl/tssl-validator.xslt)
Removed various validity problems in html.tssl
When processing a start-tag, don't restart elements that aren't in the new
	element's content model
Remove bogus double param in tssl.xslt

Changes from 1.0rc2 to 1.0rc3
=============================
Convert CR and CRLF to LF in comments and PIs
Force empty elements to close immediately
Match close tags of CDATA elements more precisely (but case-blind)
Process switches on the command line
Man page available

Changes from 1.0rc1 to 1.0rc2
=============================
Isolated & and &# now don't crash parser
TagSoup no longer depends on /dev/stdin existing
Refactored Parser class, removing main method to new CommandLine class
Changes to content models of form, button, table, and tr elements in html.tssl
'</scr' + 'ipt>' in a script element no longer terminates it
Introduced "uncloseability" of form and table elements
"pyxin" property specifies that input is in PYX format
Correctly cope with unexpected characters around colons, also with multiple colons
Correctly output comments with "--" in them (by adding a space)

Changes from 0.10.2 to 1.0rc1
=============================
Script can now appear anywhere
Switch -nocdata correctly implemented
Eliminated useless M_n constants in Schema
Introduced <memberofAny> and <isRoot> as alternatives to
	<memberOf> in TSSL
Allow prefixes in element names
Attributes are now normalized
Expanded public API for Element and ElementType
Javadoc improved

Changes from 0.10.1 to 0.10.2
=============================
Removed misfeature whereby > terminated a tag even inside quotes
Added licensing language to XSLT scripts, RELAX NG schemas
Removed long-standing mishandling of entity references in attributes
Cleaned up logic for converting junky strings to proper XML Names
Correctly handle empty tag that has no whitespace or attributes
Restore correct 0.9.3 handling of an apparent end-tag in a CDATA element
Added script element to content model of head element

Changes from 0.9.7 to 0.10.1 (there is no 0.10.0):
==================================================
Convert to XSLT configuration exclusively;
	Perl code and tab-separated tables are gone
Remove xmlns:* attributes
Append "_" to attribute names ending in ":"
Don't prepend "_" to an attribute name starting in "_"
Handle namespace prefixes in attributes:
	"xml" prefix is handled correctly
	other prefixes are mapped to "urn:x-prefix:foo"
Ignore XML declarations
-Dnocdata=true turns off F_CDATA on script and style elements
Fixed off-by-one errors in character references that made them uninterpreted
Start-tags ending in a minimized attribute are no longer being dropped
XML empty tags are now supported (though slashes are still allowed in
	unquoted attribute values)

Changes from 0.9.6 to 0.9.7:
============================
Upgraded AFL to version 2.1
Passed through newlines in character content (very old bug)

Changes from 0.9.5 to 0.9.6:
============================
Script element can appear directly in body
">" terminates a start-tag even inside a quoted attribute,
	to protect against unbalanced quotes
"_" is prepended to attributes that don't begin with a letter
Remove "xmlns" attributes from the input
All standard features can now be set
	(although there is no effect from doing so)
New "bogons-empty" feature can be set to false to give bogons
	 content model of ANY rather than EMPTY;
	-Dany switch sets this feature to false
TSSL now has an explicit group element to declare an element group
STML is a new XML format for modeling state-table changes
License updated to AFL 2.1

Changes from 0.9.4 to 0.9.5:
============================
S in the statetable now means \r and \n and \t as well as space
	(as was always intended; brain fart!)
Ins and del elements are now allowed everywhere
TSSL now correctly supports attributes that are legal on all elements

Changes from 0.9.3 to 0.9.4:
============================
Fixed paper-bag bug that revealed attribute type BOOLEAN to applications.
Obsolete ABSTRACT removed in favor of README.
Improved implementation of CDATA restart after bogus end-tag.
Allowed hyphen, underscore, and period in names as well as colon.
First cut at TagSoup Schema Language -- doesn't do anything yet.
Support CDATA sections on input.
Don't generate built-in entities within CDATA elements.

Changes from 0.9.2 to 0.9.3:
============================
Convenience main program "tagsoup" in bin directory.
Begin to integrate tests.
Introduced BOOLEAN type (currently just converted to NMTOKEN).
Features that actually work are now named constants in Parser.
Double root elements are really gone now.
ID attributes weren't being removed from restarted elements.
Fixed a bug that made unknown elements disappear in some cases.
Parser is now safely reusable.
PYXWriter and XMLWriter now implement LexicalHandler.
Parser reports comments, startCDATA, and endCDATA events to a LexicalHandler.
ScanHandler methods now throw only SAXException, not also IOException.
-Dlexical=true switch sets the ContentHandler as a LexicalHandler as well
	(XMLWriter prints comments, ignores CDATA sections; PYXWriter ignores all).
-Dreuse=true switch reuses a single Parser object (no great speed gain).
We now disallow an a element as the child of another a element.
An empty input is now treated as zero-length character content.
HTMLWriter is gone in favor of an extended XMLWriter with get/setHTMLMode methods.
CDATA elements only terminaate with matching end-tags (thanks to Sebastien Bardoux).

Changes from 0.9.1 to 0.9.2:
============================
No longer inserts bogus ; after unknown entity reference without ;.
Consecutive entity references now work correctly.
Setting namespaces and namespace-prefixes methods now works.
-Dnons=true option turns off namespace and prefix.
New feature http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons"
	suppresses unknown start-tags (any end-tag will be automatically ignored).
-Dnobogons=true option turns ignore-bogons on.
Suppress unknown and/or empty initial start-tag always
	(prevents double root element).
Schema now allows style as an inline element, like script.
Schema now allows tr as a child of table to avoid problems with embedded tables.
Clear Parser instance variables to make Parsers properly reusable.

Changes from 0.9 to 0.9.1:
==========================
Incorporated patch for -jar support by Joseph Walton.
Incorporated patch for Megginson XMLWriter support by Joseph Walton.
Changed existing XMLWriter to HTMLWriter.
Rewrote Parsermain for better features, removed Tester class.
-Dnewline=true removed, now implied by -DHTML=true.
-Dfiles=true now used to generate separate outputs (old Tester behavior)
	with extension xhtml (removing any old extension).
Fixed nasty bug in HTMLScanner that was failing to fix unusual entities.
Don't attempt to smash whitespace to spaces any more.

Changes from 0.8 to 0.9:
========================
Ant-ified by Martin Rademacher.
Don't suppress colons in element names.
Entity problems fixed (I hope).
Can now set namespace and namespace-prefixes features (without effect).
Properly templatize HTMLModels.java.
Attributes are no longer in the HTML namespace.