xml.minidom is stripping out my CRLF's in attrib values!!

Duncan Booth duncan at NOSPAMrcp.co.uk
Mon Sep 9 10:36:27 EDT 2002


Ahmad Baitalmal <ahmad at NOSPAMbitbuilder.com> wrote in
news:3D7C8C76.8060700 at NOSPAMbitbuilder.com: 

> That's not what my problem is, that I know about ( crlf's between
> nodes),, 
> 
> Here is the deal:
><cow>
>     <hide value="spotted
> with black and white"></hide>
></cow>
> 
> After the word "spotted" there is a crlf, -inside- the attribute
> value. Sholdn't it be treated as part of the value?
> 
> The value now comes stripped of that crlf.

The XML specification, para 3.3.3 specifies that attribute values must
be normalised. The normalisation will convert your newline to a space and 
may also remove duplicate spaces:

> 3.3.3 Attribute-Value Normalization
> Before the value of an attribute is passed to the application or
> checked for validity, the XML processor must normalize the attribute
> value by applying the algorithm below, or by using some other method
> such that the value passed to the application is the same as that
> produced by the algorithm. 
> 
> All line breaks must have been normalized on input to #xA as described
> in 2.11 End-of-Line Handling, so the rest of this algorithm operates
> on text normalized in this way. 
> 
> Begin with a normalized value consisting of the empty string.
> 
> For each character, entity reference, or character reference in the
> unnormalized attribute value, beginning with the first and continuing
> to the last, do the following: 
> 
> For a character reference, append the referenced character to the
> normalized value. 
> 
> For an entity reference, recursively apply step 3 of this algorithm to
> the replacement text of the entity. 
> 
> For a white space character (#x20, #xD, #xA, #x9), append a space
> character (#x20) to the normalized value. 
> 
> For another character, append the character to the normalized value.
> 
> If the attribute type is not CDATA, then the XML processor must
> further process the normalized attribute value by discarding any
> leading and trailing space (#x20) characters, and by replacing
> sequences of space (#x20) characters by a single space (#x20)
> character. 
> 
> Note that if the unnormalized attribute value contains a character
> reference to a white space character other than space (#x20), the
> normalized value contains the referenced character itself (#xD, #xA or
> #x9). This contrasts with the case where the unnormalized value
> contains a white space character (not a reference), which is replaced
> with a space character (#x20) in the normalized value and also
> contrasts with the case where the unnormalized value contains an
> entity reference whose replacement text contains a white space
> character; being recursively processed, the white space character is
> replaced with a space character (#x20) in the normalized value. 
> 
> All attributes for which no declaration has been read should be
> treated by a non-validating processor as if declared CDATA. 
> 

So the only way to get a newline into an attribute is to escape it in using 
an entity reference.

-- 
Duncan Booth                                            
duncan at rcp.co.uk int month(char
*p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3" 
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure? 



More information about the Python-list mailing list