Package org.cyberneko.html.filters
Class Purifier
java.lang.Object
org.cyberneko.html.filters.DefaultFilter
org.cyberneko.html.filters.Purifier
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent,org.apache.xerces.xni.parser.XMLDocumentFilter,org.apache.xerces.xni.parser.XMLDocumentSource,org.apache.xerces.xni.XMLDocumentHandler,HTMLComponent
This filter purifies the HTML input to ensure XML well-formedness.
The purification process includes:
- fixing illegal characters in the document, including
- element and attribute names,
- processing instruction target and data,
- document text;
- ensuring the string "--" does not appear in the content of a comment;
- ensuring the string "]]>" does not appear in the content of a CDATA section;
- ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
- synthesized missing namespace bindings.
Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".
In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.
The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.
- Version:
- $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
- Author:
- Andy Clark
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final StringInclude infoset augmentations.protected booleanAugmentations.protected booleanTrue if inside a CDATA section.protected org.apache.xerces.xni.NamespaceContextNamespace information.protected booleanNamespaces.protected StringPublic identifier of doctype declaration.protected booleanTrue if the doctype declaration was seen.protected booleanTrue if root element was seen.protected intSynthesized namespace binding count.protected StringSystem identifier of doctype declaration.protected static final StringNamespaces.protected static final HTMLEventInfoSynthesized event info item.static final StringSynthesized namespace binding prefix.Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidcharacters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) Characters.voidcomment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) Comment.voiddoctypeDecl(String root, String pubid, String sysid, org.apache.xerces.xni.Augmentations augs) Doctype declaration.voidemptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) Empty element.voidendCDATA(org.apache.xerces.xni.Augmentations augs) End CDATA section.voidendElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) End element.protected voidHandle start document.protected voidhandleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs) Handle start element.voidprocessingInstruction(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) Processing instruction.protected StringpurifyName(String name, boolean localpart) Purify name.protected org.apache.xerces.xni.QNamepurifyQName(org.apache.xerces.xni.QName qname) Purify qualified name.protected org.apache.xerces.xni.XMLStringpurifyText(org.apache.xerces.xni.XMLString text) Purify content.voidreset(org.apache.xerces.xni.parser.XMLComponentManager manager) Resets the component.voidstartCDATA(org.apache.xerces.xni.Augmentations augs) Start CDATA section.voidstartDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) Start document.voidstartDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) Start document.voidstartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) Start element.protected voidsynthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, String ns) Synthesize namespace binding.protected final org.apache.xerces.xni.AugmentationsReturns an augmentations object with a synthesized item added.protected static StringtoHexString(int c, int padlen) Returns a padded hexadecimal string for the given value.voidxmlDecl(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) XML declaration.Methods inherited from class org.cyberneko.html.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
-
Field Details
-
SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.- See Also:
-
NAMESPACES
Namespaces.- See Also:
-
AUGMENTATIONS
Include infoset augmentations.- See Also:
-
SYNTHESIZED_ITEM
Synthesized event info item. -
fNamespaces
protected boolean fNamespacesNamespaces. -
fAugmentations
protected boolean fAugmentationsAugmentations. -
fSeenDoctype
protected boolean fSeenDoctypeTrue if the doctype declaration was seen. -
fSeenRootElement
protected boolean fSeenRootElementTrue if root element was seen. -
fInCDATASection
protected boolean fInCDATASectionTrue if inside a CDATA section. -
fPublicId
Public identifier of doctype declaration. -
fSystemId
System identifier of doctype declaration. -
fNamespaceContext
protected org.apache.xerces.xni.NamespaceContext fNamespaceContextNamespace information. -
fSynthesizedNamespaceCount
protected int fSynthesizedNamespaceCountSynthesized namespace binding count.
-
-
Constructor Details
-
Purifier
public Purifier()
-
-
Method Details
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException Description copied from class:DefaultFilterResets the component. The component can query the component manager about any features and properties that affect the operation of the component.- Specified by:
resetin interfaceorg.apache.xerces.xni.parser.XMLComponent- Overrides:
resetin classDefaultFilter- Parameters:
manager- The component manager.- Throws:
org.apache.xerces.xni.parser.XMLConfigurationException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start document.- Overrides:
startDocumentin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start document.- Specified by:
startDocumentin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startDocumentin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
xmlDecl
public void xmlDecl(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException XML declaration.- Specified by:
xmlDeclin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
xmlDeclin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
comment
public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Comment.- Specified by:
commentin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
commentin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
processingInstruction
public void processingInstruction(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Processing instruction.- Specified by:
processingInstructionin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
processingInstructionin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
doctypeDecl
public void doctypeDecl(String root, String pubid, String sysid, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Doctype declaration.- Specified by:
doctypeDeclin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
doctypeDeclin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
startElement
public void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start element.- Specified by:
startElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startElementin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
emptyElement
public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Empty element.- Specified by:
emptyElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
emptyElementin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
startCDATA
public void startCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start CDATA section.- Specified by:
startCDATAin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startCDATAin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
endCDATA
public void endCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException End CDATA section.- Specified by:
endCDATAin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
endCDATAin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
characters
public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Characters.- Specified by:
charactersin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
charactersin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
endElement
public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException End element.- Specified by:
endElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
endElementin classDefaultFilter- Throws:
org.apache.xerces.xni.XNIException
-
handleStartDocument
protected void handleStartDocument()Handle start document. -
handleStartElement
protected void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs) Handle start element. -
synthesizeBinding
Synthesize namespace binding. -
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()Returns an augmentations object with a synthesized item added. -
purifyQName
protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname) Purify qualified name. -
purifyName
Purify name. -
purifyText
protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text) Purify content. -
toHexString
Returns a padded hexadecimal string for the given value.
-