diff --git a/ChangeLog.md b/ChangeLog.md new file mode 100644 index 0000000..0ddb1e3 --- /dev/null +++ b/ChangeLog.md @@ -0,0 +1,102 @@ +## 1.7.0 + +* `psDecodeEntities` is no longer passed numeric character references (e.g., ` `, `A`) and the predefined XML entities (`&`, `<`, etc). They are now handled by the parser. Both of these construct classes only have one spec-compliant interpretation and this behaviour must always be present, so it makes no sense to force user code to re-implement the parsing logic. +* In prior versions of xml-conduit, hexadecimal character references with a leading `0x` or `0X` like `&0x20;` were accepted. This was not in compliance with the XML specification and it has been corrected. +* xml-conduit now rejects some (but not all) invalid-according-to-spec entities during parsing: specifically, entities with a leading `#` that are not character references are no longer allowed and will be parse errors. + +## 1.6.0 + +* Dropped the dependency on `data-default` for `data-default-class`, reducing the transitive dependency load. For most users, this will not be a breaking change, but it does mean that importing `Text.XML.Conduit` will no longer bring various instances for `Default` into scope. This will break code that relies on those instances and does not otherwise see them. To fix this, import `Data.Default` from `data-default` or one of the more specific instance-providing packages directly (e.g., `data-default-dlist` for the `DList` instance). + +## 1.5.1 + +* New render setting, `rsXMLDeclaration`; setting it to `False` omits the XML declaration. + +## 1.5.0 + +* `tag` function no longer throws an exception when attributes don't match [#93](https://github.com/snoyberg/xml/pull/93) +* Add `many_` combinator to avoid building results in memory [#94](https://github.com/snoyberg/xml/pull/94) +* Turn some functions from `Consumer Event m a` to `ConduitM Event o m a` to allow yielding values +* Replace `takeAllTreesContent` with `takeAnyTreeContent`, that only consumes one tree +* Introduce `NameMatcher` type to refactor tag parsers +* Add a couple of `take*` functions to stream events rather than parse them +* Rename `ignore*` functions to comply with naming convention + +## 1.4.0.3 + +* Compatibility with blaze-markup-0.8.0.0 [#95](https://github.com/snoyberg/xml/issues/95) + +## 1.4.0.2 + +* Parse XML encoding case-insensitively +* Remove extra EOL when printing XmlException + +## 1.4.0.1 + +* Handle CDATA in takeAllTreesContent [#88](https://github.com/snoyberg/xml/pull/88) + +## 1.4.0 + +* Improve XmlException definition and usage +* Add 'takeAllTreesContent' function + +## 1.3.5 + +* Improvements for using xml-conduit for streaming XML protocols [#85](https://github.com/snoyberg/xml/pull/85) + +## 1.3.4.2 + +* transformers dep bump + +## 1.3.4.1 + +* Remove unneeded ImpredicativeTypes + +## 1.3.4 + +* dropWS retains consumed whitespace values [#74](https://github.com/snoyberg/xml/issues/74) [#75](https://github.com/snoyberg/xml/pull/75) [#76](https://github.com/snoyberg/xml/pull/76) + +## 1.3.3.1 + +* Generalize signature of choose (Fixes [#72](https://github.com/snoyberg/xml/issues/72)) [#73](https://github.com/snoyberg/xml/pull/73) + +## 1.3.3 + +* New render setting to control when to use CDATA [#68](https://github.com/snoyberg/xml/pull/68) +* Escaping CDATA closing tag in CDATA [#69](https://github.com/snoyberg/xml/pull/69) + +## 1.3.2 + +* Support for iso-8859-1 [#63](https://github.com/snoyberg/xml/issues/63) + +## 1.3.1 + +* Add functions to ignore subtrees & result-streaming (yield) parsers [#58](https://github.com/snoyberg/xml/pull/58) + +## 1.3.0 + +* Drop system-filepath + +## 1.2.6 + +* Reuse 'MonadThrow' and 'force' for 'AttrParser' [#52](https://github.com/snoyberg/xml/pull/52) + +## 1.2.5 + +* Added helper functions to render XML elements [#48](https://github.com/snoyberg/xml/pull/48) + +## 1.2.4 + +* 'parseText' becomes 'parseText'/'parseTextPos', depending on the output type [#47](https://github.com/snoyberg/xml/pull/47) + +## 1.2.3.3 + +* Allow blaze-builder 0.4 + +## 1.2.3.2 + +* Doc fix [#44](https://github.com/snoyberg/xml/pull/44) + +## 1.2.3.1 + +Support monad-control 1.0 diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..aa73fd2 --- /dev/null +++ b/LICENSE @@ -0,0 +1,20 @@ +Copyright 2010, Suite Solutions. All rights reserved. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..8f43e20 --- /dev/null +++ b/README.md @@ -0,0 +1,19 @@ +## xml-conduit + +This package provides parsing and rendering functions for XML. It is based on the datatypes found in the xml-types package. This package is broken up into the following modules: + +* Text.XML: DOM-based parsing and rendering. This is the most commonly used module. + +* Text.XML.Cursor: A wrapper around `Text.XML` which allows bidirectional traversing of the DOM, similar to XPath. (Note: Text.XML.Cursor.Generic is the same concept, but will work with any node representation.) + +* Text.XML.Unresolved: A slight modification to `Text.XML` which does not require all entities to be resolved at parsing. The datatypes are slightly more complicated here, and therefore this module is only recommended when you need to deal directly with raw entities. + +* Text.XML.Stream.Parse: Streaming parser, including some streaming parser combinators. + +* Text.XML.Stream.Render: Streaming renderer. + +Additionally, the [xml-hamlet +package](http://www.stackage.org/package/xml-hamlet) provides a more convenient +syntax for creating XML documents. For a more thorough tutorial on this +library, please see +[http://www.yesodweb.com/book/xml](http://www.yesodweb.com/book/xml). diff --git a/Setup.lhs b/Setup.lhs new file mode 100644 index 0000000..06e2708 --- /dev/null +++ b/Setup.lhs @@ -0,0 +1,7 @@ +#!/usr/bin/env runhaskell + +> module Main where +> import Distribution.Simple + +> main :: IO () +> main = defaultMain diff --git a/Text/XML.hs b/Text/XML.hs new file mode 100644 index 0000000..acb4962 --- /dev/null +++ b/Text/XML.hs @@ -0,0 +1,353 @@ +{-# LANGUAGE CPP #-} +{-# LANGUAGE DeriveDataTypeable #-} +{-# LANGUAGE FlexibleContexts #-} +{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE PatternGuards #-} +{-# LANGUAGE RankNTypes #-} +-- | DOM-based parsing and rendering. +-- +-- This module requires that all entities be resolved at parsing. If you need +-- to interact with unresolved entities, please use "Text.XML.Unresolved". This +-- is the recommended module for most uses cases. +-- +-- While many of the datatypes in this module are simply re-exported from +-- @Data.XML.Types@, 'Document', 'Node' and 'Element' are all redefined here to +-- disallow the possibility of unresolved entities. Conversion functions are +-- provided to switch between the two sets of datatypes. +-- +-- For simpler, bidirectional traversal of the DOM tree, see the +-- "Text.XML.Cursor" module. +module Text.XML + ( -- * Data types + Document (..) + , Prologue (..) + , Instruction (..) + , Miscellaneous (..) + , Node (..) + , Element (..) + , Name (..) + , Doctype (..) + , ExternalID (..) + -- * Parsing + -- ** Files + , readFile + -- ** Bytes + , parseLBS + , parseLBS_ + , sinkDoc + -- ** Text + , parseText + , parseText_ + , sinkTextDoc + -- ** Other + , fromEvents + , UnresolvedEntityException (..) + , XMLException (..) + -- * Rendering + , writeFile + , renderLBS + , renderText + , renderBytes + -- * Settings + , def + -- ** Parsing + , ParseSettings + , psDecodeEntities + , P.psRetainNamespaces + -- *** Entity decoding + , P.decodeXmlEntities + , P.decodeHtmlEntities + -- ** Rendering + , R.RenderSettings + , R.rsPretty + , R.rsNamespaces + , R.rsAttrOrder + , R.rsUseCDATA + , R.rsXMLDeclaration + , R.orderAttrs + -- * Conversion + , toXMLDocument + , fromXMLDocument + , toXMLNode + , fromXMLNode + , toXMLElement + , fromXMLElement + ) where + +import Control.Applicative ((<$>)) +import Control.DeepSeq (NFData (rnf)) +import Control.Exception (Exception, SomeException, handle, + throw, throwIO) +import Control.Monad.ST (runST) +import Control.Monad.Trans.Resource (MonadThrow, monadThrow, + runExceptionT, runResourceT) +import Data.ByteString (ByteString) +import qualified Data.ByteString.Lazy as L +import Data.Data (Data) +import Data.Either (partitionEithers) +import qualified Data.Map as Map +import Data.Set (Set) +import qualified Data.Set as Set +import Data.Text (Text) +import qualified Data.Text as T +import Data.Typeable (Typeable) +import Data.XML.Types (Doctype (..), ExternalID (..), + Instruction (..), + Miscellaneous (..), Name (..), + Prologue (..)) +import qualified Data.XML.Types as X +import Prelude hiding (readFile, writeFile) +import Text.XML.Stream.Parse (ParseSettings, def, + psDecodeEntities) +import qualified Text.XML.Stream.Parse as P +import qualified Text.XML.Stream.Render as R +import qualified Text.XML.Unresolved as D + +import Control.Monad.Trans.Class (lift) +import Data.Conduit +import qualified Data.Conduit.Binary as CB +import Data.Conduit.Lazy (lazyConsume) +import qualified Data.Conduit.List as CL +import qualified Data.Text.Lazy as TL +import qualified Data.Text.Lazy.Encoding as TLE +import System.IO.Unsafe (unsafePerformIO) + +import Control.Arrow (first) +import Data.List (foldl') +import Data.Monoid (mappend, mempty) +import Data.String (fromString) +import qualified Text.Blaze as B +import qualified Text.Blaze.Html as B +import qualified Text.Blaze.Html5 as B5 +import qualified Text.Blaze.Internal as BI + +data Document = Document + { documentPrologue :: Prologue + , documentRoot :: Element + , documentEpilogue :: [Miscellaneous] + } + deriving (Show, Eq, Typeable, Data) + +#if MIN_VERSION_containers(0, 4, 2) +instance NFData Document where + rnf (Document a b c) = rnf a `seq` rnf b `seq` rnf c `seq` () +#endif + +data Node + = NodeElement Element + | NodeInstruction Instruction + | NodeContent Text + | NodeComment Text + deriving (Show, Eq, Ord, Typeable, Data) + +#if MIN_VERSION_containers(0, 4, 2) +instance NFData Node where + rnf (NodeElement e) = rnf e `seq` () + rnf (NodeInstruction i) = rnf i `seq` () + rnf (NodeContent t) = rnf t `seq` () + rnf (NodeComment t) = rnf t `seq` () +#endif + +data Element = Element + { elementName :: Name + , elementAttributes :: Map.Map Name Text + , elementNodes :: [Node] + } + deriving (Show, Eq, Ord, Typeable, Data) + +#if MIN_VERSION_containers(0, 4, 2) +instance NFData Element where + rnf (Element a b c) = rnf a `seq` rnf b `seq` rnf c `seq` () +#endif + +{- +readFile :: FilePath -> ParseSettings -> IO (Either SomeException Document) +readFile_ :: FIlePath -> ParseSettings -> IO Document +-} + +toXMLDocument :: Document -> X.Document +toXMLDocument = toXMLDocument' def + +toXMLDocument' :: R.RenderSettings -> Document -> X.Document +toXMLDocument' rs (Document a b c) = X.Document a (toXMLElement' rs b) c + +toXMLElement :: Element -> X.Element +toXMLElement = toXMLElement' def + +toXMLElement' :: R.RenderSettings -> Element -> X.Element +toXMLElement' rs (Element name as nodes) = + X.Element name as' nodes' + where + as' = map (\(x, y) -> (x, [X.ContentText y])) $ R.rsAttrOrder rs name as + nodes' = map (toXMLNode' rs) nodes + +toXMLNode :: Node -> X.Node +toXMLNode = toXMLNode' def + +toXMLNode' :: R.RenderSettings -> Node -> X.Node +toXMLNode' rs (NodeElement e) = X.NodeElement $ toXMLElement' rs e +toXMLNode' _ (NodeContent t) = X.NodeContent $ X.ContentText t +toXMLNode' _ (NodeComment c) = X.NodeComment c +toXMLNode' _ (NodeInstruction i) = X.NodeInstruction i + +fromXMLDocument :: X.Document -> Either (Set Text) Document +fromXMLDocument (X.Document a b c) = + case fromXMLElement b of + Left es -> Left es + Right b' -> Right $ Document a b' c + +fromXMLElement :: X.Element -> Either (Set Text) Element +fromXMLElement (X.Element name as nodes) = + case (lnodes, las) of + ([], []) -> Right $ Element name ras rnodes + (x, []) -> Left $ Set.unions x + ([], y) -> Left $ Set.unions y + (x, y) -> Left $ Set.unions x `Set.union` Set.unions y + where + enodes = map fromXMLNode nodes + (lnodes, rnodes) = partitionEithers enodes + eas = map go as + (las, ras') = partitionEithers eas + ras = Map.fromList ras' + go (x, y) = + case go' [] id y of + Left es -> Left es + Right y' -> Right (x, y') + go' [] front [] = Right $ T.concat $ front [] + go' errs _ [] = Left $ Set.fromList errs + go' errs front (X.ContentText t:ys) = go' errs (front . (:) t) ys + go' errs front (X.ContentEntity t:ys) = go' (t : errs) front ys + +fromXMLNode :: X.Node -> Either (Set Text) Node +fromXMLNode (X.NodeElement e) = NodeElement <$> fromXMLElement e +fromXMLNode (X.NodeContent (X.ContentText t)) = Right $ NodeContent t +fromXMLNode (X.NodeContent (X.ContentEntity t)) = Left $ Set.singleton t +fromXMLNode (X.NodeComment c) = Right $ NodeComment c +fromXMLNode (X.NodeInstruction i) = Right $ NodeInstruction i + +readFile :: ParseSettings -> FilePath -> IO Document +readFile ps fp = handle + (throwIO . InvalidXMLFile fp) + (runResourceT $ CB.sourceFile fp $$ sinkDoc ps) + +data XMLException = InvalidXMLFile FilePath SomeException + deriving Typeable + +instance Show XMLException where + show (InvalidXMLFile fp e) = concat + [ "Error parsing XML file " + , fp + , ": " + , show e + ] +instance Exception XMLException + +parseLBS :: ParseSettings -> L.ByteString -> Either SomeException Document +parseLBS ps lbs = runST + $ runExceptionT + $ CL.sourceList (L.toChunks lbs) + $$ sinkDoc ps + +parseLBS_ :: ParseSettings -> L.ByteString -> Document +parseLBS_ ps = either throw id . parseLBS ps + +sinkDoc :: MonadThrow m + => ParseSettings + -> Consumer ByteString m Document +sinkDoc ps = P.parseBytesPos ps =$= fromEvents + +parseText :: ParseSettings -> TL.Text -> Either SomeException Document +parseText ps tl = runST + $ runExceptionT + $ CL.sourceList (TL.toChunks tl) + $$ sinkTextDoc ps + +parseText_ :: ParseSettings -> TL.Text -> Document +parseText_ ps = either throw id . parseText ps + +sinkTextDoc :: MonadThrow m + => ParseSettings + -> Consumer Text m Document +sinkTextDoc ps = P.parseTextPos ps =$= fromEvents + +fromEvents :: MonadThrow m => Consumer P.EventPos m Document +fromEvents = do + d <- D.fromEvents + either (lift . monadThrow . UnresolvedEntityException) return $ fromXMLDocument d + +data UnresolvedEntityException = UnresolvedEntityException (Set Text) + deriving (Show, Typeable) +instance Exception UnresolvedEntityException + +--renderBytes :: MonadUnsafeIO m => R.RenderSettings -> Document -> Producer m ByteString +renderBytes rs doc = D.renderBytes rs $ toXMLDocument' rs doc + +writeFile :: R.RenderSettings -> FilePath -> Document -> IO () +writeFile rs fp doc = + runResourceT $ renderBytes rs doc $$ CB.sinkFile fp + +renderLBS :: R.RenderSettings -> Document -> L.ByteString +renderLBS rs doc = + L.fromChunks $ unsafePerformIO + -- not generally safe, but we know that runResourceT + -- will not deallocate any of the resources being used + -- by the process + $ lazyConsume + $ renderBytes rs doc + +renderText :: R.RenderSettings -> Document -> TL.Text +renderText rs = TLE.decodeUtf8 . renderLBS rs + +instance B.ToMarkup Document where + toMarkup (Document _ root _) = B5.docType >> B.toMarkup root + +-- | Note that the special element name +-- @{http://www.snoyman.com/xml2html}ie-cond@ with the single attribute @cond@ +-- is used to indicate an IE conditional comment. +instance B.ToMarkup Element where + toMarkup (Element "{http://www.snoyman.com/xml2html}ie-cond" attrs children) + | [("cond", cond)] <- Map.toList attrs = + B.preEscapedToMarkup ("" :: T.Text) + + toMarkup (Element name' attrs children) = + if isVoid + then foldl' (B.!) leaf attrs' + else foldl' (B.!) parent attrs' childrenHtml + where + childrenHtml :: B.Html + childrenHtml = + case (name `elem` ["style", "script"], children) of + (True, [NodeContent t]) -> B.preEscapedToMarkup t + _ -> mapM_ B.toMarkup children + + isVoid = nameLocalName name' `Set.member` voidElems + + parent :: B.Html -> B.Html + parent = BI.Parent tag open close + leaf :: B.Html +#if MIN_VERSION_blaze_markup(0,8,0) + leaf = BI.Leaf tag open (fromString " />") () +#else + leaf = BI.Leaf tag open (fromString " />") +#endif + + name = T.unpack $ nameLocalName name' + tag = fromString name + open = fromString $ '<' : name + close = fromString $ concat [""] + + attrs' :: [B.Attribute] + attrs' = map (goAttr . first nameLocalName) $ Map.toList attrs + goAttr (key, value) = B.customAttribute (B.textTag key) $ B.toValue value + +instance B.ToMarkup Node where + toMarkup (NodeElement e) = B.toMarkup e + toMarkup (NodeContent t) = B.toMarkup t + toMarkup _ = mempty + +voidElems :: Set.Set T.Text +voidElems = Set.fromAscList $ T.words $ T.pack "area base br col command embed hr img input keygen link meta param source track wbr" diff --git a/Text/XML/Cursor.hs b/Text/XML/Cursor.hs new file mode 100644 index 0000000..1669c13 --- /dev/null +++ b/Text/XML/Cursor.hs @@ -0,0 +1,222 @@ +-- | This module provides for simple DOM traversal. It is inspired by XPath. There are two central concepts here: +-- +-- * A 'Cursor' represents a node in the DOM. It also contains information on the node's /location/. While the 'Node' datatype will only know of its children, a @Cursor@ knows about its parent and siblings as well. (The underlying mechanism allowing this is called a zipper, see and .) +-- +-- * An 'Axis', in its simplest form, takes a @Cursor@ and returns a list of @Cursor@s. It is used for selections, such as finding children, ancestors, etc. Axes can be chained together to express complex rules, such as all children named /foo/. +-- +-- The terminology used in this module is taken directly from the XPath +-- specification: . For those familiar with XPath, +-- the one major difference is that attributes are not considered nodes in this +-- module. +module Text.XML.Cursor + ( + -- * Data types + Cursor + , Axis + -- * Production + , fromDocument + , fromNode + , cut + -- * Axes + , parent + , CG.precedingSibling + , CG.followingSibling + , child + , node + , CG.preceding + , CG.following + , CG.ancestor + , descendant + , orSelf + -- ** Filters + , check + , checkNode + , checkElement + , checkName + , anyElement + , element + , laxElement + , content + , attribute + , laxAttribute + , hasAttribute + , attributeIs + -- * Operators + , (CG.&|) + , (CG.&/) + , (CG.&//) + , (CG.&.//) + , (CG.$|) + , (CG.$/) + , (CG.$//) + , (CG.$.//) + , (CG.>=>) + -- * Type classes + , Boolean(..) + -- * Error handling + , force + , forceM + ) where + +import Control.Exception (Exception) +import Control.Monad +import Control.Monad.Trans.Resource (MonadThrow, monadThrow) +import Data.Function (on) +import qualified Data.Map as Map +import Data.Maybe (maybeToList) +import qualified Data.Text as T +import Text.XML +import Text.XML.Cursor.Generic (child, descendant, node, orSelf, + parent) +import qualified Text.XML.Cursor.Generic as CG + +-- TODO: Consider [Cursor] -> [Cursor]? +-- | The type of an Axis that returns a list of Cursors. +-- They are roughly modeled after . +-- +-- Axes can be composed with '>=>', where e.g. @f >=> g@ means that on all results of +-- the @f@ axis, the @g@ axis will be applied, and all results joined together. +-- Because Axis is just a type synonym for @Cursor -> [Cursor]@, it is possible to use +-- other standard functions like '>>=' or 'concatMap' similarly. +-- +-- The operators '&|', '&/', '&//' and '&.//' can be used to combine axes so that the second +-- axis works on the context nodes, children, descendants, respectively the context node as +-- well as its descendants of the results of the first axis. +-- +-- The operators '$|', '$/', '$//' and '$.//' can be used to apply an axis (right-hand side) +-- to a cursor so that it is applied on the cursor itself, its children, its descendants, +-- respectively itself and its descendants. +-- +-- Note that many of these operators also work on /generalised Axes/ that can return +-- lists of something other than Cursors, for example Content elements. +type Axis = Cursor -> [Cursor] + +-- XPath axes as in http://www.w3.org/TR/xpath/#axes + +-- TODO: Decide whether to use an existing package for this +-- | Something that can be used in a predicate check as a boolean. +class Boolean a where + bool :: a -> Bool + +instance Boolean Bool where + bool = id +instance Boolean [a] where + bool = not . null +instance Boolean (Maybe a) where + bool (Just _) = True + bool _ = False +instance Boolean (Either a b) where + bool (Left _) = False + bool (Right _) = True + +-- | A cursor: contains an XML 'Node' and pointers to its children, ancestors and siblings. +type Cursor = CG.Cursor Node + +-- | Cut a cursor off from its parent. The idea is to allow restricting the scope of queries on it. +cut :: Cursor -> Cursor +cut = fromNode . CG.node + +-- | Convert a 'Document' to a 'Cursor'. It will point to the document root. +fromDocument :: Document -> Cursor +fromDocument = fromNode . NodeElement . documentRoot + +-- | Convert a 'Node' to a 'Cursor' (without parents). +fromNode :: Node -> Cursor +fromNode = + CG.toCursor cs + where + cs (NodeElement (Element _ _ x)) = x + cs _ = [] + +-- | Filter cursors that don't pass a check. +check :: Boolean b => (Cursor -> b) -> Axis +check f c = [c | bool $ f c] + +-- | Filter nodes that don't pass a check. +checkNode :: Boolean b => (Node -> b) -> Axis +checkNode f = check (f . node) + +-- | Filter elements that don't pass a check, and remove all non-elements. +checkElement :: Boolean b => (Element -> b) -> Axis +checkElement f c = case node c of + NodeElement e -> [c | bool $ f e] + _ -> [] + +-- | Filter elements that don't pass a name check, and remove all non-elements. +checkName :: Boolean b => (Name -> b) -> Axis +checkName f = checkElement (f . elementName) + +-- | Remove all non-elements. Compare roughly to XPath: +-- /A node test * is true for any node of the principal node type. For example, child::* will select all element children of the context node [...]/. +anyElement :: Axis +anyElement = checkElement (const True) + +-- | Select only those elements with a matching tag name. XPath: +-- /A node test that is a QName is true if and only if the type of the node (see [5 Data Model]) is the principal node type and has an expanded-name equal to the expanded-name specified by the QName./ +element :: Name -> Axis +element n = checkName (== n) + +-- | Select only those elements with a loosely matching tag name. Namespace and case are ignored. XPath: +-- /A node test that is a QName is true if and only if the type of the node (see [5 Data Model]) is the principal node type and has an expanded-name equal to the expanded-name specified by the QName./ +laxElement :: T.Text -> Axis +laxElement n = checkName (on (==) T.toCaseFold n . nameLocalName) + +-- | Select only text nodes, and directly give the 'Content' values. XPath: +-- /The node test text() is true for any text node./ +-- +-- Note that this is not strictly an 'Axis', but will work with most combinators. +content :: Cursor -> [T.Text] +content c = case node c of + (NodeContent v) -> [v] + _ -> [] + +-- | Select attributes on the current element (or nothing if it is not an element). XPath: +-- /the attribute axis contains the attributes of the context node; the axis will be empty unless the context node is an element/ +-- +-- Note that this is not strictly an 'Axis', but will work with most combinators. +-- +-- The return list of the generalised axis contains as elements lists of 'Content' +-- elements, each full list representing an attribute value. +attribute :: Name -> Cursor -> [T.Text] +attribute n c = + case node c of + NodeElement e -> maybeToList $ Map.lookup n $ elementAttributes e + _ -> [] + +-- | Select attributes on the current element (or nothing if it is not an element). Namespace and case are ignored. XPath: +-- /the attribute axis contains the attributes of the context node; the axis will be empty unless the context node is an element/ +-- +-- Note that this is not strictly an 'Axis', but will work with most combinators. +-- +-- The return list of the generalised axis contains as elements lists of 'Content' +-- elements, each full list representing an attribute value. +laxAttribute :: T.Text -> Cursor -> [T.Text] +laxAttribute n c = + case node c of + NodeElement e -> do + (n', v) <- Map.toList $ elementAttributes e + guard $ (on (==) T.toCaseFold) n (nameLocalName n') + return v + _ -> [] + +-- | Select only those element nodes with the given attribute. +hasAttribute :: Name -> Axis +hasAttribute n c = + case node c of + NodeElement (Element _ as _) -> maybe [] (const [c]) $ Map.lookup n as + _ -> [] + +-- | Select only those element nodes containing the given attribute key/value pair. +attributeIs :: Name -> T.Text -> Axis +attributeIs n v c = + case node c of + NodeElement (Element _ as _) -> [ c | Just v == Map.lookup n as] + _ -> [] + +force :: (Exception e, MonadThrow f) => e -> [a] -> f a +force e [] = monadThrow e +force _ (x:_) = return x + +forceM :: (Exception e, MonadThrow f) => e -> [f a] -> f a +forceM e [] = monadThrow e +forceM _ (x:_) = x diff --git a/Text/XML/Cursor/Generic.hs b/Text/XML/Cursor/Generic.hs new file mode 100644 index 0000000..af442db --- /dev/null +++ b/Text/XML/Cursor/Generic.hs @@ -0,0 +1,164 @@ +-- | Generalized cursors to be applied to different nodes. +module Text.XML.Cursor.Generic + ( -- * Core + Cursor + , Axis + , toCursor + , node + -- * Axes + , child + , parent + , precedingSibling + , followingSibling + , ancestor + , descendant + , orSelf + , preceding + , following + -- * Operators + , (&|) + , (&/) + , (&//) + , (&.//) + , ($|) + , ($/) + , ($//) + , ($.//) + , (>=>) + ) where + +import Data.Maybe (maybeToList) +import Data.List (foldl') +import Control.Monad ((>=>)) + +type DiffCursor node = [Cursor node] -> [Cursor node] +type Axis node = Cursor node -> [Cursor node] + +-- | A cursor: contains an XML 'Node' and pointers to its children, ancestors and siblings. +data Cursor node = Cursor + { parent' :: Maybe (Cursor node) + , precedingSibling' :: DiffCursor node + , followingSibling' :: DiffCursor node + -- | The child axis. XPath: + -- /the child axis contains the children of the context node/. + , child :: [Cursor node] + -- | The current node. + , node :: node + } + +instance Show node => Show (Cursor node) where + show Cursor { node = n } = "Cursor @ " ++ show n + +toCursor :: (node -> [node]) -- ^ get children + -> node + -> Cursor node +toCursor cs = toCursor' cs Nothing id id + +toCursor' :: (node -> [node]) + -> Maybe (Cursor node) -> DiffCursor node -> DiffCursor node -> node -> Cursor node +toCursor' cs par pre fol n = + me + where + me = Cursor par pre fol chi n + chi' = cs n + chi = go id chi' [] + go _ [] = id + go pre' (n':ns') = + (:) me' . fol' + where + me' = toCursor' cs (Just me) pre' fol' n' + fol' = go (pre' . (:) me') ns' + +-- | The parent axis. As described in XPath: +-- /the parent axis contains the parent of the context node, if there is one/. +-- +-- Every node but the root element of the document has a parent. Parent nodes +-- will always be 'NodeElement's. +parent :: Axis node +parent = maybeToList . parent' + +-- | The preceding-sibling axis. XPath: +-- /the preceding-sibling axis contains all the preceding siblings of the context node [...]/. +precedingSibling :: Axis node +precedingSibling = ($ []) . precedingSibling' + +-- | The following-sibling axis. XPath: +-- /the following-sibling axis contains all the following siblings of the context node [...]/. +followingSibling :: Axis node +followingSibling = ($ []) . followingSibling' + +-- | The preceding axis. XPath: +-- /the preceding axis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes and namespace nodes/. +preceding :: Axis node +preceding c = + go (precedingSibling' c []) (parent c >>= preceding) + where + go x y = foldl' (flip go') y x + go' x rest = foldl' (flip go') (x : rest) (child x) + +-- | The following axis. XPath: +-- /the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes/. +following :: Axis node +following c = + go (followingSibling' c) (parent c >>= following) + where + go x z = foldr go' z (x []) + go' x rest = x : foldr go' rest (child x) + +-- | The ancestor axis. XPath: +-- /the ancestor axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent's parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node/. +ancestor :: Axis node +ancestor = parent >=> (\p -> p : ancestor p) + +-- | The descendant axis. XPath: +-- /the descendant axis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute or namespace nodes/. +descendant :: Axis node +descendant = child >=> (\c -> c : descendant c) + +-- | Modify an axis by adding the context node itself as the first element of the result list. +orSelf :: Axis node -> Axis node +orSelf ax c = c : ax c + +infixr 1 &| +infixr 1 &/ +infixr 1 &// +infixr 1 &.// +infixr 1 $| +infixr 1 $/ +infixr 1 $// +infixr 1 $.// + +-- | Apply a function to the result of an axis. +(&|) :: (Cursor node -> [a]) -> (a -> b) -> (Cursor node -> [b]) +f &| g = map g . f + +-- | Combine two axes so that the second works on the children of the results +-- of the first. +(&/) :: Axis node -> (Cursor node -> [a]) -> (Cursor node -> [a]) +f &/ g = f >=> child >=> g + +-- | Combine two axes so that the second works on the descendants of the results +-- of the first. +(&//) :: Axis node -> (Cursor node -> [a]) -> (Cursor node -> [a]) +f &// g = f >=> descendant >=> g + +-- | Combine two axes so that the second works on both the result nodes, and their +-- descendants. +(&.//) :: Axis node -> (Cursor node -> [a]) -> (Cursor node -> [a]) +f &.// g = f >=> orSelf descendant >=> g + +-- | Apply an axis to a 'Cursor node'. +($|) :: Cursor node -> (Cursor node -> a) -> a +v $| f = f v + +-- | Apply an axis to the children of a 'Cursor node'. +($/) :: Cursor node -> (Cursor node -> [a]) -> [a] +v $/ f = child v >>= f + +-- | Apply an axis to the descendants of a 'Cursor node'. +($//) :: Cursor node -> (Cursor node -> [a]) -> [a] +v $// f = descendant v >>= f + +-- | Apply an axis to a 'Cursor node' as well as its descendants. +($.//) :: Cursor node -> (Cursor node -> [a]) -> [a] +v $.// f = orSelf descendant v >>= f diff --git a/Text/XML/Stream/Parse.hs b/Text/XML/Stream/Parse.hs new file mode 100644 index 0000000..ba9b9ec --- /dev/null +++ b/Text/XML/Stream/Parse.hs @@ -0,0 +1,1409 @@ +{-# LANGUAGE BangPatterns #-} +{-# LANGUAGE CPP #-} +{-# LANGUAGE DeriveDataTypeable #-} +{-# LANGUAGE DeriveFunctor #-} +{-# LANGUAGE FlexibleContexts #-} +{-# LANGUAGE FlexibleInstances #-} +{-# LANGUAGE GeneralizedNewtypeDeriving #-} +{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE PatternGuards #-} +{-# LANGUAGE RankNTypes #-} +{-# LANGUAGE StandaloneDeriving #-} +{-# LANGUAGE TupleSections #-} +{-# LANGUAGE TypeFamilies #-} +-- | This module provides both a native Haskell solution for parsing XML +-- documents into a stream of events, and a set of parser combinators for +-- dealing with a stream of events. +-- +-- As a simple example, if you have the following XML file: +-- +-- > +-- > +-- > Michael +-- > Eliezer +-- > +-- +-- Then this code: +-- +-- > {-# LANGUAGE OverloadedStrings #-} +-- > import Control.Monad.Trans.Resource +-- > import Data.Conduit (Consumer, Sink, ($$)) +-- > import Data.Text (Text, unpack) +-- > import Text.XML.Stream.Parse +-- > import Data.XML.Types (Event) +-- > +-- > data Person = Person Int Text +-- > deriving Show +-- > +-- > parsePerson :: MonadThrow m => Consumer Event m (Maybe Person) +-- > parsePerson = tag' "person" (requireAttr "age") $ \age -> do +-- > name <- content +-- > return $ Person (read $ unpack age) name +-- > +-- > parsePeople :: MonadThrow m => Sink Event m (Maybe [Person]) +-- > parsePeople = tagNoAttr "people" $ many parsePerson +-- > +-- > main = do +-- > people <- runResourceT $ +-- > parseFile def "people.xml" $$ force "people required" parsePeople +-- > print people +-- +-- will produce: +-- +-- > [Person 25 "Michael",Person 2 "Eliezer"] +-- +-- This module also supports streaming results using 'yield'. +-- This allows parser results to be processed using conduits +-- while a particular parser (e.g. 'many') is still running. +-- Without using streaming results, you have to wait until the parser finished +-- before you can process the result list. Large XML files might be easier +-- to process by using streaming results. +-- See http://stackoverflow.com/q/21367423/2597135 for a related discussion. +-- +-- > {-# LANGUAGE OverloadedStrings #-} +-- > import Control.Monad (void) +-- > import Control.Monad.Trans.Class (lift) +-- > import Control.Monad.Trans.Resource +-- > import Data.Conduit +-- > import qualified Data.Conduit.List as CL +-- > import Data.Text (Text, unpack) +-- > import Data.XML.Types (Event) +-- > import Text.XML.Stream.Parse +-- > +-- > data Person = Person Int Text deriving Show +-- > +-- > parsePerson :: MonadThrow m => Consumer Event m (Maybe Person) +-- > parsePerson = tag' "person" (requireAttr "age") $ \age -> do +-- > name <- content +-- > return $ Person (read $ unpack age) name +-- > +-- > parsePeople :: MonadThrow m => Conduit Event m Person +-- > parsePeople = void $ tagNoAttr "people" $ manyYield parsePerson +-- > +-- > main = runResourceT $ +-- > parseFile def "people.xml" $$ parsePeople =$ CL.mapM_ (lift . print) +-- +-- Previous versions of this module contained a number of more sophisticated +-- functions written by Aristid Breitkreuz and Dmitry Olshansky. To keep this +-- package simpler, those functions are being moved to a separate package. This +-- note will be updated with the name of the package(s) when available. +module Text.XML.Stream.Parse + ( -- * Parsing XML files + parseBytes + , parseBytesPos + , parseText' + , parseText + , parseTextPos + , detectUtf + , parseFile + , parseLBS + -- ** Parser settings + , ParseSettings + , def + , DecodeEntities + , psDecodeEntities + , psRetainNamespaces + -- *** Entity decoding + , decodeXmlEntities + , decodeHtmlEntities + -- * Event parsing + , tag + , tag' + , tagNoAttr + , tagIgnoreAttrs + , content + , contentMaybe + -- * Ignoring tags/trees + , ignoreTag + , ignoreEmptyTag + , ignoreTree + , ignoreTreeContent + , ignoreAnyTreeContent + , ignoreAllTreesContent + -- * Streaming events + , takeContent + , takeTree + , takeTreeContent + , takeAnyTreeContent + , takeAllTreesContent + -- * Tag name matching + , NameMatcher(..) + , matching + , anyOf + , anyName + -- * Attribute parsing + , AttrParser + , attr + , requireAttr + , optionalAttr + , requireAttrRaw + , optionalAttrRaw + , ignoreAttrs + -- * Combinators + , orE + , choose + , many + , many_ + , manyIgnore + , many' + , force + -- * Streaming combinators + , manyYield + , manyYield' + , manyIgnoreYield + -- * Exceptions + , XmlException (..) + -- * Other types + , PositionRange + , EventPos + ) where +import Control.Applicative (Alternative (empty, (<|>)), + Applicative (..), (<$>)) +import qualified Control.Applicative as A +import Control.Arrow ((***)) +import Control.Exception (Exception (..), SomeException) +import Control.Monad (ap, liftM, void) +import Control.Monad.Fix (fix) +import Control.Monad.Trans.Class (lift) +import Control.Monad.Trans.Maybe (MaybeT (..)) +import Control.Monad.Trans.Resource (MonadResource, MonadThrow (..), + monadThrow) +import Data.Attoparsec.Text (Parser, anyChar, char, manyTill, + skipWhile, string, takeWhile, + takeWhile1, try) +import qualified Data.Attoparsec.Text as AT +import qualified Data.ByteString as S +import qualified Data.ByteString.Lazy as L +import Data.Char (isSpace) +import Data.Conduit +import Data.Conduit.Attoparsec (PositionRange, conduitParser) +import Data.Conduit.Binary (sourceFile) +import qualified Data.Conduit.List as CL +import qualified Data.Conduit.Text as CT +import Data.Default.Class (Default (..)) +import Data.List (intercalate) +import Data.List (foldl') +import qualified Data.Map as Map +import Data.Maybe (fromMaybe, isNothing) +import Data.String (IsString (..)) +import Data.Text (Text, pack) +import qualified Data.Text as T +import Data.Text.Encoding (decodeUtf8With) +import Data.Text.Encoding.Error (lenientDecode) +import Data.Typeable (Typeable) +import Data.XML.Types (Content (..), Event (..), + ExternalID (..), + Instruction (..), Name (..)) +import Prelude hiding (takeWhile) +import Text.XML.Stream.Token + +type Ents = [(Text, Text)] + +tokenToEvent :: ParseSettings -> Ents -> [NSLevel] -> Token -> (Ents, [NSLevel], [Event]) +tokenToEvent _ es n (TokenXMLDeclaration _) = (es, n, []) +tokenToEvent _ es n (TokenInstruction i) = (es, n, [EventInstruction i]) +tokenToEvent ps es n (TokenBeginElement name as isClosed _) = + (es, n', if isClosed then [begin, end] else [begin]) + where + l0 = case n of + [] -> NSLevel Nothing Map.empty + x:_ -> x + (as', l') = foldl' go (id, l0) as + go (front, l) (TName kpref kname, val) = + (addNS front, l'') + where + isPrefixed = kpref == Just "xmlns" + isUnprefixed = isNothing kpref && kname == "xmlns" + + addNS + | not (psRetainNamespaces ps) && (isPrefixed || isUnprefixed) = id + | otherwise = (((tname, map resolve val):) .) + where + tname + | isPrefixed = TName Nothing ("xmlns:" `T.append` kname) + | otherwise = TName kpref kname + + l'' + | isPrefixed = + l { prefixes = Map.insert kname (contentsToText val) + $ prefixes l } + | isUnprefixed = + l { defaultNS = if T.null $ contentsToText val + then Nothing + else Just $ contentsToText val } + | otherwise = l + + resolve (ContentEntity e) + | Just t <- lookup e es = ContentText t + resolve c = c + n' = if isClosed then n else l' : n + fixAttName (name', val) = (tnameToName True l' name', val) + elementName = tnameToName False l' name + begin = EventBeginElement elementName $ map fixAttName $ as' [] + end = EventEndElement elementName +tokenToEvent _ es n (TokenEndElement name) = + (es, n', [EventEndElement $ tnameToName False l name]) + where + (l, n') = + case n of + [] -> (NSLevel Nothing Map.empty, []) + x:xs -> (x, xs) +tokenToEvent _ es n (TokenContent (ContentEntity e)) + | Just t <- lookup e es = (es, n, [EventContent $ ContentText t]) +tokenToEvent _ es n (TokenContent c) = (es, n, [EventContent c]) +tokenToEvent _ es n (TokenComment c) = (es, n, [EventComment c]) +tokenToEvent _ es n (TokenDoctype t eid es') = (es ++ es', n, [EventBeginDoctype t eid, EventEndDoctype]) +tokenToEvent _ es n (TokenCDATA t) = (es, n, [EventCDATA t]) + +tnameToName :: Bool -> NSLevel -> TName -> Name +tnameToName _ _ (TName (Just "xml") name) = + Name name (Just "http://www.w3.org/XML/1998/namespace") (Just "xml") +tnameToName isAttr (NSLevel def' _) (TName Nothing name) = + Name name (if isAttr then Nothing else def') Nothing +tnameToName _ (NSLevel _ m) (TName (Just pref) name) = + case Map.lookup pref m of + Just ns -> Name name (Just ns) (Just pref) + Nothing -> Name name Nothing (Just pref) -- FIXME is this correct? + +-- | Automatically determine which UTF variant is being used. This function +-- first checks for BOMs, removing them as necessary, and then check for the +-- equivalent of Conduit S.ByteString m T.Text +detectUtf = + conduit id + where + conduit front = await >>= maybe (return ()) (push front) + + push front bss = + either conduit + (uncurry checkXMLDecl) + (getEncoding front bss) + + getEncoding front bs' + | S.length bs < 4 = + Left (bs `S.append`) + | otherwise = + Right (bsOut, mcodec) + where + bs = front bs' + bsOut = S.append (S.drop toDrop x) y + (x, y) = S.splitAt 4 bs + (toDrop, mcodec) = + case S.unpack x of + [0x00, 0x00, 0xFE, 0xFF] -> (4, Just CT.utf32_be) + [0xFF, 0xFE, 0x00, 0x00] -> (4, Just CT.utf32_le) + 0xFE : 0xFF: _ -> (2, Just CT.utf16_be) + 0xFF : 0xFE: _ -> (2, Just CT.utf16_le) + 0xEF : 0xBB: 0xBF : _ -> (3, Just CT.utf8) + [0x00, 0x00, 0x00, 0x3C] -> (0, Just CT.utf32_be) + [0x3C, 0x00, 0x00, 0x00] -> (0, Just CT.utf32_le) + [0x00, 0x3C, 0x00, 0x3F] -> (0, Just CT.utf16_be) + [0x3C, 0x00, 0x3F, 0x00] -> (0, Just CT.utf16_le) + _ -> (0, Nothing) -- Assuming UTF-8 + +checkXMLDecl :: MonadThrow m + => S.ByteString + -> Maybe CT.Codec + -> Conduit S.ByteString m T.Text +checkXMLDecl bs (Just codec) = leftover bs >> CT.decode codec +checkXMLDecl bs0 Nothing = + loop [] (AT.parse (parseToken decodeXmlEntities)) bs0 + where + loop chunks0 parser nextChunk = + case parser $ decodeUtf8With lenientDecode nextChunk of + AT.Fail{} -> fallback + AT.Partial f -> await >>= maybe fallback (loop chunks f) + AT.Done _ (TokenXMLDeclaration attrs) -> findEncoding attrs + AT.Done{} -> fallback + where + chunks = nextChunk : chunks0 + fallback = complete CT.utf8 + complete codec = mapM_ leftover chunks >> CT.decode codec + + findEncoding [] = fallback + findEncoding ((TName _ "encoding", [ContentText enc]):_) = + case T.toLower enc of + "iso-8859-1" -> complete CT.iso8859_1 + "utf-8" -> complete CT.utf8 + _ -> complete CT.utf8 + findEncoding (_:xs) = findEncoding xs + +type EventPos = (Maybe PositionRange, Event) + +-- | Parses a byte stream into 'Event's. This function is implemented fully in +-- Haskell using attoparsec-text for parsing. The produced error messages do +-- not give line/column information, so you may prefer to stick with the parser +-- provided by libxml-enumerator. However, this has the advantage of not +-- relying on any C libraries. +-- +-- This relies on 'detectUtf' to determine character encoding, and 'parseText'' +-- to do the actual parsing. +parseBytes :: MonadThrow m + => ParseSettings + -> Conduit S.ByteString m Event +parseBytes = mapOutput snd . parseBytesPos + +parseBytesPos :: MonadThrow m + => ParseSettings + -> Conduit S.ByteString m EventPos +parseBytesPos ps = detectUtf =$= parseTextPos ps + +dropBOM :: Monad m => Conduit T.Text m T.Text +dropBOM = + await >>= maybe (return ()) push + where + push t = + case T.uncons t of + Nothing -> dropBOM + Just (c, cs) -> + let output + | c == '\xfeef' = cs + | otherwise = t + in yield output >> idConduit + idConduit = await >>= maybe (return ()) (\x -> yield x >> idConduit) + +-- | Parses a character stream into 'Event's. This function is implemented +-- fully in Haskell using attoparsec-text for parsing. The produced error +-- messages do not give line/column information, so you may prefer to stick +-- with the parser provided by libxml-enumerator. However, this has the +-- advantage of not relying on any C libraries. +-- +-- Since 1.2.4 +parseText' :: MonadThrow m + => ParseSettings + -> Conduit T.Text m Event +parseText' = mapOutput snd . parseTextPos + +{-# DEPRECATED parseText "Please use 'parseText'' or 'parseTextPos'." #-} +parseText :: MonadThrow m + => ParseSettings + -> Conduit T.Text m EventPos +parseText = parseTextPos + +-- | Same as 'parseText'', but includes the position of each event. +-- +-- Since 1.2.4 +parseTextPos :: MonadThrow m + => ParseSettings + -> Conduit T.Text m EventPos +parseTextPos de = + dropBOM + =$= tokenize + =$= toEventC de + =$= addBeginEnd + where + tokenize = conduitToken de + addBeginEnd = yield (Nothing, EventBeginDocument) >> addEnd + addEnd = await >>= maybe + (yield (Nothing, EventEndDocument)) + (\e -> yield e >> addEnd) + +toEventC :: Monad m => ParseSettings -> Conduit (PositionRange, Token) m EventPos +toEventC ps = + go [] [] + where + go !es !levels = + await >>= maybe (return ()) push + where + push (position, token) = + mapM_ (yield . (,) (Just position)) events >> go es' levels' + where + (es', levels', events) = tokenToEvent ps es levels token + +data ParseSettings = ParseSettings + { psDecodeEntities :: DecodeEntities + , psRetainNamespaces :: Bool + -- ^ Whether the original xmlns attributes should be retained in the parsed + -- values. For more information on motivation, see: + -- + -- + -- + -- Default: False + -- + -- Since 1.2.1 + } + +instance Default ParseSettings where + def = ParseSettings + { psDecodeEntities = decodeXmlEntities + , psRetainNamespaces = False + } + +conduitToken :: MonadThrow m => ParseSettings -> Conduit T.Text m (PositionRange, Token) +conduitToken = conduitParser . parseToken . psDecodeEntities + +parseToken :: DecodeEntities -> Parser Token +parseToken de = (char '<' >> parseLt) <|> TokenContent <$> parseContent de False False + where + parseLt = + (char '?' >> parseInstr) <|> + (char '!' >> (parseComment <|> parseCdata <|> parseDoctype)) <|> + parseBegin <|> + (char '/' >> parseEnd) + parseInstr = do + name <- parseIdent + if name == "xml" + then do + as <- A.many $ parseAttribute de + skipSpace + char' '?' + char' '>' + newline <|> return () + return $ TokenXMLDeclaration as + else do + skipSpace + x <- T.pack <$> manyTill anyChar (try $ string "?>") + return $ TokenInstruction $ Instruction name x + parseComment = do + char' '-' + char' '-' + c <- T.pack <$> manyTill anyChar (string "-->") -- FIXME use takeWhile instead + return $ TokenComment c + parseCdata = do + _ <- string "[CDATA[" + t <- T.pack <$> manyTill anyChar (string "]]>") -- FIXME use takeWhile instead + return $ TokenCDATA t + parseDoctype = do + _ <- string "DOCTYPE" + skipSpace + name <- parseName + let i = + case name of + TName Nothing x -> x + TName (Just x) y -> T.concat [x, ":", y] + skipSpace + eid <- fmap Just parsePublicID <|> + fmap Just parseSystemID <|> + return Nothing + skipSpace + ents <- (do + char' '[' + ents <- parseEntities id + skipSpace + return ents) <|> return [] + char' '>' + newline <|> return () + return $ TokenDoctype i eid ents + parseEntities front = + (char ']' >> return (front [])) <|> + (parseEntity >>= \e -> parseEntities (front . (e:))) <|> + (char '<' >> parseEntities front) <|> + (skipWhile (\t -> t /= ']' && t /= '<') >> parseEntities front) + parseEntity = try $ do + _ <- string "' + return (i, t) + parsePublicID = do + _ <- string "PUBLIC" + x <- quotedText + y <- quotedText + return $ PublicID x y + parseSystemID = do + _ <- string "SYSTEM" + x <- quotedText + return $ SystemID x + quotedText = do + skipSpace + between '"' <|> between '\'' + between c = do + char' c + x <- takeWhile (/=c) + char' c + return x + parseEnd = do + skipSpace + n <- parseName + skipSpace + char' '>' + return $ TokenEndElement n + parseBegin = do + skipSpace + n <- parseName + as <- A.many $ parseAttribute de + skipSpace + isClose <- (char '/' >> skipSpace >> return True) <|> return False + char' '>' + return $ TokenBeginElement n as isClose 0 + +parseAttribute :: DecodeEntities -> Parser TAttribute +parseAttribute de = do + skipSpace + key <- parseName + skipSpace + char' '=' + skipSpace + val <- squoted <|> dquoted + return (key, val) + where + squoted = char '\'' *> manyTill (parseContent de False True) (char '\'') + dquoted = char '"' *> manyTill (parseContent de True False) (char '"') + +parseName :: Parser TName +parseName = + name <$> parseIdent <*> A.optional (char ':' >> parseIdent) + where + name i1 Nothing = TName Nothing i1 + name i1 (Just i2) = TName (Just i1) i2 + +parseIdent :: Parser Text +parseIdent = + takeWhile1 valid + where + valid '&' = False + valid '<' = False + valid '>' = False + valid ':' = False + valid '?' = False + valid '=' = False + valid '"' = False + valid '\'' = False + valid '/' = False + valid ';' = False + valid '#' = False + valid c = not $ isXMLSpace c + +parseContent :: DecodeEntities + -> Bool -- break on double quote + -> Bool -- break on single quote + -> Parser Content +parseContent de breakDouble breakSingle = parseReference <|> parseTextContent where + parseReference = do + char' '&' + t <- parseEntityRef <|> parseHexCharRef <|> parseDecCharRef + char' ';' + return t + parseEntityRef = do + TName ma b <- parseName + let name = maybe "" (`T.append` ":") ma `T.append` b + return $ case name of + "lt" -> ContentText "<" + "gt" -> ContentText ">" + "amp" -> ContentText "&" + "quot" -> ContentText "\"" + "apos" -> ContentText "'" + _ -> de name + parseHexCharRef = do + void $ string "#x" + n <- AT.hexadecimal + case toValidXmlChar n of + Nothing -> fail "Invalid character from hexadecimal character reference." + Just c -> return $ ContentText $ T.singleton c + parseDecCharRef = do + void $ string "#" + n <- AT.decimal + case toValidXmlChar n of + Nothing -> fail "Invalid character from decimal character reference." + Just c -> return $ ContentText $ T.singleton c + parseTextContent = ContentText <$> takeWhile1 valid + valid '"' = not breakDouble + valid '\'' = not breakSingle + valid '&' = False -- amp + valid '<' = False -- lt + valid _ = True + +-- | Is this codepoint a valid XML character? See +-- . This is proudly XML 1.0 only. +toValidXmlChar :: Int -> Maybe Char +toValidXmlChar n + | any checkRange ranges = Just (toEnum n) + | otherwise = Nothing + where + --Inclusive lower bound, inclusive upper bound. + ranges :: [(Int, Int)] + ranges = + [ (0x9, 0xA) + , (0xD, 0xD) + , (0x20, 0xD7FF) + , (0xE000, 0xFFFD) + , (0x10000, 0x10FFFF) + ] + checkRange (lb, ub) = lb <= n && n <= ub + +skipSpace :: Parser () +skipSpace = skipWhile isXMLSpace + +-- | Determines whether a character is an XML white space. The list of +-- white spaces is given by +-- +-- > S ::= (#x20 | #x9 | #xD | #xA)+ +-- +-- in . +isXMLSpace :: Char -> Bool +isXMLSpace ' ' = True +isXMLSpace '\t' = True +isXMLSpace '\r' = True +isXMLSpace '\n' = True +isXMLSpace _ = False + +newline :: Parser () +newline = void $ (char '\r' >> char '\n') <|> char '\n' + +char' :: Char -> Parser () +char' = void . char + +data ContentType = Ignore | IsContent Text | IsError String | NotContent + +-- | Grabs the next piece of content if available. This function skips over any +-- comments and instructions and concatenates all content until the next start +-- or end tag. +contentMaybe :: MonadThrow m => Consumer Event m (Maybe Text) +contentMaybe = do + x <- CL.peek + case pc' x of + Ignore -> CL.drop 1 >> contentMaybe + IsContent t -> CL.drop 1 >> fmap Just (takeContents (t:)) + IsError e -> lift $ monadThrow $ InvalidEntity e x + NotContent -> return Nothing + where + pc' Nothing = NotContent + pc' (Just x) = pc x + pc (EventContent (ContentText t)) = IsContent t + pc (EventContent (ContentEntity e)) = IsError $ "Unknown entity: " ++ show e + pc (EventCDATA t) = IsContent t + pc EventBeginElement{} = NotContent + pc EventEndElement{} = NotContent + pc EventBeginDocument{} = Ignore + pc EventEndDocument = Ignore + pc EventBeginDoctype{} = Ignore + pc EventEndDoctype = Ignore + pc EventInstruction{} = Ignore + pc EventComment{} = Ignore + takeContents front = do + x <- CL.peek + case pc' x of + Ignore -> CL.drop 1 >> takeContents front + IsContent t -> CL.drop 1 >> takeContents (front . (:) t) + IsError e -> lift $ monadThrow $ InvalidEntity e x + NotContent -> return $ T.concat $ front [] + +-- | Grabs the next piece of content. If none if available, returns 'T.empty'. +-- This is simply a wrapper around 'contentMaybe'. +content :: MonadThrow m => Consumer Event m Text +content = fromMaybe T.empty <$> contentMaybe + + +isWhitespace :: Event -> Bool +isWhitespace EventBeginDocument = True +isWhitespace EventEndDocument = True +isWhitespace EventBeginDoctype{} = True +isWhitespace EventEndDoctype = True +isWhitespace EventInstruction{} = True +isWhitespace (EventContent (ContentText t)) = T.all isSpace t +isWhitespace EventComment{} = True +isWhitespace (EventCDATA t) = T.all isSpace t +isWhitespace _ = False + + +-- | The most generic way to parse a tag. It takes a 'NameMatcher' to check whether +-- this is a correct tag name, an 'AttrParser' to handle attributes, and +-- then a parser to deal with content. +-- +-- 'Events' are consumed if and only if the tag name and its attributes match. +-- +-- This function automatically absorbs its balancing closing tag, and will +-- throw an exception if not all of the attributes or child elements are +-- consumed. If you want to allow extra attributes, see 'ignoreAttrs'. +-- +-- This function automatically ignores comments, instructions and whitespace. +tag :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -- and return a value that can be used to get an @AttrParser@. + -- If this fails, the function will return @Nothing@ + -> (a -> AttrParser b) -- ^ Given the value returned by the name checker, this function will + -- be used to get an @AttrParser@ appropriate for the specific tag. + -- If the @AttrParser@ fails, the function will also return @Nothing@ + -> (b -> ConduitM Event o m c) -- ^ Handler function to handle the attributes and children + -- of a tag, given the value return from the @AttrParser@ + -> ConduitM Event o m (Maybe c) +tag nameMatcher attrParser f = do + (x, leftovers) <- dropWS [] + res <- case x of + Just (EventBeginElement name as) -> case runNameMatcher nameMatcher name of + Just y -> case runAttrParser' (attrParser y) as of + Left _ -> return Nothing + Right z -> do + z' <- f z + (a, _leftovers') <- dropWS [] + case a of + Just (EventEndElement name') + | name == name' -> return (Just z') + _ -> lift $ monadThrow $ InvalidEndElement name a + Nothing -> return Nothing + _ -> return Nothing + + case res of + -- Did not parse, put back all of the leading whitespace events and the + -- final observed event generated by dropWS + Nothing -> mapM_ leftover leftovers + -- Parse succeeded, discard all of those whitespace events and the + -- first parsed event + Just _ -> return () + + return res + where + -- Drop Events until we encounter a non-whitespace element. Return all of + -- the events consumed here (including the first non-whitespace event) so + -- that the calling function can treat them as leftovers if the parse fails + dropWS leftovers = do + x <- await + let leftovers' = maybe id (:) x leftovers + + case isWhitespace <$> x of + Just True -> dropWS leftovers' + _ -> return (x, leftovers') + runAttrParser' p as = + case runAttrParser p as of + Left e -> Left e + Right ([], x) -> Right x + Right (attr, _) -> Left $ toException $ UnparsedAttributes attr + +-- | A simplified version of 'tag' where the 'NameMatcher' result isn't forwarded to the attributes parser. +-- +-- Since 1.5.0 +tag' :: MonadThrow m + => NameMatcher a -> AttrParser b -> (b -> ConduitM Event o m c) + -> ConduitM Event o m (Maybe c) +tag' a b = tag a (const b) + +-- | A further simplified tag parser, which requires that no attributes exist. +tagNoAttr :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m b -- ^ Handler function to handle the children of the matched tag + -> ConduitM Event o m (Maybe b) +tagNoAttr name f = tag' name (return ()) $ const f + + +-- | A further simplified tag parser, which ignores all attributes, if any exist +tagIgnoreAttrs :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m b -- ^ Handler function to handle the children of the matched tag + -> ConduitM Event o m (Maybe b) +tagIgnoreAttrs name f = tag' name ignoreAttrs $ const f + + +-- | Ignore an empty tag and all of its attributes. +-- This does not ignore the tag recursively +-- (i.e. it assumes there are no child elements). +-- This function returns @Just ()@ if the tag matched. +-- +-- Since 1.5.0 +ignoreEmptyTag :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m (Maybe ()) +ignoreEmptyTag nameMatcher = tagIgnoreAttrs nameMatcher (return ()) + + +{-# DEPRECATED ignoreTag "Please use 'ignoreEmptyTag'." #-} +ignoreTag :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m (Maybe ()) +ignoreTag = ignoreEmptyTag + + +-- | Ignore a tag, its attributes and its children subtrees recursively. +-- Both content and text events are ignored. +-- This function returns @Just ()@ if the tag matched. +-- +-- Since 1.5.0 +ignoreTreeContent :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m (Maybe ()) +ignoreTreeContent namePred = tagIgnoreAttrs namePred (void $ many ignoreAnyTreeContent) + +{-# DEPRECATED ignoreTree "Please use 'ignoreTreeContent'." #-} +ignoreTree :: MonadThrow m + => NameMatcher a -- ^ Check if this is a correct tag name + -> ConduitM Event o m (Maybe ()) +ignoreTree = ignoreTreeContent + +-- | Like 'ignoreTreeContent', but matches any name and also ignores content events. +ignoreAnyTreeContent :: MonadThrow m => ConduitM Event o m (Maybe ()) +ignoreAnyTreeContent = (void <$> contentMaybe) `orE` ignoreTreeContent anyName + +{-# DEPRECATED ignoreAllTreesContent "Please use 'ignoreAnyTreeContent'." #-} +ignoreAllTreesContent :: MonadThrow m => ConduitM Event o m (Maybe ()) +ignoreAllTreesContent = ignoreAnyTreeContent + +-- | Get the value of the first parser which returns 'Just'. If no parsers +-- succeed (i.e., return @Just@), this function returns 'Nothing'. +-- +-- > orE a b = choose [a, b] +orE :: Monad m + => Consumer Event m (Maybe a) -- ^ The first (preferred) parser + -> Consumer Event m (Maybe a) -- ^ The second parser, only executed if the first parser fails + -> Consumer Event m (Maybe a) +orE a b = a >>= \x -> maybe b (const $ return x) x + +-- | Get the value of the first parser which returns 'Just'. If no parsers +-- succeed (i.e., return 'Just'), this function returns 'Nothing'. +choose :: Monad m + => [ConduitM Event o m (Maybe a)] -- ^ List of parsers that will be tried in order. + -> ConduitM Event o m (Maybe a) -- ^ Result of the first parser to succeed, or @Nothing@ + -- if no parser succeeded +choose [] = return Nothing +choose (i:is) = i >>= maybe (choose is) (return . Just) + +-- | Force an optional parser into a required parser. All of the 'tag' +-- functions, 'attr', 'choose' and 'many' deal with 'Maybe' parsers. Use this when you +-- want to finally force something to happen. +force :: MonadThrow m + => String -- ^ Error message + -> m (Maybe a) -- ^ Optional parser to be forced + -> m a +force msg i = i >>= maybe (throwM $ XmlException msg Nothing) return + +-- | A helper function which reads a file from disk using 'enumFile', detects +-- character encoding using 'detectUtf', parses the XML using 'parseBytes', and +-- then hands off control to your supplied parser. +parseFile :: MonadResource m + => ParseSettings + -> FilePath + -> Producer m Event +parseFile ps fp = sourceFile fp =$= parseBytes ps + +-- | Parse an event stream from a lazy 'L.ByteString'. +parseLBS :: MonadThrow m + => ParseSettings + -> L.ByteString + -> Producer m Event +parseLBS ps lbs = CL.sourceList (L.toChunks lbs) =$= parseBytes ps + +data XmlException = XmlException + { xmlErrorMessage :: String + , xmlBadInput :: Maybe Event + } + | InvalidEndElement Name (Maybe Event) + | InvalidEntity String (Maybe Event) + | MissingAttribute String + | UnparsedAttributes [(Name, [Content])] + deriving (Show, Typeable) + +instance Exception XmlException where +#if MIN_VERSION_base(4, 8, 0) + displayException (XmlException msg (Just event)) = "Error while parsing XML event " ++ show event ++ ": " ++ msg + displayException (XmlException msg _) = "Error while parsing XML: " ++ msg + displayException (InvalidEndElement name (Just event)) = "Error while parsing XML event: expected , got " ++ show event + displayException (InvalidEndElement name _) = "Error while parsing XML event: expected , got nothing" + displayException (InvalidEntity msg (Just event)) = "Error while parsing XML entity " ++ show event ++ ": " ++ msg + displayException (InvalidEntity msg _) = "Error while parsing XML entity: " ++ msg + displayException (MissingAttribute msg) = "Missing required attribute: " ++ msg + displayException (UnparsedAttributes attrs) = show (length attrs) ++ " remaining unparsed attributes: \n" ++ intercalate "\n" (show <$> attrs) +#endif + + +-- | A @NameMatcher@ describes which names a tag parser is allowed to match. +-- +-- Since 1.5.0 +newtype NameMatcher a = NameMatcher { runNameMatcher :: Name -> Maybe a } + +deriving instance Functor NameMatcher + +instance Applicative NameMatcher where + pure a = NameMatcher $ const $ pure a + NameMatcher f <*> NameMatcher a = NameMatcher $ \name -> f name <*> a name + +-- | 'NameMatcher's can be combined with @\<|\>@ +instance Alternative NameMatcher where + empty = NameMatcher $ const Nothing + NameMatcher f <|> NameMatcher g = NameMatcher (\a -> f a <|> g a) + +-- | Match a single 'Name' in a concise way. +-- Note that 'Name' is namespace sensitive: when using the 'IsString' instance, +-- use @"{http:\/\/a\/b}c"@ to match the tag @c@ in the XML namespace @http://a/b@ +instance (a ~ Name) => IsString (NameMatcher a) where + fromString s = matching (== fromString s) + +-- | @matching f@ matches @name@ iff @f name@ is true. Returns the matched 'Name'. +-- +-- Since 1.5.0 +matching :: (Name -> Bool) -> NameMatcher Name +matching f = NameMatcher $ \name -> if f name then Just name else Nothing + +-- | Matches any 'Name'. Returns the matched 'Name'. +-- +-- Since 1.5.0 +anyName :: NameMatcher Name +anyName = matching (const True) + +-- | Matches any 'Name' from the given list. Returns the matched 'Name'. +-- +-- Since 1.5.0 +anyOf :: [Name] -> NameMatcher Name +anyOf values = matching (`elem` values) + + +-- | A monad for parsing attributes. By default, it requires you to deal with +-- all attributes present on an element, and will throw an exception if there +-- are unhandled attributes. Use the 'requireAttr', 'attr' et al +-- functions for handling an attribute, and 'ignoreAttrs' if you would like to +-- skip the rest of the attributes on an element. +-- +-- 'Alternative' instance behaves like 'First' monoid: it chooses first +-- parser which doesn't fail. +newtype AttrParser a = AttrParser { runAttrParser :: [(Name, [Content])] -> Either SomeException ([(Name, [Content])], a) } + +instance Monad AttrParser where + return a = AttrParser $ \as -> Right (as, a) + (AttrParser f) >>= g = AttrParser $ \as -> + either Left (\(as', f') -> runAttrParser (g f') as') (f as) +instance Functor AttrParser where + fmap = liftM +instance Applicative AttrParser where + pure = return + (<*>) = ap +instance Alternative AttrParser where + empty = AttrParser $ const $ Left $ toException $ XmlException "AttrParser.empty" Nothing + AttrParser f <|> AttrParser g = AttrParser $ \x -> + either (const $ g x) Right (f x) +instance MonadThrow AttrParser where + throwM = AttrParser . const . throwM + +optionalAttrRaw :: ((Name, [Content]) -> Maybe b) -> AttrParser (Maybe b) +optionalAttrRaw f = + AttrParser $ go id + where + go front [] = Right (front [], Nothing) + go front (a:as) = + maybe (go (front . (:) a) as) + (\b -> Right (front as, Just b)) + (f a) + +requireAttrRaw :: String -> ((Name, [Content]) -> Maybe b) -> AttrParser b +requireAttrRaw msg f = optionalAttrRaw f >>= + maybe (AttrParser $ const $ Left $ toException $ MissingAttribute msg) + return + +-- | Return the value for an attribute if present. +attr :: Name -> AttrParser (Maybe Text) +attr n = optionalAttrRaw + (\(x, y) -> if x == n then Just (contentsToText y) else Nothing) + +-- | Shortcut composition of 'force' and 'attr'. +requireAttr :: Name -> AttrParser Text +requireAttr n = force ("Missing attribute: " ++ show n) $ attr n + + +{-# DEPRECATED optionalAttr "Please use 'attr'." #-} +optionalAttr :: Name -> AttrParser (Maybe Text) +optionalAttr = attr + +contentsToText :: [Content] -> Text +contentsToText = T.concat . map toText where + toText (ContentText t) = t + toText (ContentEntity e) = T.concat ["&", e, ";"] + +-- | Skip the remaining attributes on an element. Since this will clear the +-- list of attributes, you must call this /after/ any calls to 'requireAttr', +-- 'optionalAttr', etc. +ignoreAttrs :: AttrParser () +ignoreAttrs = AttrParser $ const $ Right ([], ()) + +-- | Keep parsing elements as long as the parser returns 'Just'. +many :: Monad m + => ConduitM Event o m (Maybe a) + -> ConduitM Event o m [a] +many i = manyIgnore i $ return Nothing + +-- | Like 'many' but discards the results without building an intermediate list. +-- +-- Since 1.5.0 +many_ :: MonadThrow m + => ConduitM Event o m (Maybe a) + -> ConduitM Event o m () +many_ consumer = manyIgnoreYield (return Nothing) (void <$> consumer) + +-- | Keep parsing elements as long as the parser returns 'Just' +-- or the ignore parser returns 'Just'. +manyIgnore :: Monad m + => ConduitM Event o m (Maybe a) + -> ConduitM Event o m (Maybe b) + -> ConduitM Event o m [a] +manyIgnore i ignored = go id where + go front = i >>= maybe (onFail front) (\y -> go $ front . (:) y) + -- onFail is called if the main parser fails + onFail front = ignored >>= maybe (return $ front []) (const $ go front) + +-- | Like @many@, but any tags and content the consumer doesn't match on +-- are silently ignored. +many' :: MonadThrow m + => ConduitM Event o m (Maybe a) + -> ConduitM Event o m [a] +many' consumer = manyIgnore consumer ignoreAllTreesContent + + +-- | Like 'many', but uses 'yield' so the result list can be streamed +-- to downstream conduits without waiting for 'manyYield' to finish +manyYield :: Monad m + => ConduitM a b m (Maybe b) + -> Conduit a m b +manyYield consumer = fix $ \loop -> + consumer >>= maybe (return ()) (\x -> yield x >> loop) + +-- | Like 'manyIgnore', but uses 'yield' so the result list can be streamed +-- to downstream conduits without waiting for 'manyIgnoreYield' to finish +manyIgnoreYield :: MonadThrow m + => ConduitM Event b m (Maybe b) -- ^ Consuming parser that generates the result stream + -> ConduitM Event b m (Maybe ()) -- ^ Ignore parser that consumes elements to be ignored + -> Conduit Event m b +manyIgnoreYield consumer ignoreParser = fix $ \loop -> + consumer >>= maybe (onFail loop) (\x -> yield x >> loop) + where onFail loop = ignoreParser >>= maybe (return ()) (const loop) + +-- | Like 'many'', but uses 'yield' so the result list can be streamed +-- to downstream conduits without waiting for 'manyYield'' to finish +manyYield' :: MonadThrow m + => ConduitM Event b m (Maybe b) + -> Conduit Event m b +manyYield' consumer = manyIgnoreYield consumer ignoreAllTreesContent + + +-- | Stream a content 'Event'. If next event isn't a content, nothing is consumed. +-- +-- Returns @Just ()@ if a content 'Event' was consumed, @Nothing@ otherwise. +-- +-- Since 1.5.0 +takeContent :: MonadThrow m => ConduitM Event Event m (Maybe ()) +takeContent = do + event <- await + case event of + Just e@(EventContent ContentText{}) -> yield e >> return (Just ()) + Just e@EventCDATA{} -> yield e >> return (Just ()) + Just e -> if isWhitespace e then yield e >> takeContent else leftover e >> return Nothing + _ -> return Nothing + +-- | Stream 'Event's corresponding to a single element that matches given 'NameMatcher' and 'AttrParser', from the opening- to the closing-tag. +-- +-- If next 'Event' isn't an element, nothing is consumed. +-- +-- If an opening-tag is consumed but no matching closing-tag is found, an 'XmlException' is thrown. +-- +-- This function automatically ignores comments, instructions and whitespace. +-- +-- Returns @Just ()@ if an element was consumed, 'Nothing' otherwise. +-- +-- Since 1.5.0 +takeTree :: MonadThrow m => NameMatcher a -> AttrParser b -> ConduitM Event Event m (Maybe ()) +takeTree nameMatcher attrParser = do + event <- await + case event of + Just e@(EventBeginElement name as) -> case runNameMatcher nameMatcher name of + Just _ -> case runAttrParser attrParser as of + Right _ -> do + yield e + whileJust takeAnyTreeContent + endEvent <- await + case endEvent of + Just e'@(EventEndElement name') | name == name' -> yield e' >> return (Just ()) + _ -> lift $ monadThrow $ InvalidEndElement name endEvent + _ -> leftover e >> return Nothing + _ -> leftover e >> return Nothing + + Just e -> if isWhitespace e then yield e >> takeTree nameMatcher attrParser else leftover e >> return Nothing + _ -> return Nothing + where + whileJust f = fix $ \loop -> f >>= maybe (return ()) (const loop) + +-- | Like 'takeTree', but can also stream a content 'Event'. +-- +-- Since 1.5.0 +takeTreeContent :: MonadThrow m + => NameMatcher a + -> AttrParser b + -> ConduitM Event Event m (Maybe ()) +takeTreeContent nameMatcher attrParser = runMaybeT $ MaybeT (takeTree nameMatcher attrParser) <|> MaybeT takeContent + +-- | Like 'takeTreeContent', without checking for tag name or attributes. +-- +-- >>> runResourceT $ parseLBS def "text" $$ takeAnyTreeContent =$= consume +-- Just [ EventContent (ContentText "text") ] +-- +-- >>> runResourceT $ parseLBS def "" $$ takeAnyTreeContent =$= consume +-- Just [ ] +-- +-- >>> runResourceT $ parseLBS def "text" $$ takeAnyTreeContent =$= consume +-- Just [ EventBeginElement "b" [], EventBeginElement "c" [], EventEndElement "c", EventEndElement "b" ] +-- +-- Since 1.5.0 +takeAnyTreeContent :: MonadThrow m + => ConduitM Event Event m (Maybe ()) +takeAnyTreeContent = takeTreeContent anyName ignoreAttrs + +{-# DEPRECATED takeAllTreesContent "Please use 'takeAnyTreeContent'." #-} +takeAllTreesContent :: MonadThrow m => ConduitM Event Event m (Maybe ()) +takeAllTreesContent = takeAnyTreeContent + +type DecodeEntities = Text -> Content + +-- | Default implementation of 'DecodeEntities', which leaves the +-- entity as-is. Numeric character references and the five standard +-- entities (lt, gt, amp, quot, pos) are handled internally by the +-- parser. +decodeXmlEntities :: DecodeEntities +decodeXmlEntities = ContentEntity + +-- | HTML4-compliant entity decoder. Handles the additional 248 +-- entities defined by HTML 4 and XHTML 1. +-- +-- Note that HTML 5 introduces a drastically larger number of entities, and +-- this code does not recognize most of them. +decodeHtmlEntities :: DecodeEntities +decodeHtmlEntities t = + maybe (ContentEntity t) ContentText $ Map.lookup t htmlEntities + +htmlEntities :: Map.Map T.Text T.Text +htmlEntities = Map.fromList + $ map (pack *** pack) -- Work around the long-compile-time bug + [ ("nbsp", "\160") + , ("iexcl", "\161") + , ("cent", "\162") + , ("pound", "\163") + , ("curren", "\164") + , ("yen", "\165") + , ("brvbar", "\166") + , ("sect", "\167") + , ("uml", "\168") + , ("copy", "\169") + , ("ordf", "\170") + , ("laquo", "\171") + , ("not", "\172") + , ("shy", "\173") + , ("reg", "\174") + , ("macr", "\175") + , ("deg", "\176") + , ("plusmn", "\177") + , ("sup2", "\178") + , ("sup3", "\179") + , ("acute", "\180") + , ("micro", "\181") + , ("para", "\182") + , ("middot", "\183") + , ("cedil", "\184") + , ("sup1", "\185") + , ("ordm", "\186") + , ("raquo", "\187") + , ("frac14", "\188") + , ("frac12", "\189") + , ("frac34", "\190") + , ("iquest", "\191") + , ("Agrave", "\192") + , ("Aacute", "\193") + , ("Acirc", "\194") + , ("Atilde", "\195") + , ("Auml", "\196") + , ("Aring", "\197") + , ("AElig", "\198") + , ("Ccedil", "\199") + , ("Egrave", "\200") + , ("Eacute", "\201") + , ("Ecirc", "\202") + , ("Euml", "\203") + , ("Igrave", "\204") + , ("Iacute", "\205") + , ("Icirc", "\206") + , ("Iuml", "\207") + , ("ETH", "\208") + , ("Ntilde", "\209") + , ("Ograve", "\210") + , ("Oacute", "\211") + , ("Ocirc", "\212") + , ("Otilde", "\213") + , ("Ouml", "\214") + , ("times", "\215") + , ("Oslash", "\216") + , ("Ugrave", "\217") + , ("Uacute", "\218") + , ("Ucirc", "\219") + , ("Uuml", "\220") + , ("Yacute", "\221") + , ("THORN", "\222") + , ("szlig", "\223") + , ("agrave", "\224") + , ("aacute", "\225") + , ("acirc", "\226") + , ("atilde", "\227") + , ("auml", "\228") + , ("aring", "\229") + , ("aelig", "\230") + , ("ccedil", "\231") + , ("egrave", "\232") + , ("eacute", "\233") + , ("ecirc", "\234") + , ("euml", "\235") + , ("igrave", "\236") + , ("iacute", "\237") + , ("icirc", "\238") + , ("iuml", "\239") + , ("eth", "\240") + , ("ntilde", "\241") + , ("ograve", "\242") + , ("oacute", "\243") + , ("ocirc", "\244") + , ("otilde", "\245") + , ("ouml", "\246") + , ("divide", "\247") + , ("oslash", "\248") + , ("ugrave", "\249") + , ("uacute", "\250") + , ("ucirc", "\251") + , ("uuml", "\252") + , ("yacute", "\253") + , ("thorn", "\254") + , ("yuml", "\255") + , ("OElig", "\338") + , ("oelig", "\339") + , ("Scaron", "\352") + , ("scaron", "\353") + , ("Yuml", "\376") + , ("fnof", "\402") + , ("circ", "\710") + , ("tilde", "\732") + , ("Alpha", "\913") + , ("Beta", "\914") + , ("Gamma", "\915") + , ("Delta", "\916") + , ("Epsilon", "\917") + , ("Zeta", "\918") + , ("Eta", "\919") + , ("Theta", "\920") + , ("Iota", "\921") + , ("Kappa", "\922") + , ("Lambda", "\923") + , ("Mu", "\924") + , ("Nu", "\925") + , ("Xi", "\926") + , ("Omicron", "\927") + , ("Pi", "\928") + , ("Rho", "\929") + , ("Sigma", "\931") + , ("Tau", "\932") + , ("Upsilon", "\933") + , ("Phi", "\934") + , ("Chi", "\935") + , ("Psi", "\936") + , ("Omega", "\937") + , ("alpha", "\945") + , ("beta", "\946") + , ("gamma", "\947") + , ("delta", "\948") + , ("epsilon", "\949") + , ("zeta", "\950") + , ("eta", "\951") + , ("theta", "\952") + , ("iota", "\953") + , ("kappa", "\954") + , ("lambda", "\955") + , ("mu", "\956") + , ("nu", "\957") + , ("xi", "\958") + , ("omicron", "\959") + , ("pi", "\960") + , ("rho", "\961") + , ("sigmaf", "\962") + , ("sigma", "\963") + , ("tau", "\964") + , ("upsilon", "\965") + , ("phi", "\966") + , ("chi", "\967") + , ("psi", "\968") + , ("omega", "\969") + , ("thetasym", "\977") + , ("upsih", "\978") + , ("piv", "\982") + , ("ensp", "\8194") + , ("emsp", "\8195") + , ("thinsp", "\8201") + , ("zwnj", "\8204") + , ("zwj", "\8205") + , ("lrm", "\8206") + , ("rlm", "\8207") + , ("ndash", "\8211") + , ("mdash", "\8212") + , ("lsquo", "\8216") + , ("rsquo", "\8217") + , ("sbquo", "\8218") + , ("ldquo", "\8220") + , ("rdquo", "\8221") + , ("bdquo", "\8222") + , ("dagger", "\8224") + , ("Dagger", "\8225") + , ("bull", "\8226") + , ("hellip", "\8230") + , ("permil", "\8240") + , ("prime", "\8242") + , ("Prime", "\8243") + , ("lsaquo", "\8249") + , ("rsaquo", "\8250") + , ("oline", "\8254") + , ("frasl", "\8260") + , ("euro", "\8364") + , ("image", "\8465") + , ("weierp", "\8472") + , ("real", "\8476") + , ("trade", "\8482") + , ("alefsym", "\8501") + , ("larr", "\8592") + , ("uarr", "\8593") + , ("rarr", "\8594") + , ("darr", "\8595") + , ("harr", "\8596") + , ("crarr", "\8629") + , ("lArr", "\8656") + , ("uArr", "\8657") + , ("rArr", "\8658") + , ("dArr", "\8659") + , ("hArr", "\8660") + , ("forall", "\8704") + , ("part", "\8706") + , ("exist", "\8707") + , ("empty", "\8709") + , ("nabla", "\8711") + , ("isin", "\8712") + , ("notin", "\8713") + , ("ni", "\8715") + , ("prod", "\8719") + , ("sum", "\8721") + , ("minus", "\8722") + , ("lowast", "\8727") + , ("radic", "\8730") + , ("prop", "\8733") + , ("infin", "\8734") + , ("ang", "\8736") + , ("and", "\8743") + , ("or", "\8744") + , ("cap", "\8745") + , ("cup", "\8746") + , ("int", "\8747") + , ("there4", "\8756") + , ("sim", "\8764") + , ("cong", "\8773") + , ("asymp", "\8776") + , ("ne", "\8800") + , ("equiv", "\8801") + , ("le", "\8804") + , ("ge", "\8805") + , ("sub", "\8834") + , ("sup", "\8835") + , ("nsub", "\8836") + , ("sube", "\8838") + , ("supe", "\8839") + , ("oplus", "\8853") + , ("otimes", "\8855") + , ("perp", "\8869") + , ("sdot", "\8901") + , ("lceil", "\8968") + , ("rceil", "\8969") + , ("lfloor", "\8970") + , ("rfloor", "\8971") + , ("lang", "\9001") + , ("rang", "\9002") + , ("loz", "\9674") + , ("spades", "\9824") + , ("clubs", "\9827") + , ("hearts", "\9829") + , ("diams", "\9830") + ] diff --git a/Text/XML/Stream/Render.hs b/Text/XML/Stream/Render.hs new file mode 100644 index 0000000..17cead5 --- /dev/null +++ b/Text/XML/Stream/Render.hs @@ -0,0 +1,417 @@ +{-# LANGUAGE FlexibleContexts #-} +{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE RankNTypes #-} +-- | 'Enumeratee's to render XML 'Event's. Unlike libxml-enumerator and +-- expat-enumerator, this module does not provide IO and ST variants, since the +-- underlying rendering operations are pure functions. +module Text.XML.Stream.Render + ( -- * Rendering XML files + renderBuilder + , renderBuilderFlush + , renderBytes + , renderText + , prettify + -- * Renderer settings + , RenderSettings + , def + , rsPretty + , rsNamespaces + , rsAttrOrder + , rsUseCDATA + , rsXMLDeclaration + , orderAttrs + -- * Event rendering + , tag + , content + -- * Attribute rendering + , Attributes + , attr + , optionalAttr + ) where + +import Blaze.ByteString.Builder +import Control.Applicative ((<$>)) +import Control.Monad.Trans.Resource (MonadThrow) +import Data.ByteString (ByteString) +import Data.Conduit +import Data.Conduit.Blaze (builderToByteString) +import qualified Data.Conduit.List as CL +import qualified Data.Conduit.Text as CT +import Data.Default.Class (Default (def)) +import Data.List (foldl') +import Data.Map (Map) +import qualified Data.Map as Map +import Data.Maybe (fromMaybe, mapMaybe) +import Data.Monoid (Monoid, mappend, mempty) +import qualified Data.Set as Set +import Data.Text (Text) +import qualified Data.Text as T +import Data.XML.Types (Content (..), Event (..), + Name (..)) +import Text.XML.Stream.Token + +-- | Render a stream of 'Event's into a stream of 'ByteString's. This function +-- wraps around 'renderBuilder' and 'builderToByteString', so it produces +-- optimally sized 'ByteString's with minimal buffer copying. +-- +-- The output is UTF8 encoded. +--renderBytes :: Monad m => RenderSettings -> Conduit Event m ByteString +renderBytes rs = renderBuilder rs =$= builderToByteString + +-- | Render a stream of 'Event's into a stream of 'Text's. This function +-- wraps around 'renderBuilder', 'builderToByteString' and 'renderBytes', so it +-- produces optimally sized 'Text's with minimal buffer copying. +{- +renderText :: (MonadThrow m) + => RenderSettings -> Conduit Event m Text +-} +renderText rs = renderBytes rs =$= CT.decode CT.utf8 + +data RenderSettings = RenderSettings + { rsPretty :: Bool + , rsNamespaces :: [(Text, Text)] + -- ^ Defines some top level namespace definitions to be used, in the form + -- of (prefix, namespace). This has absolutely no impact on the meaning + -- of your documents, but can increase readability by moving commonly + -- used namespace declarations to the top level. + , rsAttrOrder :: Name -> Map.Map Name Text -> [(Name, Text)] + -- ^ Specify how to turn the unordered attributes used by the "Text.XML" + -- module into an ordered list. + , rsUseCDATA :: Content -> Bool + -- ^ Determines if for a given text content the renderer should use a + -- CDATA node. + -- + -- Default: @False@ + -- + -- @since 1.3.3 + , rsXMLDeclaration :: Bool + -- ^ Determines whether the XML declaration will be output. + -- + -- Default: @True@ + -- + -- @since 1.5.1 + } + +instance Default RenderSettings where + def = RenderSettings + { rsPretty = False + , rsNamespaces = [] + , rsAttrOrder = const Map.toList + , rsUseCDATA = const False + , rsXMLDeclaration = True + } + +-- | Convenience function to create an ordering function suitable for +-- use as the value of 'rsAttrOrder'. The ordering function is created +-- from an explicit ordering of the attributes, specified as a list of +-- tuples, as follows: In each tuple, the first component is the +-- 'Name' of an element, and the second component is a list of +-- attributes names. When the given element is rendered, the +-- attributes listed, when present, appear first in the given order, +-- followed by any other attributes in arbitrary order. If an element +-- does not appear, all of its attributes are rendered in arbitrary +-- order. +orderAttrs :: [(Name, [Name])] -> + Name -> Map Name Text -> [(Name, Text)] +orderAttrs orderSpec = order + where + order elt attrMap = + let initialAttrs = fromMaybe [] $ lookup elt orderSpec + mkPair attr = (,) attr <$> Map.lookup attr attrMap + otherAttrMap = + Map.filterWithKey (const . not . (`elem` initialAttrs)) attrMap + in mapMaybe mkPair initialAttrs ++ Map.toAscList otherAttrMap + +-- | Render a stream of 'Event's into a stream of 'Builder's. Builders are from +-- the blaze-builder package, and allow the create of optimally sized +-- 'ByteString's with minimal buffer copying. +renderBuilder :: Monad m => RenderSettings -> Conduit Event m Builder +renderBuilder settings = CL.map Chunk =$= renderBuilder' yield' settings + where + yield' Flush = return () + yield' (Chunk bs) = yield bs + +-- | Same as 'renderBuilder' but allows you to flush XML stream to ensure that all +-- events at needed point are rendered. +-- +-- @since 1.3.5 +renderBuilderFlush :: Monad m => RenderSettings -> Conduit (Flush Event) m (Flush Builder) +renderBuilderFlush = renderBuilder' yield + +renderBuilder' :: Monad m => (Flush Builder -> Producer m o) -> RenderSettings -> Conduit (Flush Event) m o +renderBuilder' yield' settings = + if rsPretty settings + then prettify =$= renderEvent' + else renderEvent' + where + renderEvent' = renderEvent yield' settings + +renderEvent :: Monad m => (Flush Builder -> Producer m o) -> RenderSettings -> Conduit (Flush Event) m o +renderEvent yield' RenderSettings { rsPretty = isPretty, rsNamespaces = namespaces0, rsUseCDATA = useCDATA, rsXMLDeclaration = useXMLDecl } = + loop [] + where + loop nslevels = await >>= maybe (return ()) (go nslevels) + + go nslevels Flush = yield' Flush >> loop nslevels + go nslevels (Chunk e) = + case e of + EventBeginElement n1 as -> do + mnext <- CL.peek + isClosed <- + case mnext of + Just (Chunk (EventEndElement n2)) | n1 == n2 -> do + CL.drop 1 + return True + _ -> return False + let (token, nslevels') = mkBeginToken isPretty isClosed namespaces0 nslevels n1 as + yield' $ Chunk token + loop nslevels' + _ -> do + let (token, nslevels') = eventToToken nslevels useCDATA useXMLDecl e + yield' $ Chunk token + loop nslevels' + +eventToToken :: Stack -> (Content -> Bool) -> Bool -> Event -> (Builder, [NSLevel]) +eventToToken s _ True EventBeginDocument = + (tokenToBuilder $ TokenXMLDeclaration + [ ("version", [ContentText "1.0"]) + , ("encoding", [ContentText "UTF-8"]) + ] + , s) +eventToToken s _ False EventBeginDocument = (mempty, s) +eventToToken s _ _ EventEndDocument = (mempty, s) +eventToToken s _ _ (EventInstruction i) = (tokenToBuilder $ TokenInstruction i, s) +eventToToken s _ _ (EventBeginDoctype n meid) = (tokenToBuilder $ TokenDoctype n meid [], s) +eventToToken s _ _ EventEndDoctype = (mempty, s) +eventToToken s _ _ (EventCDATA t) = (tokenToBuilder $ TokenCDATA t, s) +eventToToken s _ _ (EventEndElement name) = + (tokenToBuilder $ TokenEndElement $ nameToTName sl name, s') + where + (sl:s') = s +eventToToken s useCDATA _ (EventContent c) + | useCDATA c = + case c of + ContentText txt -> (tokenToBuilder $ TokenCDATA txt, s) + ContentEntity txt -> (tokenToBuilder $ TokenCDATA txt, s) + | otherwise = (tokenToBuilder $ TokenContent c, s) +eventToToken s _ _ (EventComment t) = (tokenToBuilder $ TokenComment t, s) +eventToToken _ _ _ EventBeginElement{} = error "eventToToken on EventBeginElement" -- mkBeginToken False s name attrs + +type Stack = [NSLevel] + +nameToTName :: NSLevel -> Name -> TName +nameToTName _ (Name name _ (Just pref)) + | pref == "xml" = TName (Just "xml") name +nameToTName _ (Name name Nothing _) = TName Nothing name -- invariant that this is true +nameToTName (NSLevel def' sl) (Name name (Just ns) _) + | def' == Just ns = TName Nothing name + | otherwise = + case Map.lookup ns sl of + Nothing -> error "nameToTName" + Just pref -> TName (Just pref) name + +mkBeginToken :: Bool -- ^ pretty print attributes? + -> Bool -- ^ self closing? + -> [(Text, Text)] -- ^ namespaces to apply to top-level + -> Stack + -> Name + -> [(Name, [Content])] + -> (Builder, Stack) +mkBeginToken isPretty isClosed namespaces0 s name attrs = + (tokenToBuilder $ TokenBeginElement tname tattrs3 isClosed indent, + if isClosed then s else sl3 : s) + where + indent = if isPretty then 2 + 4 * length s else 0 + prevsl = case s of + [] -> NSLevel Nothing Map.empty + sl':_ -> sl' + (sl1, tname, tattrs1) = newElemStack prevsl name + (sl2, tattrs2) = foldr newAttrStack (sl1, tattrs1) $ nubAttrs attrs + (sl3, tattrs3) = + case s of + [] -> (sl2 { prefixes = Map.union (prefixes sl2) $ Map.fromList namespaceSL }, namespaceAttrs ++ tattrs2) + _ -> (sl2, tattrs2) + + (namespaceSL, namespaceAttrs) = unzip $ mapMaybe unused namespaces0 + unused (k, v) = + case lookup k' tattrs2 of + Just{} -> Nothing + Nothing -> Just ((v, k), (k', v')) + where + k' = TName (Just "xmlns") k + v' = [ContentText v] + +newElemStack :: NSLevel -> Name -> (NSLevel, TName, [TAttribute]) +newElemStack nsl@(NSLevel def' _) (Name local ns _) + | def' == ns = (nsl, TName Nothing local, []) +newElemStack (NSLevel _ nsmap) (Name local Nothing _) = + (NSLevel Nothing nsmap, TName Nothing local, [(TName Nothing "xmlns", [])]) +newElemStack (NSLevel _ nsmap) (Name local (Just ns) Nothing) = + (NSLevel (Just ns) nsmap, TName Nothing local, [(TName Nothing "xmlns", [ContentText ns])]) +newElemStack (NSLevel def' nsmap) (Name local (Just ns) (Just pref)) = + case Map.lookup ns nsmap of + Just pref' + | pref == pref' -> + ( NSLevel def' nsmap + , TName (Just pref) local + , [] + ) + _ -> ( NSLevel def' nsmap' + , TName (Just pref) local + , [(TName (Just "xmlns") pref, [ContentText ns])] + ) + where + nsmap' = Map.insert ns pref nsmap + +newAttrStack :: (Name, [Content]) -> (NSLevel, [TAttribute]) -> (NSLevel, [TAttribute]) +newAttrStack (name, value) (NSLevel def' nsmap, attrs) = + (NSLevel def' nsmap', addNS $ (tname, value) : attrs) + where + (nsmap', tname, addNS) = + case name of + Name local Nothing _ -> (nsmap, TName Nothing local, id) + Name local (Just ns) mpref -> + let ppref = fromMaybe "ns" mpref + (pref, addNS') = getPrefix ppref nsmap ns + in (Map.insert ns pref nsmap, TName (Just pref) local, addNS') + +getPrefix :: Text -> Map Text Text -> Text -> (Text, [TAttribute] -> [TAttribute]) +getPrefix _ _ "http://www.w3.org/XML/1998/namespace" = ("xml", id) +getPrefix ppref nsmap ns = + case Map.lookup ns nsmap of + Just pref -> (pref, id) + Nothing -> + let pref = findUnused ppref $ Map.elems nsmap + in (pref, (:) (TName (Just "xmlns") pref, [ContentText ns])) + where + findUnused x xs + | x `elem` xs = findUnused (x `T.snoc` '_') xs + | otherwise = x + +-- | Convert a stream of 'Event's into a prettified one, adding extra +-- whitespace. Note that this can change the meaning of your XML. +prettify :: Monad m => Conduit (Flush Event) m (Flush Event) +prettify = prettify' 0 + +prettify' :: Monad m => Int -> Conduit (Flush Event) m (Flush Event) +prettify' level = + await >>= maybe (return ()) goC + where + yield' = yield . Chunk + + goC Flush = yield Flush >> prettify' level + goC (Chunk e) = go e + + go e@EventBeginDocument = do + yield' e + yield' $ EventContent $ ContentText "\n" + prettify' level + go e@EventBeginElement{} = do + yield' before + yield' e + mnext <- CL.peek + case mnext of + Just (Chunk next@EventEndElement{}) -> do + CL.drop 1 + yield' next + yield' after + prettify' level + _ -> do + yield' after + prettify' $ level + 1 + go e@EventEndElement{} = do + let level' = max 0 $ level - 1 + yield' $ before' level' + yield' e + yield' after + prettify' level' + go (EventContent c) = do + cs <- takeContents (c:) + let cs' = mapMaybe normalize cs + case cs' of + [] -> return () + _ -> do + yield' before + mapM_ (yield' . EventContent) cs' + yield' after + prettify' level + go (EventCDATA t) = go $ EventContent $ ContentText t + go e@EventInstruction{} = do + yield' before + yield' e + yield' after + prettify' level + go (EventComment t) = do + yield' before + yield' $ EventComment $ T.concat + [ " " + , T.unwords $ T.words t + , " " + ] + yield' after + prettify' level + + go e@EventEndDocument = yield' e >> prettify' level + go e@EventBeginDoctype{} = yield' e >> prettify' level + go e@EventEndDoctype{} = yield' e >> yield' after >> prettify' level + + takeContents front = do + me <- CL.peek + case me of + Just (Chunk (EventContent c)) -> do + CL.drop 1 + takeContents $ front . (c:) + Just (Chunk (EventCDATA t)) -> do + CL.drop 1 + takeContents $ front . (ContentText t:) + _ -> return $ front [] + + normalize (ContentText t) + | T.null t' = Nothing + | otherwise = Just $ ContentText t' + where + t' = T.unwords $ T.words t + normalize c = Just c + + before = EventContent $ ContentText $ T.replicate level " " + before' l = EventContent $ ContentText $ T.replicate l " " + after = EventContent $ ContentText "\n" + +nubAttrs :: [(Name, v)] -> [(Name, v)] +nubAttrs orig = + front [] + where + (front, _) = foldl' go (id, Set.empty) orig + go (dlist, used) (k, v) + | k `Set.member` used = (dlist, used) + | otherwise = (dlist . ((k, v):), Set.insert k used) + + +-- | Generate a complete XML 'Element'. +tag :: (Monad m) => Name -> Attributes -> Source m Event -- ^ 'Element''s subnodes. + -> Source m Event +tag name (Attributes a) content = do + yield $ EventBeginElement name a + content + yield $ EventEndElement name + +-- | Generate a textual 'EventContent'. +content :: (Monad m) => Text -> Source m Event +content = yield . EventContent . ContentText + +-- | A list of attributes. +data Attributes = Attributes [(Name, [Content])] + +instance Monoid Attributes where + mempty = Attributes mempty + (Attributes a) `mappend` (Attributes b) = Attributes (a `mappend` b) + +-- | Generate a single attribute. +attr :: Name -- ^ Attribute's name + -> Text -- ^ Attribute's value + -> Attributes +attr name value = Attributes [(name, [ContentText value])] + +-- | Helper function that generates a valid attribute if input isn't 'Nothing', or 'mempty' otherwise. +optionalAttr :: Name -> Maybe Text -> Attributes +optionalAttr name = maybe mempty (attr name) diff --git a/Text/XML/Stream/Token.hs b/Text/XML/Stream/Token.hs new file mode 100644 index 0000000..826ef4a --- /dev/null +++ b/Text/XML/Stream/Token.hs @@ -0,0 +1,176 @@ +{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE FlexibleContexts #-} +module Text.XML.Stream.Token + ( tokenToBuilder + , TName (..) + , Token (..) + , TAttribute + , NSLevel (..) + ) where + +import Data.XML.Types (Instruction (..), Content (..), ExternalID (..)) +import qualified Data.Text as T +import Data.Text (Text) +import Data.String (IsString (fromString)) +import Blaze.ByteString.Builder + (Builder, fromByteString, writeByteString, copyByteString) +import Blaze.ByteString.Builder.Internal.Write (fromWriteList) +import Blaze.ByteString.Builder.Char.Utf8 (writeChar, fromText) +import Data.Monoid (mconcat, mempty, mappend) +import Data.ByteString.Char8 () +import Data.Map (Map) +import qualified Blaze.ByteString.Builder.Char8 as BC8 +import qualified Data.Set as Set +import Data.List (foldl') +import Control.Arrow (first) + +oneSpace :: Builder +oneSpace = copyByteString " " + +data Token = TokenXMLDeclaration [TAttribute] + | TokenInstruction Instruction + | TokenBeginElement TName [TAttribute] Bool Int -- ^ indent + | TokenEndElement TName + | TokenContent Content + | TokenComment Text + | TokenDoctype Text (Maybe ExternalID) [(Text, Text)] + | TokenCDATA Text + deriving Show +tokenToBuilder :: Token -> Builder +tokenToBuilder (TokenXMLDeclaration attrs) = + fromByteString "") +tokenToBuilder (TokenInstruction (Instruction target data_)) = mconcat + [ fromByteString "" + ] +tokenToBuilder (TokenBeginElement name attrs' isEmpty indent) = + copyByteString "<" + `mappend` tnameToText name + `mappend` foldAttrs + (if indent == 0 || lessThan3 attrs + then oneSpace + else BC8.fromString ('\n' : replicate indent ' ')) + attrs + (if isEmpty then fromByteString "/>" else fromByteString ">") + where + attrs = nubAttrs $ map (first splitTName) attrs' + lessThan3 [] = True + lessThan3 [_] = True + lessThan3 [_, _] = True + lessThan3 _ = False +tokenToBuilder (TokenEndElement name) = mconcat + [ fromByteString "" + ] +tokenToBuilder (TokenContent c) = contentToText c +tokenToBuilder (TokenCDATA t) = + copyByteString "" +tokenToBuilder (TokenComment t) = mconcat [fromByteString ""] +tokenToBuilder (TokenDoctype name eid _) = mconcat + [ fromByteString "" + ] + where + go Nothing = mempty + go (Just (SystemID uri)) = mconcat + [ fromByteString " SYSTEM \"" + , fromText uri + , fromByteString "\"" + ] + go (Just (PublicID pid uri)) = mconcat + [ fromByteString " PUBLIC \"" + , fromText pid + , fromByteString "\" \"" + , fromText uri + , fromByteString "\"" + ] + +data TName = TName (Maybe Text) Text + deriving (Show, Eq, Ord) + +tnameToText :: TName -> Builder +tnameToText (TName Nothing name) = fromText name +tnameToText (TName (Just prefix) name) = mconcat [fromText prefix, fromByteString ":", fromText name] + +contentToText :: Content -> Builder +contentToText (ContentText t) = + fromWriteList go $ T.unpack t + where + go '<' = writeByteString "<" + go '>' = writeByteString ">" + go '&' = writeByteString "&" + -- Not escaping quotes, since this is only called outside of attributes + go c = writeChar c +contentToText (ContentEntity e) = mconcat + [ fromByteString "&" + , fromText e + , fromByteString ";" + ] + +type TAttribute = (TName, [Content]) + +foldAttrs :: Builder -- ^ before + -> [TAttribute] + -> Builder + -> Builder +foldAttrs before attrs rest' = + foldr go rest' attrs + where + go (key, val) rest = + before + `mappend` tnameToText key + `mappend` copyByteString "=\"" + `mappend` foldr go' (fromByteString "\"" `mappend` rest) val + go' (ContentText t) rest = + fromWriteList h (T.unpack t) `mappend` rest + where + h '<' = writeByteString "<" + h '>' = writeByteString ">" + h '&' = writeByteString "&" + h '"' = writeByteString """ + -- Not escaping single quotes, since our attributes are always double + -- quoted + h c = writeChar c + go' (ContentEntity t) rest = + fromByteString "&" + `mappend` fromText t + `mappend` fromByteString ";" + `mappend` rest + +instance IsString TName where + fromString = TName Nothing . T.pack + +data NSLevel = NSLevel + { defaultNS :: Maybe Text + , prefixes :: Map Text Text + } + deriving Show + +nubAttrs :: [TAttribute] -> [TAttribute] +nubAttrs orig = + front [] + where + (front, _) = foldl' go (id, Set.empty) orig + go (dlist, used) (k, v) + | k `Set.member` used = (dlist, used) + | otherwise = (dlist . ((k, v):), Set.insert k used) + +splitTName :: TName -> TName +splitTName x@(TName Just{} _) = x +splitTName x@(TName Nothing t) + | T.null b = x + | otherwise = TName (Just a) $ T.drop 1 b + where + (a, b) = T.break (== ':') t + +escCDATA :: Text -> Builder +escCDATA s = fromText (T.replace "]]>" "]]]]>" s) diff --git a/Text/XML/Unresolved.hs b/Text/XML/Unresolved.hs new file mode 100644 index 0000000..4aee7af --- /dev/null +++ b/Text/XML/Unresolved.hs @@ -0,0 +1,297 @@ +{-# LANGUAGE DeriveDataTypeable #-} +{-# LANGUAGE FlexibleContexts #-} +{-# LANGUAGE RankNTypes #-} +-- | DOM-based XML parsing and rendering. +-- +-- In this module, attribute values and content nodes can contain either raw +-- text or entities. In most cases, these can be fully resolved at parsing. If +-- that is the case for your documents, the "Text.XML" module provides +-- simplified datatypes that only contain raw text. +module Text.XML.Unresolved + ( -- * Non-streaming functions + writeFile + , readFile + -- * Lazy bytestrings + , renderLBS + , parseLBS + , parseLBS_ + -- * Text + , parseText + , parseText_ + , sinkTextDoc + -- * Byte streams + , sinkDoc + -- * Streaming functions + , toEvents + , elementToEvents + , fromEvents + , elementFromEvents + , renderBuilder + , renderBytes + , renderText + -- * Exceptions + , InvalidEventStream (..) + -- * Settings + , P.def + -- ** Parse + , P.ParseSettings + , P.psDecodeEntities + , P.psRetainNamespaces + -- ** Render + , R.RenderSettings + , R.rsPretty + , R.rsNamespaces + ) where + +import Blaze.ByteString.Builder (Builder) +import Control.Applicative ((<$>), (<*>)) +import Control.Exception (Exception, SomeException, throw) +import Control.Monad (when) +import Control.Monad.ST (runST) +import Control.Monad.Trans.Class (lift) +import Control.Monad.Trans.Resource (MonadThrow, monadThrow, + runExceptionT, runResourceT) +import Data.ByteString (ByteString) +import qualified Data.ByteString.Lazy as L +import Data.Char (isSpace) +import Data.Conduit +import qualified Data.Conduit.Binary as CB +import Data.Conduit.Lazy (lazyConsume) +import qualified Data.Conduit.List as CL +import Data.Maybe (isJust, mapMaybe) +import Data.Monoid (mconcat) +import Data.Text (Text) +import qualified Data.Text as T +import qualified Data.Text.Lazy as TL +import Data.Typeable (Typeable) +import Data.XML.Types +import Prelude hiding (readFile, writeFile) +import System.IO.Unsafe (unsafePerformIO) +import Text.XML.Stream.Parse (ParseSettings) +import qualified Text.XML.Stream.Parse as P +import qualified Text.XML.Stream.Render as R + +readFile :: P.ParseSettings -> FilePath -> IO Document +readFile ps fp = runResourceT $ CB.sourceFile fp $$ sinkDoc ps + +sinkDoc :: MonadThrow m + => P.ParseSettings + -> Consumer ByteString m Document +sinkDoc ps = P.parseBytesPos ps =$= fromEvents + +writeFile :: R.RenderSettings -> FilePath -> Document -> IO () +writeFile rs fp doc = + runResourceT $ renderBytes rs doc $$ CB.sinkFile fp + +renderLBS :: R.RenderSettings -> Document -> L.ByteString +renderLBS rs doc = + L.fromChunks $ unsafePerformIO + -- not generally safe, but we know that runResourceT + -- will not deallocate any of the resources being used + -- by the process + $ lazyConsume + $ renderBytes rs doc + +parseLBS :: P.ParseSettings -> L.ByteString -> Either SomeException Document +parseLBS ps lbs = + runST $ runExceptionT + $ CL.sourceList (L.toChunks lbs) $$ sinkDoc ps + +parseLBS_ :: P.ParseSettings -> L.ByteString -> Document +parseLBS_ ps lbs = either throw id $ parseLBS ps lbs + +data InvalidEventStream = ContentAfterRoot P.EventPos + | MissingRootElement + | InvalidInlineDoctype P.EventPos + | MissingEndElement Name (Maybe P.EventPos) + | UnterminatedInlineDoctype + deriving Typeable +instance Exception InvalidEventStream +instance Show InvalidEventStream where + show (ContentAfterRoot (pos, e)) = mShowPos pos ++ "Found content after root element: " ++ prettyShowE e + show MissingRootElement = "Missing root element" + show (InvalidInlineDoctype (pos, e)) = mShowPos pos ++ "Invalid content inside doctype: " ++ prettyShowE e + show (MissingEndElement name Nothing) = "Documented ended while expected end element for: " ++ prettyShowName name + show (MissingEndElement name (Just (pos, e))) = mShowPos pos ++ "Expected end element for: " ++ prettyShowName name ++ ", but received: " ++ prettyShowE e + show UnterminatedInlineDoctype = "Unterminated doctype declaration" + +mShowPos :: Maybe P.PositionRange -> String +mShowPos Nothing = "" +mShowPos (Just pos) = show pos ++ ": " + +prettyShowE :: Event -> String +prettyShowE = show -- FIXME + +prettyShowName :: Name -> String +prettyShowName = show -- FIXME + +renderBuilder :: Monad m => R.RenderSettings -> Document -> Producer m Builder +renderBuilder rs doc = CL.sourceList (toEvents doc) =$= R.renderBuilder rs + +--renderBytes :: MonadUnsafeIO m => R.RenderSettings -> Document -> Producer m ByteString +renderBytes rs doc = CL.sourceList (toEvents doc) =$= R.renderBytes rs + +--renderText :: (MonadThrow m, MonadUnsafeIO m) => R.RenderSettings -> Document -> Producer m Text +renderText rs doc = CL.sourceList (toEvents doc) =$= R.renderText rs + +manyTries :: Monad m => m (Maybe a) -> m [a] +manyTries f = + go id + where + go front = do + x <- f + case x of + Nothing -> return $ front [] + Just y -> go (front . (:) y) + +dropReturn :: Monad m => a -> ConduitM i o m a +dropReturn x = CL.drop 1 >> return x + +-- | Parse a document from a stream of events. +fromEvents :: MonadThrow m => Consumer P.EventPos m Document +fromEvents = do + skip EventBeginDocument + d <- Document <$> goP <*> require elementFromEvents <*> goM + skip EventEndDocument + y <- CL.head + case y of + Nothing -> return d + Just (_, EventEndDocument) -> lift $ monadThrow MissingRootElement + Just z -> + lift $ monadThrow $ ContentAfterRoot z + where + skip e = do + x <- CL.peek + when (fmap snd x == Just e) (CL.drop 1) + require f = do + x <- f + case x of + Just y -> return y + Nothing -> do + my <- CL.head + case my of + Nothing -> error "Text.XML.Unresolved:impossible" + Just (_, EventEndDocument) -> lift $ monadThrow MissingRootElement + Just y -> lift $ monadThrow $ ContentAfterRoot y + goP = Prologue <$> goM <*> goD <*> goM + goM = manyTries goM' + goM' = do + x <- CL.peek + case x of + Just (_, EventInstruction i) -> dropReturn $ Just $ MiscInstruction i + Just (_, EventComment t) -> dropReturn $ Just $ MiscComment t + Just (_, EventContent (ContentText t)) + | T.all isSpace t -> CL.drop 1 >> goM' + _ -> return Nothing + goD = do + x <- CL.peek + case x of + Just (_, EventBeginDoctype name meid) -> do + CL.drop 1 + dropTillDoctype + return (Just $ Doctype name meid) + _ -> return Nothing + dropTillDoctype = do + x <- CL.head + case x of + -- Leaving the following line commented so that the intention of + -- this function stays clear. I figure in the future xml-types will + -- be expanded again to support some form of EventDeclaration + -- + -- Just (EventDeclaration _) -> dropTillDoctype + Just (_, EventEndDoctype) -> return () + Just epos -> lift $ monadThrow $ InvalidInlineDoctype epos + Nothing -> lift $ monadThrow UnterminatedInlineDoctype + +-- | Try to parse a document element (as defined in XML) from a stream of events. +-- +-- @since 1.3.5 +elementFromEvents :: MonadThrow m => Consumer P.EventPos m (Maybe Element) +elementFromEvents = goE + where + goE = do + x <- CL.peek + case x of + Just (_, EventBeginElement n as) -> Just <$> goE' n as + _ -> return Nothing + goE' n as = do + CL.drop 1 + ns <- manyTries goN + y <- CL.head + if fmap snd y == Just (EventEndElement n) + then return $ Element n as $ compressNodes ns + else lift $ monadThrow $ MissingEndElement n y + goN = do + x <- CL.peek + case x of + Just (_, EventBeginElement n as) -> (Just . NodeElement) <$> goE' n as + Just (_, EventInstruction i) -> dropReturn $ Just $ NodeInstruction i + Just (_, EventContent c) -> dropReturn $ Just $ NodeContent c + Just (_, EventComment t) -> dropReturn $ Just $ NodeComment t + Just (_, EventCDATA t) -> dropReturn $ Just $ NodeContent $ ContentText t + _ -> return Nothing + +-- | Render a document into events. +toEvents :: Document -> [Event] +toEvents (Document prol root epi) = + (EventBeginDocument :) + . goP prol . elementToEvents' root . goM epi $ [EventEndDocument] + where + goP (Prologue before doctype after) = + goM before . maybe id goD doctype . goM after + goM [] = id + goM [x] = (goM' x :) + goM (x:xs) = (goM' x :) . goM xs + goM' (MiscInstruction i) = EventInstruction i + goM' (MiscComment t) = EventComment t + goD (Doctype name meid) = + (:) (EventBeginDoctype name meid) + . (:) EventEndDoctype + +-- | Render a document element into events. +-- +-- @since 1.3.5 +elementToEvents :: Element -> [Event] +elementToEvents e = elementToEvents' e [] + +elementToEvents' :: Element -> [Event] -> [Event] +elementToEvents' = goE + where + goE (Element name as ns) = + (EventBeginElement name as :) + . goN ns + . (EventEndElement name :) + goN [] = id + goN [x] = goN' x + goN (x:xs) = goN' x . goN xs + goN' (NodeElement e) = goE e + goN' (NodeInstruction i) = (EventInstruction i :) + goN' (NodeContent c) = (EventContent c :) + goN' (NodeComment t) = (EventComment t :) + +compressNodes :: [Node] -> [Node] +compressNodes [] = [] +compressNodes [x] = [x] +compressNodes (x@(NodeContent (ContentText _)) : y@(NodeContent (ContentText _)) : z) = + let (textNodes, remainder) = span (isJust . unContent) (x:y:z) + texts = mapMaybe unContent textNodes + in + compressNodes $ NodeContent (ContentText $ mconcat texts) : remainder + where + unContent (NodeContent (ContentText text)) = Just text + unContent _ = Nothing +compressNodes (x:xs) = x : compressNodes xs + +parseText :: ParseSettings -> TL.Text -> Either SomeException Document +parseText ps tl = runST + $ runExceptionT + $ CL.sourceList (TL.toChunks tl) + $$ sinkTextDoc ps + +parseText_ :: ParseSettings -> TL.Text -> Document +parseText_ ps = either throw id . parseText ps + +sinkTextDoc :: MonadThrow m + => ParseSettings + -> Consumer Text m Document +sinkTextDoc ps = P.parseTextPos ps =$= fromEvents diff --git a/test/main.hs b/test/main.hs new file mode 100644 index 0000000..f3521df --- /dev/null +++ b/test/main.hs @@ -0,0 +1,955 @@ +{-# LANGUAGE DeriveDataTypeable #-} +{-# LANGUAGE OverloadedStrings #-} + +import Control.Exception (Exception, toException) +import Control.Monad.IO.Class (liftIO) +import qualified Data.ByteString.Char8 as S +import qualified Data.ByteString.Lazy.Char8 as L +import Data.Typeable (Typeable) +import Data.XML.Types +import Test.Hspec +import Test.HUnit hiding (Test) +import qualified Text.XML as Res +import qualified Text.XML.Cursor as Cu +import Text.XML.Stream.Parse (def) +import qualified Text.XML.Stream.Parse as P +import qualified Text.XML.Unresolved as D + +import Control.Applicative ((<$>)) +import Control.Monad +import Control.Monad.Trans.Class (lift) +import qualified Data.Set as Set +import Data.Text (Text) +import qualified Data.Text as T +import Text.XML.Cursor (($.//), ($/), ($//), ($|), + (&.//), (&/), (&//)) + +import Control.Monad.Trans.Resource (runResourceT) +import qualified Control.Monad.Trans.Resource as C +import Data.Conduit ((=$=)) +import qualified Data.Conduit as C +import qualified Data.Conduit.List as CL +import qualified Data.Map as Map +import Text.Blaze (toMarkup) +import Text.Blaze.Renderer.String (renderMarkup) + +main :: IO () +main = hspec $ do + describe "XML parsing and rendering" $ do + it "is idempotent to parse and render a document" documentParseRender + it "has valid parser combinators" combinators + context "has working choose function" testChoose + it "has working many function" testMany + it "has working many' function" testMany' + it "has working manyYield function" testManyYield + it "has working takeContent function" testTakeContent + it "has working takeTree function" testTakeTree + it "has working takeAnyTreeContent function" testTakeAnyTreeContent + it "has working orE" testOrE + it "is idempotent to parse and pretty render a document" documentParsePrettyRender + it "ignores the BOM" parseIgnoreBOM + it "strips duplicated attributes" stripDuplicateAttributes + it "displays comments" testRenderComments + it "conduit parser" testConduitParser + it "can omit the XML declaration" omitXMLDeclaration + context "correctly parses hexadecimal entities" hexEntityParsing + describe "XML Cursors" $ do + it "has correct parent" cursorParent + it "has correct ancestor" cursorAncestor + it "has correct orSelf" cursorOrSelf + it "has correct preceding" cursorPreceding + it "has correct following" cursorFollowing + it "has correct precedingSibling" cursorPrecedingSib + it "has correct followingSibling" cursorFollowingSib + it "has correct descendant" cursorDescendant + it "has correct check" cursorCheck + it "has correct check with lists" cursorPredicate + it "has correct checkNode" cursorCheckNode + it "has correct checkElement" cursorCheckElement + it "has correct checkName" cursorCheckName + it "has correct anyElement" cursorAnyElement + it "has correct element" cursorElement + it "has correct laxElement" cursorLaxElement + it "has correct content" cursorContent + it "has correct attribute" cursorAttribute + it "has correct laxAttribute" cursorLaxAttribute + it "has correct &* and $* operators" cursorDeep + it "has correct force" cursorForce + it "has correct forceM" cursorForceM + it "has correct hasAttribute" cursorHasAttribute + it "has correct attributeIs" cursorAttributeIs + describe "resolved" $ do + it "identifies unresolved entities" resolvedIdentifies + it "decodeHtmlEntities" testHtmlEntities + it "works for resolvable entities" resolvedAllGood + it "merges adjacent content nodes" resolvedMergeContent + it "understands inline entity declarations" resolvedInline + describe "pretty" $ do + it "works" casePretty + describe "top level namespaces" $ do + it "works" caseTopLevelNamespace + it "works with prefix" caseTopLevelNamespacePrefix + it "handles conflicts" caseTLNConflict + describe "blaze-html instances" $ do + it "works" caseBlazeHtml + describe "attribute reordering" $ do + it "works" caseAttrReorder + describe "ordering attributes explicitly" $ do + it "works" caseOrderAttrs + it "parsing CDATA" caseParseCdata + it "retains namespaces when asked" caseRetainNamespaces + it "handles iso-8859-1" caseIso8859_1 + it "renders CDATA when asked" caseRenderCDATA + it "escapes CDATA closing tag in CDATA" caseEscapesCDATA + +documentParseRender :: IO () +documentParseRender = + mapM_ go docs + where + go x = x @=? D.parseLBS_ def (D.renderLBS def x) + docs = + [ Document (Prologue [] Nothing []) + (Element "foo" [] []) + [] + , D.parseLBS_ def + "\n" + , D.parseLBS_ def + "\n&ignore;" + , D.parseLBS_ def + "]]>" + , D.parseLBS_ def + "" + , D.parseLBS_ def + "" + , D.parseLBS_ def + "" + ] + +documentParsePrettyRender :: IO () +documentParsePrettyRender = + L.unpack (D.renderLBS def { D.rsPretty = True } (D.parseLBS_ def doc)) @?= L.unpack doc + where + doc = L.unlines + [ "" + , "" + , " " + , " text" + , " " + , "" + ] + +combinators :: Assertion +combinators = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tag' "hello" (P.requireAttr "world") $ \world -> do + liftIO $ world @?= "true" + P.force "need child1" $ P.tagNoAttr "{mynamespace}child1" $ return () + P.force "need child2" $ P.tagNoAttr "child2" $ return () + P.force "need child3" $ P.tagNoAttr "child3" $ do + x <- P.contentMaybe + liftIO $ x @?= Just "combine &content" + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + , " " + , "combine <all> \n" + , "" + ] + +testChoose :: Spec +testChoose = do + it "can choose between elements" + testChooseEitherElem + it "can choose between elements and text, returning text" + testChooseElemOrTextIsText + it "can choose between elements and text, returning elements" + testChooseElemOrTextIsElem + it "can choose between text and elements, returning text" + testChooseTextOrElemIsText + it "can choose between text and elements, returning elements" + testChooseTextOrElemIsElem + it "can choose between text and elements, when the text is encoded" + testChooseElemOrTextIsEncoded + it "can choose between text and elements, when the text is encoded, NBSP" + testChooseElemOrTextIsEncodedNBSP + it "can choose between elements and text, when the text is whitespace" + testChooseElemOrTextIsWhiteSpace + it "can choose between text and elements, when the text is whitespace" + testChooseTextOrElemIsWhiteSpace + it "can choose between text and elements, when the whitespace is both literal and encoded" + testChooseElemOrTextIsChunkedText + it "can choose between text and elements, when the text is chunked the other way" + testChooseElemOrTextIsChunkedText2 + +testChooseElemOrTextIsText :: Assertion +testChooseElemOrTextIsText = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just " something " + where + input = L.concat + [ "" + , "\n" + , "" + , " something " + , "" + ] + +testChooseElemOrTextIsEncoded :: Assertion +testChooseElemOrTextIsEncoded = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just "\x20something\x20" + where + input = L.concat + [ "" + , "\n" + , "" + , " something " + , "" + ] + +testChooseElemOrTextIsEncodedNBSP :: Assertion +testChooseElemOrTextIsEncodedNBSP = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just "\160something\160" + where + input = L.concat + [ "" + , "\n" + , "" + , " something " + , "" + ] + + +testChooseElemOrTextIsWhiteSpace :: Assertion +testChooseElemOrTextIsWhiteSpace = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just "\x20\x20\x20" + where + input = L.concat + [ "" + , "\n" + , " " + ] + +testChooseTextOrElemIsWhiteSpace :: Assertion +testChooseTextOrElemIsWhiteSpace = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.contentMaybe + , P.tagNoAttr "failure" $ return "boom" + ] + liftIO $ x @?= Just "\x20\x20\x20" + where + input = L.concat + [ "" + , "\n" + , " " + ] + +testChooseElemOrTextIsChunkedText :: Assertion +testChooseElemOrTextIsChunkedText = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just "\x20\x20\x20" + where + input = L.concat + [ "" + , "\n" + , " " + ] + +testChooseElemOrTextIsChunkedText2 :: Assertion +testChooseElemOrTextIsChunkedText2 = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return "boom" + , P.contentMaybe + ] + liftIO $ x @?= Just "\x20\x20\x20" + where + input = L.concat + [ "" + , "\n" + , " " + ] + +testChooseElemOrTextIsElem :: Assertion +testChooseElemOrTextIsElem = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "success" $ return "success" + , P.contentMaybe + ] + liftIO $ x @?= Just "success" + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + ] + +testChooseTextOrElemIsText :: Assertion +testChooseTextOrElemIsText = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.contentMaybe + , P.tagNoAttr "failure" $ return "boom" + ] + liftIO $ x @?= Just " something " + where + input = L.concat + [ "" + , "\n" + , "" + , " something " + , "" + ] + +testChooseTextOrElemIsElem :: Assertion +testChooseTextOrElemIsElem = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.contentMaybe + , P.tagNoAttr "success" $ return "success" + ] + liftIO $ x @?= Just "success" + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + ] + +testChooseEitherElem :: Assertion +testChooseEitherElem = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.choose + [ P.tagNoAttr "failure" $ return 1 + , P.tagNoAttr "success" $ return 2 + ] + liftIO $ x @?= Just (2 :: Int) + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + ] + +testManyYield :: Assertion +testManyYield = do + -- Basically the same as testMany, but consume the streamed result + result <- runResourceT $ + P.parseLBS def input C.$$ helloParser + =$= CL.consume + length result @?= 5 + where + helloParser = void $ P.tagNoAttr "hello" $ P.manyYield successParser + successParser = P.tagNoAttr "success" $ return () + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + , "" + , "" + , "" + ] + +testTakeContent :: Assertion +testTakeContent = do + result <- runResourceT $ P.parseLBS def input C.$$ rootParser + result @?= Just + [ EventContent (ContentText "Hello world !") + ] + where + rootParser = P.tagNoAttr "root" $ void (P.takeContent >> P.takeContent) =$= CL.consume + input = L.concat + [ "" + , "\n" + , "" + , "Hello world !" + , "" + ] + +testTakeTree :: Assertion +testTakeTree = do + result <- runResourceT $ P.parseLBS def input C.$$ rootParser + result @?= + [ EventBeginDocument + , EventBeginDoctype "foo" Nothing + , EventEndDoctype + , EventBeginElement "a" [] + , EventBeginElement "em" [] + , EventContent (ContentText "Hello world !") + , EventEndElement "em" + , EventEndElement "a" + ] + where + rootParser = void (P.takeTree "a" P.ignoreAttrs) =$= CL.consume + input = L.concat + [ "" + , "\n" + , "" + , "Hello world !" + , "" + , "" + , "" + ] + +testTakeAnyTreeContent :: Assertion +testTakeAnyTreeContent = do + result <- runResourceT $ P.parseLBS def input C.$$ rootParser + result @?= Just + [ EventBeginElement "b" [] + , EventContent (ContentText "Hello ") + , EventBeginElement "em" [] + , EventContent (ContentText "world") + , EventEndElement "em" + , EventContent (ContentText " !") + , EventEndElement "b" + ] + where + rootParser = P.tagNoAttr "root" $ (P.takeAnyTreeContent >> void P.ignoreAnyTreeContent) =$= CL.consume + input = L.concat + [ "" + , "\n" + , "" + , "Hello world ! Welcome !" + , "" + ] + + +testMany :: Assertion +testMany = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.many $ P.tagNoAttr "success" $ return () + liftIO $ length x @?= 5 + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + , "" + , "" + , "" + ] + +testMany' :: Assertion +testMany' = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.many' $ P.tagNoAttr "success" $ return () + liftIO $ length x @?= 5 + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + , "" + , "" + , "some content" + , "" + , "" + ] + +testOrE :: IO () +testOrE = runResourceT $ P.parseLBS def input C.$$ do + P.force "need hello" $ P.tagNoAttr "hello" $ do + x <- P.tagNoAttr "failure" (return 1) `P.orE` + P.tagNoAttr "success" (return 2) + y <- P.tag' "success" (P.requireAttr "failure") (const $ return 1) `P.orE` + P.tag' "success" (P.requireAttr "success") (const $ return 2) + liftIO $ x @?= Just (2 :: Int) + liftIO $ y @?= Just (2 :: Int) + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + ] + +testConduitParser :: Assertion +testConduitParser = runResourceT $ do + x <- P.parseLBS def input + C.$$ (P.force "need hello" $ P.tagNoAttr "hello" f) + =$= CL.consume + liftIO $ x @?= [1, 1, 1] + where + input = L.concat + [ "" + , "\n" + , "" + , "" + , "" + , "" + , "" + ] + f :: C.MonadThrow m => C.Conduit Event m Int + f = do + ma <- P.tagNoAttr "item" (return 1) + maybe (return ()) (\a -> C.yield a >> f) ma + +omitXMLDeclaration :: Assertion +omitXMLDeclaration = Res.renderLBS settings input @?= spec + where + settings = def { Res.rsXMLDeclaration = False } + input = Res.Document (Prologue [] Nothing []) + (Res.Element "foo" Map.empty [Res.NodeContent "bar"]) + [] + spec = "bar" + +hexEntityParsing :: Spec +hexEntityParsing = do + it "rejects leading 0x" $ + go "�xff;" @?= Nothing + it "rejects leading 0X" $ + go "�Xff;" @?= Nothing + it "accepts lowercase hex digits" $ + go "ÿ" @?= Just spec + it "accepts uppercase hex digits" $ + go "ÿ" @?= Just spec + --Note: this must be rejected, because, according to the XML spec, a + --legal EntityRef's entity matches Name, which can't start with a + --hash. + it "rejects trailing junk" $ + go "ÿhello;" @?= Nothing + --Some of these next tests are XML 1.0 specific (i.e., they would + --differ for XML 1.1), but approximately no-one uses XML 1.1. + it "rejects illegal character #x0" $ + go "" @?= Nothing + it "rejects illegal character #xFFFE" $ + go "" @?= Nothing + it "rejects illegal character #xFFFF" $ + go "￿" @?= Nothing + it "rejects illegal character #xD900" $ + go "" @?= Nothing + it "rejects illegal character #xC" $ + go " " @?= Nothing + it "rejects illegal character #x1F" $ + go "" @?= Nothing + it "accepts astral plane character" $ + go "􀛿" @?= Just astralSpec + where + spec = Document (Prologue [] Nothing []) + (Element "foo" [] [NodeContent (ContentText "\xff")]) + [] + + astralSpec = Document (Prologue [] Nothing []) + (Element "foo" [] [NodeContent (ContentText "\x1006ff")]) + [] + + go = either (const Nothing) Just . D.parseLBS def + +name :: [Cu.Cursor] -> [Text] +name [] = [] +name (c:cs) = ($ name cs) $ case Cu.node c of + Res.NodeElement e -> ((Res.nameLocalName $ Res.elementName e) :) + _ -> id + +cursor :: Cu.Cursor +cursor = + Cu.fromDocument $ Res.parseLBS_ def input + where + input = L.concat + [ "" + , "" + , "" + , "" + , "" + , "a" + , "" + , "" + , "" + , "b" + , "" + , "" + , "" + , "" + , "" + ] + +bar2, baz2, bar3, bin2 :: Cu.Cursor +bar2 = Cu.child cursor !! 1 +baz2 = Cu.child bar2 !! 1 + +bar3 = Cu.child cursor !! 2 +bin2 = Cu.child bar3 !! 1 + +cursorParent, cursorAncestor, cursorOrSelf, cursorPreceding, cursorFollowing, + cursorPrecedingSib, cursorFollowingSib, cursorDescendant, cursorCheck, + cursorPredicate, cursorCheckNode, cursorCheckElement, cursorCheckName, + cursorAnyElement, cursorElement, cursorLaxElement, cursorContent, + cursorAttribute, cursorLaxAttribute, cursorHasAttribute, + cursorAttributeIs, cursorDeep, cursorForce, cursorForceM, + resolvedIdentifies, resolvedAllGood, resolvedMergeContent, + testHtmlEntities + :: Assertion +cursorParent = name (Cu.parent bar2) @?= ["foo"] +cursorAncestor = name (Cu.ancestor baz2) @?= ["bar2", "foo"] +cursorOrSelf = name (Cu.orSelf Cu.ancestor baz2) @?= ["baz2", "bar2", "foo"] +cursorPreceding = do + name (Cu.preceding baz2) @?= ["baz1", "bar1"] + name (Cu.preceding bin2) @?= ["bin1", "baz3", "baz2", "baz1", "bar2", "bar1"] +cursorFollowing = do + name (Cu.following baz2) @?= ["baz3", "bar3", "bin1", "bin2", "bin3", "Bar1"] + name (Cu.following bar2) @?= ["bar3", "bin1", "bin2", "bin3", "Bar1"] +cursorPrecedingSib = name (Cu.precedingSibling baz2) @?= ["baz1"] +cursorFollowingSib = name (Cu.followingSibling baz2) @?= ["baz3"] +cursorDescendant = (name $ Cu.descendant cursor) @?= T.words "bar1 bar2 baz1 baz2 baz3 bar3 bin1 bin2 bin3 Bar1" +cursorCheck = null (cursor $.// Cu.check (const False)) @?= True +cursorPredicate = (name $ cursor $.// Cu.check Cu.descendant) @?= T.words "foo bar2 baz3 bar3" +cursorCheckNode = (name $ cursor $// Cu.checkNode f) @?= T.words "bar1 bar2 bar3" + where f (Res.NodeElement e) = "bar" `T.isPrefixOf` Res.nameLocalName (Res.elementName e) + f _ = False +cursorCheckElement = (name $ cursor $// Cu.checkElement f) @?= T.words "bar1 bar2 bar3" + where f e = "bar" `T.isPrefixOf` Res.nameLocalName (Res.elementName e) +cursorCheckName = (name $ cursor $// Cu.checkName f) @?= T.words "bar1 bar2 bar3" + where f n = "bar" `T.isPrefixOf` nameLocalName n +cursorAnyElement = (name $ cursor $// Cu.anyElement) @?= T.words "bar1 bar2 baz1 baz2 baz3 bar3 bin1 bin2 bin3 Bar1" +cursorElement = (name $ cursor $// Cu.element "bar1") @?= ["bar1"] +cursorLaxElement = (name $ cursor $// Cu.laxElement "bar1") @?= ["bar1", "Bar1"] +cursorContent = do + Cu.content cursor @?= [] + (cursor $.// Cu.content) @?= ["a", "b"] +cursorAttribute = Cu.attribute "attr" cursor @?= ["x"] +cursorLaxAttribute = (cursor $.// Cu.laxAttribute "Attr") @?= ["x", "y", "q"] + +cursorHasAttribute = (length $ cursor $.// Cu.hasAttribute "attr") @?= 2 +cursorAttributeIs = (length $ cursor $.// Cu.attributeIs "attr" "y") @?= 1 + +cursorDeep = do + (Cu.element "foo" &/ Cu.element "bar2" &// Cu.attribute "attr") cursor @?= ["y"] + (return &.// Cu.attribute "attr") cursor @?= ["x", "y"] + (cursor $.// Cu.attribute "attr") @?= ["x", "y"] + (cursor $/ Cu.element "bar2" &// Cu.attribute "attr") @?= ["y"] + (cursor $/ Cu.element "bar2" &/ Cu.element "baz2" >=> Cu.attribute "attr") @?= ["y"] + null (cursor $| Cu.element "foo") @?= False +cursorForce = do + Cu.force DummyEx [] @?= (Nothing :: Maybe Integer) + Cu.force DummyEx [1] @?= Just (1 :: Int) + Cu.force DummyEx [1,2] @?= Just (1 :: Int) +cursorForceM = do + Cu.forceM DummyEx [] @?= (Nothing :: Maybe Integer) + Cu.forceM DummyEx [Just 1, Nothing] @?= Just (1 :: Int) + Cu.forceM DummyEx [Nothing, Just (1 :: Int)] @?= Nothing + +data DummyEx = DummyEx + deriving (Show, Typeable) +instance Exception DummyEx + +showEq :: (Show a, Show b) => Either a b -> Either a b -> Assertion +showEq x y = show x @=? show y + +resolvedIdentifies = + Left (toException $ Res.UnresolvedEntityException $ Set.fromList ["foo", "bar", "baz"]) `showEq` + Res.parseLBS def + "&foo; --- &baz; &foo;" + +testHtmlEntities = + Res.parseLBS_ def + { P.psDecodeEntities = P.decodeHtmlEntities + } xml1 @=? Res.parseLBS_ def xml2 + where + xml1 = " " + xml2 = " " + +resolvedAllGood = + D.parseLBS_ def xml @=? + Res.toXMLDocument (Res.parseLBS_ def xml) + where + xml = "" + +resolvedMergeContent = + Res.documentRoot (Res.parseLBS_ def xml) @=? + Res.Element "foo" Map.empty [Res.NodeContent "bar&baz"] + where + xml = "bar&baz" + +parseIgnoreBOM :: Assertion +parseIgnoreBOM = do + either (const $ Left (1 :: Int)) Right (Res.parseText Res.def "\xfeef") @?= + either (const $ Left (2 :: Int)) Right (Res.parseText Res.def "") + +stripDuplicateAttributes :: Assertion +stripDuplicateAttributes = do + "" @=? + D.renderLBS def (Document (Prologue [] Nothing []) (Element "foo" [("bar", [ContentText "baz"]), ("bar", [ContentText "bin"])] []) []) + "" @=? + D.renderLBS def (Document (Prologue [] Nothing []) (Element "foo" + [ ("x:bar", [ContentText "baz"]) + , (Name "bar" (Just "namespace") (Just "x"), [ContentText "bin"]) + ] []) []) + +testRenderComments :: Assertion +testRenderComments =do + "" + @=? D.renderLBS def (Document (Prologue [] Nothing []) + (Element "foo" [] [NodeComment "comment"]) []) + +resolvedInline :: Assertion +resolvedInline = do + Res.Document _ root _ <- return $ Res.parseLBS_ Res.def "]>&bar;" + root @?= Res.Element "foo" Map.empty [Res.NodeContent "baz"] + Res.Document _ root2 _ <- return $ Res.parseLBS_ Res.def "]>" + root2 @?= Res.Element "foo" (Map.singleton "bar" "baz") [] + +casePretty :: Assertion +casePretty = do + let pretty = S.unlines + [ "" + , "" + , "" + , " " + , " Hello World" + , " " + , " " + , " " + , " " + , " " + , " bar content" + , " " + , "" + ] + doctype = Res.Doctype "foo" Nothing + doc = Res.Document (Res.Prologue [] (Just doctype) []) root [] + root = Res.Element "foo" (Map.fromList [("bar", "bar"), ("baz", "baz")]) + [ Res.NodeElement $ Res.Element "foo" (Map.fromList [("bar", "bar"), ("baz", "baz"), ("bin", "bin")]) + [ Res.NodeContent " Hello World\n\n" + , Res.NodeContent " " + ] + , Res.NodeElement $ Res.Element "foo" Map.empty [] + , Res.NodeInstruction $ Res.Instruction "foo" "bar" + , Res.NodeComment "foo bar\n\r\nbaz \tbin " + , Res.NodeElement $ Res.Element "bar" Map.empty [Res.NodeContent "bar content"] + ] + pretty @=? S.concat (L.toChunks $ Res.renderLBS def { D.rsPretty = True } doc) + +caseTopLevelNamespace :: Assertion +caseTopLevelNamespace = do + let lbs = S.concat + [ "" + , "" + , "" + , "" + ] + rs = def { D.rsNamespaces = [("bar", "baz")] } + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "foo" Map.empty + [ Res.NodeElement + $ Res.Element "subfoo" (Map.singleton "{baz}bin" "") [] + ]) + [] + lbs @=? S.concat (L.toChunks $ Res.renderLBS rs doc) + +caseTopLevelNamespacePrefix :: Assertion +caseTopLevelNamespacePrefix = do + let lbs = S.concat + [ "" + , "" + , "" + , "" + ] + rs = def { D.rsNamespaces = [("bar", "baz")] } + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "foo" Map.empty + [ Res.NodeElement + $ Res.Element "subfoo" (Map.fromList [(Name "bin" (Just "baz") (Just "bar"), "")]) [] + ]) + [] + lbs @=? S.concat (L.toChunks $ Res.renderLBS rs doc) + +caseTLNConflict :: Assertion +caseTLNConflict = do + let lbs = S.concat + [ "" + , "" + , "" + , "" + ] + rs = def { D.rsNamespaces = [("bar", "baz")] } + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "foo" (Map.fromList [(Name "x" (Just "something") (Just "bar"), "y")]) + [ Res.NodeElement + $ Res.Element "subfoo" (Map.fromList [(Name "bin" (Just "baz") (Just "bar"), "")]) [] + ]) + [] + lbs @=? S.concat (L.toChunks $ Res.renderLBS rs doc) + +caseBlazeHtml :: Assertion +caseBlazeHtml = + expected @=? str + where + str = renderMarkup $ toMarkup $ Res.Document (Res.Prologue [] Nothing []) root [] + root :: Res.Element + root = Res.Element "html" Map.empty + [ Res.NodeElement $ Res.Element "head" Map.empty + [ Res.NodeElement $ Res.Element "title" Map.empty [Res.NodeContent "Test"] + , Res.NodeElement $ Res.Element "script" Map.empty + [Res.NodeContent "if (5 < 6 || 8 > 9) alert('Hello World!');"] + , Res.NodeElement $ Res.Element "{http://www.snoyman.com/xml2html}ie-cond" (Map.singleton "cond" "lt IE 7") + [Res.NodeElement $ Res.Element "link" (Map.singleton "href" "ie6.css") []] + , Res.NodeElement $ Res.Element "style" Map.empty + [Res.NodeContent "body > h1 { color: red }"] + ] + , Res.NodeElement $ Res.Element "body" Map.empty + [ Res.NodeElement $ Res.Element "h1" Map.empty [Res.NodeContent "Hello World!"] + ] + ] + expected :: String + expected = concat + [ "\n" + , "Test" + , "" + , "" + , "" + , "

Hello World!

" + ] + +caseAttrReorder :: Assertion +caseAttrReorder = do + let lbs = S.concat + [ "" + , "" + , "" + , "" + ] + rs = def { Res.rsAttrOrder = \name m -> + case name of + "foo" -> reverse $ Map.toAscList m + _ -> Map.toAscList m + } + attrs = Map.fromList [("a", "a"), ("b", "b"), ("c", "c")] + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "foo" attrs + [ Res.NodeElement + $ Res.Element "bar" attrs [] + ]) + [] + lbs @=? S.concat (L.toChunks $ Res.renderLBS rs doc) + +caseOrderAttrs :: Assertion +caseOrderAttrs = do + let lbs = S.concat + [ "" + , "" + , "" + , "" + ] + rs = def { Res.rsAttrOrder = Res.orderAttrs + [("foo", ["c", "b"])] + } + attrs = Map.fromList [("a", "a"), ("b", "b"), ("c", "c")] + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "foo" attrs + [ Res.NodeElement + $ Res.Element "bar" attrs [] + ]) + [] + lbs @=? S.concat (L.toChunks $ Res.renderLBS rs doc) + +caseParseCdata :: Assertion +caseParseCdata = do + let lbs = "" + doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "a" Map.empty + [ Res.NodeContent "www.google.com" + ]) + [] + Res.parseLBS_ def lbs @?= doc + +caseRetainNamespaces :: Assertion +caseRetainNamespaces = do + let lbs = "" + doc = Res.parseLBS_ def { Res.psRetainNamespaces = True } lbs + doc `shouldBe` Res.Document + (Res.Prologue [] Nothing []) + (Res.Element + "foo" + (Map.singleton "xmlns:bar" "baz") + [ Res.NodeElement $ Res.Element + "{baz}bin" + Map.empty + [] + , Res.NodeElement $ Res.Element + "{bin4}bin3" + (Map.singleton "xmlns" "bin4") + [] + ]) + [] + +caseIso8859_1 :: Assertion +caseIso8859_1 = do + let lbs = "\232" + doc = Res.parseLBS_ def lbs + doc `shouldBe` Res.Document + (Res.Prologue [] Nothing []) + (Res.Element + "foo" + Map.empty + [Res.NodeContent "\232"]) + [] + +caseRenderCDATA :: Assertion +caseRenderCDATA = do + let doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "a" Map.empty + [ Res.NodeContent "www.google.com" + ]) + [] + withoutCDATA = Res.renderLBS def doc + withCDATA = Res.renderLBS (def { Res.rsUseCDATA = const True }) doc + withCDATA `shouldBe` "" + withoutCDATA `shouldBe` "www.google.com" + +caseEscapesCDATA :: Assertion +caseEscapesCDATA = do + let doc = Res.Document (Res.Prologue [] Nothing []) + (Res.Element "a" Map.empty + [ Res.NodeContent "]]>" + ]) + [] + result = Res.renderLBS (def { Res.rsUseCDATA = const True }) doc + result `shouldBe` "]]>" diff --git a/xml-conduit.cabal b/xml-conduit.cabal new file mode 100644 index 0000000..d3637d5 --- /dev/null +++ b/xml-conduit.cabal @@ -0,0 +1,63 @@ +name: xml-conduit +version: 1.7.0.1 +license: MIT +license-file: LICENSE +author: Michael Snoyman , Aristid Breitkreuz +maintainer: Michael Snoyman +synopsis: Pure-Haskell utilities for dealing with XML with the conduit package. +description: Hackage documentation generation is not reliable. For up to date documentation, please see: . +category: XML, Conduit +stability: Stable +cabal-version: >= 1.8 +build-type: Simple +homepage: http://github.com/snoyberg/xml +extra-source-files: test/main.hs + README.md + ChangeLog.md + +library + build-depends: base >= 4 && < 5 + , conduit >= 1.0 && < 1.3 + , conduit-extra >= 1.1 + , resourcet >= 0.3 && < 1.2 + , bytestring >= 0.9 + , text >= 0.7 + , containers >= 0.2 + , xml-types >= 0.3.4 && < 0.4 + , attoparsec >= 0.10 + , blaze-builder >= 0.2 && < 0.5 + , transformers >= 0.2 && < 0.6 + , data-default-class + , monad-control >= 0.3 && < 1.1 + , blaze-markup >= 0.5 + , blaze-html >= 0.5 + , deepseq >= 1.1.0.0 + exposed-modules: Text.XML.Stream.Parse + Text.XML.Stream.Render + Text.XML.Unresolved + Text.XML.Cursor + Text.XML.Cursor.Generic + Text.XML + other-modules: Text.XML.Stream.Token + ghc-options: -Wall + +test-suite test + type: exitcode-stdio-1.0 + main-is: main.hs + hs-source-dirs: test + build-depends: base + , containers + , text + , transformers + , bytestring + , xml-conduit + , hspec >= 1.3 + , HUnit + , xml-types >= 0.3.1 + , conduit + , blaze-markup + , resourcet + +source-repository head + type: git + location: git://github.com/snoyberg/xml.git