Blame README.md

Packit 247f4e
# TagSoup [![Hackage version](https://img.shields.io/hackage/v/tagsoup.svg?label=Hackage)](https://hackage.haskell.org/package/tagsoup) [![Stackage version](https://www.stackage.org/package/tagsoup/badge/lts?label=Stackage)](https://www.stackage.org/package/tagsoup) [![Linux Build Status](https://img.shields.io/travis/ndmitchell/tagsoup.svg?label=Linux%20build)](https://travis-ci.org/ndmitchell/tagsoup) [![Windows Build Status](https://img.shields.io/appveyor/ci/ndmitchell/tagsoup.svg?label=Windows%20build)](https://ci.appveyor.com/project/ndmitchell/tagsoup)
Packit 247f4e
Packit 247f4e
TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.
Packit 247f4e
Packit 247f4e
The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the [Sample](https://github.com/ndmitchell/tagsoup/blob/master/TagSoup/Sample.hs) file from the source repository. The examples we give are:
Packit 247f4e
Packit 247f4e
* Obtaining the last modified date of the Haskell wiki
Packit 247f4e
* Obtaining a list of Simon Peyton Jones' latest papers
Packit 247f4e
* A brief overview of some other examples
Packit 247f4e
Packit 247f4e
The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!
Packit 247f4e
Packit 247f4e
This library was written without knowledge of the Java version of [TagSoup](http://home.ccil.org/~cowan/XML/tagsoup/). They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.
Packit 247f4e
Packit 247f4e
Packit 247f4e
#### Acknowledgements
Packit 247f4e
Packit 247f4e
Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.
Packit 247f4e
Packit 247f4e
Packit 247f4e
## Potential Bugs
Packit 247f4e
Packit 247f4e
There are two things that may go wrong with these examples:
Packit 247f4e
Packit 247f4e
* _The Websites being scraped may change._ There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work.
Packit 247f4e
* _The `openURL` method may not work._ This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use `wget` to download the page locally, then use `readFile` instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.
Packit 247f4e
Packit 247f4e
Packit 247f4e
## Last modified date of Haskell wiki
Packit 247f4e
Packit 247f4e
Our goal is to develop a program that displays the date that the wiki at
Packit 247f4e
[`wiki.haskell.org`](http://wiki.haskell.org/Haskell) was last modified. This
Packit 247f4e
example covers all the basics in designing a basic web-scraping application.
Packit 247f4e
Packit 247f4e
### Finding the Page
Packit 247f4e
Packit 247f4e
We first need to find where the information is displayed and in what format.
Packit 247f4e
Taking a look at the [front web page](http://wiki.haskell.org/Haskell), when
Packit 247f4e
not logged in, we see:
Packit 247f4e
Packit 247f4e
```html
Packit 247f4e
    Packit 247f4e
      
  • This page was last modified on 9 September 2013, at 22:38.
  • Packit 247f4e
      
  • Recent content is available under a simple permissive license.
  • Packit 247f4e
      
  • Privacy policy
  • Packit 247f4e
      
  • About HaskellWiki
  • Packit 247f4e
      
  • Disclaimers
  • Packit 247f4e
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    So, we see that the last modified date is available. This leads us to rule 1:
    Packit 247f4e
    Packit 247f4e
    **Rule 1:** Scrape from what the page returns, not what a browser renders, or what view-source gives.
    Packit 247f4e
    Packit 247f4e
    Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.
    Packit 247f4e
    Packit 247f4e
    #### Using the HTTP package
    Packit 247f4e
    Packit 247f4e
    We can write a simple HTTP downloader with using the [HTTP package](http://hackage.haskell.org/package/HTTP):
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    module Main where
    Packit 247f4e
    Packit 247f4e
    import Network.HTTP
    Packit 247f4e
    Packit 247f4e
    openURL :: String -> IO String
    Packit 247f4e
    openURL x = getResponseBody =<< simpleHTTP (getRequest x)
    Packit 247f4e
    Packit 247f4e
    main :: IO ()
    Packit 247f4e
    main = do
    Packit 247f4e
        src <- openURL "http://wiki.haskell.org/Haskell"
    Packit 247f4e
        writeFile "temp.htm" src
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    Now open `temp.htm`, find the fragment of HTML containing the hit count, and examine it.
    Packit 247f4e
    Packit 247f4e
    #### Using the `tagsoup` Program
    Packit 247f4e
    Packit 247f4e
    TagSoup installs both as a library and a program. The program contains all the
    Packit 247f4e
    examples mentioned on this page, along with a few other useful functions. In
    Packit 247f4e
    order to download a URL to a file:
    Packit 247f4e
    Packit 247f4e
    ```bash
    Packit 247f4e
    $ tagsoup grab http://wiki.haskell.org/Haskell > temp.htm
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    ### Finding the Information
    Packit 247f4e
    Packit 247f4e
    Now we examine both the fragment that contains our snippet of information, and
    Packit 247f4e
    the wider page. What does the fragment have that nothing else has? What
    Packit 247f4e
    algorithm would we use to obtain that particular element? How can we still
    Packit 247f4e
    return the element as the content changes? What if the design changes? But
    Packit 247f4e
    wait, before going any further:
    Packit 247f4e
    Packit 247f4e
    **Rule 2:** Do not be robust to design changes, do not even consider the possibility when writing the code.
    Packit 247f4e
    Packit 247f4e
    If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.
    Packit 247f4e
    Packit 247f4e
    So now, let's consider the fragment from above. It is useful to find a tag
    Packit 247f4e
    which is unique just above your snippet - something with a nice `id` or `class`
    Packit 247f4e
    attribute - something which is unlikely to occur multiple times. In the above
    Packit 247f4e
    example, an `id` with value  `lastmod` seems perfect.
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    module Main where
    Packit 247f4e
    Packit 247f4e
    import Data.Char
    Packit 247f4e
    import Network.HTTP
    Packit 247f4e
    import Text.HTML.TagSoup
    Packit 247f4e
    Packit 247f4e
    openURL :: String -> IO String
    Packit 247f4e
    openURL x = getResponseBody =<< simpleHTTP (getRequest x)
    Packit 247f4e
    Packit 247f4e
    haskellLastModifiedDateTime :: IO ()
    Packit 247f4e
    haskellLastModifiedDateTime = do
    Packit 247f4e
        src <- openURL "http://wiki.haskell.org/Haskell"
    Packit 247f4e
        let lastModifiedDateTime = fromFooter $ parseTags src
    Packit 247f4e
        putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
    Packit 247f4e
        where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "
  • ")
  • Packit 247f4e
    Packit 247f4e
    main :: IO ()
    Packit 247f4e
    main = haskellLastModifiedDateTime
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of `Tag`s with `parseTags`. The `fromFooter` function does the interesting thing, and can be read right to left:
    Packit 247f4e
    Packit 247f4e
    * First we throw away everything (`dropWhile`) until we get to an `li` tag
    Packit 247f4e
      containing `id=lastmod`. The `(~==)` and `(~/=)` operators are different from
    Packit 247f4e
    standard equality and inequality since they allow additional attributes to be
    Packit 247f4e
    present. We write `"
  • "` as syntactic sugar for `TagOpen "li"
  • Packit 247f4e
    [("id","lastmod")]`. If we just wanted any open tag with the given `id`
    Packit 247f4e
    attribute we could have written `(~== TagOpen "" [("id","lastmod")])` and this
    Packit 247f4e
    would have matched.  Any empty strings in the second element of the match are
    Packit 247f4e
    considered as wildcards.
    Packit 247f4e
    * Next we take two elements: the `
  • ` tag and the text node immediately
  • Packit 247f4e
      following.
    Packit 247f4e
    * We call the `innerText` function to get all the text values from inside,
    Packit 247f4e
      which will just be the text node following the `lastmod`.
    Packit 247f4e
    * We split the string into a series of words and drop the first six, i.e. the
    Packit 247f4e
      words `This`, `page`, `was`, `last`, `modified` and `on`
    Packit 247f4e
    * We reassemble the remaining words into the resulting string `9 September
    Packit 247f4e
      2013, at 22:38.`
    Packit 247f4e
    Packit 247f4e
    This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.
    Packit 247f4e
    Packit 247f4e
    **Rule 3:** TagSoup is for extracting information where structure has been lost, use more structured information if it is available.
    Packit 247f4e
    Packit 247f4e
    Packit 247f4e
    ## Simon's Papers
    Packit 247f4e
    Packit 247f4e
    Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his [home page](http://research.microsoft.com/en-us/people/simonpj/). The largest change to the previous example is that now we desire a list of papers, rather than just a single result.
    Packit 247f4e
    Packit 247f4e
    As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.
    Packit 247f4e
    Packit 247f4e
    First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple `take`/`drop` pair:
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    takeWhile (~/= "") $
    Packit 247f4e
    drop 5 $ dropWhile (~/= "") tags
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    map f $ sections (~== "") $ ...
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    Remember that the function to select all tags with name "A" could have been written as `(~== TagOpen "A" [])`, or alternatively `isTagOpenName "A"`. Afterwards we map each item with an `f` function. This function needs to take the tags starting just after the link, and find the text inside the link.
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    f = dequote . unwords . words . fromTagText . head . filter isTagText
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the `filter` drops all those that are not, until we find a pure text node. The `unwords . words` deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.
    Packit 247f4e
    Packit 247f4e
    For completeness, we now present the entire example:
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    module Main where
    Packit 247f4e
    Packit 247f4e
    import Network.HTTP
    Packit 247f4e
    import Text.HTML.TagSoup
    Packit 247f4e
    Packit 247f4e
    openURL :: String -> IO String
    Packit 247f4e
    openURL x = getResponseBody =<< simpleHTTP (getRequest x)
    Packit 247f4e
    Packit 247f4e
    spjPapers :: IO ()
    Packit 247f4e
    spjPapers = do
    Packit 247f4e
            tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/"
    Packit 247f4e
            let links = map f $ sections (~== "") $
    Packit 247f4e
                        takeWhile (~/= "") $
    Packit 247f4e
                        drop 5 $ dropWhile (~/= "") tags
    Packit 247f4e
            putStr $ unlines links
    Packit 247f4e
        where
    Packit 247f4e
            f :: [Tag String] -> String
    Packit 247f4e
            f = dequote . unwords . words . fromTagText . head . filter isTagText
    Packit 247f4e
    Packit 247f4e
            dequote ('\"':xs) | last xs == '\"' = init xs
    Packit 247f4e
            dequote x = x
    Packit 247f4e
    Packit 247f4e
    main :: IO ()
    Packit 247f4e
    main = spjPapers
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    ## Other Examples
    Packit 247f4e
    Packit 247f4e
    Several more examples are given in the Example file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All can be invoked using the `tagsoup` executable program. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.
    Packit 247f4e
    Packit 247f4e
    ### My Papers
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    module Main where
    Packit 247f4e
    Packit 247f4e
    import Network.HTTP
    Packit 247f4e
    import Text.HTML.TagSoup
    Packit 247f4e
    Packit 247f4e
    openURL :: String -> IO String
    Packit 247f4e
    openURL x = getResponseBody =<< simpleHTTP (getRequest x)
    Packit 247f4e
    Packit 247f4e
    ndmPapers :: IO ()
    Packit 247f4e
    ndmPapers = do
    Packit 247f4e
            tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/"
    Packit 247f4e
            let papers = map f $ sections (~== "
  • ") tags
  • Packit 247f4e
            putStr $ unlines papers
    Packit 247f4e
        where
    Packit 247f4e
            f :: [Tag String] -> String
    Packit 247f4e
            f xs = fromTagText (xs !! 2)
    Packit 247f4e
    Packit 247f4e
    main :: IO ()
    Packit 247f4e
    main = ndmPapers
    Packit 247f4e
    ```
    Packit 247f4e
    Packit 247f4e
    ### UK Time
    Packit 247f4e
    Packit 247f4e
    ```haskell
    Packit 247f4e
    module Main where
    Packit 247f4e
    Packit 247f4e
    import Network.HTTP
    Packit 247f4e
    import Text.HTML.TagSoup
    Packit 247f4e
    Packit 247f4e
    openURL :: String -> IO String
    Packit 247f4e
    openURL x = getResponseBody =<< simpleHTTP (getRequest x)
    Packit 247f4e
    Packit 247f4e
    currentTime :: IO ()
    Packit 247f4e
    currentTime = do
    Packit 247f4e
        tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london"
    Packit 247f4e
        let time = fromTagText (dropWhile (~/= "") tags !! 1)
    Packit 247f4e
        putStrLn time
    Packit 247f4e
    Packit 247f4e
    main :: IO ()
    Packit 247f4e
    main = currentTime
    Packit 247f4e
    ```
    Packit 247f4e
            
    Packit 247f4e
    ## Related Projects
    Packit 247f4e
    Packit 247f4e
    * [TagSoup for Java](http://tagsoup.info/) - an independently written malformed HTML parser for Java. Including [links to other](http://tagsoup.info/#other) HTML parsers.
    Packit 247f4e
    * [HXT: Haskell XML Toolbox](http://www.fh-wedel.de/~si/HXmlToolbox/) - a more comprehensive XML parser, giving the option of using TagSoup as a lexer.
    Packit 247f4e
    * [Other Related Work](http://www.fh-wedel.de/~si/HXmlToolbox/#rel) - as described on the HXT pages.
    Packit 247f4e
    * [Using TagSoup with Parsec](http://therning.org/magnus/posts/2008-08-08-367-tagsoup-meet-parsec.html) - a nice combination of Haskell libraries.
    Packit 247f4e
    * [tagsoup-parsec](http://hackage.haskell.org/package/tagsoup-parsec) - a library for easily using TagSoup as a token type in Parsec.
    Packit 247f4e
    * [tagsoup-megaparsec](http://hackage.haskell.org/package/tagsoup-megaparsec) - a library for easily using TagSoup as a token type in Megaparsec.
    Packit 247f4e
    * [WraXML](http://hackage.haskell.org/packages/archive/wraxml/latest/doc/html/Text-XML-WraXML-Tree-TagSoup.html) - construct a lazy tree from TagSoup lexemes.