Copyright | (c) Robbin C. |
---|---|
License | GPLv3 |
Maintainer | Robbin C. |
Stability | unstable |
Portability | portable |
Safe Haskell | None |
Language | Haskell2010 |
Codec.Archive.Zim.Parser
Description
This is a library for parsing ZIM (http://openzim.org) files. ZIM files contain offline web content (eg, Wikipedia) which can be browsed locally without an Internet connection.
The API is meant to be intuitive for normal use-cases.
To get content for "A/index.htm" from ZIM file "file.zim":
> mimeContent <- "file.zim" `getContent` Url "A/index.htm" > :t mimeContent mimeContent :: Maybe (B8.ByteString, BL.ByteString) > print mimeContent Just ("text/html", "<html><head>...</html>")
The above will open the file, parse the ZIM header, lookup the MIME type and content of the URL, close the file and return the MIME type and content as a pair. Note that content is a lazy bytestring.
The above operation should suffice for a simple webserver serving a ZIM file. For finer control, it is possible to cache and reuse the file handle and the ZIM header.
> hdl <- openBinaryFile "file.zim" ReadMode > hdr <- getHeader hdl > :t hdr hdr :: ZimHeader > (hdl, hdr) `getContent` Url "A/index.htm" Just ("text/html", "<html><head>...</html>")
ZIM files of Wikimedia Foundation (Wikipedia, Wikibooks, etc) can be found at http://ftpmirror.your.org/pub/kiwix/zim.
Below is a full example of a Scotty web server that serves a ZIM file (specified on command line) on localhost port 3000:
{-# LANGUAGE OverloadedStrings #-} import Control.Monad.IO.Class (liftIO) import Data.Text.Lazy (toStrict, fromStrict) import Data.Text.Encoding (decodeUtf8, encodeUtf8) import System.Environment (getArgs) import Network.HTTP.Types.Status (status404) import Web.Scotty import Codec.Archive.Zim.Parser (getMainPageUrl, getContent, Url(..)) main :: IO () main = do [fp] <- getArgs scotty 3000 $ do get "/" (redirectToZimMainPage fp) get (regex "^/(./.*)$") (serveZimUrl fp) notFound $ text "Invalid URL!" redirectToZimMainPage :: FilePath -> ActionM () redirectToZimMainPage fp = do res <- liftIO $ getMainPageUrl fp case res of Nothing -> do status status404 text "This ZIM file has no main page specified!" Just (Url url) -> redirect . fromStrict $ decodeUtf8 url serveZimUrl :: FilePath -> ActionM () serveZimUrl fp = do url <- (encodeUtf8 . toStrict) <$> param "1" res <- liftIO $ fp `getContent` Url url case res of Nothing -> do liftIO . putStrLn $ "Invalid URL: " ++ show url status status404 text $ "Invalid URL!" Just (mimeType, content) -> do liftIO . putStrLn $ "Serving: " ++ show url setHeader "Content-Type" (fromStrict $ decodeUtf8 mimeType) raw content
Feedback and contributions are welcome on http://github.com/robbinch/zim-parser.
Synopsis
- getHeader :: RunZim h => h -> IO ZimHeader
- getMimeList :: RunZim h => h -> IO MimeList
- getDE :: (ZimGetDE k, RunZim h) => h -> k -> IO ZimDirEnt
- getMainPageUrl :: RunZim h => h -> IO (Maybe Url)
- getCluster :: RunZim h => h -> ClusterNumber -> IO Cluster
- getBlob :: RunZim h => h -> (ClusterNumber, BlobNumber) -> IO Blob
- getContent :: (ZimGetContent k, RunZim h) => h -> k -> IO (Maybe (ByteString, ByteString))
- searchDE :: (ZimSearchDE k, RunZim h) => h -> k -> IO [(Int, ZimDirEnt)]
- type MimeList = Array Int ByteString
- mkNsTitle :: Char -> ByteString -> Title
- mkNsTitlePrefix :: Char -> ByteString -> TitlePrefix
- mkNsUrl :: Char -> ByteString -> Url
- class RunZim h
- class ZimGetDE k
- class ZimSearchDE k
- class ZimGetContent k
- data ZimException
- data ZimHeader = ZimHeader {}
- data ZimDirEntType
- data ZimDirEnt = ZimDirEnt {}
- newtype UrlIndex = UrlIndex Int
- newtype TitleIndex = TitleIndex Int
- newtype ClusterNumber = ClusterNumber Int
- newtype BlobNumber = BlobNumber Int
- newtype Cluster = Cluster {}
- newtype Blob = Blob {
- unBlob :: ByteString
- newtype Url = Url ByteString
- data Title
- data TitlePrefix
Functions
getMainPageUrl :: RunZim h => h -> IO (Maybe Url) Source #
Returns URL of main page in ZIM. This URL can be used for redirecting to the actual page.
getCluster :: RunZim h => h -> ClusterNumber -> IO Cluster Source #
getBlob :: RunZim h => h -> (ClusterNumber, BlobNumber) -> IO Blob Source #
getContent :: (ZimGetContent k, RunZim h) => h -> k -> IO (Maybe (ByteString, ByteString)) Source #
Get (MIME type, Content). Note that Content is lazy.
searchDE :: (ZimSearchDE k, RunZim h) => h -> k -> IO [(Int, ZimDirEnt)] Source #
Search for a Directory Entry on a RunZim. When searching for a:
Url
- Returns either 0 (not found) or 1 element.
Title
- Returns either 0 (not found) or 1 element.
TitlePrefix
- Returns either 0 (not found) or 2 elements corresponding to lower and upper bound of titles containing the prefix.
mkNsTitlePrefix :: Char -> ByteString -> TitlePrefix Source #
Construct a TitlePrefix with a Namespace.
Instances of this class represent a Zim File and are able to perform ZIM operations (getMimeList, getContent, etc). Valid instances include a Handle to a ZIM file, a FilePath to a ZIM file, or a (Handle, ZimHeader) where ZimHeader is parsed previously (so it does not need to be reparsed).
Minimal complete definition
runZim
Minimal complete definition
class ZimSearchDE k Source #
Minimal complete definition
Instances
class ZimGetContent k Source #
Minimal complete definition
Instances
Exceptions
data ZimException Source #
Other than the below, ErrorCall can be thrown by LZMA library if there is a problem with decompression.
Constructors
ZimInvalidMagic | ZIM file has invalid magic number (anything other than 72173914). |
ZimParseError String | There is an error in parsing. |
ZimIncompleteInput | There is insufficient bytes required to parse. |
ZimInvalidIndex Int | The given index (URL, title or cluster) is out of bounds for this ZIM file. |
Instances
Exception ZimException Source # | |
Defined in Codec.Archive.Zim.Parser Methods toException :: ZimException -> SomeException # fromException :: SomeException -> Maybe ZimException # displayException :: ZimException -> String # backtraceDesired :: ZimException -> Bool # | |
Show ZimException Source # | |
Defined in Codec.Archive.Zim.Parser Methods showsPrec :: Int -> ZimException -> ShowS # show :: ZimException -> String # showList :: [ZimException] -> ShowS # |
ZIM Header
See http://www.openzim.org/wiki/ZIM_file_format#Header for more details.
Constructors
ZimHeader | |
Fields
|
ZIM Directory Entry
data ZimDirEntType Source #
There are 4 types of directory entries. Most content in a ZIM file are
usually ZimArticleEntry
or ZimRedirectEntry
.
Constructors
ZimArticleEntry | |
ZimRedirectEntry | |
ZimLinkTarget | |
ZimDeletedEntry |
Instances
Show ZimDirEntType Source # | |
Defined in Codec.Archive.Zim.Parser Methods showsPrec :: Int -> ZimDirEntType -> ShowS # show :: ZimDirEntType -> String # showList :: [ZimDirEntType] -> ShowS # | |
Eq ZimDirEntType Source # | |
Defined in Codec.Archive.Zim.Parser Methods (==) :: ZimDirEntType -> ZimDirEntType -> Bool # (/=) :: ZimDirEntType -> ZimDirEntType -> Bool # |
See http://www.openzim.org/wiki/ZIM_file_format#Directory_Entries for more details.
Constructors
ZimDirEnt | |
Fields
|
Instances
Show ZimDirEnt Source # | |
Eq ZimDirEnt Source # | |
ZimGetContent ZimDirEnt Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> ZimDirEnt -> IO (Maybe (ByteString, ByteString)) Source # | |
ZimGetContent (MimeList, ZimDirEnt) Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> (MimeList, ZimDirEnt) -> IO (Maybe (ByteString, ByteString)) Source # |
Wrapper for URL index
Instances
Show UrlIndex Source # | |
Eq UrlIndex Source # | |
Ord UrlIndex Source # | |
Defined in Codec.Archive.Zim.Parser | |
ZimGetContent UrlIndex Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> UrlIndex -> IO (Maybe (ByteString, ByteString)) Source # | |
ZimGetDE UrlIndex Source # | |
ZimGetContent (MimeList, UrlIndex) Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> (MimeList, UrlIndex) -> IO (Maybe (ByteString, ByteString)) Source # |
newtype TitleIndex Source #
Wrapper for Title index
Constructors
TitleIndex Int |
Instances
newtype ClusterNumber Source #
Wrapper for Cluster number
Constructors
ClusterNumber Int |
Instances
Show ClusterNumber Source # | |
Defined in Codec.Archive.Zim.Parser Methods showsPrec :: Int -> ClusterNumber -> ShowS # show :: ClusterNumber -> String # showList :: [ClusterNumber] -> ShowS # | |
Eq ClusterNumber Source # | |
Defined in Codec.Archive.Zim.Parser Methods (==) :: ClusterNumber -> ClusterNumber -> Bool # (/=) :: ClusterNumber -> ClusterNumber -> Bool # | |
Ord ClusterNumber Source # | |
Defined in Codec.Archive.Zim.Parser Methods compare :: ClusterNumber -> ClusterNumber -> Ordering # (<) :: ClusterNumber -> ClusterNumber -> Bool # (<=) :: ClusterNumber -> ClusterNumber -> Bool # (>) :: ClusterNumber -> ClusterNumber -> Bool # (>=) :: ClusterNumber -> ClusterNumber -> Bool # max :: ClusterNumber -> ClusterNumber -> ClusterNumber # min :: ClusterNumber -> ClusterNumber -> ClusterNumber # |
newtype BlobNumber Source #
Wrapper for Blob number
Constructors
BlobNumber Int |
Instances
Show BlobNumber Source # | |
Defined in Codec.Archive.Zim.Parser Methods showsPrec :: Int -> BlobNumber -> ShowS # show :: BlobNumber -> String # showList :: [BlobNumber] -> ShowS # | |
Eq BlobNumber Source # | |
Defined in Codec.Archive.Zim.Parser | |
Ord BlobNumber Source # | |
Defined in Codec.Archive.Zim.Parser Methods compare :: BlobNumber -> BlobNumber -> Ordering # (<) :: BlobNumber -> BlobNumber -> Bool # (<=) :: BlobNumber -> BlobNumber -> Bool # (>) :: BlobNumber -> BlobNumber -> Bool # (>=) :: BlobNumber -> BlobNumber -> Bool # max :: BlobNumber -> BlobNumber -> BlobNumber # min :: BlobNumber -> BlobNumber -> BlobNumber # |
Wrapper for Url
Constructors
Url ByteString |
Instances
Show Url Source # | |
Eq Url Source # | |
Ord Url Source # | |
ZimGetContent Url Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> Url -> IO (Maybe (ByteString, ByteString)) Source # | |
ZimSearchDE Url Source # | |
ZimGetContent (MimeList, Url) Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> (MimeList, Url) -> IO (Maybe (ByteString, ByteString)) Source # |
Wrapper for Title
Instances
Show Title Source # | |
Eq Title Source # | |
Ord Title Source # | |
ZimGetContent Title Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> Title -> IO (Maybe (ByteString, ByteString)) Source # | |
ZimSearchDE Title Source # | |
ZimGetContent (MimeList, Title) Source # | |
Defined in Codec.Archive.Zim.Parser Methods getContent :: RunZim h => h -> (MimeList, Title) -> IO (Maybe (ByteString, ByteString)) Source # |
data TitlePrefix Source #
Wrapper for Title Prefix
Instances
Show TitlePrefix Source # | |
Defined in Codec.Archive.Zim.Parser Methods showsPrec :: Int -> TitlePrefix -> ShowS # show :: TitlePrefix -> String # showList :: [TitlePrefix] -> ShowS # | |
Eq TitlePrefix Source # | |
Defined in Codec.Archive.Zim.Parser | |
Ord TitlePrefix Source # | |
Defined in Codec.Archive.Zim.Parser Methods compare :: TitlePrefix -> TitlePrefix -> Ordering # (<) :: TitlePrefix -> TitlePrefix -> Bool # (<=) :: TitlePrefix -> TitlePrefix -> Bool # (>) :: TitlePrefix -> TitlePrefix -> Bool # (>=) :: TitlePrefix -> TitlePrefix -> Bool # max :: TitlePrefix -> TitlePrefix -> TitlePrefix # min :: TitlePrefix -> TitlePrefix -> TitlePrefix # | |
ZimSearchDE TitlePrefix Source # | |
Defined in Codec.Archive.Zim.Parser |
ZIM file format
Following is a short summary of the ZIM file format. The authoritative reference is at http://www.openzim.org/wiki/ZIM_file_format.
1. ZIM header
This is an 80-byte header (see ZimHeader
). Among other things, it contains
file offsets to the below.
2. List of MIME types
This is a sequence of null-terminated strings (eg. text/html
,
text/javascript
). The last string is zero length, so the end always
consists of 2 consecutive null bytes.
3. List of URLs
This is a sequence of 8-byte file offsets, each pointing to a directory entry. This list is sorted by the directory entries' URL.
4. List of Titles
This is a sequence of 4-byte indices, each pointing to a URL above (which in turn point to a directory entry). This list is sorted by the directory entries' Title.
5. Directory Entries
This is a sequence of Directory Entries (see ZimDirEnt
).
The first 2 bytes determine the type of this entry, which also determine the
length.
Contents include:
a. MIME type
This 2-byte field means:
0xffff
- This directory entry is a
ZimRedirectEntry
. 0xfffe
- This directory entry is a
ZimLinkTarget
. 0xfffd
- This directory entry is a
ZimDeletedEntry
. any other value
- This directory entry is a
ZimArticleEntry
and this index into the MIME list from above determines its MIME type.
b. Namespace
This single character determines the directory entry's namespace. (eg. A for articles, I for images, etc.) The comprehensive list is at http://www.openzim.org/wiki/ZIM_file_format#Namespaces.
c. Cluster and Blob number
Only for ZimArticleEntry
, this is the directory entry's Cluster and
Blob number. The Cluster number is a 4-byte index into the list of Clusters
below. The Blob number refers to a block inside the (decompressed) cluster.
Together, they provide the content of this directory entry.
d. URL and Title
These 2 null-terminated strings represent the URL and Title of this directory entry respectively. If the Title is empty, it is taken to be the same as the URL.
6. List of Clusters
This is a list of 8-byte file offsets, each pointing to a cluster in the file. -- The end of a cluster is also the start of the next cluster. Therefore, the length of a cluster is the difference between the adjacent offsets. For the last cluster, the end is the Checksum file offset, as the Checksum is always the last 16 bytes of a ZIM file.
a. Compression Type
The first byte of the cluster determines if it is uncompressed (eg. PNG image) or compressed with LZMA (eg. HTML).
0 or 1
- No compression
4
- Compressed with LZMA
b. List of Blobs
This is a list of 4-byte offsets, each pointing inside this cluster. The end of a blob is also the start of the next blob. Therefore, the length of a blob is the difference between the adjacent offsets. The last offset points to the end of the data area so there is always one more offset than blobs.