Word-based compression of XML documents

Content :


  • 3.6.2007 : Version 0.1alpha released
  • Overview

    The aim of this project is to create software that would serve as platform for testing various lossless compression methods and their applications. We have focused on compression of large XML files. The most important factor is the compression ratio and therefore the best known approaches and methods were used. We combined methods for XML compression and text compression.

    We have implemented parser that is able to parse non-well-formed XML files. On the output of the parser we have tried several standard methods: block compression, dictionary and statistical methods. We tried also combination of methods not presented before, such as use of block compression followed by statistical methods.

    We are happy to announce that XBW has the best compression ratio for the files it was designed for: large XML files. For other types of files however XBW does not perform that great.

    The biggest strength of XBW seems to be large XML files with languages using non ASCII characters. On our test corpus containing several cca 20MB Czech XML files we have beaten bzip2 by 81% and rar by 45%. Our program is the best choice for these files we are aware of.

  • Note: detailed performance measurements could be found in documentation.
  • Download

    Sourcecode of XBW is available at the sourceforge download page.


    XBW user guide (PDF)
    XBW user guide (HTML)
    User guide contains program requirements, installation instructions, swiple usage examples and full description of program arguments.

    XBW documentation (PDF)
    Documentation contains following sections :

  • XBW overview
  • Performance measurement
  • Description of algorithms
  • Note : we apologize for quality of our documentation caused by the fact that we are not native English speakers. We plan its improving. Any corrections are very welcome.
  • Authors

    The XBW project team has following members :

  • Jan Lansky - Teamleader
  • Radovan Sestak - Burrows-Wheeler transform
  • Petr Uzel - Arithmetic and Huffman coder
  • Stanislav Kovalcin - LZC, LZSS, Move to front, Run-length encoding
  • Tomas Urban - Dictionary encoding
  • Pavol Kumicak - Prediction by partial matching
  • Maria Szabo - XML Parser
  • Please contact one of authors directly via email if you want to submit bug reports, request new features etc. We will be very happy to receive your comments.


  • Jan Lánský, Katsiaryna Chernik, Zuzana Vlčková: Syllable-Based Burrows-Wheeler Transform . In: Pokorný, J., Snášel, V., and Richta, K. (Eds.): Proceedings of the Dateso 2007 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, Vol. 235, pg. 1-10, ISBN 80-7378-002-X, ISSN 1613-0073.
  • Jan Lánský, Katsiaryna Chernik, Zuzana Vlčková: Comparison of Text Models for BWT.. In: Storer, J. A., Marcellin, M. W. (Eds.): Proceedings of 2007 Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California, USA, 2007, pg. 389, ISBN 0-7695-2791-4, ISSN 1068-0314.
  • Leo Galamboš, Jan Lánský, Michal Žemlička, Katsiaryna Chernik: Compression of Semistructured Documents. In: Internation Journal of Information Technology (IJIT), Volume 4, Number 1, 2007, pg. 11-17, ISSN 1305-2403.

  • SourceForge Logo