We introduce a combinatorial optimization framework that naturally induces a class of optimal word permutations with respect to a suitably defined cost function taking into account various measures of relatedness between words. The Burrows and Wheeler transform (bwt) (cf. [M. Burrows, D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994]), and its analog for labelled trees (cf. [P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan, Structuring labeled trees for optimal succinctness, and beyond, in: Proc. of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2005, pp. 198–207]), are special cases in the class. We also show that the class of optimal word permutations defined here is identical to the one identified by Ferragina et al. for compression boosting [P. Ferragina, R. Giancarlo, G. Manzini, M. Sciortino, Boosting textual compression in optimal linear time, Journal of the ACM 52 (2005) 688–713]. Therefore, they are all highly compressible. We also provide, by using techniques from Combinatorics on Words, a fast method to compute bwt without using any end-of-string symbol. We also investigate more general classes of optimal word permutations, where relatedness of symbols may be measured by functions more complex than context length. For this general problem we provide an instance that is MAX-SNP hard, and therefore unlikely to be solved or approximated efficiently. The results presented here indicate that a key feature of the Burrows and Wheeler transform seems to be, besides compressibility, the existence of efficient algorithms for its computation and inversion.
|Rivista||Theoretical Computer Science|
|Stato di pubblicazione||Published - 2007|
All Science Journal Classification (ASJC) codes