Wednesday, December 15, 2021

Academic paper: Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

 Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

Here is the link.

Abstract 

Several schemes for linear mapping of a multidimensional space have been proposed for various applications such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering [1, 14]. In this paper, we analyze the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work [14]. They agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the required disk access behaviors and hence the total access time.

Index Terms: 

locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.  

1 Introduction 

The design of multidimensional access methods is difficult compared to one-dimensional cases because there is no total ordering that preserves spatial locality. Once a total ordering is found for a given spatial or multidimensional database, one can use any one-dimensional access method (e.g., B+ -tree), which may yield good performance for multidimensional queries. An interesting application of the ordering arises in a multidimensional indexing technique proposed by Orenstein [19]. The idea is to develop a single numeric index on a one-dimensional space for each point in a multidimensional space, such that for any given object, the range of indices, from the smallest index to the largest, includes few points not in the object itself.

Consider a linear traversal or a typical range query for a database where record signatures are mapped with multi-attribute hashing [24] to buckets stored on disk. The linear traversal specifies the order in which the objects are fetched from disk as well as the number of blocks fetched. The number of nonconsecutive disk accesses will be determined by the order of blocks fetched. Although the order of blocks fetched is not explicitly specified in the range query, it is reasonable to assume that the set of blocks fetched can be rearranged into a number of groups of consecutive blocks by a database server or disk controller mechanism [25]. Since it is more efficient to fetch a set of consecutive disk blocks rather than a randomly scattered set in order to reduce additional seek time, it is desirable that objects close together in a multidimensional attribute space also be close together in the one-dimensional disk space. A good clustering of multidimensional data points on the one-dimensional sequence of disk blocks may also reduce the number of disk accesses that are required for a range query.

In addition to the applications described above, several other applications also benefit from a linear mapping that preserves locality: 

  1. In traditional databases, a multi-attribute data space must be mapped into a one-dimensional disk space to allow efficient handling of partial-match queries [22]; in numerical analysis, large multidimensional arrays [6] have to be stored on disk, which is a linear structure. 
  2. In image compression, a family of methods use a linear mapping to transform an image into a bit string; subsequently, any standard compression method can be applied [18]. A good clustering of pixels will result in a fewer number of long runs of similar pixel values, thereby improving the compression ratio. 
  3. In geographic information systems (GIS), run-encoded forms of image representations are ordering sensitive, as they are based on representations of the image as sets of runs [1]. 
  4. Heuristics in computational geometry problems use a linear mapping. For example, for the traveling salesman problem, the cities are linearly ordered and visited accordingly [2]. 
  5. Locality-preserving mappings are used for bandwidth reduction of digitally sampled signals [4] and for graphics display generation [20]. 
  6. In scientific parallel processing, locality-preserving linearization techniques are widely used for dynamic unstructured mesh partitioning [17]. 
Sophisticated mapping functions have been proposed in the literature. One based on interleaving bits from the coordinates, which is called z-ordering, was proposed [19]. Its improvement was suggested by Faloutsos [8], using Gray coding on the interleaved bits. A third method, based on the Hilbert curve [13], was proposed for secondary key retrieval [11]. In the mathematical context, these three mapping functions are based on different space-filling curves: the z curve, the Gray-coded curve and the Hilbert curve, respectively. Figure 1 illustrates the linear orderings yielded by the space-filling curves for a 4 x 4 grid.

To be continued | Will read later


It is very interesting paper. I will read more carefully later if I have time. I need to go back to watch again the bigTable video. 


No comments:

Post a Comment