NatureServe Bird Range Maps in Disk Resident and Main-Memory Databases

Disk Resident Databases (PostgreSQL)

Species distribution data play an important role in biodiversity related research, especially in exploring relationships with the environment. In the recent years, both the number of species being explored and the spatial resolution of species distribution data are increasing fast. It is thus imperative to develop database systems that allow users to efficiently query such large-scale data based on spatial and non-spatial (e.g., taxonomic and phylogenetics) criteria.

In this paper, we present our approach to building such a system by integrating several components, including a quadtree representation of binary raster data, tree path indexing and query processing in PostgreSQL, and window decomposition techniques for spatial queries. Our unique contribution is in associating species identifiers with intermediate quadtree nodes and query optimization for multiple independent queries after window query decomposition. Our system enables PostgreSQL to support binary raster data without requiring any changes to the database backend and is suitable for managing large-scale species distribution data.

Our experiments using 4000+ bird species distribution data related to the Western hemisphere show that the proposed approach in associating species identifiers with quadtree nodes reduces the number of database tuples by more than 1/3 and the average identifiers to be associated with each tuple from 110.6 to 4.8, a significant improvement compared to classic quadtree-based approaches. With respect to query optimization, optimized queries are 6--9.5 times faster than the baseline queries for average query response times and 5.5--8.3 times faster than the baseline queries for maximum query response times for four query window sizes ranging from 0.1 to 5.0 degrees. Our query optimization techniques thus make the system suitable for many interactive applications for querying and exploring species distribution data.

Related Publications:

Jianting Zhang, Michael Gertz, Le Gruenwald, Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL. Proceedings of ACM-GIS09, Nov. 4-6, Seattle, WA. (doi: 10.1145/1653771.1653815). [Link][Local Copy]

Main-Memory Databases
Functionality, performance and scalability are critical to Web-based information systems for publishing and disseminating large-scale species distribution data. Existing systems do not support dynamic spatial window queries on large-scale species range maps that are important to compute alpha and beta diversities for biodiversity analysis and modeling. In this study, we have developed a main-memory based novel quadtree data structure to represent large-scale species range maps and support dynamic spatial window queries to retrieve a list of species and their area sizes within a query window efficiently. Using the NatureServe's 4000 + bird species range maps, experiment results have shown that the memory footprint of the proposed quadtree data structure representing the range maps of all the species is about 1/6 of the quadtree derived by combining individual quadtrees each representing a species range map. The experiment results have also demonstrated that the query response times of our main-memory spatial database are well below a fraction of a second for query windows as large as 10 × 10°, which are 2–3 orders better than using a typical disk-resident spatial database system.

Related Publications:

Jianting Zhang (2012). A high-performance web-based information system for publishing large-scale species range maps in support of biodiversity studies. Ecological Informatics 8: 68-77. [Link][Local Copy]