Zonal Summations of GBIF Species Occurrence
Data On GPUs
Summary
Analyzing how species are
distributed on the
Earth has been one of the fundamental questions in the
intersections of
environmental sciences, geosciences and biological sciences. With
world-wide
data contributions, more than 375 million species occurrence
records for nearly
1.5 million species have been deposited to the Global Biodiversity
Information
Facility (GBIF) data portal. The sheer amounts of point and
polygon data and
the computation-intensive point-in-polygon tests for zonal
statistics for
biodiversity studies have imposed significant technical
challenges. In this
study, we have significantly extended our previous work on
parallel primitives
based spatial joins on commodity Graphics Processing Units (GPUs)
and have developed
new efficient and scalable techniques to enable parallel zonal
statistics on
the GBIF data completely on GPUs with limited memory capacity.Experiment results have
shown that an impressive
end-to-end response time under 100 seconds can be achieved for
zonal statistics
on the 375+ million species records over 15+ thousand global
eco-regions with
4+ million vertices on a single Nvidia Quadro 6000 GPU device. The
achieved
high performance, which is several orders of magnitude faster than
reference
serial implementations using traditional open source geospatial
techniques, not
only demonstrates the potential of GPU computing for large scale
geospatial
processing, but also makes interactive query driven visual
exploration of
global biodiversity data possible.
Baseline Serial Algorithms and Implementations
Descriptions of baseline implementations using traditional open
source software stack and running on single processors:
1) Serial Implementation 1 (SI1) builds an R-Tree on WWF eco-region polygons
(
14,458 polygons, 16,838 rings and
4,028,622 points
) by using libspatialindex.
For each species occurrence, a a point is created to query against
the R-Tree index structure using the
ISpatialIndex:intersectsWithQuery provided by libspatialindex . If
the point intersect with any of the MBRs stored in the leaf
nodes of the R-Tree, the point is tested with the
polyogns that are represented by the MBRs using the Contains
function in the OGRGeometry class provided by the GDAL/OGR library. If the point is
within the polygon, the polygon FeatureID will be assigned to the
point; otherwise -1 will be assigned instead.
2) Serial Implementation 2 (SI2) adds an optimization technique we
have used for the GPU design and it uses the same input as the GPU
implementation. Recall that in the GPU implementation, in addition
to the .xy file that stores the long/lat pairs of the point
coordinates, the raster cell identifiers and the number of points in
the raster cells are also part of the input. For each of the grid
cell, implementation 2 first queries its bounding box against the
R-Tree index structure and locate all polygons whose MBRs intersect
with the grid ell bounding box. For each of the point in the grid
cell, the Contains
function in the OGRGeometry class is applied for
point-in-polygon test.
One might expect SI2 will always be faster than SI1. However, our
experiments have shown differently. This is because, will SI2
reduces R-Tree query costs, it might increases the number of
point-in-polygon tests which are more expensive than R-Tree queries,
given that the R-Tree index structure is likely to be completely
cached in memory (the R-Tree structure is only about 2.4 MB) . As a
matter of fact, there are points that, although their cell
boundary intersect with some polygon MBRs, they may not intersect
with these MBRs. As such, SI2 is likely to introduce false positives
that can be avoided by query the points against the polygon R-Tree
directly.
Code, Data and Steps to Repeat Baseline Experiments
The following three programs (C++ code) allows interested readers to
repeat our experiments using a sample dataset.
1) The code for generating the R-Tree indexing structure (used for
both BI1 and BI2), the two baseline implementations can be
downloaded as gzipped tar ball by following this link.
2) The sample dataset with 746,302 species occurrences for 25
species with numbers of occurrences between 10,000 and 100,000 can
be downloaded as a gzipped tar ball by following this
link (G45_G01). FYI: not all files in the tar ball are used.
Please contact me if you are curious what the files are used for.
Please visit GBIF data portal
for more detailed info.
3) A copy of the WWF ecoregion shapefile can be downloaded by
following this link
(Warning: 51 MB). Please acknowledge WWF by following the details at
WWF eco-region
website.
4) Two additional point datasets, one is smaller (
279,808 records
) and one is bigger (
9,397,443) are now provided
for additional evaluations. They can be accessed by following the
these two links: G45_G02
and G10M
(warning: 8.7 MB).
To repeat our baseline experiments:
1) download and unzip the above three tar balls to your directory.
2) install libspatialindex and GDAL/OGR library and make the shared
library files (.so file under linux) accessible by modifying
LD_LIBRARY_PATH environment
variable (or specify their paths in g++ command line using -L).
3) change the paths to the WWF eco-region shapefile in the three
source C++ programs (changing to command line is in progress).
3) compile the three C++ programs by following the example command
line for compilation at the beginning of each program.
4) run GenPolyRTree first to generate the polygon index (newrtp.*
under the current directory).
5) run PIPGDALTest1 and/or PIPGDALTest2 by following the example
command line for execution at the beginning of each program. Please
note the differences between the two.
Related Publications
Jianting Zhang and Simin You (2014). Efficient and Scalable
Parallel Zonal Statistics on Large-Scale Species Occurrence Data
on GPUs. Technical Report [PDF]
Jianting Zhang and Simin You (2013) . Parallel Zonal Summations of
Large-Scale Species Occurrence Data on Hybrid CPU-GPU Systems.
Technical Report [PDF]