May 13, 2014

A guide to the rasterization of vector coverages in PostGIS

A frequent question since the beginning of the PostGIS Raster adventure has been "How can I convert my geometry table to a raster coverage?" (here also). People often describe the way to rasterize a single geometry using ST_AsRaster() but I don’t know any text or tutorial explaining how to convert a complete vector coverage to a single raster or to a tiled raster coverage.

In this article I will describe two methods to rasterize a vector coverage: one very fast but not very flexible and one very flexible but unfortunately a bit slow! I will rasterize a vector forest cover so that it aligns perfectly with an already loaded raster elevation layer. I will use the "height" column of the forest cover to assign values to pixels. I will assume that both layers are in the same SRID and that the forest only partially covers the elevation layer. The elevation layer is divided in 625 tiles of 100x100 pixels for a total of 6 250 000 pixels.


"Intelligent" vector VS "stupid" raster!

Why would one go from a human parsed, precise, object oriented, "intelligent" vector table to an unparsed, jagged, single variable, "stupid" raster layer? Over the years I could identify only two good reasons to do so. The first is to get all your data layers into a single analysis paradigm (in this case the raster one) to simplify the geoprocessing chain. The second is because this "stupid" raster paradigm is still also the fastest when it’s time to analyse large volume of data. Processing a raster stack relies on very simple matrix operations generally available through what geo people call map algebra operations. This is still generally faster than the linear algebra necessary to analyse vector coverages. Anyway, our task here is not to decide whether doing it in vector mode is better than in raster mode. Our task is just to get the conversion done properly…


"Rasterizing" vs "rendering"

Another interesting matter is the difference between rasterizing and rendering. Rendering is the conversion of a vector model to an image for display purpose. Rasterizing is the conversion of a vector model to a raster for data analysis purpose. Renderers, like MapServer, GeoServer and Mapnik, besides blending many layers together, symbolizing geometries and labeling them, will generally produce anti-aliasing along the edge of polygons and translate values into symbolic colors. Any original data value is lost and there is no way to use the product of a renderer for GIS analysis. On the other hand, a rasterizer function, like the ones provided by GDAL, QGIS, ArcGIS or PostGIS, will produce raw pixels trying to represent the original vector surface as crisply and accurately as possible, assigning to each pixel one of the value (and only those values) associated with the original geometry. The raster resulting from a rasterization process is a valuable input into a geoprocessing analysis chain. The result of a rendering process is not. We must also say that although there are some methods to rasterize a vector coverage in PostGIS, there are still no ways to render one (unless you consider ST_AsJPEG(), ST_AsTIFF() and ST_AsPNG() as very basic renderers).


Method 1.1 – Basic ST_Union() and ST_Tile()

The first rasterization method I present is the one that was planned from the beginning of the PostGIS Raster project: simply convert all the geometries to small rasters and union them into a single raster like we union many geometries together using ST_Union(geometry). A complete query, aligning the resulting raster on an existing elevation raster coverage, should looks like this:

CREATE TABLE forestheight_rast AS
SELECT ST_Union(ST_AsRaster(geom, rast, '32BF', height, -9999)) rast
FROM forestcover, (SELECT rast FROM elevation LIMIT 1) rast;

Close-up on the unioned rasterized
version of the forest cover resulting
in "jagged" areas of same height.
This results is a unique "big" raster, aggregating all the "small" rasterizations of the geometries.

Note the arguments to ST_AsRaster(). The first one is the geometry to rasterize. The second one is the reference raster from which the alignment parameters are borrowed (a pixel corner x and y, the scale (or pixel size) and the skew (or rotation)). The third argument is the pixel type of the expected raster (32 bit float). The fourth parameter is the value to assign to the pixel. Here is it the 'height' column from the forest cover table. The last parameter ('-9999') is the nodata value to assign to pixels padding the area surrounding each rasterized geometry. This is the very first variant of the ST_AsRaster() function in the PostGIS Raster reference.

Like its geometric counterpartST_Union() simply aggregate (merge) all those small rasters together into a unique raster. Without ST_Union(), the query would have resulted in 2755 small rasters. One for each geometry.

Note also that we picked only one raster from the elevation layer as the reference raster to ST_AsRaster(). This is because all the rasters in this table are well aligned together and we need only one set of pixel corners coordinates to align everything. We could also have put this query selecting a single raster directly into the SELECT part:

CREATE TABLE forestheight_rast AS
SELECT ST_Union(ST_AsRaster(geom, (SELECT rast FROM elevation LIMIT 1), '32BF', height, -9999)) rast
FROM forestcover;

We normally don’t want to keep big rasters into PostGIS so we can split the merged raster into smaller tiles having the same dimensions as the elevation coverage with the ST_Tile() function:

CREATE TABLE tiled_forestheight_rast AS
SELECT ST_Tile(rast, 100, 100) rast
FROM forestheight_rast;

The tiling of the rasterized forest cover, in red, is not well
aligned with the tiling of the the elevation raster, in grey.
Note that the squares are 100x100 pixels tiles, not pixels...
This results in 48 tiles, most being 100x100 pixels. Some are only 18x100 pixels and some are only 100x46 pixels. The lower right one is only 18x46 pixels...

There are 625 tiles of 100x100 pixels in the elevation coverage and 2755 geometries in the forest cover. These geometries cover only 31 of those tiles and the resulting big raster covers 56 of them. The query for creating the "big" raster takes 5 seconds (thanks to Bborie Park who speeded up ST_Union() by a huge factor in PostGIS 2.1) and the query to resample/retile it takes only 2 seconds.

One problem with this result is that even though the pixels are well aligned with the elevation layer and most of the tiles have the same dimensions, the tiles themselves are not aligned with the elevation coverage and they do not cover the same area. Method 1.2 will fix that.


Method 1.2 – ST_Union() and ST_MapAlgebra()

In order to produce a raster coverage having a footprint identical to the elevation coverage we must use this coverage not only for alignment but also as the base for our final tiles. The trick is to "burn" the big raster into empty tiles based on each of the elevation tiles. For that we will use a quite simple (!) two rasters ST_MapAlgebra() call:

CREATE TABLE welltiled_forestheight_rast AS
SELECT ST_MapAlgebra(fc.rast, ST_AddBand(ST_MakeEmptyRaster(e.rast), '32BF'::text, -9999, -9999), '[rast1]', '32BF', 'SECOND') rast
FROM forestheight_rast fc, elevation e;

The two rasters ST_MapAlgebra() function overlays the two rasters and compute an expression for each aligned set of two pixels. The result of this expression becomes the value assigned to each pixel of a new raster.

As the two rasters are not necessarily well aligned, the new raster extent can be equal to 1) the extent of the union of both raster, 2) the extent of the intersecting area 3) the extent of the first raster or 4) the extent of the second raster. Here, no two rasters involved in the map algebra have the same extent. Since we want to "burn" the big raster into tiles having the same extent as the elevation tiles, which are second in the list of arguments, we have to specify that the new raster will have the same extent as the "SECOND" raster.

The first argument is the "big" forest height raster. It is the only row of the forestheight_rast table.

The second argument is a new tile generated with ST_MakeEmptyRaster() and ST_AddBand() from each of the 626 elevation tiles. We could have used the original tiles directly, but since we are specifying the resulting extent to be "SECOND" and that the nodata value of the result from ST_MapAlgebra() is picked from the "FIRST" or the "SECOND" raster when those values are specified, then the resulting raster would have had 32767 as nodata value. This is the nodata value of the elevation raster coverage. We want the nodata value of the result to be the same as the one of the forest height coverage, which is -9999. So the trick is to "build" a new tile from scratch with ST_MakeEmptyRaster() with the original elevation tiles as alignment reference and add them a band with ST_AddBand() specifying a pixel type ('32BF'), an initial value ('-9999') and a nodata value ('-9999') identical to the forest height raster.

The third argument is the expression computing the value to assign to each pixel. Pixel values from the first raster are refered as "[rast1]" and pixel values from the second raster are referred as "[rast1]". You can also reference the x coordinate of the pixel with "[rast.x]" and the y coordinate with "[rast.y]". The expression is any normal PostgreSQL expression like for example "([rast1] + [rast2]) / 2".

In our case the expression is simply the value of the first raster, the forest height one, that we want to "burn" into the tile. It does not involve values from the elevation raster. In other word the elevation tiles are merely used as well aligned blank pieces of paper on which we "print" the value of the forest height raster...

The fourth argument is the pixel type ('32BF') expected for the resulting raster. It is the same as the forest height raster.

The last argument ('SECOND') tells ST_MapAlgebra() to use the extent of the second raster (the elevation tiles) to build the resulting raster.

ST_MapAlgebra() will hence copy each value from the forest height raster to new empty tiles which dimensions are based on the elevation ones. The query results in 625 tiles. When a tile does not intersect the area covered by the forest cover, all its pixel are simply set to nodata.

If your big raster does not cover much of the reference raster coverage, you might speed up the process by adding a spatial index to the reference coverage (it should already be there) and restraining the ST_MapAlgebra() computation to tiles intersecting with the vector coverage:

CREATE INDEX elevation_rast_gist ON elevation USING gist (st_convexhull(rast));

CREATE TABLE welltiled_forestcover_rast AS
SELECT ST_MapAlgebra(fc.rast, ST_AddBand(ST_MakeEmptyRaster(e.rast), '32BF'::text, -9999, -9999), '[rast1]', '32BF', 'SECOND') rast
FROM forestcover_rast fc, elevation e
WHERE ST_Intersects(fc.rast, e.rast);

Keep in mind however that this query is faster because it does not process and hence does not return tiles not touching the forest cover area. If you want to produce a complete set of tile, identical in number and extent to the elevation coverage, keep with the previous query.


Some drawbacks to the methods using ST_AsRaster() and ST_Union()

All those queries involving ST_AsRaster() and ST_Union() are quite fast but they present two drawbacks:
  • ST_Union limited by RAM - Any query involving ST_Union() is limited by the RAM available to PostgreSQL. If the big raster resulting from the union of all the rasterized geometries is bigger than the available RAM, the query will fail. The trick to avoid this is to aggregate the small rasters not into a huge unique raster, but into smaller groups of rasters intersecting with tiles from the reference raster. Method 1.3 will show how to do that.
  • Only one extractable metric - A second limitation is that ST_AsRaster() has only one method to determine if the geometry value must be assigned to the intersecting pixel or not. When the geometry is a polygon, for example, ST_AsRaster() assigns the value of the geometry to the pixel if the center of the pixel is inside this polygon whatever what proportion of this polygon covers the pixel area. In the worst case the value of a very small polygon encircling the center of the pixel would be assigned when all the rest of the pixel area would be covered by a polygon with a different value. You can think of many cases where the value at the center of the pixel is not necessarily representative of the area covered by the pixel.

    This is because ST_AsRaster() rasterize only one geometry at a time without consideration for other geometries intersecting with the pixel. If you consider all the geometries of a layer when assigning a value, you might want to assign other metrics to the pixels like the count of geometry intersecting with the pixel, the total length of all the polylines intersecting with the pixel, the value associated with the smallest polygon intersecting with the pixel, etc...

    Method 2 will show how to use the PostGIS Addons ST_ExtractToRaster() function to extract some of those metrics from a whole vector coverage and how to easily implement new ones.

Method 1.3 – Memory efficient ST_Union() and ST_MapAlgebra()

This method is RAM safe in that it never produces a "big" raster bigger than the area covered by one tile from the reference raster. It is a little bit slower than the first query of method 1.2 because it union some rasters many times when they touches many tiles. It also produces 625 tiles like the elevation coverage.

CREATE TABLE ramsafe_welltiled_forestcover_rast AS
WITH forestrast AS (
  SELECT rid, ST_MapAlgebra(
                ST_Union(ST_AsRaster(geom, rast, '32BF', height, -9999)),
                ST_AddBand(ST_MakeEmptyRaster(rast), '32BF'::text, -9999, -9999), 
                '[rast1]', '32BF', 'SECOND') rast
  FROM a_forestcover2, elevation
  WHERE ST_Intersects(geom, rast)
GROUP BY rid, rast
)
SELECT a.rid,
       CASE
         WHEN b.rid IS NULL THEN ST_AddBand(ST_MakeEmptyRaster(a.rast), '32BF'::text, -9999, -9999)
         ELSE b.rast
       END rast
FROM elevation a LEFT OUTER JOIN forestrast b 
ON a.rid = b.rid;

The first query in the WITH statement (forestrast) rasterize the geometries, union them by group intersecting the reference raster tiles and burn each of them to a new tile as seen before. This is done by the addition of a WHERE clause limiting the operation to geometries intersecting with tiles and a GROUP BY clause insuring that ST_Union() only aggregate small rasters "belonging" to the same tile.

The WHERE clause however cause the extent to be limited to the area covered by the forest cover. This first part alone would return only 32 tiles. The second part of the query make sure we get the 593 remaining tiles. It uses a LEFT OUTER JOIN to make sure a row is returned for every 625 tiles from the reference raster coverage. When the rid of the reference raster coverage match one of the 32 tiles, these tiles are returned. Otherwise a new empty tile is returned. You might skip this part if your source vector layer covers the full area of the reference raster or if you wish to limit the rasterized area to the one of the vector layer.


Method 2 – PostGIS Add-ons ST_ExtractToRaster()

As said before, ST_AsRaster() is quite limited in terms of the kind of value it can assign to the pixels. Not so many values can be extracted from a single polygon: the value at the centroid, the proportion of the pixel area covered, the value weighted by this proportion, the length of the intersecting part of a polyline being rasterized... ST_AsRaster() however, because it is based on GDAL, only implements the first one.

Furthermore, many more metrics can be imagined if the value assigned to each pixel comes not only from one geometry but from the aggregation of all the geometries (or the values associated with them) of a vector coverage intersecting the centroid or the surface of the pixel. We might, for example, want to compute the number of geometries intersecting with the pixel. We might be interested by the value of the polygon covering the biggest part of the pixel or of the polyline having the longest length inside the pixel are. We might want to merge together geometries having the same value before computing the metric. I could identify nearly 76 different possible metrics extractable from point, polyline and polygon coverages.

We hence need a function that is able to extract metrics from a vector coverage as a whole and which allows us to implement new extraction methods easily. The PostGIS Add-ons ST_ExtractToRaster() function was written with exactly this goal in mind.

The PostGIS Add-ons is a set of about 15 pure PL/pgSQL community written functions helping to write advanced PostGIS queries. One of the most useful of these functions is ST_ExtractToRaster().

A query using ST_ExtractToRaster(), returning the same set of tiles as method 1.3 and also not suffering from RAM limitations would look like this:

CREATE INDEX forestcover_geom_gist ON forestcover USING gist (geom);

CREATE TABLE extracttoraster_forestcover AS
SELECT ST_ExtractToRaster(
         ST_AddBand(ST_MakeEmptyRaster(rast), '32BF'::text, -9999, -9999), 
         'public', 
         'forestcover', 
         'geom', 
         'height', 
         'MEAN_OF_VALUES_AT_PIXEL_CENTROID') rast
FROM elevation;

First and very important remark: this query takes around 200 seconds! This is about 16 times more than the equivalent RAM safe query of method 1.3! This is because ST_ExtractToRaster() does much more than ST_AsRaster(). It not only considers the whole coverage, it can compute very different things from this coverage. It is definitely not worth using ST_ExtractToRaster() if what you want is the value of the geometry intersecting the pixel centroid. The interest for this function is the other methods available...

Note the last argument to ST_ExtractToRaster(): 'MEAN_OF_VALUES_AT_PIXEL_ CENTROID'. This is the method used to extract a value for the pixel. Fifteen (15) of these methods have been implemented with ST_ExtractToRaster() up to now. They are listed below with a short description.

The 'MEAN_OF_VALUES_AT_PIXEL_CENTROID' method returns a raster coverage which values are identical to the raster coverage produced with the previous methods. Most other methods, however, will return quite different rasters…

The other arguments to ST_ExtractToRaster() are quite trivial: the first is the raster which values are extracted to. It’s a good practice to pass a new empty raster band having the right pixel type, initial value and nodata value and which alignment is based on tiles from the reference raster. This is identical as what we did in 1.2 and 1.3. The second argument is the schema of the vector table to rasterize ('public'). The third and the fourth are the name of the table ('forestcover') and the name of the column  ('geom') containing the geometry to rasterize. The fifth argument is the column containing the value to be used in the computation of a new value when it is necessary ('height'). When the computation involve only geometries (e.g. COUNT_OF_VALUES_AT_PIXEL_ CENTROID or SUM_OF_LENGTHS), this argument is not necessary.

The query will return a new raster for each raster in the elevation table, in this case 625.

Close-up on four pixel taking different values depending on the method used to assign them a value. The green and blue pixels were extracted using the MEAN_OF_VALUES_AT_PIXEL_CENTROID method. The pixels delimited by the red lines were extracted using the VALUE_OF_MERGED_BIGGEST method. The values displayed are the values associated with the polygons which a delineated in black. Note how the pixels delineated by the red lines take the values of the polygons occupying the biggest proportion of the pixel.
Here is a list of methods already implemented for ST_ExtractToRaster(). All methods using the pixel centroid are postfixed with "_AT_PIXEL_CENTROID". All other methods involve the pixel as a rectangular surface.
  • COUNT_OF_VALUES_AT_PIXEL_CENTROID: Number of geometries intersecting with the pixel centroid.
  • MEAN_OF_VALUES_AT_PIXEL_CENTROID: Average of all values intersecting with the pixel centroid.
  • COUNT_OF_POLYGONS: Number of polygons or multipolygons intersecting with the pixel surface.
  • COUNT_OF_LINESTRINGS: Number of linestrings or multilinestrings intersecting with the pixel surface.
  • COUNT_OF_POINTS: Number of points or multipoints intersecting with the pixel surface.
  • COUNT_OF_GEOMETRIES: Number of geometries (whatever they are) intersecting with the pixel surface.
  • VALUE_OF_BIGGEST: Value associated with the polygon covering the biggest area intersecting with the pixel surface.
  • VALUE_OF_MERGED_BIGGEST: Value associated with the polygon covering the biggest area intersecting with the pixel surface. Polygons with identical values are merged before computing the area of the surface.
  • VALUE_OF_MERGED_SMALLEST: Value associated with the polygon covering the smallest area in the pixel. Polygons with identical values are merged before computing the area of the surface.
  • MIN_AREA: Area of the geometry occupying the smallest portion of the pixel surface.
  • SUM_OF_AREAS: Sum of the areas of all polygons intersecting with the pixel surface.
  • SUM_OF_LENGTHS: Sum of the lengths of all linestrings intersecting with the pixel surface.
  • PROPORTION_OF_COVERED_AREA: Proportion, between 0.0 and 1.0, of the pixel surface covered by the union of all the polygons intersecting with the pixel surface.
  • AREA_WEIGHTED_MEAN_OF_VALUES: Mean of all polygon values weighted by the area they occupy relative to the pixel being processed. The weighted sum is divided by the maximum between the intersecting area of the geometry and the sum of all the weighted geometry areas. That means if the pixel being processed is not completely covered by polygons, the value is multiplied by the proportion of the covering area. For example, if a polygon covers only 25% of the pixel surface and there are no other polygons covering the surface, only 25% of the value associated with the covering polygon is assigned to the pixel. This is generally the favorite behavior.
  • AREA_WEIGHTED_MEAN_OF_VALUES_2: Mean of all polygon values weighted by the area they occupy relative to the pixel being processed. The weighted sum is divided by the sum of all the weighted geometry areas. That means if the pixel being processed is not completely covered by polygons, the full weighted value is assigned to the pixel. For example, even if a polygon covers only 25% of the pixel surface, 100% of its value is assigned to the pixel.

Adding methods to ST_ExtractToRaster()

Want to compute some other esoteric metrics with ST_ExtracToRaster()? Internally the function use the callback version of ST_MapAlgebra() which call one of the two predefined callbacks depending on if the value is computed for the pixel centroid or if the value is computed using the full surface of the pixel. Those functions are named ST_ExtractPixelCentroidValue4ma() and ST_ExtractPixelValue4ma(). They fulfil to the criteria for ST_MapAlgebra() callback functions. They take three arguments defining the value of the input pixel (there can be many), the position of the pixel in the raster and some extra parameters when necessary. In order to build a query for each pixel, both functions expect extra parameters: the alignment parameters of the raster for which values are being computed and the parameters of the geometry column for which values are being extracted. The callbacks then define a series of extraction methods.

To add a new method to these callbacks, you can simply copy the query associated with an existing method that extract a value similar to what you want to extract, give it a proper name, modify it to compute the expected value and make sure it uses the callback extra arguments properly. It is a good idea to test the query directly on the vector coverage prior to test it in one of the two callback. Most of the methods do a ST_Intersects() between the shape of the pixel and the vector coverage (this is why it is important that the vector coverage be spatially indexed) and extract some metrics from the list of intersecting geometries.

If you add a new method which might be useful to others, let me know. I will add it to the PostGIS Add-ons code…


Conclusion

Rasterizing a complete vector coverage can be done very quickly with ST_AsRaster() and ST_Union() but some projects require more sophisticated methods to assign values to pixels considering the coverage as a whole. PostGIS Add-ons ST_ExtractToRaster() is slow but provides much more control on the metric that is extracted from the vector coverage.

In both case it is possible to extract values following a preloaded reference tiled coverage. That should be the case for most applications. It is also easy to add more extraction methods to ST_ExtractToRaster().