Earth Engine by Example
While Earth Engine is a powerful system, sometimes you just need to download some data. But if you want to download a lot of little pieces of data that are fast to compute (like image chips or random samples), an Export task using the Earth Engine batch system might not be the best way to do it. It can take a couple of minutes just to schedule, start and stop each task and you can only run a handful in parallel, so running a lot of small tasks is very inefficient. Additionally, since the batch system only allows you to queue up 3000 tasks, trying to extract 100,000 small items can be painful just to orchestrate.
In this example, I’m going to demonstrate how to efficiently download many small pieces of data. This particular example is going to utilize
ee.Image.getThumbURL() to download a bunch of RGB image chips for offline validation, however, the same technique can work for other types of data (like CSVs or NumPy arrays) using
getDownloadURL() or even
The basic structure of this tool is a Python program that uses the
multiprocessing module to make many requests simultaneously. The program’s skeleton looks like this:
First, the Earth Engine library needs to be initialized to use the high-volume endpoint. You should use this URL whenever you’re making automated requests. The bulk of the work is then done by two functions (code in the next section):
getRequests()figures out which items to download, and
getResult()does the downloading and save-to-file for one item
getRequests() function’s job is to do enough setup to retrieve a list of work items that need to be downloaded (requests). This typically involves using
getInfo() to get a list of features or geometries, or maybe the IDs from an image collection. The trick is to get the smallest amount of data possible, but get all of it in one go (ie: get a list of IDs instead of the whole collection). Each of those items is then sent to the
getResult() function (in parallel) to do the actual downloading.
All of the parallel processing is handled by the
Pool class, which I’ve initialized with 25 processes.
Pool has a
map() function and a
starmap() function, both of which will apply my
getResult() function in parallel to every item in
map() only passes 1 argument and I want
getResult() to also get the item’s index so it can use it when generating a filename for each result. So I’m using the Python built-in
enumerate() to turn each work item into a tuple that includes the item’s index, and
starmap() will unpack those tuples into arguments for each call to
As I previously mentioned, I want to download a bunch of image chips. Specifically, I want 1000 randomly located, 256x256 pixel, RGB images from the USDA’s National Agriculture Imagery Program dataset, in each of the RESOLVE ecoregions that intersect my ROI (for a total of 4,000 images). To minimize the amount of data that
getResult() need to share, I’m going to have
getRequests() just generate the random sample centroids as a list of points.
getResult() function then takes one of those points and generates an image centered on that location, which is then downloaded as a PNG and saved to a file. This function uses
image.getThumbURL() to select the pixels, however you could also use
image.getDownloadURL() if you wanted the output to be in GeoTIFF or NumPy format.
When you’re making lots of requests in parallel, it’s common for some of them to fail due to ‘too many requests’ errors. So it’s very important for
getResult() to handle query errors and retries with a backoff. The
retry module (also available from pip) makes this very easy with a function decorator.
Here’s the whole script. Using 25 parallel processes, it took just under 4 minutes to download and save my 4,000 images.
And here’s another version for downloading a spatially-aggregated time-series for multiple regions (specifically, it computes the time-series of maximum land surface temperature from the MODIS MOD11A2 dataset for all GAUL level-2 regions in South America). The
getRequests() function just serves up an ID for each of the regions, and the
getResult() function does a spatial aggregation, mapped over the time-series of images. The output is a separate CSV per region.
- If you’ve never used Earth Engine from Python, you’ll need to install the earthengine-api package (I use the pip installation method). That will also install the
earthenginecommand line tool. You’ll also need to install the
retrypackage (also from pip).
- Make sure you’ve got
loggingturned on or
retrywill end up retrying (and hiding) any other errors your script might have.
- The final results of
getThumbURL()are limited to 32MB of data per request. For more than that, you should use the batch system.
- This script is likely to generate some errors on startup that you can probably ignore (the very first request from each worker might fail), but if you’re getting more than a few errors per second after that, try using a smaller number of pool workers.
- The division of labor between
getResult()is tricky to get right, and it can be tough to decide what to use as a work unit. For instance, you might try to generate a bunch of points with
stratifiedSample(), but if you then refer to one of those points in
getResult()by its position or its
system:indexinstead of by its geometry, then each of the download requests would end up having to regenerate the entire collection of random points (from scratch), just to get to the one you’ve specified. The good news is that getting this balance wrong isn’t catastrophic; it’ll just cause the workers to take longer.
- If you’re just trying to download a bunch of images clipped to an ROI (maybe every Sentinel-2 image over a city), then your
getRequests()function might just extract the
system:indexof every image in the original collection that meets your requirements. Then your
getResult()function can filter the original collection down to an ID with:
collection.filter(ee.Filter.eq(“system:index”, id)).first()Unlike the point above, in which this was potentially expensive for a programmatically generated FeatureCollection, this is usually quite cheap for filtered/clipped/transformed image collections. (Note, this is probably not true for any image collection generated by mapping over a list, like a collection of temporal composites. In that case, the best work unit might be a date range.
- If you’re just trying to sample bands from one image (or composite), it can be a lot faster to chop your ROI into a grid and use the grid cells as work units, because a query can handle the points near each other in a single request (using, for instance,
reduceRegions()). This would potentially be faster because any input tile might only need to be fetched once for multiple points. (Of course, this only helps if you have more than 1 point per 256x256 input tile).