Fast(er) Downloads

Noel Gorelick
5 min readNov 2, 2021

--

Earth Engine by Example

Downloading lots of small images using the Earth Engine task system isn’t always the best fit.

While Earth Engine is a powerful system, sometimes you just need to download some data. But if you want to download a lot of little pieces of data that are fast to compute (like image chips or random samples), an Export task using the Earth Engine batch system might not be the best way to do it. It can take a couple of minutes just to schedule, start and stop each task and you can only run a handful in parallel, so running a lot of small tasks is very inefficient. Additionally, since the batch system only allows you to queue up 3000 tasks, trying to extract 100,000 small items can be painful just to orchestrate.

In this example, I’m going to demonstrate how to efficiently download many small pieces of data. This particular example is going to utilize ee.Image.getThumbURL() to download a bunch of RGB image chips for offline validation, however, the same technique can work for other types of data (like CSVs or NumPy arrays) using getDownloadURL() or even getInfo().

The basic structure of this tool is a Python program that uses the multiprocessing module to make many requests simultaneously. The program’s skeleton looks like this:

First, the Earth Engine library needs to be initialized to use the high-volume endpoint. You should use this URL whenever you’re making automated requests. The bulk of the work is then done by two functions (code in the next section):

  • getRequests() figures out which items to download, and
  • getResult() does the downloading and save-to-file for one item

The getRequests() function’s job is to do enough setup to retrieve a list of work items that need to be downloaded (requests). This typically involves using getInfo() to get a list of features or geometries, or maybe the IDs from an image collection. The trick is to get the smallest amount of data possible, but get all of it in one go (ie: get a list of IDs instead of the whole collection). Each of those items is then sent to the getResult() function (in parallel) to do the actual downloading.

All of the parallel processing is handled by the multiprocessing module’s Pool class, which I’ve initialized with 25 processes. Pool has a map() function and a starmap() function, both of which will apply my getResult() function in parallel to every item in items. However map() only passes 1 argument and I want getResult() to also get the item’s index so it can use it when generating a filename for each result. So I’m using the Python built-in enumerate() to turn each work item into a tuple that includes the item’s index, and starmap() will unpack those tuples into arguments for each call to getResult().

As I previously mentioned, I want to download a bunch of image chips. Specifically, I want 1000 randomly located, 256x256 pixel, RGB images from the USDA’s National Agriculture Imagery Program dataset, in each of the RESOLVE ecoregions that intersect my ROI (for a total of 4,000 images). To minimize the amount of data that getRequests() and getResult() need to share, I’m going to have getRequests() just generate the random sample centroids as a list of points.

The getResult() function then takes one of those points and generates an image centered on that location, which is then downloaded as a PNG and saved to a file. This function uses image.getThumbURL() to select the pixels, however you could also use image.getDownloadURL() if you wanted the output to be in GeoTIFF or NumPy format.

When you’re making lots of requests in parallel, it’s common for some of them to fail due to ‘too many requests’ errors. So it’s very important for getResult() to handle query errors and retries with a backoff. The retry module (also available from pip) makes this very easy with a function decorator.

Here’s the whole script. Using 25 parallel processes, it took just under 4 minutes to download and save my 4,000 images.

And here’s another version for downloading a spatially-aggregated time-series for multiple regions (specifically, it computes the time-series of maximum land surface temperature from the MODIS MOD11A2 dataset for all GAUL level-2 regions in South America). The getRequests() function just serves up an ID for each of the regions, and the getResult() function does a spatial aggregation, mapped over the time-series of images. The output is a separate CSV per region.

Caveats

  • If you’ve never used Earth Engine from Python, you’ll need to install the earthengine-api package (I use the pip installation method). That will also install the earthengine command line tool. You’ll also need to install the retry package (also from pip).
  • Make sure you’ve got logging turned on or retry will end up retrying (and hiding) any other errors your script might have.
  • The final results of getDownloadURL() and getThumbURL() are limited to 32MB of data per request. For more than that, you should use the batch system.
  • This script is likely to generate some errors on startup that you can probably ignore (the very first request from each worker might fail), but if you’re getting more than a few errors per second after that, try using a smaller number of pool workers.
  • The division of labor between getRequests() and getResult() is tricky to get right, and it can be tough to decide what to use as a work unit. For instance, you might try to generate a bunch of points with randomPoints() or stratifiedSample(), but if you then refer to one of those points in getResult() by its position or its system:index instead of by its geometry, then each of the download requests would end up having to regenerate the entire collection of random points (from scratch), just to get to the one you’ve specified. The good news is that getting this balance wrong isn’t catastrophic; it’ll just cause the workers to take longer.
  • If you’re just trying to download a bunch of images clipped to an ROI (maybe every Sentinel-2 image over a city), then your getRequests() function might just extract the system:index of every image in the original collection that meets your requirements. Then your getResult() function can filter the original collection down to an ID with: collection.filter(ee.Filter.eq(“system:index”, id)).first() Unlike the point above, in which this was potentially expensive for a programmatically generated FeatureCollection, this is usually quite cheap for filtered/clipped/transformed image collections. (Note, this is probably not true for any image collection generated by mapping over a list, like a collection of temporal composites. In that case, the best work unit might be a date range.
  • If you’re just trying to sample bands from one image (or composite), it can be a lot faster to chop your ROI into a grid and use the grid cells as work units, because a query can handle the points near each other in a single request (using, for instance, reduceRegions()). This would potentially be faster because any input tile might only need to be fetched once for multiple points. (Of course, this only helps if you have more than 1 point per 256x256 input tile).

Lots of thanks to Justin Braaten and Nicholas Clinton for helping to refine this posting and these scripts. I’m just one part of a great team.

--

--

Noel Gorelick
Noel Gorelick

Written by Noel Gorelick

I’m a scientist and engineer working at the intersection of technology and nature. I helped send spacecraft to Mars, and co-founded Google Earth Engine.

Responses (9)