Landsat & Sentinel-2 Data on AWS
This post will show you how to leverage cloud-optimized GeoTiffs and the AWS free tier to access satellite imagery, using Python.

Working with high-resolution satellite imagery, before the recent development Cloud-Optimized GeoTIFFs (COG), required downloading massive datasets which, most of the time, included a lot of useless data. More often than not, a user wants to access only a few bands in a small area at a time, instead of a full multispectral scene.
With Cloud-Optimized GeoTIFFs, we can now query a web server just for the information we require. This has made working with high-resolution imagery substantially easier (and cheaper, as well!).
You can get the complete code for this tutorial in this repository.
Cost of satellite imagery on AWS
Accessing public datasets on AWS like Landsat and Sentinel are stored in requester-pays buckets. This requires payment for data transfer and processing above the free-tier quotas, which should suffice for prototyping and other use cases, making access to this data virtually free for most use cases.
In detail, AWS always-free tier was recently expanded to 100GB per month of network egress.
What’s important about this is that code can be prototyped for free, downloading data and processing data locally, and deployed into production using AWS services like AWS Lambda, which also have a convenient free-tier.
Other reasons why AWS for Geospatial can be a good suit for many use cases is that:
- It leaves freedom regarding the choice of programming languages and libraries you can use.
- It allows to rapidly transition from prototyping into production.
- Is based on open-source tools which means there is less vendor lock-in.
At least at the time of writing this, none of these seems to be possible with GEE. As I’ve discussed in this post, GEE imposes many restrictions in return for a totally free service, which is comprehensible.
Downloading Sentinel-2 data from AWS
Ok, now that we cleared the convenience of relying on COG on AWS, let’s go about how to actually do this.
Now, we should beware that cloud-native geospatial development is still a rapidly changing field. Not surprisingly, I’ve noticed a certain lack of updated documentation in this area, as I had to go through lots of outdated tutorials to be able to begin working with this new technology.
Step 1: Create and configure an AWS account and the AWS CLI
The first thing you need is an AWS account. Remember that Amazon wants everybody’s credit card information, including users of the free-tier services.
This step has been covered over and over on other places, like here on the AWS site.
You will also need to set up the AWS Command Line Interface. To do so, you can follow this instructions.
Step 2: Find a STAC server and explore collections
Now we need to find which Landsat or Sentinel scenes we are interested in. To do this, another interesting development took place recently, which was the creation of the STAC standard for Spatio-temporal asset catalogs.
As possitive as this recent development might be, regrettably the standardization process seems to be taking some time, with some servers implementing only a portion of the standard, and the existance of certain client/server incompatibility.
The first important thing is to obtain the URLs of actual, functioning, STAC servers. So far, I’ve found the following:
- For Sentinel-2, the team of Element84 maintains the earth-search catalog on AWS
https://earth-search.aws.element84.com/v0
. - For Landsat, we can use the official USGS STAC server
https://landsatlook.usgs.gov/stac-server
, which includes s3 file locations in their metadata. - For many other collections, the Microsoft Planetary Computer has a STAC server at
https://planetarycomputer-staging.microsoft.com/api/stac/v1
To query a STAC server, we are going to have to use two Python packages:
- pystac-client for the USGS STAC server. Official docs
- sat-search for Sentinel data on AWS’s
earth-search
.
To install these packages,
pip install pystac-client sat-search
The reason to use two clients is that, at the time of writing this, the earth-search API endpoint we need for the Sentinel data doesn’t seem to be running a server which is compatible with the latest STAC standard, which the more actively developed pystac_client
requires. On the other hand, the package sat-search
continues to work with the older server version.
Using two different API clients is far from being an optimal solution, in particular as sat-search
seems to not being actively developed anymore. However, it seems like the reasonable thing to do, as we wait for an update of the earth-search
server.
Let’s start working with pystac-client
. The first step is to connect to a STAC server and query the available collections, for which we do:
from pystac_client import Client
LandsatSTAC = Client.open("https://landsatlook.usgs.gov/stac-server", headers=[])
for collection in LandsatSTAC.get_collections():
print(collection)
Choose the collection of your interest. In my case, I selected landsat-c2l2-sr
, as that data is atmospherically-corrected.
As for satsearch
, we can check that the connection to the server is working with the following simple command:
import satsearch
SentinelSTAC = satsearch.Search.search( url = "https://earth-search.aws.element84.com/v0" )
print("Found " + str(SentinelSTAC.found()) + "items")
Step 3: Look for scenes which intersect an area of interest
Now that we know which collection we are interested in, the next thing is to look for scenes which intersect a given area and a time range.
Importantly, at least at the time of writing this, not all STAC servers support a search
functionality, and are static-only servers. That means we should query their entire catalog, and search for the scenes of interest locally. But fortunately, the STAC server at USGS does support this dynamic feature.
We first define our area of interest with a geojson file (which can be produced interactively at geojson.io):
from json import load
file_path = "my-area.geojson"
file_content = load(open(file_path))
geometry = file_content["features"][0]["geometry"]
timeRange = '2019-06-01/2021-06-01'
In case you don’t want to work with geojson files, and want to define a polygon within Python, you can also do that. For example, when I’m interested in creating a small square around a given point, I use this function:
def BuildSquare(lon, lat, delta):
c1 = [lon + delta, lat + delta]
c2 = [lon + delta, lat - delta]
c3 = [lon - delta, lat - delta]
c4 = [lon - delta, lat + delta]
geometry = {"type": "Polygon", "coordinates": [[ c1, c2, c3, c4, c1 ]]}
return geometry
geometry = BuildSquare(-59.346271, -34.233076, 0.04)
Once we have defined our geometry
and timeRange
, we prepare the parameters, and we query the STAC server:
LandsatSearch = LandsatSTAC.search (
intersects = geometry,
datetime = timeRange,
query = ['eo:cloud_cover95'],
collections = ["landsat-c2l2-sr"] )
Landsat_items = [i.to_dict() for i in LandsatSearch.get_items()]
print(f"{len(Landsat_items)} Landsat scenes fetched")
Finally, we can now process the metadata to extract the location of the images.
This code will print the URLs of the red band of the Landsat collection we just found:
for item in Landsat_items:
red_href = item['assets']['red']['href']
red_s3 = item['assets']['red']['alternate']['s3']['href']
print(red_href)
print(red_s3)
Notice that, for Landsat data from the USGS STAC server, the default URL is in the USGS site, so you could actually get the data directly from them without the need of an AWS account at all.
However, if you later want to do cloud-based data processing on AWS (like we will do in subsequent posts!), it will be convenient to rely on AWS data. So, for the remainder of this post, we’ll just stick with the s3
location.
For sat-search
, the syntax is slightly different.
SentinelSearch = satsearch.Search.search(
url = "https://earth-search.aws.element84.com/v0",
intersects = geometry,
datetime = timeRange,
collections = ['sentinel-s2-l2a-cogs'] )
Sentinel_items = SentinelSearch.items()
print(Sentinel_items.summary())
for item in Sentinel_items:
red_s3 = item.assets['B04']['href']
print(red_s3)
If you want to explore what’s in each files, you can
item = Sentinel_items[0]
print(item.assets.keys())
dict_keys(['thumbnail', 'overview', 'info', 'metadata', 'visual', 'visual_20m', 'visual_60m', 'B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12', 'AOT', 'WVP', 'SCL'])
Interesting things to look at include of course the metadata, and the Scene Clasification Layer (SCL), which in particular includes cloud detection (see also the algoritm documentation).
Step 3: Open the desired GeoTiff with RasterIO
The next step is to open the area of interest of the Cloud-optimized GeoTIFF. We will resort to RasterIO, which will need to access an s3 resource on AWS. It is necessary to install the optional s3
component of RasterIO (see docs), and, on some systems configure SSL certificates.
pip install rasterio[s3]
import os
import boto3
import rasterio as rio
os.environ['CURL_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
print("Creating AWS Session")
aws_session = rio.session.AWSSession(boto3.Session(), requester_pays=True)
An additional thing to solve is the coordinates-to-pixels transformation for Landsat and Sentinel imagery. To do this, we can resort to the pyproj
package which handles geographical coordinates and transformations.
We summarize what we need in the following function:
from pyproj import Transformer
def getSubset(geotiff_file, bbox):
with rio.Env(aws_session):
with rio.open(geotiff_file) as geo_fp:
# Calculate pixels with PyProj
Transf = Transformer.from_crs("epsg:4326", geo_fp.crs)
lat_north, lon_west = Transf.transform(bbox[3], bbox[0])
lat_south, lon_east = Transf.transform(bbox[1], bbox[2])
x_top, y_top = geo_fp.index( lat_north, lon_west )
x_bottom, y_bottom = geo_fp.index( lat_south, lon_east )
# Define window in RasterIO
window = rio.windows.Window.from_slices( ( x_top, x_bottom ), ( y_top, y_bottom ) )
# Actual HTTP range request
subset = geo_fp.read(1, window=window)
return subset
Step 4: Process Your Data Locally
As a final step, we will just go through the image catalog we found, download our area of interest, and perform some local processing, which in this example will be just computing the NDVI and saving an image.
def plotNDVI(nir,red,filename):
ndvi = (nir-red)/(nir+red)
ndvi[ndvi>1] = 1
plt.imshow(ndvi)
plt.savefig(filename)
plt.close()
Of course, in a real application we would be using much more advanced techniques, like image segmentation or machine learning. What I want to stress in the present guide, is that we can prototype those algorithms locally and later deploy them to the cloud (a topic we will cover in a subsequent post).
Now the only thing left is to loop over the catalog and process all the files, which we do as follows:
from rasterio.features import bounds
import matplotlib.pyplot as plt
bbox = bounds(geometry)
for i,item in enumerate(Sentinel_items):
red_s3 = item.assets['B04']['href']
nir_s3 = item.assets['B08']['href']
date = item.properties['datetime'][0:10]
print("Sentinel item number " + str(i) + "/" + str(len(Sentinel_items)) + " " + date)
red = getSubset(red_s3, bbox)
nir = getSubset(nir_s3, bbox)
plotNDVI(nir,red,"sentinel/" + date + "_ndvi.png")
for i,item in enumerate(Landsat_items):
red_s3 = item['assets']['red']['alternate']['s3']['href']
nir_s3 = item['assets']['nir08']['alternate']['s3']['href']
date = item['properties']['datetime'][0:10]
print("Landsat item number " + str(i) + "/" + str(len(Landsat_items)) + " " + date)
red = getSubset(red_s3, bbox)
nir = getSubset(nir_s3, bbox)
plotNDVI(nir,red,"landsat/" + date + "_ndvi.png")
Step 5: Deploy on an EC2 instance
After you finished testing, you might want to run your code directly on AWS. This will substantially reduce latency, and also give you access to more computing power.
After launching an EC2 instance and logging in, you will need to install the required dependencies like before. Assuming you are using an Amazon Linux 2 instance it should be straightfoward:
pip3 install sat-search pyproj rasterio[s3]
Now, an extremely simple way to deploy a Python script to an EC2 instance, is to upload it using scp.
scp -i security-key-file.pem sentinel-script.py ec2-user@ec2_elastic_ip:/home/ec2-user/
Running the script could result in the following error:
rasterio.errors.RasterioIOError: Illegal characters found in URL
This can hapened because we haven’t configured our EC2 instance to be able to access s3 buckets. See this video tutorial about how to do it the proper way (establishing suitable Roles in the IAM console). Note: you shouldn’t use aws configure
and enter your credentials because that is a terrible from a security point of view.
Also, on Amazon Linux 2, the CA certificates where stored on a different location than my local computer, so after digging around a little, I found them:
os.environ['CURL_CA_BUNDLE'] = '/etc/ssl/certs/ca-bundle.crt'
After this, your Python code should run successfully.
Conclusion
Hopefully this tutorial has cleared out some basic questions and helped you get started with Cloud Optimized GeoTIFFs on AWS S3, using Python.
Remember that you can get the complete code for this tutorial here.
I plan to write more tutorials on Cloud-Native Geospatial Development. One way to get a notification whenever new content comes out is to subscribe to my mailing list (I only send one e-mail per month, at most).