Land cover classification at the Upper Santa Maria Creek, CA¶

About the Upper Santa Maria Creek Subwatershed¶

The Upper Santa Maria Creek Sub-watershed (hydrologic unit code 180703040203) is an inland headwater sub-watershed in California, (technically) within San Diego County, and is part of the larger San Dieguito River watershed system that runs from the eastern mountain range all the way to the ocean. I say that the Santa Maria Creek Sub-watershed is "technically" within San Diego County, because the town that occupies all the wastershed boundaries is an unincorporated community of San Diego county.

This area is typically dry and we are often subject to flashflooding due to compaction and drought conditions. So often the watershed doesn't appear to be much of a watershed. If you check out the USGS station for this (11028500), we have very little stream flow and occasional larger storm events (like the winter of 2026 where we got 4 inches of rain in a day!).

This sub-watershed lies in north-central Ramona, CA and adjacent uplands, draining generally northwesterly toward the San Pasqual Valley area where Santa Maria Creek meets Santa Ysabel Creek to form the San Dieguito River.

As one might expect, this subwatershed like many others in California is under climate stress. Ramona relies on water pumped up from the lower city of Poway, CA (about 1,000 feet in elevation difference). We have one main resevoir (Lake Ramona), but in general all of or other water is from groundwater. This particular sub-watershed is typically very dry and much of its recharge capabilities has been negatively impacted by the construction of the town in the major of the sub-watershed basin area. This will be very obvious in the maps we will later create in this notebook.

Using this notebook¶

There are 5 steps in this notebook with little "what is this code doing" signposting throughout. There are some notes about clusters and bands later in the notebook as well as additional analysis on the RGB map and kmeans map of this subwatershed. This project requires caching data. As a warning if you run this notebook once and then need to change the site or other data "upstream" you will need to either clear out the cache, or skip that. If you skip caching the notebook will still work as I had a cache bug when computing.

Step 1a: Libraries and GDAL¶

  • We step up libraries and GDAL so we don't have to worry about timeout issues

Step 1b: Run the caching decorator¶

  • In this step, we define a caching decorator that saves the output of computationally intensive functions to disk.

Step 2: Plotting the study site¶

  • In this step, we pull the data for the subwatershed via URL to create a geodataframe(gdf), filter to our particular watershed's shapefile to create a site map.

Step 3: Multispectral data array¶

  • In this step, we use earthaccess to pull spectral band data for our watershed, process the granules from earthaccess, and then open, crop, and mask the data.
  • We then merge and composite our data to make an array.

Step 4: Kmeans clustering¶

  • In this step, we take the multi-band raster data from step 3 and do a kmeans clustering.

Step 5: Plot¶

  • In this step, we plot both a RBG map of the site and a kmeans map of the site side by side.

Analysis¶

  • At the end of the notebook is some general analysis from the plots and other data.
In [1]:
#+++Libraries for caching decorator and debugging+++#
import os
import pickle
import re
import warnings
from pathlib import Path

#+++Libraries for geospatial work and clustering+++#
#++Note++#
#You will need to install earthaccess in terminal. Below is how to do this#
# conda activate [your environment]
# conda install -c conda-forge earthaccess
# python -c "import earthaccess; print(earthaccess.__version__)"
# my kernel also is missing earthpy so that also was installed in terminal
# do the same steps as above but replace with earthpy

import cartopy.crs as ccrs
import earthaccess
import earthpy as et
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import rioxarray as rxr
import rioxarray.merge as rxrmerge
#Note#
# This library specifically gives us spiffy progress bars that you will see later
from tqdm.notebook import tqdm
import xarray as xr
from shapely.geometry import Polygon
from sklearn.cluster import KMeans

### set GDAL parameters
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

# Half way through I got tired of debugging warnings. The following will keep warning messages 
# from making our output messy or harder to read
### don't show non-critical warnings
warnings.simplefilter('ignore')
c:\Users\kayle\miniconda3\envs\geog\Lib\site-packages\earthpy\__init__.py:7: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_string

Step 1b: Run the caching decorator¶

What my code is doing¶

In this step, I define a caching decorator that saves the output of computationally intensive functions to disk. This allows me to avoid re-running expensive operations (such as downloading and processing satellite data) every time the notebook is executed. Despite this you may run into some cache issues like I did later in the notebook. Keep in mind that when caching, if you cache an empty array or dataframe you will keep calling the empty data unless you fix (or delete) the cache.

When the function is called again with the same inputs, the cached result is loaded instead of recomputing everything. This is especially useful when working with remote sensing data, which can take several minutes to process. Depending on how big your watershed is, the time to complete the functions later should ideally be no longer than 10 minutes.

In [2]:
### make the caching decorator
## some of the items here are to insure when we rerun cache cells we don't redo our work
def cached(func_key, override=False):
    """
    A decorator to cache function results
    
    Parameters
    ==========
    key: str
      File basename used to save pickled results
    override: bool
      When True, re-compute even if the results are already stored
    """
    def compute_and_cache_decorator(compute_function):
        """
        Wrap the caching function
        
        Parameters
        ==========
        compute_function: function
          The function to run and cache results
        """
        def compute_and_cache(*args, **kwargs):
            """
            Perform a computation and cache, or load cached result.
            
            Parameters
            ==========
            args
              Positional arguments for the compute function
            kwargs
              Keyword arguments for the compute function
            """
            ### Add an identifier from the particular function call
            if 'cache_key' in kwargs:
                key = '_'.join((func_key, kwargs['cache_key']))
            else:
                key = func_key

            ### define a file path based on the directory structure in earthpy
            path = os.path.join(
                
                ### earthpy directory
                et.io.HOME, 
                
                ### earthpy dataset
                et.io.DATA_NAME, 
                
                ### make a subdirectory called "jars"
                'jars', 
                
                ### use f-string (formatted string) to create a string by embedding the value
                ### of the variable "key" into the string 
                ### use .pickle file extension (a pickle file is a serialized python objecT)
                f'{key}.pickle')
            
            ### Check if the cache exists already or if we should override caching
            if not os.path.exists(path) or override:
                
                ### Make jars directory if needed
                os.makedirs(os.path.dirname(path), exist_ok=True)
                
                ### Run the compute function as the user did
                result = compute_function(*args, **kwargs)
                
                ### Pickle the object (save to file)
                ### open the file at filename
                with open(path, 'wb') as file:
                    
                    ### save the result without needing to recompute when loading
                    ### it back into Python
                    pickle.dump(result, file)
            
            ### if the file already exists/we are not overriding the cache
            else:
               
                ### Unpickle the object (load the cached result)
                with open(path, 'rb') as file:
                    
                    ### use pickle.load to unserialize the file back into a python object
                    result = pickle.load(file)
                    
            return result
        
        return compute_and_cache
    
    return compute_and_cache_decorator

STEP 2: Study site map¶

For this portfolio I want to focus on California, but especially a subwatershed in Southern California.

I looked at mapped boundaries from californianature.cal.gov (https://www.californianature.ca.gov/datasets/CAnature::watersheds-huc12-cdfw/explore?location=33.045728%2C-116.655078%2C10)

From the map, I chose the Upper Santa Maria Creek (HU12 180703040203)

For California the region we need to use is 18.

I chose this subwatershed because its one of the many that overlap with my hometown.

What my code is doing¶

Here, I define the assigned study area using a watershed boundary. This boundary determines the spatial extent of my analysis and is used to clip all satellite imagery.

In [3]:
### assign the hydrologic unit code
huc12 = "180703040203"
In [4]:
### download, unzip, and read the shapefile, and use the caching decorator to store it:

### use the cached decorator to wrap the function we'll make (read_wbd_file) 
@cached("wbd_region18_hu12")

### define a function (read_wbd_file()) to download the watershed boundary data
def read_wbd_file(region="18"):

    ### define the URL to download data
    wbd_url = (
        "https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/WBD/HU2/Shape/"

        ### insert the name of the specific file we want using an f-string
        f"WBD_{region}_HU2_Shape.zip"
    )

    ### download the data and unzip it into the directory we defined earlier
    wbd_dir = Path(et.data.get_data(url=wbd_url, verbose=True))

    ### make path to the shapefile in the directory
    shp_path = next(
        wbd_dir.rglob(

            ### make the shapefile name in any subfolder
            "WBDHU12.shp"
        )
    )

    ### read the shapefile as a gdf
    try:
        wbd_gdf = gpd.read_file(
            shp_path,

            ### use pyogrio library to read the shapefile 
            ### (better performance with large data)
            engine="pyogrio",
        )
    except Exception:
        wbd_gdf = gpd.read_file(shp_path)

    ### return the gdf of the watershed boundaries
    return wbd_gdf
In [5]:
### open the shapefile using the read_wbd_file function that we created
wbd_gdf = read_wbd_file()
Downloading from https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/WBD/HU2/Shape/WBD_18_HU2_Shape.zip
Extracted output to C:\Users\kayle\earth-analytics\data\earthpy-downloads\WBD_18_HU2_Shape
In [6]:
### filter the shapefile to the specific watershed we're using

### define the gdf for the watershed by subsetting the gdf of the whole watershed dataset
huc_col = "huc12" if "huc12" in wbd_gdf.columns else "HUC12"
gdf = (

    ### filter the gdf to the row(s) with the watershed we want
    ### use "dissolve" to merge the geometries of all the rows matching the target watershed
    wbd_gdf.loc[wbd_gdf[huc_col] == huc12]
    .dissolve()
    .reset_index(drop=True)
)
gdf[huc_col] = huc12

### check it out
gdf
Out[6]:
geometry tnmid metasource sourcedata sourceorig sourcefeat loaddate referenceg areaacres areasqkm ... huc12 name hutype humod tohuc noncontrib noncontr_1 shape_Leng shape_Area ObjectID
0 POLYGON ((-116.79668 33.10525, -116.79622 33.1... {3B709BAE-C613-40FE-896C-E661983FFCDD} {511D2AC8-11BA-45FC-AB98-F69D693D4C44} Watershed Boundary Dataset (WBD) Natural Resources and Conservation Service and... None 2024-08-15 249084 29387.89 118.93 ... 180703040203 Upper Santa Maria Creek S AW 180703040204 0.0 0.0 NaN NaN 3803

1 rows × 21 columns

In [7]:
### Make a site map with satellite imagery in the background
gv.extension("bokeh")
site_map = (
    gv.tile_sources.EsriImagery.opts(width=700, height=450)
    * gdf.to_crs(epsg=4326).hvplot(
        geo=True,
        crs=ccrs.PlateCarree(),
        fill_alpha=0,
        line_color="yellow",
        line_width=2,
        title=f"HU12 Watershed {huc12}",
    )
)
site_map
No description has been provided for this image No description has been provided for this image ⓘ
Out[7]:
In [48]:
import holoviews as hv

upper_santa_maria_creek_plot = site_map

hv.save(upper_santa_maria_creek_plot, "upper_santa_maria_creek_watershed.html")

STEP 3: Multispectral data¶

Step 3a: Search for data¶

In this step, I query NASA Earthdata using the earthaccess library to find satellite images (granules) that intersect the assigned study area and time range.

We then extract metadata for each granule, including:

  • acquisition date and time
  • tile ID
  • file links for each spectral band

This creates a structured dataset that I will use to access and process the imagery.

In [8]:
### Log in to earthaccess
auth = earthaccess.login(persist=True)
In [10]:
### Search for HLS granules we want
results = earthaccess.search_data(

    ### specify which dataset and spatial resolution we want 
    short_name="HLSL30",

    ### specify that we're using cloud data
    cloud_hosted=True,

    ### use the bounding box from our watershed boundary
    bounding_box=tuple(gdf.to_crs(epsg=4326).total_bounds),

    ### set the temporal range of the data
    temporal=("2023-05-01", "2023-10-31"),
)
len(results)
Out[10]:
46

Why we retrieve multiple image granules¶

Satellite imagery is typically stored in tiles (granules), each covering a portion of the Earth’s surface. Because the watershed may not align perfectly with a single tile, we retrieve multiple granules that intersect the study area.

These granules are later merged into a single composite image so that we can analyze the watershed as a continuous landscape.

Step 3b: Compile information about each granule¶

In [11]:
### make a function to process all the granules from the earthaccess search
### and extract information for each granule

def process_granules(search_results):

    ### make and display a progress bar
    progress_bar = tqdm(total=len(search_results), desc="Compiling granule metadata")

    ### regex to extract tile id and band from HLS filenames
    tif_regex = re.compile(
        r"\.(?P<tile_id>\w+)\.\d+T\d+\.v\d\.\d\.(?P<band>[A-Za-z0-9]+)\.tif$",
        flags=re.IGNORECASE,
    )

    ### accumulate rows
    link_rows = []

    ### loop over granules
    for result in search_results:

        ### locate metadata
        umm = result["umm"]

        ### pull out granule id
        granule_id = umm.get("GranuleUR")

        ### extract datetime
        granule_datetime = pd.to_datetime(
            umm.get("TemporalExtent", {})
            .get("RangeDateTime", {})
            .get("BeginningDateTime")
        )

        ### extract boundary coordinates
        points = []
        try:
            gpolygons = (
                umm.get("SpatialExtent", {})
                .get("HorizontalSpatialDomain", {})
                .get("Geometry", {})
                .get("GPolygons", [])
            )

            if isinstance(gpolygons, dict):
                gpolygons = [gpolygons]

            if gpolygons:
                boundary = gpolygons[0].get("Boundary", [])
                if isinstance(boundary, dict):
                    boundary = [boundary]
                if boundary:
                    points = boundary[0].get("Points", [])

        except Exception as e:
            print(f"Could not extract polygon for {granule_id}: {e}")
            points = []

        ### make polygon
        if len(points) >= 3:
            geometry = Polygon([(pt["Longitude"], pt["Latitude"]) for pt in points])
        else:
            geometry = None

        ### THIS is the important part:
        ### open through earthaccess and keep the returned file-like objects
        try:
            granule_files = earthaccess.open([result])
        except Exception as e:
            print(f"Could not open granule {granule_id}: {e}")
            progress_bar.update(1)
            continue

        ### loop through each file object in the granule
        for file_obj in granule_files:
            file_match = tif_regex.search(file_obj.full_name)

            if file_match:
                link_rows.append(
                    {
                        "granule_id": granule_id,
                        "datetime": granule_datetime,
                        "geometry": geometry,
                        "url": file_obj,   # keep file-like object, not plain string
                        "tile_id": file_match.group("tile_id"),
                        "band": file_match.group("band"),
                    }
                )

        progress_bar.update(1)

    progress_bar.close()

    granule_gdf = gpd.GeoDataFrame(link_rows, geometry="geometry", crs="EPSG:4326")
    return granule_gdf

Step 3c: Open, crop, and mask data¶

Converting raw imagery to reflectance values¶

Satellite bands represent reflected energy across different portions of the electromagnetic spectrum. These values allow us to distinguish between surface types such as vegetation, water, and soil.

By working with multiple spectral bands, we can capture differences in how land cover types reflect light, which is essential for clustering and classification.

Here, I open each satellite image band, clip it to the watershed, and apply a cloud mask using the Fmask layer.

These steps ensures that:

  • all data are spatially aligned to the study area
  • cloudy or low-quality pixels are removed
  • reflectance values are scaled properly

This produces a clean set of spectral data ready for analysis.

In this portion of the notebook I had to do quite a lot of debugging as I had an issue with an empty cache. You will see the @cached("masked_reflectance) commented out as the workflow functioned without it. This is also why rasterio.errors was imported here. You will also see a print summary block section in the code where I was trying to determine if rasters were being skipped or not. Even if you do not have bugs, this is a nice set of code to keep as it is a easy way to see if everything is working.

In [12]:
from rasterio.errors import RasterioIOError

# @cached("masked_reflectance")   # leave off to avoid odd cache bug
def compute_reflectance_data(search_results, watershed_gdf):

    ### rebuild metadata inside function using earthaccess.open() file objects
    granule_gdf = process_granules(search_results)

    def open_crop_raster(file_obj, watershed_gdf, scale_factor=1.0, masked=True):
        try:
            da = rxr.open_rasterio(file_obj, masked=masked).squeeze(drop=True)
        except RasterioIOError as e:
            print(f"Open failed:\n{file_obj}\n{e}")
            return None
        except Exception as e:
            print(f"Unexpected open failure:\n{file_obj}\n{e}")
            return None

        try:
            watershed_on_da = watershed_gdf.to_crs(da.rio.crs)
            da = da.rio.clip_box(*watershed_on_da.total_bounds)
        except Exception as e:
            print(f"Clip failed:\n{file_obj}\n{e}")
            return None

        return da.astype("float32") * scale_factor

    def get_cloud_mask(fmask_da, mask_bits=(1, 2, 3)):
        bits = np.unpackbits(
            fmask_da.fillna(0).astype(np.uint8).values[..., np.newaxis],
            axis=-1,
            bitorder="little",
        )
        valid = np.prod(bits[..., list(mask_bits)] == 0, axis=-1).astype(bool)
        return xr.DataArray(valid, coords=fmask_da.coords, dims=fmask_da.dims)

    granule_da_rows = []

    total_groups = 0
    missing_fmask = 0
    failed_fmask_open = 0
    attempted_bands = 0
    failed_band_open = 0
    successful_bands = 0

    group_iter = granule_gdf.groupby(["datetime", "tile_id"])

    for (image_datetime, tile_id), tile_df in tqdm(group_iter, desc="Processing granules"):
        total_groups += 1

        fmask_df = tile_df.loc[tile_df.band.str.lower() == "fmask"]
        if fmask_df.empty:
            missing_fmask += 1
            continue

        fmask_da = open_crop_raster(
            fmask_df.iloc[0].url,
            watershed_gdf,
            scale_factor=1.0,
            masked=False,
        )

        if fmask_da is None:
            failed_fmask_open += 1
            continue

        clear_mask = get_cloud_mask(fmask_da)

        spectral_df = tile_df.loc[tile_df.band.str.upper().str.startswith("B")]

        for _, band_row in spectral_df.iterrows():
            attempted_bands += 1

            band_da = open_crop_raster(
                band_row.url,
                watershed_gdf,
                scale_factor=0.0001,
                masked=True,
            )

            if band_da is None:
                failed_band_open += 1
                continue

            band_da.name = band_row.band
            band_da = band_da.where(clear_mask)

            row_data = band_row.to_dict()
            row_data["da"] = band_da
            granule_da_rows.append(row_data)
            successful_bands += 1

    print("\nSummary")
    print("-------")
    print("Total datetime/tile groups:", total_groups)
    print("Groups missing Fmask row:", missing_fmask)
    print("Groups where Fmask failed to open:", failed_fmask_open)
    print("Band rasters attempted:", attempted_bands)
    print("Band rasters failed to open:", failed_band_open)
    print("Successful band rasters:", successful_bands)
    print("Rows appended:", len(granule_da_rows))

    if len(granule_da_rows) == 0:
        print("No valid rasters were processed; returning an empty GeoDataFrame.")
        empty_cols = list(granule_gdf.columns) + ["da"]
        return gpd.GeoDataFrame(
            {col: [] for col in empty_cols},
            geometry="geometry",
            crs=granule_gdf.crs
        )

    granule_da_df = pd.DataFrame(granule_da_rows)

    granule_da_gdf = gpd.GeoDataFrame(
        granule_da_df,
        geometry="geometry",
        crs=granule_gdf.crs
    )

    return granule_da_gdf
In [13]:
### apply the function
granule_da_gdf = compute_reflectance_data(results, gdf)
Compiling granule metadata:   0%|          | 0/46 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
Processing granules:   0%|          | 0/46 [00:00<?, ?it/s]
Summary
-------
Total datetime/tile groups: 46
Groups missing Fmask row: 0
Groups where Fmask failed to open: 0
Band rasters attempted: 460
Band rasters failed to open: 0
Successful band rasters: 460
Rows appended: 460
In [ ]:
# Here I am just checking the granules
print("Shape:", granule_da_gdf.shape)
print("Unique tiles:", granule_da_gdf["tile_id"].nunique())
print("Unique dates:", granule_da_gdf["datetime"].nunique())
Shape: (460, 7)
Unique tiles: 2
Unique dates: 24
In [ ]:
# Call the gdf to make sure we have all 46 granules across all bands
granule_da_gdf["band"].value_counts()
Out[ ]:
band
B06    46
B03    46
B11    46
B05    46
B02    46
B09    46
B01    46
B10    46
B07    46
B04    46
Name: count, dtype: int64
In [ ]:
### check out the dataframe for time band and our array too
granule_da_gdf[["datetime", "tile_id", "band", "da"]].head()
Out[ ]:
datetime tile_id band da
0 2023-05-02 18:22:00.594000+00:00 T11SMS B06 [[<xarray.DataArray 'B06' ()> Size: 4B\narray(...
1 2023-05-02 18:22:00.594000+00:00 T11SMS B03 [[<xarray.DataArray 'B03' ()> Size: 4B\narray(...
2 2023-05-02 18:22:00.594000+00:00 T11SMS B11 [[<xarray.DataArray 'B11' ()> Size: 4B\narray(...
3 2023-05-02 18:22:00.594000+00:00 T11SMS B05 [[<xarray.DataArray 'B05' ()> Size: 4B\narray(...
4 2023-05-02 18:22:00.594000+00:00 T11SMS B02 [[<xarray.DataArray 'B02' ()> Size: 4B\narray(...

Step 3d: Merge and Composite Data¶

Creating a composite image¶

Each granule contains only part of the watershed. To analyze the full study area, we merge all intersecting granules into a single continuous raster.

This step ensures that all pixels within the watershed are represented in a consistent spatial grid, allowing for downstream analysis such as clustering.

So in this step, I combine all spectral bands into a single data structure. Each pixel now has a vector of reflectance values across multiple wavelengths.

In [26]:
### apply cache decorator
@cached(f"composite_reflectance_{huc12}", override=True)

### create a function to merge and composite reflectance data from multiple granules
### end result: single, composite reflectance image for each spectral band
def merge_and_composite(granule_da_gdf):

    ### initialize a list to store dfs
    band_composites = []

    ### initialize a list to store composites after procesing
    spectral_bands = sorted(granule_da_gdf.band.unique())

    ### loop over each spectral band
    for band in spectral_bands:
        merged_das = []
        merged_dates = []
        band_df = granule_da_gdf.loc[granule_da_gdf.band == band]

        ### loop over date/time of image acquisition and merge granules for each data
        for image_datetime, date_df in band_df.groupby("datetime"):
            merged_da = rxrmerge.merge_arrays(list(date_df.da))

            ### mask negative values (could be no data or invalid data)
            merged_da = merged_da.where(merged_da >= 0)

            ### append to merged_das list we initialized
            merged_das.append(merged_da)
            merged_dates.append(pd.to_datetime(image_datetime))

        if not merged_das:
            continue

        ### composite images across dates
        merged_stack = xr.concat(merged_das, dim=pd.Index(merged_dates, name="datetime"))
        composite_da = merged_stack.mean(dim="datetime", skipna=True)
        composite_da = composite_da.expand_dims(band=[band])

        ### add processed and composite data array to lsit
        band_composites.append(composite_da)

    ### concatenates composite data arrays for each band along band dimension
    spectral_da = xr.concat(band_composites, dim="band").sortby("band")
    spectral_da.name = "reflectance"
    return spectral_da

Note on the cells below¶

I added a extra print cell to better inspect the array to make sure everything was working properly. I then inspected the min and max values for each band to make sure everything was good there, and to see if (and how many) null values exist across the spectral data array.

In [ ]:
### call function to get final composite reflectance data 
spectral_da = merge_and_composite(granule_da_gdf)
spectral_da
Out[ ]:
<xarray.DataArray 'reflectance' (band: 10, y: 411, x: 723)> Size: 12MB
array([[[0.02323   , 0.02051   , 0.02068182, ..., 0.02800667,
         0.02893334, 0.02663333],
        [0.0272    , 0.02226363, 0.02182727, ..., 0.03052667,
         0.02136   , 0.02165333],
        [0.02383   , 0.02519091, 0.02255454, ..., 0.02534667,
         0.02004   , 0.02010667],
        ...,
        [0.0182    , 0.024575  , 0.022325  , ..., 0.0297375 ,
         0.0295125 , 0.03119375],
        [0.02154167, 0.02566666, 0.0228    , ..., 0.02696875,
         0.02768125, 0.03118125],
        [0.02564167, 0.021125  , 0.01755   , ..., 0.0282875 ,
         0.029525  , 0.030075  ]],

       [[0.02739091, 0.0237    , 0.02630909, ..., 0.03753333,
         0.03829334, 0.03575333],
        [0.03229091, 0.02842727, 0.02795455, ..., 0.04112667,
         0.02968   , 0.02982667],
        [0.02806364, 0.03233636, 0.02918182, ..., 0.03398   ,
         0.02794667, 0.02822   ],
...
        [0.29019165, 0.2897    , 0.28825834, ..., 0.34480622,
         0.34674996, 0.34965625],
        [0.2914    , 0.29025832, 0.28791663, ..., 0.3422937 ,
         0.344625  , 0.3476125 ],
        [0.2916917 , 0.289775  , 0.28661668, ..., 0.33952498,
         0.34243748, 0.34556875]],

       [[0.2868091 , 0.28707272, 0.28841817, ..., 0.31446669,
         0.31547335, 0.31639332],
        [0.28775454, 0.28784546, 0.28925455, ..., 0.31435332,
         0.31426668, 0.31457335],
        [0.28948182, 0.2891909 , 0.29031816, ..., 0.3147733 ,
         0.31355998, 0.3128733 ],
        ...,
        [0.27674165, 0.27655   , 0.2756    , ..., 0.32446876,
         0.326175  , 0.32840624],
        [0.27785   , 0.27693334, 0.27515   , ..., 0.32219994,
         0.324125  , 0.3264625 ],
        [0.27839997, 0.27651668, 0.27376667, ..., 0.3194438 ,
         0.32189375, 0.32453123]]], shape=(10, 411, 723), dtype=float32)
Coordinates:
  * band         (band) object 80B 'B01' 'B02' 'B03' 'B04' ... 'B09' 'B10' 'B11'
  * y            (y) float64 3kB 3.663e+06 3.663e+06 ... 3.651e+06 3.651e+06
  * x            (x) float64 6kB 5.044e+05 5.044e+05 ... 5.26e+05 5.26e+05
    spatial_ref  int64 8B 0
Attributes: (12/35)
    ACCODE:                    Lasrc; Lasrc
    add_offset:                0.0
    arop_ave_xshift(meters):   0, 0
    arop_ave_yshift(meters):   0, 0
    arop_ncp:                  0, 0
    arop_rmse(meters):         0, 0
    ...                        ...
    TIRS_SSM_MODEL:            PRELIMINARY; PRELIMINARY
    TIRS_SSM_POSITION_STATUS:  ESTIMATED; ESTIMATED
    ULX:                       399960
    ULY:                       3700020
    USGS_SOFTWARE:             LPGS_16.2.0
    AREA_OR_POINT:             Area
xarray.DataArray
'reflectance'
  • band: 10
  • y: 411
  • x: 723
  • 0.02323 0.02051 0.02068 0.02136 0.02199 ... 0.317 0.3194 0.3219 0.3245
    array([[[0.02323   , 0.02051   , 0.02068182, ..., 0.02800667,
             0.02893334, 0.02663333],
            [0.0272    , 0.02226363, 0.02182727, ..., 0.03052667,
             0.02136   , 0.02165333],
            [0.02383   , 0.02519091, 0.02255454, ..., 0.02534667,
             0.02004   , 0.02010667],
            ...,
            [0.0182    , 0.024575  , 0.022325  , ..., 0.0297375 ,
             0.0295125 , 0.03119375],
            [0.02154167, 0.02566666, 0.0228    , ..., 0.02696875,
             0.02768125, 0.03118125],
            [0.02564167, 0.021125  , 0.01755   , ..., 0.0282875 ,
             0.029525  , 0.030075  ]],
    
           [[0.02739091, 0.0237    , 0.02630909, ..., 0.03753333,
             0.03829334, 0.03575333],
            [0.03229091, 0.02842727, 0.02795455, ..., 0.04112667,
             0.02968   , 0.02982667],
            [0.02806364, 0.03233636, 0.02918182, ..., 0.03398   ,
             0.02794667, 0.02822   ],
    ...
            [0.29019165, 0.2897    , 0.28825834, ..., 0.34480622,
             0.34674996, 0.34965625],
            [0.2914    , 0.29025832, 0.28791663, ..., 0.3422937 ,
             0.344625  , 0.3476125 ],
            [0.2916917 , 0.289775  , 0.28661668, ..., 0.33952498,
             0.34243748, 0.34556875]],
    
           [[0.2868091 , 0.28707272, 0.28841817, ..., 0.31446669,
             0.31547335, 0.31639332],
            [0.28775454, 0.28784546, 0.28925455, ..., 0.31435332,
             0.31426668, 0.31457335],
            [0.28948182, 0.2891909 , 0.29031816, ..., 0.3147733 ,
             0.31355998, 0.3128733 ],
            ...,
            [0.27674165, 0.27655   , 0.2756    , ..., 0.32446876,
             0.326175  , 0.32840624],
            [0.27785   , 0.27693334, 0.27515   , ..., 0.32219994,
             0.324125  , 0.3264625 ],
            [0.27839997, 0.27651668, 0.27376667, ..., 0.3194438 ,
             0.32189375, 0.32453123]]], shape=(10, 411, 723), dtype=float32)
    • band
      (band)
      object
      'B01' 'B02' 'B03' ... 'B10' 'B11'
      array(['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B09', 'B10', 'B11'],
            dtype=object)
    • y
      (y)
      float64
      3.663e+06 3.663e+06 ... 3.651e+06
      array([3663195., 3663165., 3663135., ..., 3650955., 3650925., 3650895.],
            shape=(411,))
    • x
      (x)
      float64
      5.044e+05 5.044e+05 ... 5.26e+05
      array([504375., 504405., 504435., ..., 525975., 526005., 526035.], shape=(723,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["WGS 84 / UTM zone 11N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-117],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32611"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314245179
      inverse_flattening :
      298.257223563
      reference_ellipsoid_name :
      WGS 84
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      WGS 84
      horizontal_datum_name :
      World Geodetic System 1984
      projected_crs_name :
      WGS 84 / UTM zone 11N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -117.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["WGS 84 / UTM zone 11N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-117],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32611"]]
      GeoTransform :
      504360.0 30.0 0.0 3663210.0 0.0 -30.0
      array(0)
  • ACCODE :
    Lasrc; Lasrc
    add_offset :
    0.0
    arop_ave_xshift(meters) :
    0, 0
    arop_ave_yshift(meters) :
    0, 0
    arop_ncp :
    0, 0
    arop_rmse(meters) :
    0, 0
    arop_s2_refimg :
    NONE
    cloud_coverage :
    56
    HLS_PROCESSING_TIME :
    2023-05-04T07:38:45Z
    HORIZONTAL_CS_NAME :
    UTM, WGS84, UTM ZONE 11; UTM, WGS84, UTM ZONE 11
    L1_PROCESSING_TIME :
    2023-05-02T23:18:31Z; 2023-05-02T23:15:06Z
    LANDSAT_PRODUCT_ID :
    LC08_L1TP_040037_20230502_20230502_02_RT; LC08_L1TP_040038_20230502_20230502_02_RT
    LANDSAT_SCENE_ID :
    LC80400372023122LGN00; LC80400382023122LGN00
    long_name :
    Coastal_Aerosol
    MEAN_SUN_AZIMUTH_ANGLE :
    127.223345599132
    MEAN_SUN_ZENITH_ANGLE :
    25.9675732681091
    MEAN_VIEW_AZIMUTH_ANGLE :
    153.774418990156
    MEAN_VIEW_ZENITH_ANGLE :
    2.87200769504312
    NBAR_SOLAR_ZENITH :
    24.3433464290722
    NCOLS :
    3660
    NROWS :
    3660
    OVR_RESAMPLING_ALG :
    NEAREST
    PROCESSING_LEVEL :
    L1TP; L1TP
    scale_factor :
    0.0001
    SENSING_TIME :
    2023-05-02T18:22:00.5947289Z; 2023-05-02T18:22:24.4815329Z
    SENSOR :
    OLI_TIRS; OLI_TIRS
    SENTINEL2_TILEID :
    11SMS
    spatial_coverage :
    100
    SPATIAL_RESOLUTION :
    30
    TIRS_SSM_MODEL :
    PRELIMINARY; PRELIMINARY
    TIRS_SSM_POSITION_STATUS :
    ESTIMATED; ESTIMATED
    ULX :
    399960
    ULY :
    3700020
    USGS_SOFTWARE :
    LPGS_16.2.0
    AREA_OR_POINT :
    Area
In [ ]:
# Just like before we print here to make sure everything looks okay. We have some null values here but we will handles those later 
# before plotting
print(spectral_da.shape)
print("Any nonzero?", bool((spectral_da != 0).any()))
print("Min:", float(spectral_da.min()))
print("Max:", float(spectral_da.max()))
print("Null count:", int(spectral_da.isnull().sum()))
(10, 411, 723)
Any nonzero? True
Min: 0.0
Max: 0.9353333115577698
Null count: 16
In [ ]:
# Here I want to the see the min and max of the values for each band, we will explore this further with clustering
# I just want to make sure there is nothing really odd here either
for b in spectral_da.band.values:
    band = spectral_da.sel(band=b)
    print(
        b,
        "min=", float(band.min()),
        "max=", float(band.max()),
        "nonzero=", int((band != 0).sum()),
        "null=", int(band.isnull().sum()),
    )
B01 min= 0.0 max= 0.49246424436569214 nonzero= 297152 null= 0
B02 min= 0.0033400002866983414 max= 0.5289143323898315 nonzero= 297153 null= 0
B03 min= 0.0080933328717947 max= 0.6199153065681458 nonzero= 297153 null= 0
B04 min= 0.004414285533130169 max= 0.682315468788147 nonzero= 297153 null= 0
B05 min= 0.00029999998514540493 max= 0.7021153569221497 nonzero= 297153 null= 4
B06 min= 0.0 max= 0.8451332449913025 nonzero= 297152 null= 7
B07 min= 0.00044999999227002263 max= 0.9353333115577698 nonzero= 297153 null= 5
B09 min= 0.0008166665793396533 max= 0.004364286083728075 nonzero= 297153 null= 0
B10 min= 0.24478568136692047 max= 0.42134544253349304 nonzero= 297153 null= 0
B11 min= 0.23344282805919647 max= 0.3924363851547241 nonzero= 297153 null= 0

STEP 4: K-means clustering¶

Applying KMeans clustering¶

KMeans groups pixels into clusters based on similarity in their spectral values. Pixels within the same cluster have similar reflectance patterns, which often correspond to similar land cover types.

The number of clusters is user-defined, allowing us to explore different levels of landscape complexity.

I convert the multi-band raster data into a tabular format where:

  • each row represents a pixel
  • each column represents a spectral band

I also remove pixels with missing or zero values to ensure that the clustering algorithm receives clean input data.

In [30]:
### Convert spectral DataArray to a tidy DataFrame
model_df = (
    spectral_da.to_dataframe(name="reflectance")
    .reset_index()
    .pivot(index=["x", "y"], columns="band", values="reflectance")
)

### remove missing values
model_df = model_df.dropna()

### remove pixels where all bands are zero
model_df = model_df.loc[(model_df.sum(axis=1) > 0)]

model_df.head()
Out[30]:
band B01 B02 B03 B04 B05 B06 B07 B09 B10 B11
x y
504375.0 3650895.0 0.025642 0.033417 0.057108 0.059558 0.213975 0.186283 0.119000 0.001158 0.291692 0.278400
3650925.0 0.021542 0.028300 0.049100 0.050400 0.198392 0.172817 0.108717 0.001150 0.291400 0.277850
3650955.0 0.018200 0.023858 0.042358 0.040517 0.195050 0.145400 0.085942 0.001192 0.290192 0.276742
3650985.0 0.016983 0.022633 0.041750 0.038917 0.199042 0.139700 0.079883 0.001042 0.289808 0.276750
3651015.0 0.012230 0.014967 0.033875 0.030000 0.200625 0.138375 0.074400 0.001008 0.291342 0.278542

Now we're reading to fit the k-means clustering model. We can run the fit and prediction functions at the same time because we don't have target data.

What my code is doing¶

Here, I apply k-means clustering to group pixels based on their spectral similarity.

The algorithm assigns each pixel to one of several clusters, where each cluster represents a distinct spectral pattern. These patterns often correspond to different land cover types (e.g., vegetation, water, built environments).

It is important to note that for this subwatershed there will be a lot of heterogeneity towards the edges. I know this because I am familiar with the local area. The center is almost all built environment with grasslands, low shrubs, and oaks towards the edges. Due to this, the most "optimal" cluster structure for the kmeans will likely be between 5-7 clusters. I play around with 5, 6 and 7 clusters to see what fits best. Regardless the outputted maps will be messier than a more homogenous region that is not disturbed by human activities.

In [ ]:
### features only (ignore the clusters column)
X = model_df.copy()

### initialize k-means model 
kmeans_model = KMeans(n_clusters=6, random_state=42, n_init="auto")

### fit model and predict
cluster_labels = kmeans_model.fit_predict(X)

### add clusters back
model_df["clusters"] = cluster_labels

model_df.head()
Out[ ]:
band B01 B02 B03 B04 B05 B06 B07 B09 B10 B11 clusters
x y
504375.0 3650895.0 0.025642 0.033417 0.057108 0.059558 0.213975 0.186283 0.119000 0.001158 0.291692 0.278400 1
3650925.0 0.021542 0.028300 0.049100 0.050400 0.198392 0.172817 0.108717 0.001150 0.291400 0.277850 1
3650955.0 0.018200 0.023858 0.042358 0.040517 0.195050 0.145400 0.085942 0.001192 0.290192 0.276742 1
3650985.0 0.016983 0.022633 0.041750 0.038917 0.199042 0.139700 0.079883 0.001042 0.289808 0.276750 1
3651015.0 0.012230 0.014967 0.033875 0.030000 0.200625 0.138375 0.074400 0.001008 0.291342 0.278542 1

Clarifying Bands vs Clusters¶

The bands (e.g., B01, B02, B03) represent different wavelengths of light measured by the satellite sensor. Each band captures a specific part of the electromagnetic spectrum, such as visible light or near-infrared.

The cluster numbers (e.g., 0, 1, 2, 3, 4) are not physical measurements. Instead, they are labels created by the k-means algorithm to group pixels with similar spectral characteristics. When we change the number of clusters in the kmeans section, the number of clusters under each band will change. Right now we have 6 clusters (0, 1, 2, 3, 4, 5) but you could change it and rerun these cells to see what differences might pop up.

In [ ]:
# In the above cell I specified 6 clusters. I want to actually inspect them furhter to see what potential land cover features
# we can infer from them (e.g., water, vegetation, built environment) before we get to the RBG and Kmeans maps
cluster_summary = (
    model_df.groupby("clusters")
    .mean()
    .reset_index()
)

cluster_summary
Out[ ]:
band clusters B01 B02 B03 B04 B05 B06 B07 B09 B10 B11
0 0 0.032548 0.043107 0.069339 0.090922 0.189569 0.265964 0.188226 0.001647 0.355648 0.334234
1 1 0.016472 0.022893 0.042723 0.045080 0.176095 0.152530 0.094621 0.001419 0.306055 0.290713
2 2 0.024953 0.033707 0.057382 0.068760 0.184276 0.208346 0.142976 0.001534 0.333516 0.314948
3 3 0.053037 0.070494 0.111710 0.141235 0.279038 0.321384 0.239069 0.001618 0.370657 0.347376
4 4 0.043205 0.057961 0.092269 0.121504 0.236178 0.311877 0.224460 0.001666 0.370873 0.347318
5 5 0.035028 0.049063 0.082814 0.100561 0.236223 0.237609 0.170229 0.001519 0.355898 0.334728
In [ ]:
# Here instead of a table, I plot the values we got above for each the clusters so there are 6 lines plotted over the bands
# You will immediately notice nearly no reflectance for band 9 and band 10 is the highest of the bunch
import matplotlib.pyplot as plt

cluster_summary.set_index("clusters").T.plot(figsize=(8, 5))
plt.title("Spectral Profiles by Cluster")
plt.ylabel("Reflectance")
plt.xlabel("Band")
plt.legend(title="Cluster")
plt.show()
No description has been provided for this image

Basic insights from kmeans clustering before plotting¶

The KMeans clustering applied to the multispectral dataset reveals distinct groupings of pixels based on their spectral reflectance, allowing us to make educated guesses about the land cover types and variability within the subwatershed. For my particular choice of 6 clusters, clustering is driven primarily by differences across red (B04), near-infrared (B05), and shortwave infrared (B06–B07), which are known to capture vegetation health and moisture conditions.

B10 is very interesting as it tells us for this time of year (summer), as we would expect, there is moisture and heat stress as this band (thermal IR) captures surface temperature better. So a high value means that the vegetation is under stress of some amount.

STEP 5: Plot¶

Mapping cluster results¶

After clustering, we convert the results back into a spatial format so they can be visualized as a map.

This allows us to examine how different spectral groups are distributed across the watershed and identify spatial patterns such as vegetation zones, water bodies, or disturbed areas.

So in this final step, I visualize the clustering results alongside an RGB image of the same area.

Comparing the k-means clusters to the RGB image helps us interpret what each cluster represents in real-world terms.

In [45]:
### make data array with bands to use for rgb: red, green, and blue
rgb_da = spectral_da.sel(band=["B04", "B03", "B02"]).sortby(["x", "y"]) * 10
rgb_plot = rgb_da.clip(min=0, max=1).hvplot.rgb(
    x="x",
    y="y",
    xlabel="Easting",
    ylabel="Northing",
    bands="band",
    aspect="equal",
    width=500,
    height=350,
)
rgb_plot
Out[45]:
In [46]:
cluster_xr = (
    model_df["clusters"]
    .to_xarray()
    .transpose("y", "x")
    .sortby("x")
    .sortby("y")
)

kmeans_plot = cluster_xr.hvplot(
    x="x",
    y="y",
    cmap="Colorblind",
    aspect="equal",
    xlabel="Easting",
    ylabel="Northing",
    width=500,
    height=350,
).opts(
    xrotation=45
)

(rgb_plot + kmeans_plot)
Out[46]:
In [47]:
import holoviews as hv

final_plot = rgb_plot + kmeans_plot

hv.save(final_plot, "kmeans_rgb_comparison.html")

What do our RBG and KMeans plots tell us?¶

We have six (mostly discrete) clusters on our KMeans map these include:

  • Cluster 0-1 (deep blue/orange)
  • Cluster 1-2 (orange/yellow)
  • Cluster 2-3 (green/light blue)
  • Cluster 3-4 (light blue/red)
  • Cluster 4-5 (pink/black)
  • Cluster 5 (black)

In order we see the following vegetation (or non-vegetation) patterns. Since we know there is a lot of built environment in the center of the sub-watershed we will see some stark differences between some clusters more than others.

Cluster 0/1 – Likely Vegetation (High NIR, Low Red)¶

Cluster 0/1 likely represents areas of vegetation. This interpretation is supported by higher reflectance in the near-infrared band (B05), which is characteristic of chlorophyll-rich vegetation, and lower reflectance in the red band (B04), where vegetation absorbs light for photosynthesis.

Spatially, this cluster dominates the upland areas and mountainsides of our maps. As I know the area, this cluster is actually grouping some water features (small rivers) but this is because most of these small rivers are in deep mountain ravines that are overshadowed with oaks and other dense trees at the banks. So on imagery it is actually very difficult for spectral imagery to capture the river as "water."

Cluster 1/2 – Likely Moderate Vegetation / Mixed Cover¶

Cluster 1 appears to also represent moderately vegetated or mixed land cover areas. We have no yellow on this map despite the suggested gradient between clusters 1 and 2. So the orange is what we predominantly see on the KMeans map. We can see that this cluster is also mostly on the upland areas and mountain sides.

Cluster 2/3 – Likely Sparse Vegetation / Disturbed Areas¶

Cluster 2 likely captures areas with sparse vegetation or disturbance. Lower NIR reflectance combined with higher visible-band reflectance suggests reduced vegetation cover and greater exposure of soil or non-vegetated surfaces. We can see this on the RGB map where these areas do appear to be bare soils.

Cluster 3/4 – Built environments¶

As is apparent on the RBG map, we have a lot of white/bright tan in the center of the map. For the KMeans map most of this area ranges between light blue to pink. This plotted area actually matching up with the footprint of Ramona, CA quite well.

Cluster 4/5 – Dry Soil / Bare Ground (High SWIR, Higher Red)¶

This cluster range is still capturing the built environment and also more bare soil.

Cluster 5 – Mixed or Transitional¶

Cluster 5 appears to capture more ambiguous or transitional spectral signatures. I am actually not sure what these areas are as we do not have tall buildings (higher than 2 stories) so I don't think it is shadow effects. Because KMeans forces all pixels into discrete groups, some clusters inevitably represent intermediate or mixed conditions. This cluster likely reflects that complexity rather than a single dominant land cover type.

Takeaway¶

Taken together, the six clusters reveal that the watershed is characterized by a continuum of vegetation density and built environments. The clustering highlights:

  • A gradient from some somewhat denser vegetation on mountain sides to sparse or bare ground or built areas
  • Clear differentiation between wetter and drier areas (clusters 0, 1, 2 v. 3, 4, 5)
  • Transitional zones that suggest landscape heterogeneity

Importantly, these results demonstrate the limitations of unsupervised classification: because cluster labels are algorithmically assigned, interpretation remains inferential and should be validated with additional data (e.g., land cover maps or field observations).

Overall, the KMeans clustering provides a meaningful first step in understanding spatial variability within the sub-watershed and offers a strong foundation for further ecological or land cover analysis. I can see KMeans being useful to do before doing any groundtruthing, but in countries that have very good data (Global North countries), I don't see KMeans being as useful. For countries lacking good data, KMeans may actually be more helpful.