Embedding with traj2vec

AIS data is messy. It comes as a stream of latitude, longitude, timestamps, and a few metadata fields. By themselves, theyโ€™re hard to compare or feed into downstream ML models. What we really want is

Inspired by word2vec, traj2vec applies the same logic to movement data: instead of predicting the next word, it predicts the next location in a sequence. Just like words gain meaning from context, vessel positions gain meaning from their trajectory history.

The result: trajectories that โ€œlook alikeโ€ end up close together in embedding space. For instance, two ferries running parallel routes will embed similarly, while a cargo vessel crossing the Gulf of Mexico will sit far away from a fishing boat looping off the coast.

Imports

import os
import h3
import json
import aisdb
import cartopy.feature as cfeature
import cartopy.crs as ccrs
from aisdb.database.dbconn import PostgresDBConn
from aisdb.denoising_encoder import encode_greatcircledistance, InlandDenoising
from aisdb.track_gen import min_speed_filter, min_track_length_filter
from aisdb.database import sqlfcn
from datetime import datetime, timedelta
from collections import defaultdict
from tqdm import tqdm
import pprint
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

import nest_asyncio
nest_asyncio.apply()

Processing AIS Tracks into Clean Segments

This function pulls raw AIS data from a database, denoises it, splits tracks into time-consistent segments, filters outliers, and interpolates them at fixed time steps. The result is a set of clean, continuous vessel trajectories ready for embedding

Loading Region and Grid Shapefiles

We load the study region (Gulf shapefile) and a hexagonal grid (H3 resolution 6). These will be used to map vessel positions into discrete spatial cells โ€” the โ€œtokensโ€ for our trajectory embedding model.

Each trajectory is converted from lat/lon coordinates into H3 hexagon IDs at resolution 6. To avoid redundant entries, we deduplicate consecutive identical cells while keeping the timestamp of first entry. The result is a sequence of discrete spatial tokens with time information โ€” the input format for traj2vec.

Before training embeddings, itโ€™s useful to check how long our AIS trajectories are. The function below computes summary statistics (min, max, mean, percentiles) and plots the distribution of track lengths in terms of H3 cells.

In our dataset, the distribution is skewed to the left โ€” most vessel tracks are relatively short, with only a few very long trajectories.

For a simpler visualization, we also plot trajectories in raw lat/lon space without cartographic features. This is handy for debugging and checking if preprocessing (deduplication, interpolation) worked correctly.

Filtering out some values

We collect all unique H3 IDs from the trajectories and assign each one an integer index. Just like in NLP, we also reserve special tokens for padding, start, and end of sequence. This turns spatial cells into a vocabulary that our embedding model can work with.

Each vessel track is then mapped from its H3 sequence into an integer sequence (int_seq). We also convert the H3 cells back into lat/lon pairs for later visualization. At this point, the data is ready to be fed into a traj2vec-style model.

Train Test Split + Data Saving

We split the cleaned trajectories into train, validation, and test sets. This ensures our model can be trained, tuned, and evaluated fairly without data leakage.

Each trajectory is written out in multiple aligned formats:

  • .src โ†’ input sequence (all tokens except last)

  • .trg โ†’ target sequence (all tokens except first)

  • .lat / .lon โ†’ raw geographic coordinates (for visualization)

  • .t โ†’ the complete trajectory sequence

This setup mirrors NLP datasets, where models learn to predict the โ€œnext tokenโ€ in a sequence.

Some Other Imports for Training

Training Loop Setup

We set up utility functions to initialize model weights, save checkpoints during training, and run validation. These ensure training is reproducible and models can be restored later. The train() function loads train/val datasets, defines the loss functions (negative log-likelihood or KL-divergence), and builds the encoder-decoder model with its optimizer and scheduler. If a checkpoint exists, training resumes from where it left off; otherwise, parameters are freshly initialized.

Generative + Discriminative Losses

Training uses two objectives:

  • Generative loss (predicting the next trajectory cell, like word prediction in NLP).

  • Discriminative loss (triplet margin loss, ensuring embeddings of similar trajectories are close while different ones are far apart).

These combined losses help the model learn not only to generate realistic trajectories but also to embed them in a useful vector space.

The loop runs over iterations, logging training progress, validating periodically, and saving checkpoints. A learning rate scheduler adjusts the optimizer based on validation loss, and early stopping prevents wasted computation when no improvements occur.

The test() function loads the best checkpoint, evaluates it on the test set, and reports average loss and perplexity. Perplexity is borrowed from NLP โ€” lower values mean the model is more confident in predicting the next trajectory cell.

ARGS

Test

Results-

The generative modelโ€™s performance on the test dataset demonstrates remarkable accuracy in predicting vessel trajectories. Over the course of the evaluation, the cumulative generative loss across batches increased steadily, reflecting the accumulation of prediction errors over the sequences. When aggregated and normalized per token, the average loss was 0.2309, corresponding to a perplexity of approximately 1.26.

Perplexity is a standard measure in sequence modeling that quantifies how well a probabilistic model predicts a sequence of tokens. A perplexity close to 1 indicates near-deterministic prediction, meaning that the model assigns very high probability to the correct next token in the sequence. In the context of vessel trajectories, this result implies that the model is extremely confident and precise in forecasting the next H3 cell in a track, capturing the underlying spatial and temporal patterns in the data.

These results are particularly noteworthy because vessel movements are constrained by both geography and navigational behavior. The model effectively learns these patterns, predicting transitions between cells with minimal uncertainty. Achieving such a low perplexity confirms that the preprocessing pipeline, H3 cell encoding, and the sequence modeling architecture are all functioning harmoniously, enabling highly accurate trajectory modeling.

Overall, the evaluation demonstrates that the model not only generalizes well to unseen tracks but also reliably captures the deterministic structure of vessel movement, providing a robust foundation for downstream tasks such as trajectory prediction, anomaly detection, or maritime route analysis.

Last updated