1 of 31

Documentation

Introduction

A little bit about where we stand.

Overview

Welcome to AISdb - a comprehensive gateway for Automatic Identification System (AIS) data uses and applications. AISdb is part of the Making Vessels Tracking Data Available to Everyone (AISViz) project within the Marine Environmental Research Infrastructure for Data Integration and Application Network (MERIDIAN) initiative at Dalhousie University, designed to streamline the collection, processing, and analysis of AIS data, both in live-streaming scenarios and through historical records.

The primary features AISdb provides include:

SQL database for storing AIS position reports and vessel metadata: At the heart of AISdb is a database built on SQLite, giving users a friendly Python interface with which to interact. This interface simplifies tasks like database creation, data querying, processing, visualization, and even exporting data to CSV format for diverse uses. To cater to advanced needs, AISdb supports using Postgres, offering superior concurrency handling and data-sharing capabilities for collaborative environments.

Vessel data cleaning and trajectory modeling: AISdb includes vessel position cleaning and trajectory modeling features. This ensures that the data used for analyses is accurate and reliable, providing a solid foundation for further studies and applications.

Integration with environmental context and external metadata: One of AISdb's unique features is its ability to enrich AIS datasets with environmental context. Users can seamlessly integrate oceanographic and bathymetric data in raster formats to bring depth to their analyses — quite literally, as the tool allows for incorporating seafloor depth data underneath vessel positions. Such versatility ensures that AISdb users can merge various environmental data points with AIS information, resulting in richer, multi-faceted maritime studies.

Advanced features for maritime studies: AISdb offers network graph analysis, MMSI deduplication, interpolation, and other processing utilities. These features enable advanced data processing and analysis, supporting complex maritime studies and applications.

Python interface and machine learning for vessel behavior modeling: AISdb includes a Python interface with a RUST background that paves the way for incorporating machine learning and deep learning techniques into vessel behavior modeling in an optimized way. This aspect of AISdb enhances the reproducibility and scalability of research, be it for academic exploration or practical industry applications.

Research support: AISdb is more than just a storage and processing tool; it is a comprehensive platform designed to support research. Through a formal partnership with our research initiative (contact us for more information), academics, industry experts, and researchers can access extensive Canadian AIS data up to 100 km from the Canadian coastline. This dataset spans from January 2012 to the present and is updated monthly. AISdb offers raw and parsed data formats, eliminating preprocessing needs and streamlining AIS-related research.

The AISViz team is based on the Modeling and Analytics on Predictive Systems (MAPS) lab in collaboration with the Maritime Risk and Safety (MARS) research group at Dalhousie University. Funded by the Department of Fisheries and Oceans Canada (DFO), our mission revolves around democratizing AIS data use, making it accessible and understandable across multiple sectors, from government and academia to NGOs and the broader public. Besides, AISViz aims to introduce machine learning applications into AIS data handling of AISdb. This seeks to streamline user interactions with AIS data, enhancing the user experience by simplifying data access.

Our commitment goes beyond just providing tools. Through AISViz, we're opening doors to innovative research and policy development, targeting environmental conservation, maritime traffic management, and much more. Whether you're a professional in the field, an educator, or a maritime enthusiast, AISViz and its components, including AISdb, offer the knowledge and technology to deepen your understanding and significantly impact marine vessel tracking and the well-being of our oceans.

Our Team

Active Members

Ruixin Song is a research assistant in the Computer Science Department at Dalhousie University. She has an M.Sc. in Computer Science and a B.Eng. in Spatial Information and Digital Technology. Her recent work focuses on marine traffic data analysis and physics-inspired models, particularly in relation to biological invasions in the ocean. Her research interests include mobility data mining, graph neural networks, and network flow and optimization problems.
- Contact: rsong@dal.ca
Jinkun Chen is a Ph.D. student in Computer Science at Dalhousie University, specializing in Explainable AI, Natural Language Processing (NLP), and Visualization. He earned a bachelor's degree in Computer Science with First-Class Honours from Dalhousie University. Jinkun is actively involved in research, working on advancing fairness, responsibility, trustworthiness, and explainability within Large Language Models (LLMs) and AI.
- Contact: jinkun.chen@dal.ca
Gabriel Spadon is an Assistant Professor at the Faculty of Computer Science at Dalhousie University, Halifax - NS, Canada. He holds a Ph.D. and an MSc in Computer Science from the University of Sao Paulo, Sao Carlos - SP, Brazil. His research focuses on spatio-temporal analytics, time-series forecasting, and complex network mining, with deep involvement in Data Science & Engineering and GeoInformatics and a particular interest in mobility-related problems.
- Contact: spadon@dal.ca
Ron Pelot has a Ph.D. in Management Sciences and is a Professor of Industrial Engineering at Dalhousie University. For the last 30 years, he and his team have been working on developing new software tools and analysis methods for maritime traffic safety, coastal zone security, and marine spills. Their research methods include spatial risk analysis, vessel traffic modeling, data processing, pattern analysis, location models for response resource allocation, safety analyses, and cumulative shipping impact studies.
- Contact: ronald.pelot@dal.ca

Adjunct Members

Vaishnav Vaidheeswaran is a Master's student in Computer Science at Dalhousie University. He holds a B.Tech in Computer Science and Engineering and has three years of experience as a software engineer in India, working at cutting-edge startups. His ongoing work addresses incorporating spatial knowledge into trajectory forecasting models to reduce aleatoric uncertainty coming from stochastic interactions of the vessel with the environment. His research interests include large language models, graph neural networks, and reinforcement learning.
- Contact: vaishnav@dal.ca

Former Members

Jay Kumar has a Ph.D. in Computer Science and Technology and was a postdoctoral fellow at the Department of Industrial Engineering at Dalhousie University. He has researched AI models for time-series data for over five years, focusing on Recurrent Neural models, probabilistic modeling, and feature engineering data analytics applied to ocean traffic. His research interests include Spatio-temporal Data Mining, Stochastic Modeling, Machine Learning, and Deep Learning.
Matthew Smith has a BSc degree in Applied Computer Science from Dalhousie University and specializes in managing and analyzing vessel tracking data. He is currently a Software Engineer at Radformation in Toronto, ON. Matt served as the AIS data manager on the MERIDIAN project, where he supported research groups across Canada in accessing and utilizing AIS data. The data was used to answer a range of scientific queries, including the impact of shipping on underwater noise pollution and the danger posed to endangered marine mammals by vessel collisions.
Casey Hilliard has a BSc degree in Computer Science from Dalhousie University and was a Senior Data Manager at the Institute for Big Data Analytics. He is currently a Chief Architect at GSTS (Global Spatial Technology Solutions) in Dartmouth, NS. Casey was a long-time research support staff member at the Institute and an expert in managing and using AIS vessel-tracking data. During his time, he assisted in advancing the Institute's research projects by managing and organizing large datasets, ensuring data integrity, and facilitating data usage in research.
Stan Matwin was the director of the Institute for Big Data Analytics, Dalhousie University, Halifax, Nova Scotia; he is a professor and Canada Research Chair (Tier 1) in Interpretability for Machine Learning. He is also a distinguished professor (Emeritus) at the University of Ottawa and a full professor with the Institute of Computer Science, Polish Academy of Sciences. His main research interests include big data, text mining, machine learning, and data privacy. He is a member of the Editorial Boards of IEEE Transactions on Knowledge and Data Engineering and the Journal of Intelligent Information Systems. He received the Lifetime Achievement Award of the Canadian AI Association (CAIAC).

Contact

We are passionate about fostering a collaborative and engaged community. We welcome your questions, insights, and feedback as vital components of our continuous improvement and innovation. Should you have any inquiries about AISdb, desire further information on our research, or wish to explore potential collaborations, please don't hesitate to contact us. Staying connected with users and researchers plays a crucial role in shaping the tool's development and ensuring it meets the diverse needs of our growing user base. You can easily contact our team via email or our GitHub team platform. In addition to addressing individual queries, we are committed to organizing webinars and workshops and presenting at conferences to share knowledge, gather feedback, and widen our outreach (stay tuned for more information about these). Together, let's advance the understanding and utilization of marine data for a brighter, more informed future in ocean research and preservation.

Default Start

Quick Start

A hands-on quick start guide for using AISdb.

If you are new to AIS topics, to know about "Automatic Identification System (AIS)".

Note: If you are starting from scratch, download the data ".db" file in our repository so that you can follow this guide properly.

Python Environment and Installation

To work with the AISdb Python package, please ensure you have Python version 3.8 or higher. If you plan to use SQLite, no additional installation is required, as it is included with Python by default. However, those who prefer using a PostgreSQL server must install it separately and enable the TimescaleDB extension to function correctly.

User Installation

The AISdb Python package can be conveniently installed using pip. It's highly recommended that a virtual Python environment be created and the package installed within it.

You can test your installation by running the following commands:

Notice that if you are running , ensure it is installed in the same environment as AISdb:

The Python code in the rest of this document can be run in the Python environment you created.

Development Installation

For using nightly builds (not mandatory), you can install it from the source:

Alternatively, you can use nightly builds (not mandatory) on Google Colab as follows:

Database Handling

AISdb supports SQLite and PostgreSQL databases. Since version 1.7.3, AISdb requires to function properly. To install TimescaleDB, follow these steps:

Install TimescaleDB (PostgreSQL Extension)

Enable the Extension in PostgreSQL

Verify the Installation

Restart PostgreSQL

Connecting to a PostgreSQL database

This option requires an optional dependency psycopg for interfacing with Postgres databases. Beware that Postgres accepts these Alternatively, a connection string may be used. Information on connection strings and Postgres URI format can be found .

Attaching a SQLite database to AISdb

Querying SQLite is as easy as informing the name of a ".db" file with the same entity-relationship as the databases supported by AIS, which are detailed in the section. We prepared an example SQLite database example_data.db based AIS data in a small region near Maine, United States in Jan 2022 from , which is available in AISdb GitHub repository.

If you want to create your database using your data, we have a with examples that show you how to create an SQLite database from open-source data.

Querying the Database

Parameters for the database query can be defined using . Iterate over rows returned from the database for each vessel with . Convert the results into generator-yielding dictionaries with NumPy arrays describing position vectors, e.g., lon, lat, and time, using .

The following query will return vessel trajectories from a given 1-hour time window:

A specific region can be queried for AIS data using or one of its sub-classes to define a collection of shapely polygon features. For this example, the domain contains a single bounding box polygon derived from a longitude/latitude coordinate pair and radial distance specified in meters. If multiple features are included in the domain object, the domain boundaries will encompass the convex hull of all features.

Additional query callbacks for filtering by region, timeframe, identifier, etc. can be found in and .

Processing

Voyage Modelling

The above generator can be input into a processing function, yielding modified results. For example, to model the activity of vessels on a per-voyage or per-transit basis, each voyage is defined as a continuous vector of positions where the time between observed timestamps never exceeds 24 hours.

Data cleaning and MMSI deduplication

A common problem with AIS data is noise, where multiple vessels might broadcast using the same identifier (sometimes simultaneously). In such cases, AISdb can denoise the data:

(1) Denoising with Encoder: The function checks the approximate distance between each vessel’s position. It separates vectors where a vessel couldn’t reasonably travel using the most direct path, such as speeds over 50 knots.

(2) Distance and Speed Thresholds: A distance and speed threshold limits the maximum distance or time between messages that can be considered continuous.

(3) Scoring and Segment Concatenation: A score is computed for each position delta, with sequential messages nearby at shorter intervals given a higher score. This score is calculated by dividing the Haversine distance by elapsed time. Any deltas with a score not reaching the minimum threshold are considered the start of a new segment. New segments are compared to the end of existing segments with the same vessel identifier; if the score exceeds the minimum, they are concatenated. If multiple segments meet the minimum score, the new segment is concatenated to the existing segment with the highest score.

Notice that processing functions may be executed in sequence as a chain or pipeline, so after segmenting the individual voyages as shown above, results can be input into the encoder to remove noise and correct for vessels with duplicate identifiers.

Interpolating, geofencing, and filtering

Building on the above processing pipeline, the resulting cleaned trajectories can be geofenced and filtered for results contained by at least one domain polygon and interpolated for uniformity.

Additional processing functions can be found in the module.

Exporting as CSV

The resulting processed voyage data can be exported in CSV format instead of being printed:

Integration with external metadata

AISDB supports integrating external data sources such as bathymetric charts and other raster grids.

Bathymetric charts

To determine the approximate ocean depth at each vessel position, the module can be used.

Once the data has been downloaded, the class may be used to append bathymetric data to tracks in the context of a processing pipeline like the processing functions described above.

Also, see for determining the approximate nearest distance to shore from vessel positions.

Rasters

Similarly, arbitrary raster coordinate-gridded data may be appended to vessel tracks

Visualization

AIS data from the database may be overlayed on a map such as the one shown above using the function. This function accepts a generator of track dictionaries such as those output by .

For a complete plug-and-play solution, you may clone our .

SQL Database

Table Naming

When loading data into the database, messages will be sorted into SQL tables determined by the message type and month. The names of these tables follow the following format, which {YYYYMM} indicates the table year and month in the format YYYYMM.

Some additional tables containing computed data may be created depending on the indexes used. For example, an aggregate of vessel static data by month or a virtual table is used as a covering index.

Additional tables are also included for storing data not directly derived from AIS message reports.

For quick reference to data types and detailed explanations of these table entries, please see the .

Custom SQL Queries

In addition to querying the database using module, there is an option to customize the query with your own SQL code.

Example of listing all the tables in your database:

As messages are separated into tables by message type and month, queries spanning multiple message types or months should use UNIONs and JOINs to combine results as appropriate.

Example of querying tables with `JOIN`:

More information about SQL queries can be looked up from .

The R* tree virtual tables should be queried for AIS position reports instead of the default tables. Query performance can be significantly improved using the R* tree index when restricting output to a narrow range of MMSIs, timestamps, longitudes, and latitudes. However, querying a wide range will not yield much benefit. If custom indexes are required for specific manual queries, these should be defined on message tables 1_2_3, 5, 18, and 24 directly instead of upon the virtual tables.

Timestamps are stored as epoch minutes in the database. To facilitate querying the database manually, use the dt_2_epoch() function to convert datetime values to epoch minutes and the epoch_2_dt() function to convert epoch minutes back to datetime values. Here is how you can use dt_2_epoch() with the example above:

For more examples, please see the SQL code in that is used to create database tables and associated queries.

Detailed Table Description

`ais_{YYYYMM}_dynamic` tables

Column

Data Type

Description

`ais_{YYYYMM}_static` tables

Column

Data Type

Description

`static_{YYYYMM}_aggregate` tables

Column

Data Type

Description

AIS Hardware

How to deploy your own Automatic Identification System (AIS) receiver.

In addition to utilizing for the Canadian coasts, you can install AIS receiver hardware to capture AIS data directly. The received data can be processed and stored in databases, which can then be used with AISdb. This approach offers additional data sources and allows users to collect and process their data (as illustrated in the pipeline below). Doing so allows you to customize your data collection efforts to meet specific needs and seamlessly integrate the data with AISdb for enhanced analysis and application. At the same time, you can share the data you collect with others.

Requirements

Raspberry Pi or other computers with internet working capability

162MHz receiver, such as the
An antenna in the VHF frequency band (30MHz - 300MHz) e.g. Shakespeare QC-4 VHF Antenna
Optionally, you may want
- Antenna mount
- A filtered preamp, such as , to improve signal range and quality

An additional option includes free AIS receivers from . This option may require you to share the data with the organization to help expand its AIS-receiving network.

Hardware Setup

When setting up your antenna, place it as high as possible and far away from obstructions and other equipment as is practical.
Connect the antenna to the receiver. If using a preamp filter, connect it between the antenna and the receiver.
Connect the receiver to your Linux device via a USB cable. If using a preamp filter, power it with a USB cable.
Validate the hardware configuration
- When connected via USB, the AIS receiver is typically found under /dev/ with a name beginning with ttyACM, for example /dev/ttyACM0. Ensure the device is listed in this directory.
- To test the receiver, use the command sudo cat /dev/ttyACM0 to display its output. If all works as intended, you will see streams of bytes appearing on the screen.

A visual example of the antenna hardware setup that MERIDIAN has available is as follows:

Software Setup

Quick Start

Connect the receiver to the Raspberry Pi via a USB port, and then run the configure_rpi.sh script. This will install the Rust toolchain, AISdb dispatcher, and AISdb system service (described below), allowing the receiver to start at boot.

Custom Install

Install Raspberry Pi OS with SSH enabled: Visit to download and install the Raspberry Pi OS. If using the RPi imager, please ensure you run it as an administrator.
Connect the receiver: Attach the receiver to the Raspberry Pi using a USB cable. Then log in to the Raspberry Pi and update the system with the following command: sudo apt-get update
Install the Rust toolchain: Install the Rust toolchain on the Raspberry Pi using the following command: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh Afterward, log out and log back in to add Rust and Cargo to the system path.
Install the network client and dispatcher: (a) From , using cargo install mproxy-client(b) To install from the source, use the local path instead, e.g. cargo install --path ./dispatcher/client
Install systemd services: Set up new systemd services to run the AIS receiver and dispatcher. First, create a new text file ./ais_rcv.service with contents in the block below, replace User=ais and /home/ais with the username and home directory chosen in step 1.

This service will broadcast receiver input downstream to aisdb.meridian.cs.dal.ca via UDP. You can add additional endpoints at this stage; for more information, see mproxy-client --help. Additional AIS networking tools, such as mproxy-forward, mproxy-server, and mproxy-reverse, are available in the ./dispatcher source directory.

Next, link and enable the service on the Raspberry Pi to ensure the receiver starts at boot:

See more examples in docker-compose.yml

💡 Common Issues

For some Raspberry hardware (such as the author's Raspberry Pi 4 Model B Rev 1.5), when connecting dAISy AIS Receivers, the device file in Linux used to represent a serial communication interface is not always "/dev/ttyACM0", as used in our ./ais_rcv.service.

You can check the actual device file in use by running:

For example, the author found that serial0 was linked to ttyS0 (i.e., ttyS0).

Simply changing /dev/ttyACM0 to /dev/ttyS0 may result in receiving garbled AIS signals. This is because the default baud rate settings are different. You can modify the default baud rate for ttyS0 using the following command:

Tutorials

Database Loading

This tutorial will guide you in using the AISdb package to load AIS data into a database and perform queries. We will begin with AISdb installation and environment setup, then proceed to examples of querying the loaded data and creating simple visualizations.

Install Requirements

Preparing a Python virtual environment for AISdb is a safe practice. It allows you to manage dependencies and prevent conflicts with other projects, ensuring a clean and isolated setup for your work with AISdb. Run these commands in your terminal based on the operating system you are using:

Linux

python -m venv AISdb         # create a python virtual environment
source ./AISdb/bin/activate  # activate the virtual environment
pip install aisdb            # from https://pypi.org/project/aisdb/

Windows

python -m venv AISdb         # create a virtual environment
./AISdb/Scripts/activate     # activate the virtual environment
pip install aisdb            # install the AISdb package using pip

Now you can check your installation by running:

$ python
>>> import aisdb
>>> aisdb.__version__        # should return '1.7.0' or newer

If you're using AISdb in Jupyter Notebook, please include the following commands in your notebook cells:

# install nest-asyncio for enabling asyncio.run() in Jupyter Notebook
%pip install nest-asyncio

# Some of the systems may show the following error when running the user interface:
# urllib3 v2.0 only supports OpenSSL 1.1.1+; currently, the 'SSL' module is compiled with 'LibreSSL 2.8.3'.
# install urllib3 v1.26.6 to avoid this error
%pip install urllib3==1.26.6

Then, import the required packages:

from datetime import datetime, timedelta
import os
import aisdb
import nest_asyncio
nest_asyncio.apply()

Load AIS data into a database

This section will show you how to efficiently load AIS data into a database.

AISdb includes two database connection approaches:

SQLite database connection; and,
PostgreSQL database connection.

SQLite database connection

We are working with the SQLite database in most of the usage scenarios. Here is an example of loading data using the sample data included in the AISdb package:

# List the test data files included in the package
print(os.listdir(os.path.join(aisdb.sqlpath, '..', 'tests', 'testdata')))
# You will see the print result: 
# ['test_data_20210701.csv', 'test_data_20211101.nm4', 'test_data_20211101.nm4.gz']

# Set the path for the SQLite database file to be used
test_database

# Use test_data_20210701.csv as the test data
filepaths = [os.path.join(aisdb.sqlpath, '..', 'tests', 'testdata', 'test_data_20210701.csv')]
with aisdb.DBConn(dbpath = dbpath) as dbconn:
    aisdb.decode_msgs(filepaths=filepaths, dbconn=dbconn, source='TESTING')

The code above decodes the AIS messages from the CSV file specified in filepaths and inserts them into the SQLite database connected via dbconn.

Following is a quick example of a query and visualization of the data we just loaded with AISdb:

start_time = datetime.strptime("2021-07-01 00:00:00", '%Y-%m-%d %H:%M:%S')
end_time = datetime.strptime("2021-07-02 00:00:00", '%Y-%m-%d %H:%M:%S')

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn,
        dbpath='./AIS2.db',
        callback=aisdb.database.sql_query_strings.in_timerange,
        start=start_time,
        end=end_time,
    )
    rowgen = qry.gen_qry()
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)

    if __name__ == '__main__':
        aisdb.web_interface.visualize(
            tracks,
            visualearth=True,
            open_browser=True,
        )

PostgreSQL database connection

In addition to SQLite database connection, PostgreSQL is used in AISdb for its superior concurrency handling and data-sharing capabilities, making it suitable for collaborative environments and handling larger datasets efficiently. The structure and interactions with PostgreSQL are designed to provide robust and scalable solutions for AIS data storage and querying. For PostgreSQL, you need the psycopg2 library:

pip install psycopg2

To connect to a PostgreSQL database, AISdb uses the PostgresDBConn class:

from aisdb.database.dbconn import PostgresDBConn

# Option 1: Using keyword arguments
dbconn = PostgresDBConn(
    hostaddr='127.0.0.1',      # Replace with the PostgreSQL address
    port=5432,                 # Replace with the PostgreSQL running port
    user='USERNAME',           # Replace with the PostgreSQL username
    password='PASSWORD',       # Replace with your password
    dbname='aisviz'            # Replace with your database name
)

# Option 2: Using a connection string
dbconn = PostgresDBConn('postgresql://USERNAME:PASSWORD@HOST:PORT/DATABASE')

After establishing a connection to the PostgreSQL database, specifying the path of the data files, and using the aisdb.decode_msgs function for data processing, the following operations will be performed in order: data files processing, table creation, data insertion, and index rebuilding.

Please pay close attention to the flags in aisdb.decode_msgs, as recent updates provide more flexibility for database configurations. These updates include support for ingesting NOAA data into the aisdb format and the option to structure tables using either the original B-Tree indexes or TimescaleDB’s structure when the extension is enabled. In particular, please take care of the following parameters:

source (str, optional) Specifies the data source to be processed and loaded into the database.
- Options: "Spire", "NOAA"/"noaa", or leave empty.
- Default: empty but will progress with Spire source.
raw_insertion (bool, optional)
- If False, the function will drop and rebuild indexes to speed up data loading.
- Default: True.
timescaledb (bool, optional)
- Set to True only if using the TimescaleDB extension in your PostgreSQL database.
- Refer to the TimescaleDB documentation for proper setup and usage.

Example: Processing a Full Year of Spire Data (2024)

The following example demonstrates how to process and load Spire data for the entire year 2024 into an aisdb database with the TimescaleDB extension installed:

start_year = 2024
end_year = 2024
start_month = 1
end_month = 12

overall_start_time = time.time()

for year in range(start_year, end_year + 1):
    for month in range(start_month, end_month + 1):
        print(f'Loading {year}{month:02d}')
        month_start_time = time.time()

        filepaths = aisdb.glob_files(f'/slow-array/Spire/{year}{month:02d}/','.zip')
        filepaths = sorted([f for f in filepaths if f'{year}{month:02d}' in f])
        print(f'Number of files: {len(filepaths)}')

        with aisdb.PostgresDBConn(libpq_connstring=psql_conn_string) as dbconn:
            try:
                aisdb.decode_msgs(filepaths,
                                dbconn=dbconn,
                                source='Spire',
                                verbose=True,
                                skip_checksum=True,
                                raw_insertion=True,
                                workers=6,
                                timescaledb=True,
                        )
            except Exception as e:
                print(f'Error loading {year}{month:02d}: {e}')
                continue

Example of performing queries and visualizations with PostgreSQL database:

from aisdb.gis import DomainFromPoints
from aisdb.database.dbqry import DBQuery
from datetime import datetime

# Define a spatial domain centered around the point (-63.6, 44.6) with a radial distance of 50000 meters.
domain = DomainFromPoints(points=[(-63.6, 44.6)], radial_distances=[50000])

# Create a query object to fetch AIS data within the specified time range and spatial domain.
qry = DBQuery(
    dbconn=dbconn,
    start=datetime(2023, 1, 1), end=datetime(2023, 2, 1),
    xmin=domain.boundary['xmin'], xmax=domain.boundary['xmax'],
    ymin=domain.boundary['ymin'], ymax=domain.boundary['ymax'],
    callback=aisdb.database.sqlfcn_callbacks.in_time_bbox_validmmsi
)

# Generate rows from the query
rowgen = qry.gen_qry()

# Convert the generated rows into tracks
tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)

# Visualize the tracks on a map
aisdb.web_interface.visualize(
    tracks,           # The tracks (trajectories) to visualize.
    domain=domain,    # The spatial domain to use for the visualization.
    visualearth=True, # If True, use Visual Earth for the map background.
    open_browser=True # If True, automatically open the visualization in a web browser.
)

Moreover, if you wish to use your own AIS data to create and process a database with AISdb, please check out our instructional guide on data processing and database creation: Using Your AIS Data.

Data Querying

Data querying with AISdb involves setting up a connection to the database, defining query parameters, creating and executing the query, and processing the results. Following the previous tutorial, Database Loading, we set up a database connection and made simple queries and visualizations. This tutorial will dig into data query functions and parameters and show you the queries you can make with AISdb.

Query functions

Data querying with AISdb includes two components: DBQuery and TrackGen. In this section, we will introduce each component with examples. Before starting data querying, please ensure you have connected to the database. If you have not done so, please follow the instructions and examples in Database Loading or Quick Start.

Query database

The DBQuery class is used to create a query object that specifies the parameters for data retrieval, including the time range, spatial domain, and any filtering callbacks. Here is an example to create a DBQuery object and use parameters to specify the time range and geographical locations:

from aisdb.database.dbqry import DBQuery

# Specify database path
dbpath = ...

# Specify constraints (optional)
start_time = ...
end_time = ...
domain = ...

# Create a query object to fetch data within time and geographical range
qry = DBQuery(
    dbconn=dbconn,                  # Database connection object
    start=start_time,               # Start time for the query
    end=end_time,                   # End time for the query
    xmin=domain.boundary['xmin'],   # Minimum longitude of the domain
    xmax=domain.boundary['xmax'],   # Maximum longitude of the domain
    ymin=domain.boundary['ymin'],   # Minimum latitude of the domain
    ymax=domain.boundary['ymax'],   # Maximum latitude of the domain
    callback=aisdb.database.sqlfcn_callbacks.in_time_bbox_validmmsi  # Callback function to filter data
)

Callback functions

Callback functions are used in the DBQuery class to filter data based on specific criteria. Some common callbacks include: in_bbox, in_time_bbox, valid_mmsi, and in_time_bbox_validmmsi. These callbacks ensure that the data retrieved matches the specific criteria defined in the query. Please find examples of using different callbacks with other parameters in Query types with practical examples.

For more callback functions, refer to the API documentation here: API-Doc

Method `gen_qry`

The function gen_qry is a method of the DBQuery class in AISdb. It is responsible for generating rows of data that match the query criteria specified when creating the DBQuery object. This function acts as a generator, yielding one row at a time and efficiently handling large datasets.

Two callback functions can be passed to gen_qry. They are:

crawl_dynamic : Iterates only over the position reports table. By default this is called.
crawl_dynamic_static: Iterates over both position reports and static messages tables.

After creating the DBQuery object, we can generate rows with gen_qry :

# Generate rows from the query
rowgen = qry.gen_qry(fcn=sqlfcn.crawl_dynamic_static) # callback parameter is optional

# Process the generated rows as needed
for row in rowgen:
    print(row)

Each row from gen_qry is a tuple or dictionary representing a record in the database.

Generate trajectories

The TrackGen class converts the generated rows from gen_qry into tracks (trajectories). It takes the row generator and, optionally, a decimate parameter to control point reduction. This conversion is essential for analyzing vessel movements, identifying patterns, and visualizing trajectories in later steps.

Following the generated rows above, here is how to use the TrackGen class:

from aisdb.track_gen import TrackGen

# Convert the generated rows into tracks
tracks = TrackGen(rowgen, decimate=False)

The TrackGen class yields "tracks," which is a generator object. While iterating over tracks, each component is a dictionary representing a track for a specific vessel:

for track in tracks:
    mmsi = track['mmsi']
    lons = track['lon']
    lats = track['lat']
    speeds = track['sog']
    
    print(f"Track for vessel MMSI {mmsi}:")
    for lon, lat, speed in zip(lons[:3], lats[:3], speeds[:3]):
        print(f" - Lon: {lon}, Lat: {lat}, Speed: {speed}")
    break  # Exit after the first track

This is the output with our sample data:

Track for vessel MMSI 316004240:
 - Lon: -63.54868698120117, Lat: 44.61691665649414, Speed: 7.199999809265137
 - Lon: -63.54880905151367, Lat: 44.61708450317383, Speed: 7.099999904632568
 - Lon: -63.55659866333008, Lat: 44.626953125, Speed: 1.5

Query types with practical examples

In this section, we will provide practical examples of the most common querying types you can make using the DBQuery class, including querying within a time range, geographical areas, and tracking vessels by MMSI. Different queries can be achieved by changing the callbacks parameters and other parameters defined in the DBQuery class. Then, we will use TrackGen to convert these query results into structured tracks for further analysis and visualization.

First, we need to import the necessary packages and prepare data:

import os
import aisdb
from datetime import datetime, timedelta
from aisdb import DBConn, DBQuery, DomainFromPoints

dbpath='YOUR_DATABASE.db' # Define the path to your database

Within time range

Querying data within a specified time range can be done by using the in_timerange_validmmsi callback in the DBQuery class:

start_time = datetime.strptime("2018-01-01 00:00:00", '%Y-%m-%d %H:%M:%S')
end_time = datetime.strptime("2018-01-02 00:00:00", '%Y-%m-%d %H:%M:%S')

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn, start=start_time, end=end_time,
        callback=aisdb.database.sqlfcn_callbacks.in_timerange_validmmsi,
    )
    rowgen = qry.gen_qry()
    
    # Convert queried rows to vessel trajectories
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)
    
    # Visualization
    aisdb.web_interface.visualize(
        tracks,
        domain=domain,
        visualearth=True,
        open_browser=True,
    )

This will display the queried vessel tracks (within a time range, has a valid MMSI) on the map:

You may find noise in some of the track data. In Data Cleaning, we introduced the de-noising methods in AISdb that can effectively remove unreasonable or error data points, ensuring more accurate and reliable vessel trajectories.

Within bounding box

In practical scenarios, people may have specific points/areas of interest. DBQuery includes parameters to define a bounding box and has relevant callbacks. Let's look at an example:

domain = DomainFromPoints(points=[(-63.6, 44.6)], radial_distances=[50000]) # a circle with a 100km radius around the location point

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn, start=start_time, end=end_time,
        xmin=domain.boundary['xmin'], xmax=domain.boundary['xmax'],
        ymin=domain.boundary['ymin'], ymax=domain.boundary['ymax'],
        callback=aisdb.database.sqlfcn_callbacks.in_validmmsi_bbox,
    )
    rowgen = qry.gen_qry()
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)
    
    aisdb.web_interface.visualize(
        tracks,
        domain=domain,
        visualearth=True,
        open_browser=True,
    )

This will show all the vessel tracks with valid MMSI in the defined bounding box:

Combination of multiple conditions

In the above examples, we queried data in a time range and a geographical area. If you want to combine multiple query criteria, please check out available types of callbacks in the API Docs. In the last example above, we can simply modify the callback type to obtain vessel tracks within both the time range and geographical area:

callback=aisdb.database.sqlfcn_callbacks.in_time_bbox_validmmsi

The displayed vessel tracks:

Filtering MMSI

In addition to time and location range, you can track single and multiple vessel(s) of interest by specifying their MMSI in the query. Here is an example of tracking several vessels within a time range:

import random

def assign_colors(mmsi_list):
    colors = {}
    for mmsi in mmsi_list:
        colors[mmsi] = "#{:06x}".format(random.randint(0, 0xFFFFFF))  # Random color in hex
    return colors

# Create a function to color tracks
def color_tracks(tracks, colors):
    colored_tracks = []
    for track in tracks:
        mmsi = track['mmsi']
        color = colors.get(mmsi, "#000000")  # Default to black if no color assigned
        track['color'] = color
        colored_tracks.append(track)
    return colored_tracks

# Set the start and end times for the query
start_time = datetime.strptime("2018-01-01 00:00:00", '%Y-%m-%d %H:%M:%S')
end_time = datetime.strptime("2018-12-31 00:00:00", '%Y-%m-%d %H:%M:%S')

# Create a list of vessel MMSIs you want to track 
MMSI = [636017611,636018124,636018253]

# Assign colors to each MMSI
colors = assign_colors(MMSI)

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn, start=start_time, end=end_time, mmsis = MMSI,
        callback=aisdb.database.sqlfcn_callbacks.in_timerange_inmmsi,
    )
    rowgen = qry.gen_qry()
    
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)
    colored_tracks = color_tracks(tracks, colors)

    # Visualizing the tracks
    aisdb.web_interface.visualize(
        colored_tracks,
        visualearth=True,
        open_browser=True,
    )

Data Cleaning

A common issue with AIS data is noise, where multiple vessels may broadcast using the same identifier simultaneously. AISdb incorporates data cleaning techniques to remove noise from vessel track data. For more details:

Denoising with Encoder: The aisdb.denoising_encoder.encode_greatcircledistance() function checks the approximate distance between each vessel’s position. It separates vectors where a vessel couldn’t reasonably travel using the most direct path, such as speeds over 50 knots.

Distance and Speed Thresholds: Distance and speed thresholds limit the maximum distance or time between messages that can be considered continuous.

Scoring and Segment Concatenation: A score is computed for each position delta, with sequential messages nearby at shorter intervals given a higher score. This score is calculated by dividing the Haversine distance by elapsed time. Any deltas with a score not reaching the minimum threshold are considered the start of a new segment. New segments are compared to the end of existing segments with the same vessel identifier; if the score exceeds the minimum, they are concatenated. If multiple segments meet the minimum score, the new segment is concatenated to the existing segment with the highest score.

Processing functions may be executed in sequence as a processing chain or pipeline, so after segmenting the individual voyages, results can be input into the encoder to remove noise and correct for vessels with duplicate identifiers effectively.

import aisdb
from datetime import datetime, timedelta
from aisdb import DBConn, DBQuery, DomainFromPoints

dbpath='YOUR_DATABASE.db' # Define the path to your database

# Set the start and end times for the query
start_time = datetime.strptime("2018-01-01 00:00:00", '%Y-%m-%d %H:%M:%S')
end_time = datetime.strptime("2018-01-02 00:00:00", '%Y-%m-%d %H:%M:%S')

# A circle with a 100km radius around the location point
domain = DomainFromPoints(points=[(-63.6, 44.6)], radial_distances=[50000])

maxdelta = timedelta(hours=24)  # the maximum time interval
distance_threshold = 20000      # the maximum allowed distance (meters) between consecutive AIS messages
speed_threshold = 50            # the maximum allowed vessel speed in consecutive AIS messages
minscore = 1e-6                 # the minimum score threshold for track segment validation

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn, start=start_time, end=end_time,
        callback=aisdb.database.sqlfcn_callbacks.in_timerange_validmmsi,
    )
    rowgen = qry.gen_qry()
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)
    
    # Split the tracks into segments based on the maximum time interval
    track_segments = aisdb.split_timedelta(tracks, maxdelta)
    
    # Encode the track segments to clean and validate the track data
    tracks_encoded = aisdb.encode_greatcircledistance(track_segments, 
                                                      distance_threshold=distance_threshold, 
                                                      speed_threshold=speed_threshold, 
                                                      minscore=minscore)
    tracks_colored = color_tracks(tracks_encoded)
    
    aisdb.web_interface.visualize(
        tracks_colored,
        domain=domain,
        visualearth=True,
        open_browser=True,
    )

After segmentation and encoding, the tracks are shown as:

For comparison, this is a shot of tracks before cleaning:

Track Interpolation

Track interpolation with AISdb involves generating estimated positions of vessels at specific intervals when actual AIS data points are unavailable. This process is important for filling in gaps in the vessel's trajectory, which can occur due to signal loss, data filtering, or other disruptions.

In this tutorial, we introduce different types of track interpolation implemented in AISdb with usage examples.

Example data preparation

First, we defined functions to transform and visualize the track data (a generator object), with options to view the data points or the tracks:

We will use an actual track retrieved from the database for the examples in this tutorial and interpolate additional data points based on this track. The visualization will show the original track data points:

Linear interpolation

Linear interpolation estimates the vessel's position by drawing a straight line between two known points and calculating the positions at intermediate times. It is simple, fast, and straightforward but may not accurately represent complex movements.

With equal time window intervals

This method estimates the position of a vessel at regular time intervals (e.g., every 10 minutes). To perform linear interpolation with an equal time window on the track defined above:

With equal distance intervals

This method estimates the position of a vessel at regular spatial intervals (e.g., every 1 km along its path). To perform linear interpolation with equal distance intervals on the pseudo track defined above:

Geodesic Track Interpolation

This method estimates the positions of a vessel along a curved path using the principles of geometry, particularly involving great-circle routes.

Cubic Spline Interpolation

Given a set of data points, cubic spline interpolation fits a smooth curve through these points. The curve is represented as a series of cubic polynomials between each pair of data points. Each polynomial ensures a smooth curve at the data points (i.e., the first and second derivatives are continuous).

Custom Track Interpolation

In addition to the standard interpolation methods provided by AISdb, users can implement other interpolation techniques tailored to their specific analytical needs. For instance, B-spline (Basis Spline) interpolation is a mathematical technique that creates a smooth curve through data points. This smoothness is important in trajectory analysis as it avoids sharp, unrealistic turns and maintains a natural flow.

Here is an implementation and example of using B-splines interpolation:

Then, we can apply the function just implemented on the vessel tracks generator:

The visualization of the interpolation shows as:

Haversine Distance

AISdb includes a function called aisdb.gis.delta_meters that calculates the Haversine distance in meters between consecutive positions within a vessel track. This function is essential for analyzing vessel movement patterns and ensuring accurate distance calculations on the Earth's curved surface. It is also integrated into the denoising encoder, which compares distances against a threshold to aid in the data-cleaning process.

Here is an example of calculating the Haversine distance between each pair of consecutive points on a track:

import aisdb
import numpy as np
from aisdb.gis import dt_2_epoch
from datetime import datetime

y1, x1 = 44.57039426840729, -63.52931373766157
y2, x2 = 44.51304767533133, -63.494075674952555
y3, x3 = 44.458038982492134, -63.535634138077945
y4, x4 = 44.393941339104074, -63.53826396955358
y5, x5 = 44.14245580737021, -64.16608964280064

t1 = dt_2_epoch( datetime(2021, 1, 1, 1) )
t2 = dt_2_epoch( datetime(2021, 1, 1, 2) )
t3 = dt_2_epoch( datetime(2021, 1, 1, 3) )
t4 = dt_2_epoch( datetime(2021, 1, 1, 4) )
t5 = dt_2_epoch( datetime(2021, 1, 1, 7) )

# Create a sample track
tracks_short = [
    dict(
        lon=np.array([x1, x2, x3, x4, x5]),
        lat=np.array([y1, y2, y3, y4, y5]),
        time=np.array([t1, t2, t3, t4, t5]),
        mmsi=123456789,
        dynamic=set(['lon', 'lat', 'time']),
        static=set(['mmsi'])
    )
]

# Calculate the Haversine distance
for track in tracks_short:
    print(aisdb.gis.delta_meters(track))

[ 6961.401286 6948.59446128 7130.40147082 57279.94580704]

If we visualize this track on the map, we can observe:

Vessel Speed

In AISdb, the speed of a vessel is calculated using the aisdb.gis.delta_knots function, which computes the speed over ground (SOG) in knots between consecutive positions within a given track. This calculation is important for the , as it compares the vessel's speed against a set threshold to aid in the data cleaning process.

Vessel speed calculation requires the distance the vessel has traveled between two consecutive positions and the time interval. This distance is computed using the function, and the time interval is simply the difference in timestamps between the two consecutive AIS position reports. The speed is then computed using the formula:

The factor 1.9438445 converts the speed from meters per second to knots, the standard speed unit used in maritime contexts.

With the example track we created in , we can calculate the vessel speed between each two consecutive positions:

Coast, shore, and ports

Extracting distance features from and to points-of-interest using raster files.

The distances of a vessel from the nearest shore, coast, and port are essential to perform particular tasks such as vessel behavior analysis, environmental monitoring, and maritime safety assessments. AISdb offers functions to acquire these distances for specific vessel positions. In this tutorial, we provide examples of calculating the distance in kilometers from shore and from the nearest port for a given point.

First, we create a sample track:

Here is what the sample track looks like:

Distance from shore

The class is used to calculate the nearest distance to shore, along with a raster file containing shore distance data. Currently, calling the get_distance function in ShoreDist will automatically download the shore distance raster file from our server. The function then merges the tracks in the provided track list, creates a new key, "km_from_shore", and stores the shore distance as the value for this key.

Distance from coast

Similar to acquiring the distance from shore, CoastDist is implemented to obtain the distance between the given track positions and the coastline.

Distance from port

Like the distances from the coast and shore, the class determines the distance between the track positions and the nearest ports.

Using Your AIS Data

In addition to accessing data stored on the AISdb server, you can download open-source AIS data or import your datasets for processing and analysis using AISdb. This tutorial guides you through downloading AIS data from popular websites, creating SQLite and PostgreSQL databases compatible with AISdb, and establishing database connections. We provide two examples: Downloading and Processing Individual Files, which demonstrates working with small data samples and creating an SQLite database, and Pipeline for Bulk File Downloads and Database Integration, which outlines our approach to handling multiple data file downloads and creating a PostgreSQL database.

Data Source

The U.S. vessel traffic data across user-defined geographies and periods are available at MarineCadastre. This resource offers comprehensive AIS data that can be accessed for various maritime analysis purposes. We can tailor the dataset based on research needs by selecting specific regions and timeframes.

Downloading and Processing Individual Files

In the following example, we will show how to download and process a single data file and import the data to a newly created SQLite database.

First, download the AIS data of the day using the curl command:

curl -o ./data/AIS_2020_01_01.zip https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2020/AIS_2020_01_01.zip

Then, extract the downloaded ZIP file to a specific path:

unzip ./data/AIS_2020_01_01.zip -d ./data/

We will look into the number of columns in the downloaded CSV file.

import pandas as pd

# Read CSV file in pandas dataframe
df_ = pd.read_csv("./data/AIS_2020_01_01.csv", parse_dates=["BaseDateTime"])

print(df_.columns)

Index(['MMSI', 'BaseDateTime', 'LAT', 'LON', 'SOG', 'COG', 'Heading',
       'VesselName', 'IMO', 'CallSign', 'VesselType', 'Status',
       'Length', 'Width', 'Draft', 'Cargo', 'TransceiverClass'],
      dtype='object')

The required columns for AISdb have specific names and may differ from the imported dataset. Therefore, let's define the exact list of columns needed.

list_of_headers_ = ["MMSI","Message_ID","Repeat_indicator","Time","Millisecond","Region","Country","Base_station","Online_data","Group_code","Sequence_ID","Channel","Data_length","Vessel_Name","Call_sign","IMO","Ship_Type","Dimension_to_Bow","Dimension_to_stern","Dimension_to_port","Dimension_to_starboard","Draught","Destination","AIS_version","Navigational_status","ROT","SOG","Accuracy","Longitude","Latitude","COG","Heading","Regional","Maneuver","RAIM_flag","Communication_flag","Communication_state","UTC_year","UTC_month","UTC_day","UTC_hour","UTC_minute","UTC_second","Fixing_device","Transmission_control","ETA_month","ETA_day","ETA_hour","ETA_minute","Sequence","Destination_ID","Retransmit_flag","Country_code","Functional_ID","Data","Destination_ID_1","Sequence_1","Destination_ID_2","Sequence_2","Destination_ID_3","Sequence_3","Destination_ID_4","Sequence_4","Altitude","Altitude_sensor","Data_terminal","Mode","Safety_text","Non-standard_bits","Name_extension","Name_extension_padding","Message_ID_1_1","Offset_1_1","Message_ID_1_2","Offset_1_2","Message_ID_2_1","Offset_2_1","Destination_ID_A","Offset_A","Increment_A","Destination_ID_B","offsetB","incrementB","data_msg_type","station_ID","Z_count","num_data_words","health","unit_flag","display","DSC","band","msg22","offset1","num_slots1","timeout1","Increment_1","Offset_2","Number_slots_2","Timeout_2","Increment_2","Offset_3","Number_slots_3","Timeout_3","Increment_3","Offset_4","Number_slots_4","Timeout_4","Increment_4","ATON_type","ATON_name","off_position","ATON_status","Virtual_ATON","Channel_A","Channel_B","Tx_Rx_mode","Power","Message_indicator","Channel_A_bandwidth","Channel_B_bandwidth","Transzone_size","Longitude_1","Latitude_1","Longitude_2","Latitude_2","Station_Type","Report_Interval","Quiet_Time","Part_Number","Vendor_ID","Mother_ship_MMSI","Destination_indicator","Binary_flag","GNSS_status","spare","spare2","spare3","spare4"]

Next, we update the name of columns in the existing dataframe df_ and change the time format as required. The timestamp of an AIS message is represented by BaseDateTime in the default format YYYY-MM-DDTHH:MM:SS. For AISdb, however, the time is represented in UNIX format. We now read the CSV and apply the necessary changes to the date format:

# Take the first 40,000 records from the original dataframe
df = df_.iloc[0:40000]

# Create a new dataframe with the specified headers
df_new = pd.DataFrame(columns=list_of_headers_)

# Populate the new dataframe with formatted data from the original dataframe
df_new['Time'] = pd.to_datetime(df['BaseDateTime']).dt.strftime('%Y%m%d_%H%M%S')
df_new['Latitude'] = df['LAT']
df_new['Longitude'] = df['LON']
df_new['Vessel_Name'] = df['VesselName']
df_new['Call_sign'] = df['CallSign']
df_new['Ship_Type'] = df['VesselType'].fillna(0).astype(int)
df_new['Navigational_status'] = df['Status']
df_new['Draught'] = df['Draft']
df_new['Message_ID'] = 1  # Mark all messages as dynamic by default
df_new['Millisecond'] = 0

# Transfer additional columns from the original dataframe, if they exist
for col_n in df_new:
    if col_n in df.columns:
        df_new[col_n] = df[col_n]

# Extract static messages for each unique vessel
filtered_df = df_new[df_new['Ship_Type'].notnull() & (df_new['Ship_Type'] != 0)]
filtered_df = filtered_df.drop_duplicates(subset='MMSI', keep='first')
filtered_df = filtered_df.reset_index(drop=True)
filtered_df['Message_ID'] = 5  # Mark these as static messages

# Merge dynamic and static messages into a single dataframe
df_new = pd.concat([filtered_df, df_new])

# Save the final dataframe to a CSV file
# The quoting parameter is necessary because the csvreader reads each column value as a string by default
df_new.to_csv("./data/AIS_2020_01_01_aisdb.csv", index=False, quoting=1)

In the code, we can see that we have mapped the column named accordingly. Additionally, the data type of some columns has also been changed. Additionally, the nm4 file usually contains raw messages, separating static messages from dynamic ones. However, the MarineCadastre Data does not have such a Message_ID to indicate the type. Thus, adding static messages is necessary for database creation so that a table related to metadata is created.

Let's process the CSV to create an SQLite database using the aisdb package.

import aisdb

# Establish a connection to the SQLite database and decode messages from the CSV file
with aisdb.SQLiteDBConn('./data/test_decode_msgs.db') as dbconn:
        aisdb.decode_msgs(filepaths=["./data/AIS_2020_01_01_aisdb.csv"],
                          dbconn=dbconn, source='Testing', verbose=True)

generating file checksums...
checking file dates...
creating tables and dropping table indexes...
Memory: 20.65GB remaining.  CPUs: 12.  Average file size: 49.12MB  Spawning 4 workers
saving checksums...
processing ./data/AIS_2020_01_01_aisdb.csv
AIS_2020_01_01_aisdb.csv                                         count:   49323    elapsed:    0.27s    rate:   183129 msgs/s
cleaning temporary data...
aggregating static reports into static_202001_aggregate...

A SQLite database has been created now.

sqlite3 ./data/test_decode_msgs.db

sqlite> .tables
ais_202001_dynamic       coarsetype_ref           static_202001_aggregate
ais_202001_static        hashmap

If prefer to progress to PostgreSQL database, defining postgresql string and progress with database connection:

// Some code

Pipeline for Bulk File Downloads and Database Integration

This section provides an example of downloading and processing multiple files, creating a PostgreSQL database, and loading data into tables. The steps are outlined in a series of pipeline scripts available in this GitHub repository, which should be executed in the order indicated by their numbers.

AIS Data Download and Extraction

The first script, 0-download-ais.py, allows you to download AIS data from MarineCadastre by specifying your needed years. If no years are specified, the script will default to downloading data for 2023. The downloaded ZIP files will be stored in a /data folder created in your current working directory. The second script, 1-zip2csv.py, extracts the CSV files from the downloaded ZIP files in /data and saves them in a new directory named /zip.

To download and extract the data, simply run the two scripts in sequence:

python 0-download-ais.py
python 1-zip2csv.py

Preprocessing - Merge and Deduplication

After downloading and extracting the AIS data, the 2-merge.py script consolidates the daily CSV files into monthly files while the 3-deduplicate.py script removes duplicate rows, retaining unique AIS messages. To perform the execution, simply run:

python 2-merge.py
python 3-deduplicate.py

The output of these two scripts will be cleaned CSV files, which will be stored in a new folder named /merged on your working directory.

PostgreSQL Database Creation and Data Loading to Tables

The final script, 4-postgresql-database.py, creates a PostgreSQL database with a specified name. To do this, the script connects to a PostgreSQL server, requiring you to provide your username and password to establish the connection. After creating the database, the script verifies that the number of columns in the CSV files matches the headers. The script creates a corresponding table in the database for each CSV file and loads the data into it. To run this script, you need to provide three command-line arguments: -dbname for the new database name, -user for your PostgreSQL username, and -password for your PostgreSQL password. Additionally, there are two optional arguments: -host (default is localhost) and -port (default is 5432), you can adjust the -host and -port values if your PostgreSQL server is running on a different host or port.

python 4-postgresql-database.py -dbname DBNAME -user USERNAME -password PASSWORD [-host HOST] [-port PORT]

When the program prompts that the task is finished, you may check the created database and loaded tables by connecting to the PostgreSQL server and using the psql command-line interface:

psql -U USERNAME -d DBNAME -h localhost -p 5432

Once connected, you can list all tables in the database by running the \dt command. In our example using 2023 AIS data (default download), the tables will appear as follows:

ais_pgdb=# \dt
           List of relations
 Schema |    Name     | Type  |  Owner  
--------+-------------+-------+----------
 public | ais_2023_01 | table | postgres
 public | ais_2023_02 | table | postgres
 public | ais_2023_03 | table | postgres
 public | ais_2023_04 | table | postgres
 public | ais_2023_05 | table | postgres
 public | ais_2023_06 | table | postgres
 public | ais_2023_07 | table | postgres
 public | ais_2023_08 | table | postgres
 public | ais_2023_09 | table | postgres
 public | ais_2023_10 | table | postgres
 public | ais_2023_11 | table | postgres
 public | ais_2023_12 | table | postgres
(12 rows)

Vessel Metadata

This tutorial demonstrates how to access vessel metadata using MMSI and SQLite databases. In many cases, AIS messages do not contain metadata. Therefore, this tutorial introduces the built-in functions in AISdb and external APIs to extract detailed vessel information associated with a specific MMSI from web sources.

Metadata Download

We introduced two methods implemented in AISdb for scraping metadata: using session requests for direct access and employing web drivers with browsers to handle modern websites with dynamic content. Additionally, we provided an example of utilizing a third-party API to access vessel information.

Session Request

The session request method in Python is a straightforward and efficient approach for retrieving metadata from websites. In AISdb, the aisdb.webdata._scraper.search_metadata_vesselfinder function leverages this method to scrape detailed information about vessels based on their MMSI numbers. This function efficiently gathers a range of data, including vessel name, type, flag, tonnage, and navigation status.

This is an example of how to use the search_metadata_vesselfinder feature in AISdb to scrape data from VesselFinder website:

from aisdb.webdata._scraper import search_metadata_vesselfinder

MMSI = 228386800
dict_ = search_metadata_vesselfinder(MMSI)

print(dict_)

{'IMO number': '9839131',
 'Vessel Name': 'CMA CGM CHAMPS ELYSEES',
 'Ship type': 'Container Ship',
 'Flag': 'France',
 'Homeport': '-',
 'Gross Tonnage': '236583',
 'Summer Deadweight (t)': '220766',
 'Length Overall (m)': '400',
 'Beam (m)': '61',
 'Draught (m)': '',
 'Year of Build': '2020',
 'Builder': '',
 'Place of Build': '',
 'Yard': '',
 'TEU': '',
 'Crude Oil (bbl)': '-',
 'Gas (m3)': '-',
 'Grain': '-',
 'Bale': '-',
 'Classification Society': '',
 'Registered Owner': '',
 'Owner Address': '',
 'Owner Website': '-',
 'Owner Email': '-',
 'Manager': '',
 'Manager Address': '',
 'Manager Website': '',
 'Manager Email': '',
 'Predicted ETA': '',
 'Distance / Time': '',
 'Course / Speed': '\xa0',
 'Current draught': '16.0 m',
 'Navigation Status': '\nUnder way\n',
 'Position received': '\n22 mins ago \n\n\n',
 'IMO / MMSI': '9839131 / 228386800',
 'Callsign': 'FLZF',
 'Length / Beam': '399 / 61 m'}

MarineTraffic API

In addition to metadata scraping, we may also use the available API the data provides. MarineTraffic offers an option to subscribe to its API to access vessel data, forecast voyages, position the vessels, etc. Here is an example of retrieving :

import requests

# Your MarineTraffic API key
api_key = 'your_marine_traffic_api_key'

# List of MMSI numbers you want to query
mmsi = [228386800,
        372351000,
        373416000,
        477003800,
        477282400
]

# Base URL for the MarineTraffic API endpoint
url = f'https://services.marinetraffic.com/api/exportvessels/{api_key}'

# Prepare the API request
params = {
    'shipid': ','.join(mmsi_list),  # Join MMSI list with commas
    'protocol': 'jsono',            # Specify the response format
    'msgtype': 'extended'           # Specify the level of details
}

# Make the API request
response = requests.get(url, params=params)

# Check if the request was successful
if response.status_code == 200:
    vessel_data = response.json()
    
    for vessel in vessel_data:
        print(f"Vessel Name: {vessel.get('NAME')}")
        print(f"MMSI: {vessel.get('MMSI')}")
        print(f"IMO: {vessel.get('IMO')}")
        print(f"Call Sign: {vessel.get('CALLSIGN')}")
        print(f"Type: {vessel.get('TYPE_NAME')}")
        print(f"Flag: {vessel.get('COUNTRY')}")
        print(f"Length: {vessel.get('LENGTH')}")
        print(f"Breadth: {vessel.get('BREADTH')}")
        print(f"Year Built: {vessel.get('YEAR_BUILT')}")
        print(f"Status: {vessel.get('STATUS_NAME')}")
        print('-' * 40)
else:
    print(f"Failed to retrieve data: {response.status_code}")

Metadata Storage

If you already have a database containing AIS track data, then vessel metadata can be downloaded and stored in a separate database.

from aisdb import track_gen, decode_msgs, DBQuery, sqlfcn_callbacks, Domain
from datetime import datetime 

dbpath = "/home/database.db"
start = datetime(2021, 11, 1)
end = datetime(2021, 11, 2)

with DBConn(dbpath="/home/data_sample_dynamic.csv.db") as dbconn:
        qry = aisdb.DBQuery(
             dbconn=dbconn, callback=in_timerange,
             start=datetime(2020, 1, 1, hour=0),
             end=datetime(2020, 12, 3, hour=2)
        )
        # A new database will be created if it does not exist to save the downloaded info from MarineTraffic
        traffic_database_path = "/home/traffic_info.db"
       
        # User can select a custom boundary for a query using aisdb.Domain
        qry.check_marinetraffic(trafficDBpath=, boundary={"xmin":-180, "xmax":180, "ymin":-180,  "ymax":180})
        
        rowgen = qry.gen_qry(verbose=True)
        trackgen = track_gen.TrackGen(rowgen, decimate=True)

AIS Data to CSV

Building on the previous section, where we used AIS data to create AISdb databases, users can export AIS data from these databases into CSV format. In this section, we provide examples of exporting data from SQLite or PostgreSQL databases into CSV files. While we demonstrate these operations using internal data, you can apply the same techniques to your databases.

Export CSV from SQLite Database

In the first example, we connected to a SQLite database, queried data in a specific time range and area of interest, and then exported the queried data to a CSV file:

import csv
import aisdb
import nest_asyncio

from aisdb import DBConn, DBQuery, DomainFromPoints
from aisdb.database.dbconn import SQLiteDBConn
from datetime import datetime

nest_asyncio.apply()

dbpath = 'YOUR_DATABASE.db' # Path to your database
end_time = datetime.strptime("2018-01-02 00:00:00", '%Y-%m-%d %H:%M:%S')
start_time = datetime.strptime("2018-01-01 00:00:00", '%Y-%m-%d %H:%M:%S')
domain = DomainFromPoints(points=[(-63.6, 44.6)], radial_distances=[50000])

# Connect to SQLite database
dbconn = SQLiteDBConn(dbpath=dbpath)

with SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = DBQuery(
        dbconn=dbconn, start=start_time, end=end_time,
        xmin=domain.boundary['xmin'], xmax=domain.boundary['xmax'],
        ymin=domain.boundary['ymin'], ymax=domain.boundary['ymax'],
        callback=aisdb.database.sqlfcn_callbacks.in_time_bbox_validmmsi,
    )
    tracks = aisdb.track_gen.TrackGen(qry.gen_qry(), decimate=False)

# Define the headers for the CSV file
headers = ['mmsi', 'time', 'lon', 'lat', 'cog', 'sog',
           'utc_second', 'heading', 'rot', 'maneuver']

# Open the CSV file for writing
csv_filename = 'output_sqlite.csv'
with open(csv_filename, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=headers)
    writer.writeheader()  # Write the header once
    
    for track in tracks:
        for i in range(len(track['time'])):
            row = {
                'rot': track['rot'],
                'mmsi': track['mmsi'],
                'lon': track['lon'][i],
                'lat': track['lat'][i],
                'cog': track['cog'][i],
                'sog': track['sog'][i],
                'time': track['time'][i],
                'heading': track['heading'],
                'maneuver': track['maneuver'],
                'utc_second': track['utc_second'][i],
            }
            writer.writerow(row)  # Write the row to the CSV file

print(f"All tracks have been combined and written to {csv_filename}")

Now we can check the data in the exported CSV file:

mmsi 	time 	lon 	lat 	cog 	sog 	utc_second 	heading 	rot 	maneuver
0 	219014000 	1514767484 	-63.537167 	44.635834 	322 	0.0 	44 	295.0 	0.0 	0
1 	219014000 	1514814284 	-63.537167 	44.635834 	119 	0.0 	45 	295.0 	0.0 	0
2 	219014000 	1514829783 	-63.537167 	44.635834 	143 	0.0 	15 	295.0 	0.0 	0
3 	219014000 	1514829843 	-63.537167 	44.635834 	171 	0.0 	15 	295.0 	0.0 	0
4 	219014000 	1514830042 	-63.537167 	44.635834 	3 	0.0 	35 	295.0 	0.0 	0

Export CSV from PostgreSQL Database

Similar to exporting data from a SQLite database to a CSV file, the only difference this time is that you'll need to connect to your PostgreSQL database and query the data you want to export to CSV. We showed a full example as follows:

import csv
import aisdb
import nest_asyncio

from datetime import datetime
from aisdb.database.dbconn import PostgresDBConn
from aisdb import DBConn, DBQuery, DomainFromPoints

nest_asyncio.apply()

dbconn = PostgresDBConn(
    host='localhost',          # PostgreSQL address
    port=5432,                 # PostgreSQL port
    user='your_username',      # PostgreSQL username
    password='your_password',  # PostgreSQL password
    dbname='database_name'     # Database name
)

qry = DBQuery(
    dbconn=dbconn,
    start=datetime(2023, 1, 1), end=datetime(2023, 1, 3),
    xmin=domain.boundary['xmin'], xmax=domain.boundary['xmax'],
    ymin=domain.boundary['ymin'], ymax=domain.boundary['ymax'],
    callback=aisdb.database.sqlfcn_callbacks.in_time_bbox_validmmsi
)

tracks = aisdb.track_gen.TrackGen(qry.gen_qry(), decimate=False)

# Define the headers for the CSV file
headers = ['mmsi', 'time', 'lon', 'lat', 'cog', 'sog',
           'utc_second', 'heading', 'rot', 'maneuver']

# Open the CSV file for writing
csv_filename = 'output_postgresql.csv'
with open(csv_filename, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=headers)
    writer.writeheader()  # Write the header once
    
    for track in tracks:
        for i in range(len(track['time'])):
            row = {
                'rot': track['rot'],
                'mmsi': track['mmsi'],
                'lon': track['lon'][i],
                'lat': track['lat'][i],
                'cog': track['cog'][i],
                'sog': track['sog'][i],
                'time': track['time'][i],
                'heading': track['heading'],
                'maneuver': track['maneuver'],
                'utc_second': track['utc_second'][i],
            }
            writer.writerow(row)  # Write the row to the CSV file

print(f"All tracks have been combined and written to {csv_filename}")

We can check the output CSV file now:

mmsi 	time 	lon 	lat 	cog 	sog 	utc_second 	heading 	rot 	maneuver
0 	210108000 	1672545711 	-63.645 	44.68833 	173 	0.0 	0 	0.0 	0.0 	False
1 	210108000 	1672545892 	-63.645 	44.68833 	208 	0.0 	0 	0.0 	0.0 	False
2 	210108000 	1672546071 	-63.645 	44.68833 	176 	0.0 	0 	0.0 	0.0 	False
3 	210108000 	1672546250 	-63.645 	44.68833 	50 	0.0 	0 	0.0 	0.0 	False
4 	210108000 	1672546251 	-63.645 	44.68833 	50 	0.0 	0 	0.0 	0.0 	False

Decimation with AISdb

Automatic Identification System (AIS) data provides a wealth of insights into maritime activities, including vessel movements and traffic patterns. However, the massive volume of AIS data, often consisting of millions or even billions of GPS position points, can be overwhelming somehow. Processing and visualizing this raw data directly can be computationally expensive, slow, and difficult to interpret.

This is where the AISdb's decimation comes into play - it helps users efficiently reduce data clutter, making it easier to extract and focus on the most relevant information.

What is Decimation in the Context of AIS Tracks?

Decimation, in simple terms, means reducing the number of data points. When applied to AIS tracks, it involves selectively removing GPS points from a vessel’s trajectory while preserving its overall shape and key characteristics. Rather than processing every recorded position, decimation algorithms identify and retain the most relevant points, optimizing data efficiency without significant loss of accuracy.

Think of it like simplifying a drawing: instead of using thousands of tiny dots to represent a complex image, you can use fewer, strategically chosen points to capture its essence. Similarly, decimation ensures that a vessel’s path with fewer points while maintaining its core trajectory, making analysis and visualization more efficient.

Why Decimate AIS Data?

There are several key benefits for using decimation techniques when working with AIS data:

Improved Performance and Efficiency: Reducing the number of data points can dramatically decrease the computational load, enabling faster analyses, quicker visualizations, and more effective workflow, especially when dealing with large datasets.
Clearer Visualizations: Dense tracks can clutter visualizations and make it difficult to interpret the data. Decimation simplifies the tracks, emphasizing on significant movements and patterns for more intuitive analysis.
Noise Reduction: While decimation is not designed as a noise removal technique, it can help smooth out minor inaccuracies and high-frequency fluctuations from raw GPS data. This can be useful for focusing on broader trends and vessel movements.

AISdb and `simplify_linestring_idx()`

In AISDB, TrackGen() method includes adecimate parameter that, when set as True, triggers the simplify_linestring_idx(x, y, precision)function. This function uses the to simplify vessel tracks while preserving key trajectory details.

How the Visvalingam-Whyatt Algorithm Works

The Visvalingam-Whyatt algorithm is an approach to line simplification. It works by removing points that contribute the least to the overall shape of the line. Here’s how it works:

The algorithm measures the importance of a point by calculating the area of the triangle formed by that point and its adjacent points.
Points on relatively straight segments form smaller triangles, meaning they’re less important in defining the shape.
Points at curves and corners form larger triangles, signaling that they’re crucial for maintaining the line’s characteristic form.

The algorithm iteratively removes the points with the smallest triangle areas until the desired level of simplification is achieved. In AISdb, this process is controlled by the decimate parameter in the TrackGen() method.

Using `TrackGen(...,decimate = True)` with AISDB Tracks

Below is a conceptual Python example that demonstrates how to apply decimation to AIS tracks:

Using `simplify_linestring_idx()` with AISDB Tracks

To get more control over the precision for decimation, use function: simplify_linestring_idx in AISdb.

Illustration of Decimation

Key Parameters and Usage Notes:

Precision: The precision parameter controls the level of simplification. A smaller value (e.g., 0.001) results in more retained points and higher fidelity, while a larger value (e.g., 0.1) simplifies the track further with fewer points.
x, y: These are NumPy arrays representing the longitude and latitude coordinates of the track points.
TrackGen Integration: Decimation is applied after generating tracks with aisdb.TrackGen, followed by the application of simplify_linestring_idx() to each track individually.
Iterative Refinement: Decimation is often an iterative process. You may need to visualize the decimated tracks, assess the level of simplification, and adjust the precision to balance simplification with data fidelity.

Conclusion

Decimation is a powerful tool for simplifying and decluttering AIS data. By intelligently reducing the data’s complexity, AISDB’s simplify_linestring_idx() and TrackGen()allows you to process data more efficiently, create clearer visualizations, and gain deeper insights from your maritime data. Experiment with different precision values, and discover how “less” data can lead to “more” meaningful results in your AIS analysis workflows!

References

Amigo D, Sánchez Pedroche D, García J, Molina JM. Review and classification of trajectory summarisation algorithms: From compression to segmentation. International Journal of Distributed Sensor Networks. 2021;17(10). doi:

AIS - Automatic Identification System

The Automatic Identification System (AIS) is a standardized and unencrypted self-reporting maritime surveillance system.

How does this work?

The protocol operates by transmitting one or more of 27 message types from an AIS transponder onboard a vessel at fixed time intervals. These intervals depend on the vessel’s status—stationary vessels (anchored or moored) transmit every 3 minutes, while fast-moving vessels transmit every 2 seconds.

These VHF radio messages are sent from the vessel’s transponder and received by either satellite or ground-based stations, enabling more detailed monitoring and analysis of maritime traffic.

Types of AIS messages

Dynamic messages convey the vessel's real-time status, which can vary between transmissions. These include data such as Speed Over Ground (SOG), Course Over Ground (COG), Rate of Turn (ROT), and the vessel’s current position (latitude and longitude).
Static messages, on the other hand, provide information that remains constant over time. This includes details like the Maritime Mobile Service Identity (MMSI), International Maritime Organization (IMO) number, vessel name, call sign, type, dimensions, and intended destination.

Limitations of AIS signals

Signals from vessels are lost.
Terrestrial base stations are limited by their physical range, while satellite AIS receivers are limited based on their position globally.

To learn more about AIS, refer to:

References

Brousseau, M. (2022). A comprehensive analysis and novel methods for on-purpose AIS switch-off detection [Master’s thesis, Dalhousie University]. DalSpace.
Kazim, T. (2016, November 14). A definitive guide to AIS. MarineLink. Retrieved May 14, 2025, from

Machine Learning

seq2seq in PyTorch

Loading Data from Database

If you want vessel metadata

Generating Tracks

Removing Anchored Pings

Some approaches remove pings near the shore. An example to calculate the distance is provided:

Interpolating Tracks

Saving into CSV

Visualization CSV on QGIS

QGIS is a cross-platform desktop geographic information system application that supports viewing, editing, printing, and analyzing geospatial data.

The CSV can be imported by Menu > Layers > Add Layer > Add delimiter text layer

The tracks can be generated by Points to Path a function in QGIS tools using Track_ID as a grouping parameter.

Applying Machine Learning to Cluster Tracks

Developing a Deep Learning Model in Keras

Reading CSV and the data transformation depends on the type of task we want to perform on Tracks. Here, we provide an example of using the CSV in a sequence-to-sequence model to predict 3 next points as output while giving the model 10 AIS messages as input.

Developing a Deep Learning Model in Pytorch

AutoEncoders in Keras

Trajectory Forecasting with Gate Recurrent Units AutoEncoders

Introduction

By the end of this tutorial, you will understand the benefits of using teacher forcing to improve model accuracy, as well as other tweaks to enhance forecasting capabilities. We'll use AutoEncoders, neural networks that learn compressed data representations, to achieve this.

We will guide you through preparing AIS data for training an AutoEncoder, setting up layers, compiling the model, and defining the training process with teacher forcing.

Given the complexity of this task, we will revisit it to explore the benefits of teacher forcing, a technique that can improve sequence-to-sequence learning in neural networks.

This tutorial focuses on Trajectory Forecasting, which predicts an object's future path based on past positions. We will work with AIS messages, a type of temporal data that provides information about vessels' location, speed, and heading over time.

AISdb Querying

Automatic Identification System (AIS) messages broadcast essential ship information such as position, speed, and course. The temporal nature of these messages is pivotal for our tutorial, where we'll train an auto-encoder neural network for trajectory forecasting. This task involves predicting a ship's future path based on its past AIS messages, making it ideal for auto-encoders, which are optimized for learning patterns in sequential data.

For querying the entire database at once, use the following code:

For querying the database in batches of hours, use the following code:

Several functions were defined using AISdb, an AIS framework developed by MERIDIAN at Dalhousie University, to efficiently extract AIS messages from SQLite databases. AISdb is designed for effective data storage, retrieval, and preparation for AIS-related tasks. It provides comprehensive tools for interacting with AIS data, including APIs for data reading and writing, parsing AIS messages, and performing various data transformations.

Data Visualization

Our next step is to create a coverage map of Atlantic Canada to visualize our dataset. We will include a 100km radius circle on the map to show the areas of the ocean where vessels can send AIS messages. Although overlapping circles may contain duplicate data from the same MMSI, we have already eliminated those from our dataset. However, messages might still appear incorrectly in inland areas.

Loading a shapefile to help us define whether a vessel is on land or in water during the trajectory:

Check if a given coordinate (latitude, longitude) is on land:

Check if any coordinate of a track is on land:

Filter out tracks with any point on land for a given MMSI:

Use a ThreadPoolExecutor to parallelize the processing of MMSIs:

Count the number of segments per MMSI after removing duplicates and inaccurate track segments:

Dataset Preparation

In this analysis, we observe that most MMSIs in the dataset exhibit between 1 and 49 segments during the search period within AISdb. However, a minor fraction of vessels have significantly more segments, with some reaching up to 176. Efficient processing involves categorizing the data by MMSI instead of merely considering its volume. This method allows us to better evaluate the model's ability to discern various movement behaviors from both the same vessel and different ones.

To prevent our model from favoring shorter trajectories, we need a balanced mix of short-term and long-term voyages in the training and test sets. We'll categorize trajectories with 30 or more segments as long-term and those with fewer segments as short-term. Implement an 80-20 split strategy to ensure an equitable distribution of both types in the datasets.

Splitting the data respecting the voyage length distribution:

Visualizing the distribution of the dataset:

Inputs & Outputs

Understanding input and output timesteps and variables is crucial in trajectory forecasting tasks. Trajectory data comprises spatial coordinates and related features that depict an object's movement over time. The aim is to predict future positions of the object based on its historical data and associated features.

INPUT_TIMESTEPS: This parameter determines the consecutive observations used to predict future trajectories. Its selection impacts the model's ability to capture temporal dependencies and patterns. Too few time steps may prevent the model from capturing all movement dynamics, resulting in inaccurate predictions. Conversely, too many time steps can add noise and complexity, increasing the risk of overfitting.
INPUT_VARIABLES: Features describe each timestep in the input sequence for trajectory forecasting. These variables can include spatial coordinates, velocities, accelerations, object types, and relevant features that aid in predicting system dynamics. Choosing the right input variables is crucial; irrelevant or redundant ones may confuse the model while missing important variables can result in poor predictions.
OUTPUT_TIMESTEPS: This parameter sets the number of future time steps the model should predict, known as the prediction horizon. Choosing the right horizon size is critical. Predicting too few timesteps may not serve the application's needs while predicting too many can increase uncertainty and degrade performance. Select a value based on your application's specific requirements and data quality.
OUTPUT_VARIABLES: In trajectory forecasting, output variables include predicted spatial coordinates and sometimes other relevant features. Reducing the number of output variables can simplify prediction tasks and decrease model complexity. However, this approach might also lead to a less effective model.

Understanding the roles of input and output timesteps and variables is key to developing accurate trajectory forecasting models. By carefully selecting these elements, we can create models that effectively capture object movement dynamics, resulting in more accurate and meaningful predictions across various applications.

For this tutorial, we'll input 4 hours of data into the model to forecast the next 8 hours of vessel movement. Consequently, we'll filter out all voyages with less than 12 hours of AIS messages. By interpolating the messages every 5 minutes, we require a minimum of 144 sequential messages (12 hours at 12 messages/hour).

With data provided by AISdb, we have AIS information, including Longitude, Latitude, Course Over Ground (COG), and Speed Over Ground (SOG), representing a ship's position and movement. Longitude and Latitude specify the ship's location, while COG and SOG indicate its heading and speed. By using all features for training the neural network, our output will be the Longitude and Latitude pair. This methodology allows the model to predict the ship's future positions based on historical data.

In this tutorial, we'll include AIS data deltas as features, which were excluded in the previous tutorial. Incorporating deltas can help the model capture temporal changes and patterns, enhancing its effectiveness in sequence-to-sequence modeling. Deltas provides information on the rate of change in features, improving the model's accuracy, especially in predicting outcomes that depend on temporal dynamics.

Data Filtering

Data Statistics

Sample Weighting

Distance Function

To improve our model, we'll prioritize training samples based on trajectory straightness. We'll compute the geographical distance between a segment's start and end points using the Haversine formula. Comparing this to the total distance of all consecutive points will give a straightness metric. Our model will focus on complex trajectories with multiple direction changes, leading to better generalization and more accurate predictions.

Complexity Score

Trajectory straightness calculation using the Haversine:

Sample Windowing

To predict 96 data points (output) using the preceding 48 data points (input) in a trajectory time series, we create a sliding window. First, we select the initial 48 data points as the input sequence and the subsequent 96 as the output sequence. We then slide the window forward by one step and repeat the process. This continues until the end of the sequence, helping our model capture temporal dependencies and patterns in the data.

Our training strategy uses the sliding window technique, requiring unique weights for each sample. Sliding Windows (SW) transforms time series data into an appropriate format for machine learning. They generate overlapping windows with a fixed number of consecutive points by sliding the window one step at a time through the series.

In this project, the input data includes four features: Longitude, Latitude, COG (Course over Ground), and SOG (Speed over Ground), while the output data includes only Longitude and Latitude. To enhance the model's learning, we need to normalize the data through three main steps.

First, normalize Longitude, Latitude, COG, and SOG to the [0, 1] range using domain-specific parameters. This ensures the model performs well in Atlantic Canada waters by restricting the geographical scope of the AIS data and maintaining a similar scale for all features.

Second, the input and output data are standardized by subtracting the mean and dividing by the standard deviation. This centers the data around zero and scales it by its variance, preventing vanishing gradients during training.

Finally, another zero-one normalization is applied to scale the data to the [0, 1] range, aligning it with the expected range for many neural network activation functions.

Denormalizing Y output to the original scale of the data:

Denormalizing X output to the original scale of the data:

machine-learningWe have successfully prepared the data for our machine-learning task. With the data ready, it's time for the modeling phase. Next, we will create, train, and evaluate a machine-learning model to forecast vessel trajectories using the processed dataset. Let's explore how our model performs in Atlantic Canada!

Gated Recurrent Unit AutoEncoder

A GRU Autoencoder is a neural network that compresses and reconstructs sequential data utilizing a Gated Recurrent Unit. GRUs are highly effective at handling time-series data, which are sequential data points captured over time, as they can model intricate temporal dependencies and patterns. To perform time-series forecasting, a GRU Autoencoder can be trained on a historical time-series dataset to discern patterns and trends, subsequently compressing a sequence of future data points into a lower-dimensional representation that can be decoded to generate a forecast of the upcoming data points. With this in mind, we will begin by constructing a model architecture composed of two GRU layers with 64 units each, taking input of shape (48, 4) and (96, 4), respectively, followed by a dense layer with 2 units.

Custom Loss Function

Model Summary

Training Callbacks

The following function lists callbacks used during the model training process. Callbacks are utilities at specific points during training to monitor progress or take actions based on the model's performance. The function pre-define the parameters and behavior of these callbacks:

WandbMetricsLogger: This callback logs the training and validation metrics for visualization and monitoring on the Weights & Biases (W&B) platform. This can be useful for tracking the training progress but may introduce additional overhead due to the logging process. You can remove this callback if you don't need to use W&B or want to reduce the overhead.
TerminateOnNaN: This callback terminates training if the loss becomes NaN (Not a Number) during the training process. It helps to stop the training process early when the model diverges and encounters an unstable state.
ReduceLROnPlateau: This callback reduces the learning rate by a specified factor when the monitored metric has stopped improving for several epochs. It helps fine-tune the model using a lower learning rate when it no longer improves significantly.
EarlyStopping: This callback stops the training process early when the monitored metric has not improved for a specified number of epochs. It restores the model's best weights when the training is terminated, preventing overfitting and reducing the training time.
ModelCheckpoint: This callback saves the best model (based on the monitored metric) to a file during training.

WandbMetricsLogger is the most computationally costly among these callbacks due to the logging process. You can remove this callback if you don't need to use Weights & Biases for monitoring or want to reduce overhead. The other callbacks help optimize the training process and are less computationally demanding. It's important to note that the Weights & Biases (W&B) platform is also used in other parts of the code. If you decide to remove the WandbMetricsLogger callback, please ensure that you also remove any other references to W&B in the code to avoid potential issues. If you choose to use W&B for monitoring and logging, you must register and log in to the . During the execution of the code, you'll be prompted for an authentication key to connect your script to your W&B account. This key can be obtained from your W&B account settings. Once you have the key, you can use it to enable W&B's monitoring and logging features provided by W&B.

Model Training

Model Evaluation

Hyperparameters Tuning

In this step, we define a function called model_placeholder that uses the Keras Tuner to create a model with tunable hyperparameters. The function takes a hyperparameter object as input, which defines the search space for the hyperparameters of interest. Specifically, we are searching for the best number of units in the encoder and decoder GRU layers and the optimal learning rate for the AdamW optimizer. The model_placeholder function constructs a GRU-AutoEncoder model with these tunable hyperparameters and compiles the model using the Mean Absolute Error (MAE) as the loss function. Keras Tuner will use this model during the hyperparameter optimization process to find the best combination of hyperparameters that minimizes the validation loss at the expanse of long computing time.

Helper for saving the training history:

Helper for restoring the training history:

Defining the model to be optimized:

HyperOpt Objective Function:

Search for Best Model

Swiping the project folder for other pre-trained weights shared with this tutorial:

Evaluating Best Model

Model Explainability

Permutation Feature Importance (PFI)

Deep learning models, although powerful, are often criticized for their lack of explainability, making it difficult to comprehend their decision-making process and raising concerns about trust and reliability. To address this issue, we can use techniques like the PFI method, a simple, model-agnostic approach that helps visualize the importance of features in deep learning models. This method works by shuffling individual feature values in the dataset and observing the impact on the model's performance. By measuring the change in a designated metric when each feature's values are randomly permuted, we can infer the importance of that specific feature. The idea is that if a feature is crucial for the model's performance, shuffling its values should lead to a significant shift in performance; otherwise if a feature has little impact, its value permutation should result in a minor change. Applying the permutation feature importance method to the best model, obtained after hyperparameter tuning, can give us a more transparent understanding of how the model makes its decisions.

Sensitivity Analysis

Permutation feature importance has some limitations, such as assuming features are independent and producing biased results when features are highly correlated. It also doesn't provide detailed explanations for individual data points. An alternative is sensitivity analysis, which studies how input features affect model predictions. By perturbing each input feature individually and observing the prediction changes, we can understand which features significantly impact the model's output. This approach offers insights into the model's decision-making process and helps identify influential features. However, it does not account for feature interactions and can be computationally expensive for many features or perturbation steps.

UMAP: Uniform Manifold Approximation and Projection

UMAP is a nonlinear dimensionality reduction technique that visualizes high-dimensional data in a lower-dimensional space, preserving the local and global structure. In trajectory forecasting, UMAP can project high-dimensional model representations into 2D or 3D to clarify the relationships between input features and outputs. Unlike sensitivity analysis, which measures prediction changes due to input feature perturbations, UMAP reveals data structure without perturbations. It also differs from feature permutation, which evaluates feature importance by shuffling values and assessing model performance changes. UMAP focuses on visualizing intrinsic data structures and relationships.

Final Considerations

GRUs can effectively forecast vessel trajectories but have notable downsides. A primary limitation is their struggle with long-term dependencies due to the vanishing gradient problem, causing the loss of relevant information from earlier time steps. This makes capturing long-term patterns in vessel trajectories challenging. Additionally, GRUs are computationally expensive with large datasets and long sequences, resulting in longer training times and higher memory use. While outperforming basic RNNs, they may not always surpass advanced architectures like LSTMs or Transformer models. Furthermore, the interpretability of GRU-based models is a challenge, which can hinder their adoption in safety-critical applications like vessel trajectory forecasting.

Using Newtonian PINNs

Clustering with Scikit Learn

Deploying an AISdb ChatBot

We’ve added a chatbot to our documentation to help you find answers faster. Powered by Retrieval-Augmented Generation (RAG), the chatbot can quickly pull information from the documentation to give you relevant answers.

The Chatbot is hosted in HuggingFace and is still under development. Feel free to use the chatbot, and share your insights!

Link to chatbot: