πŸ““
Documentation
  • βš“Introduction
  • Default Start
    • πŸ›°οΈQuick Start
    • πŸ—„οΈSQL Database
    • πŸ“‘AIS Hardware
  • Tutorials
    • πŸ“₯Database Loading
    • πŸ”ŽData Querying
    • 🚿Data Cleaning
    • πŸ—ΊοΈData Visualization
    • πŸ–‡οΈTrack Interpolation
    • 🌎Haversine Distance
    • 🚀Vessel Speed
    • 🏝️Coast, shore, and ports
    • πŸ”Using Your AIS Data
    • ⬇️Vessel Metadata
    • πŸ“’AIS Data to CSV
    • πŸ“Decimation with AISdb
    • 🌊Bathymetric Data
    • 🌦️Weather Data
    • AIS - Automatic Identification System
  • Machine Learning
    • seq2seq in PyTorch
    • AutoEncoders in Keras
    • Using Newtonian PINNs
    • Embedding with traj2vec
    • TGNs with TorchGeometric
    • Clustering with Scikit Learn
    • Kalman Filters with FilterPy
    • Deploying an AISdb ChatBot
  • Keep Exploring
    • ReadTheDocs
    • MARS Group
    • MAPS Lab
    • MERDIAN
Powered by GitBook
On this page
Export as PDF
  1. Tutorials

Data Cleaning

PreviousData QueryingNextData Visualization

Last updated 7 months ago

A common issue with AIS data is noise, where multiple vessels may broadcast using the same identifier simultaneously. AISdb incorporates data cleaning techniques to remove noise from vessel track data. For more details:

Denoising with Encoder: The function checks the approximate distance between each vessel’s position. It separates vectors where a vessel couldn’t reasonably travel using the most direct path, such as speeds over 50 knots.

Distance and Speed Thresholds: Distance and speed thresholds limit the maximum distance or time between messages that can be considered continuous.

Scoring and Segment Concatenation: A score is computed for each position delta, with sequential messages nearby at shorter intervals given a higher score. This score is calculated by dividing the Haversine distance by elapsed time. Any deltas with a score not reaching the minimum threshold are considered the start of a new segment. New segments are compared to the end of existing segments with the same vessel identifier; if the score exceeds the minimum, they are concatenated. If multiple segments meet the minimum score, the new segment is concatenated to the existing segment with the highest score.

Processing functions may be executed in sequence as a processing chain or pipeline, so after segmenting the individual voyages, results can be input into the encoder to remove noise and correct for vessels with duplicate identifiers effectively.

import aisdb
from datetime import datetime, timedelta
from aisdb import DBConn, DBQuery, DomainFromPoints

dbpath='YOUR_DATABASE.db' # Define the path to your database

# Set the start and end times for the query
start_time = datetime.strptime("2018-01-01 00:00:00", '%Y-%m-%d %H:%M:%S')
end_time = datetime.strptime("2018-01-02 00:00:00", '%Y-%m-%d %H:%M:%S')

# A circle with a 100km radius around the location point
domain = DomainFromPoints(points=[(-63.6, 44.6)], radial_distances=[50000])

maxdelta = timedelta(hours=24)  # the maximum time interval
distance_threshold = 20000      # the maximum allowed distance (meters) between consecutive AIS messages
speed_threshold = 50            # the maximum allowed vessel speed in consecutive AIS messages
minscore = 1e-6                 # the minimum score threshold for track segment validation

with aisdb.SQLiteDBConn(dbpath=dbpath) as dbconn:
    qry = aisdb.DBQuery(
        dbconn=dbconn, start=start_time, end=end_time,
        callback=aisdb.database.sqlfcn_callbacks.in_timerange_validmmsi,
    )
    rowgen = qry.gen_qry()
    tracks = aisdb.track_gen.TrackGen(rowgen, decimate=False)
    
    # Split the tracks into segments based on the maximum time interval
    track_segments = aisdb.split_timedelta(tracks, maxdelta)
    
    # Encode the track segments to clean and validate the track data
    tracks_encoded = aisdb.encode_greatcircledistance(track_segments, 
                                                      distance_threshold=distance_threshold, 
                                                      speed_threshold=speed_threshold, 
                                                      minscore=minscore)
    tracks_colored = color_tracks(tracks_encoded)
    
    aisdb.web_interface.visualize(
        tracks_colored,
        domain=domain,
        visualearth=True,
        open_browser=True,
    )

After segmentation and encoding, the tracks are shown as:

For comparison, this is a shot of tracks before cleaning:

🚿
aisdb.denoising_encoder.encode_greatcircledistance()
Queried vessel tracks after applying track segmentation and encoder (distance threshold=20km, speed threshold=50knots)
Queried vessel tracks before cleaning