Sample-based Modulated Wasserstein Distance for Heterogeneous Datasets of Unknown Distributions for Small Feature Change Detection

Funded by the Moore-Sloan Foundation NYU Data Science Seed Grant

In collaboration with Esteban Tabak

According to the Federal Highway Administration, the United States has 8.66 million lane-miles for highway travel and a further 5.23 million lane-miles of local, paved roads.[1] The pavement, signage and lighting of these assets provide a major inventory management challenge for the local, state, and federal governments tasked with their provision and maintenance. The current state of practice for inspection and inventorying involves a “windshield survey” every 1-2 years in which a driver and an inspector slowly drive every mile and note major defects as observed. Sometimes this includes photographing of particularly egregious areas.

While there are commercial alternatives for pavement inspection in the form of multiple or wide-format laser scanners that are car- or truck-mounted, most communities have found this option too expensive or are ill-equipped to handle the output data without an enormous investment in manual evaluation of the data. This leaves the majority of communities with a continued high reliance on subjective and poorly documented road condition records. Furthermore, this does not address the inventorying and inspecting of the affiliated signage and lighting. Management of those assets is still highly dependent upon driver-based reporting.

However, the development of driverless cars offers an opportunity to harvest the petabytes of mobile laser scanning data already being collected by test vehicles (and eventually fleet vehicles) and repurpose that data for roadway, utility, and signage management in the following areas:

Pavement management (cracks, ruts, potholes, etc.)
Downed or deformed utility poles and traffic signage

To do this, we propose modifying and extending the concept of the Wasserstein Distance as a means to check large quantities of mobile laser scanning data for changes in pavement condition and the status nearby utility poles and signage.[2] In such a case, we are given two sample sets {x_i} and {y_j}, where each sample is a vector (with three components for spatial location, more if extra attributes such as color are present). The two sets correspond to two different times, and our goal is to determine whether something has changed in between and, if so, where and how. For this, we propose to develop a new methodology for data analysis, entitle the “Sample-based, modulated Wasserstein distance’’ between the two distributions.

The square 2-Wasserstein distance W(rho,mu)^2 between two distributions rho(x) and mu(y) is defined through an optimal transport problem. It is the minimal average square distance between x and y over all possible couplings between rho and mu—i.e., joint distributions pi(x,y) having rho and mu as marginals. Tabak and collaborators have developed a sample-based version of the optimal transport problem, where rho and mu are only known through the two sample sets {x_i}, {y_j}.[3] In the context of this proposal, this methodology will be further extended through the introduction of a spatially modulated distance W(rho,mu,z), which quantifies how much of the distance between the two distributions is attributable to each spatial location z. This procedure will be made adaptive: wherever significant changes have been located, the spatial scale of the modulation will be further refined, so as to accurately determine the location and nature of the change (this “nature” being provided by the map y(x) to which the optimal coupling reduces under convex costs, representing for instance a change in position, with possible consideration of normalized intensity data or co-registered color data from imagery).

Achieving this with mobile laser scanning data is challenging, as the Wasserstein Distance approach has traditionally assumed two things: a known underlying data distribution and a high level of uniformity in that distribution. Neither of these are true for mobile laser scanning datasets. Additionally, there will be major questions regarding how small a feature can be to detectable or how much of a change must have occurred be to be detectable with such an approach. There are also questions as to the minimum level of required data density for the reference data set and comparative data sets, how large of a data subset should be processed individually, and how to address the problem of noise in the data either from artefacts of the data acquisition process or irrelevant changes in the scene (e.g. debris in the road).

[1] www.artba.org/about/faq/

[2] Vallender, S. S. “Calculation of the Wasserstein distance between probability distributions on the line.” Theory of Probability & Its Applications 18.4 (1974): 784-786.

[3] Trigila, Giulio, and Esteban G. Tabak. “Data‐Driven Optimal Transport.” Communications on Pure and Applied Mathematics 69.4 (2016): 613-648.