Research | Olivier Marin @ NYU Shanghai

Current Projects

My research lies at the intersection of distributed algorithms and systems implementation, with a focus on building resilient and scalable infrastructures. I am particularly interested in how emerging technologies such as data processing units in modern datacenters, reinforcement learning in multi-agent systems, and quantum resources for distributed coordination reshape the classical challenges of fault tolerance, consensus, and performance. Across these projects, my goal is to develop theoretically grounded algorithms that remain practical and impactful in real-world environments.

Frugal Distributed Algorithms at the Network Layer (FrugalDiNet)

Funding: French National Research Agency (ANR, 2024–2027)
Collaborators: Three French partner institutions

This project investigates how intelligent hardware, such as programmable switches and Data Processing Units (DPUs), can be leveraged for high performance services in large-scale datacenters. My focus is on identifying exploitable telemetry and designing distributed algorithms that enhance scalability and resilience at low cost. The work aims to establish a foundation for next-generation network devices attuned to the performance demands of modern infrastructure.

Scalable Collective AI: Learning and Decision-Making in Multi-Agent Systems

Funding: National Natural Science Foundation of China (NSFC, 2025–2028)
Collaborators: Dr. Mathieu Lauriere, NYU Shanghai

This project investigates the intersection of artificial intelligence, reinforcement learning, and mean-field game theory to understand how large populations of agents learn, adapt, and cooperate in complex environments. The research combines rigorous mathematical modeling with computational experimentation: mean-field games provide a framework for analyzing equilibrium behavior, while scalable algorithms are designed and tested to handle thousands of interacting agents. My contribution focuses on distributed simulation and algorithmic implementation, developing architectures that enable large-scale experiments, validate theoretical results, and extend them into tractable methods with applications in economics, finance, and network optimization.

QUANTum Improved COnsensus (QUANTICO)

Status: Exploratory project
Collaborators: Dr. Chandrashekar Radhakrishnan, NYU Shanghai

This collaboration examines the potential of entanglement to accelerate coordination in distributed quantum systems. We have introduced formal models that capture the interplay of quantum resources with communication delays and used them to design new approaches for leader election and consensus. Our findings suggest that quantum resources can significantly reduce complexity, but that delays remain fundamental, reframing classical impossibility results in the emerging context of hybrid quantum–classical computing.

Past Projects

Platform for HYper-Scale Interdisciplinary Computations and Simulations (PHYSICS)

The goal of the PHYSICS project is to design a reliable management system that guides the process of coding concurrent scientific applications and automatically deploys embarrassingly parallel workflows (EPWs) on an extreme scale. In order to model scientific applications, computational science commonly uses workflows. A workflow describes a set of computations meant to analyze data in a structured and concurrent manner. A workflow model therefore consists in a composition which defines a set of tasks and assigns dependencies between them. A Scientific Workflow Management System (SWMS) combines a workflow model with a system-level implementation to support scientific applications. Its objective is to make it easy for scientists to break up an abstract task into distinct computing jobs. There is a particular category of scientific workflows called embarrassingly parallel, that can inherently scale up to very high orders of magnitude. Every scientific discipline has at least one field that generates embarrassingly parallel workflows: problems that naturally produce huge numbers of highly parallel computations. Usually, these fields study the correlation between simple mechanisms at the microscopic level and the behavior that results from their interactions at the macroscopic level. Our aim is to develop a computing environment for the deployment of embarrassingly parallel scientific workflows (EPWs) on an extreme scale.
There are two main research objectives towards achieving our goal, and they are complementary: to design a specialized middleware that combines both a computing language and a deployment architecture for scientific purposes, and to build a fault tolerance support layer for the middleware.

Flexible Failure Detection (FlexFD)

Distributed systems should provide reliable and continuous service despite the failures of some of their components. A classical way for a distributed system to tolerate failures is to detect them and then to recover. It is well recognized that the dominant factor in system unavailability lies in the failure detection phase, and that a good solution is to introduce failure detectors (FDs). An FD is an unreliable oracle which provides information about process crashes: its most common output is the set of processes currently suspected of having crashed. It is unreliable in the sense that it might provide incorrect information over limited periods of time; for instance, live nodes can be temporarily considered as having crashed. FDs are used in a wide variety of settings, such as network communication and group membership protocols, computer cluster management and distributed storage systems. They usually perform best in well-circumscribed distributed environments such as clusters and grids, but lack efficiency in loosely-controlled distributed systems because they are based on a binary model in which monitored processes are either trusted or suspected. In environments such as WSNs, system reliability is not a binary property as they may tolerate some margin of failure, depending on the relevance and on the number of components in the system. In this project, we intend to develop FlexFD, a flexible failure detector based on dynamic impact factors. The value of the impact factor of a node can vary during execution, depending on the current degree of reliability of the nodes or their current behavior, as well as on the past history of stable/unstable network periods, etc. For instance, if a node is suspected of failure due to an unstable network communication period, the other nodes can reduce its impact value. When such a period ends, the impact factor of the node will gradually increase. Furthermore, FlexFD will be able to automatically adapt to the needs of the application or system requirements.

Scalable Data Structure for the Discovery of Frequent Itemsets

Monitoring and analyzing transactions is a common practice in e-commerce. It helps distribution companies understand how their products are related to each other, and how to increase their profits by improving product recommendation to customers. One of the methods used is to establish association rules when products are either purchased or looked up during the same session. An association rule is of the form “if product X is purchased, then product Y will be purchased in α% of cases”. It can also concern product groups: “if products X and Y are bought together, then product Z will be purchased in β% of cases”. This approach is generally referred to as the frequent-itemsets problem. It differs from the similarity search in that it requires to deal with the absolute number of lookups/purchases that contain a particular set of items. All transactions and activities on an online commercial site can be collected continuously. Considering that on average a customer looks up 15 products before buying one, storing transactions and data mining for association rules can quickly become a Big Data issue. The goal of this project is to design, implement, and deploy a tree structure that stores the most frequent associations between transactions and allows them to be analyzed efficiently. The first step consists in designing and implementing a structure that is well-suited for deployment in the cloud without degrading performance, and that scales with respect to the number of analyzed transactions. The second step is to propose different strategies of distribution of the nodes of the structure, in order to maximize response times and throughput while distributing the load and preserving the locality for the most frequent node paths.

hARnessing MAssive DAtaflows (ARMADA)

Managing and processing Big Data is usually handled in a static manner. Static cluster- or grid-based solutions are ill-suited for tackling Dynamic Big Data workflows, where new data is produced continuously. Our final objective in this project is to design and implement a large scale distributed framework for the management and processing of Dynamic Big Data. We focus our research on two scientific challenges: placement and processing. One of the most challenging problems is to place new data coming from a huge workflow. We explore new techniques for the mapping of huge dynamic flows of data in a large scale distributed system. In particular, these techniques ought to promote the locality of distributed computations. With respect to processing, our objective is to propose innovative solutions for the processing of a continuous flow of big data in a large scale distributed system. To this effect, we identify properties that are common to distributed programming paradigms, and then integrate these properties in the design of a framework that takes into account the locality of the data flow and ensures a reliable convergence of the data processing.
Up until 2015, I was leader of the Inria Associate Team that embodies this project in the context of a cooperation between Inria/LIP6 Project-Team REGAL, the Universidad Tecnica Santa Maria in Valparaiso (Chile), and the Universidad de Santiago de Chile. This project also obtained a joint Inria/CONYCIT funding for a 3-year PhD scholarship (2013-2016)

Online Games REfereeing System (OGRES)

Despite their name, the current batch of Massively Multiplayer Online Games (MMOGs) do not scale well. They rely on centralised client/server architectures which impose a limit on the maximum number of players (avatars) and resources that can coexist in any given virtual world. One of the main reasons for such a limitation is the common belief that fully decentralised solutions inhibit cheating prevention. The purpose of the OGRES project is to break this belief by designing and implementing a P2P gaming architecture that monitors the game at least as efficiently and as securely as a centralised architecture. Our approach delegates game refereeing to the player nodes. A reputation system assesses node honesty in order to discard corrupt referees and malicious players, whilst an independent multi-agent system challenges nodes with fake game requests so as to accelerate the reputation-building process. I was leader of this project, which obtained internal funding for the year 2012 by the Laboratoire d’Informatique de Paris 6 and a 3-year PhD scholarship (2012-2015) from the french Ministry of Higher Education and Research.

GEocentric Mobile Syndication

The GEMS project proposes a new approach towards filtering and processing the ever growing quantity of data published from mobile devices before it even reaches the Internet. We intend to tackle this issue by syndicating geocentric data on the fly as they get published by mobile device owners. By circumscribing data to the zone where it is published, we believe it is possible to extract information that is both trustworthy and relevant for a majority of users.
This project obtained internal funding for the year 2012 by the Laboratoire d’Informatique de Paris 6

Cooperative Certification in Peer to Peer Networks

In peer to peer networks, trusted third parties (TTPs) are useful for certification purposes, for preventing malicious behaviours and for monitoring processes, among other things. In these environments, traditional centralised approaches towards TTPs generate bottlenecks and single points of failure/attack; distributed solutions must address issues of dynamism, heterogeneity, and scalability. The purpose of this work is to design a system that builds and maintains a community of the most trustworthy nodes in a DHT-based peer to peer network. The resulting system must be scalable and decentralised, where nodes can build sets of reputable peers efficiently in order to constitute collaborative TTPs, and must be able to cope well with malicious behaviours.
This project (2008-2011) got funded by the INRIA and the CONYCIT (Chilean national research agency) as a joint research effort between REGAL and the University Frederico Santa Maria, Valparaiso, Chile.

Dynamic Agent Replication eXtension (DARX)

Distributed applications are very sensitive to host or process failures. This is all the more true for multi-agent systems, which are likely to deploy multitudes of agents on a great number of locations. However, fault tolerance involves costly mechanisms; it is thus advisable to apply it wisely. The DARX project investigated the dynamic adaptation of fault tolerance within multi-agents platforms. The aim of this research was double: (a) to provide effective methods for ensuring fail-proof multi-agent computations, and (b) to develop a framework for the design of scalable applications, in terms of the number of hosts as well as the number of processes/agents.
This project (2007-2011) got funded by the french national research agency ANR (ACI SETIN) as a joint research effort between INRIA, LIP6 and LIRMM. I was assistant coordinator of this project, and leader on the INRIA side.

Dependable DEployment oF Code in Open eNvironments (DDEFCON)

The primary objective of the DDEFCON project was to design a middleware for the dependable deployment of massively parallel, cooperative applications over open environments such as the Internet. In practical terms, we developed a middleware prototype which allows to run massively parallel computations in a fully decentralised manner, and hence takes into account the obstacles mentioned above as obstacles. Our work comprised three interdependent research efforts: (a) the study of the replication of cooperative software tasks within a P2P overlay, (b) the development of a secure runtime environment over a heterogeneous network, and (c) the design of a language for the dependable deployment of code.
I was leader of this project, which obtained internal funding during the year 2008 by the Laboratoire d’Informatique de Paris 6.

Fault Tolerant and Hierarchic Grid platform (FTH-GRID)

The MapReduce (MR) programming model/architecture allows to process huge data sets efficiently. The most popular platform, Hadoop, adopts a master/slave architecture that fits very well on top of computing grids and clouds. While there are many solutions that introduce crash recovery for Hadoop, to the best of our knowledge the issue of malicious nodes remains to be addressed. FTH-GRID injects a simple task replication scheme along with a results comparison mechanism into Hadoop. Voting out inconsistent results allows to detect corrupt outputs as well as potentially malicious nodes.
This project (2008-2010), was a joint research effort between the INRIA/LIP6 REGAL team and the LASIGE, and obtained an allocation from the EGIDE european fund. I was leader of this project on the REGAL team side.

PhD Students

Former

Rudyar Cortes – Data Engineer

Maxime Veron – R&D Engineer in Enterprise Data Management

Erika Rosas – Assistant Professor at the University of Santiago, Chile