Research

research-band

 

Current Projects

 

The research projects below are open for NYU Shanghai students, including study away students. Applying is a good opportunity for students interested in pursuing experimental CS research, provides preliminary study towards a meaningful capstone project, and can easily turn into a summer research internship. Students interested in working on these topics should contact me at olivier.marin@nyu.edu

Frugal Distributed Algorithms at the Network Layer (FrugalDiNet)

A DPU is a system-on-chip that combines three key elements: programmable multi-core CPUs, a high-performance NIC, and a rich set of acceleration engines. DPUs can significantly improve data center computations at low cost. They can be used in harvesting and consolidating network data, such as bandwidth usage and link status. They can delve into granular traffic metrics, and their ability to run code unlocks quality of service (QoS) analysis. Beyond, DPUs offer node-specific insights, such as buffer allocation and CPU task load, to refine performance and avert processing bottlenecks. On a broader scale, DPUs can share and propagate data along message routes at a low cost. Nodes can also cooperate towards evaluating network performance through time-stamped probes for bandwidth and latency, and exchange vital information on system health and load, enhancing network resilience and efficiency. However, this landscape presents scientific hurdles. The first major hurdle is to identify crucial data that transits via DPUs and can be leveraged to boost overall performance. The second major hurdle is to design a new framework of algorithms to implement a cooperative system on top of DPUs. This project aspires to harness as of yet unused data, guiding future technological advancements and industry standards. This endeavor aims not just to enrich the toolset for network analysis but also to set a foundation for next-generation network devices that are more attuned to the intricate demands of modern data centers.

QUANTum Improved COnsensus (QUANTICO)

Distributed quantum computing is an emerging area of research that combines principles of quantum mechanics with distributed systems to enhance performance, scalability, and fault tolerance. The core idea is to exploit quantum entanglement for seamless coordination across distributed nodes, reducing both algorithmic complexity and communication overhead. This project addresses a key challenge: the delays introduced by data exchanges between processors, a well-known constraint in distributed computing but largely overlooked in quantum contexts. Such delays complicate achieving consensus—a critical requirement for coordination among nodes. Our work focuses on developing a formal model and novel approaches to address the interplay between time, delays, and distributed quantum algorithms. The ultimate goal is to establish a theoretical foundation for designing algorithms that operate effectively in hybrid quantum-classical systems. Initial results include two key contributions: leveraging entangled (Greenberger–Horne–Zeilinger) states to accelerate leader election, a fundamental operation in distributed systems, and introducing the notion that while quantum resources can enhance consensus mechanisms, they cannot entirely eliminate delays.

 

 

 

Past Projects

Platform for HYper-Scale Interdisciplinary Computations and Simulations (PHYSICS)

The goal of the PHYSICS project is to design a reliable management system that guides the process of coding concurrent scientific applications and automatically deploys embarrassingly parallel workflows (EPWs) on an extreme scale. In order to model scientific applications, computational science commonly uses workflows. A workflow describes a set of computations meant to analyze data in a structured and concurrent manner. A workflow model therefore consists in a composition which defines a set of tasks and assigns dependencies between them. A Scientific Workflow Management System (SWMS) combines a workflow model with a system-level implementation to support scientific applications. Its objective is to make it easy for scientists to break up an abstract task into distinct computing jobs. There is a particular category of scientific workflows called embarrassingly parallel, that can inherently scale up to very high orders of magnitude. Every scientific discipline has at least one field that generates embarrassingly parallel workflows: problems that naturally produce huge numbers of highly parallel computations. Usually, these fields study the correlation between simple mechanisms at the microscopic level and the behavior that results from their interactions at the macroscopic level. Our aim is to develop a computing environment for the deployment of embarrassingly parallel scientific workflows (EPWs) on an extreme scale.
There are two main research objectives towards achieving our goal, and they are complementary: to design a specialized middleware that combines both a computing language and a deployment architecture for scientific purposes, and to build a fault tolerance support layer for the middleware.

Flexible Failure Detection (FlexFD)

Distributed systems should provide reliable and continuous service despite the failures of some of their components. A classical way for a distributed system to tolerate failures is to detect them and then to recover. It is well recognized that the dominant factor in system unavailability lies in the failure detection phase, and that a good solution is to introduce failure detectors (FDs). An FD is an unreliable oracle which provides information about process crashes: its most common output is the set of processes currently suspected of having crashed. It is unreliable in the sense that it might provide incorrect information over limited periods of time; for instance, live nodes can be temporarily considered as having crashed. FDs are used in a wide variety of settings, such as network communication and group membership protocols, computer cluster management and distributed storage systems. They usually perform best in well-circumscribed distributed environments such as clusters and grids, but lack efficiency in loosely-controlled distributed systems because they are based on a binary model in which monitored processes are either trusted or suspected. In environments such as WSNs, system reliability is not a binary property as they may tolerate some margin of failure, depending on the relevance and on the number of components in the system. In this project, we intend to develop FlexFD, a flexible failure detector based on dynamic impact factors. The value of the impact factor of a node can vary during execution, depending on the current degree of reliability of the nodes or their current behavior, as well as on the past history of stable/unstable network periods, etc. For instance, if a node is suspected of failure due to an unstable network communication period, the other nodes can reduce its impact value. When such a period ends, the impact factor of the node will gradually increase. Furthermore, FlexFD will be able to automatically adapt to the needs of the application or system requirements.

Scalable Data Structure for the Discovery of Frequent Itemsets

Monitoring and analyzing transactions is a common practice in e-commerce. It helps distribution companies understand how their products are related to each other, and how to increase their profits by improving product recommendation to customers. One of the methods used is to establish association rules when products are either purchased or looked up during the same session. An association rule is of the form “if product X is purchased, then product Y will be purchased in α% of cases”. It can also concern product groups: “if products X and Y are bought together, then product Z will be purchased in β% of cases”. This approach is generally referred to as the ​frequent-itemsets problem. It differs from the similarity search in that it requires to deal with the absolute number of lookups/purchases that contain a particular set of items. All transactions and activities on an online commercial site can be collected continuously. Considering that on average a customer looks up 15 products before buying one, storing transactions and data mining for association rules can quickly become a Big Data issue. The goal of this project is to design, implement, and deploy a tree structure that stores the most frequent associations between transactions and allows them to be analyzed efficiently. The first step consists in designing and implementing a structure that is well-suited for deployment in the cloud without degrading performance, and that scales with respect to the number of analyzed transactions. The second step is to propose different strategies of distribution of the nodes of the structure, in order to maximize response times and throughput while distributing the load and preserving the locality for the most frequent node paths.

hARnessing MAssive DAtaflows (ARMADA)

Managing and processing Big Data is usually handled in a static manner. Static cluster- or grid-based solutions are ill-suited for tackling Dynamic Big Data workflows, where new data is produced continuously. Our final objective in this project is to design and implement a large scale distributed framework for the management and processing of Dynamic Big Data. We focus our research on two scientific challenges: placement and processing. One of the most challenging problems is to place new data coming from a huge workflow. We explore new techniques for the mapping of huge dynamic flows of data in a large scale distributed system. In particular, these techniques ought to promote the locality of distributed computations. With respect to processing, our objective is to propose innovative solutions for the processing of a continuous flow of big data in a large scale distributed system. To this effect, we identify properties that are common to distributed programming paradigms, and then integrate these properties in the design of a framework that takes into account the locality of the data flow and ensures a reliable convergence of the data processing.
Up until 2015, I was leader of the Inria Associate Team that embodies this project in the context of a cooperation between Inria/LIP6 Project-Team REGAL, the Universidad Tecnica Santa Maria in Valparaiso (Chile), and the Universidad de Santiago de Chile. This project also obtained a joint Inria/CONYCIT funding for a 3-year PhD scholarship (2013-2016)

Online Games REfereeing System (OGRES)

Despite their name, the current batch of Massively Multiplayer Online Games (MMOGs) do not scale well. They rely on centralised client/server architectures which impose a limit on the maximum number of players (avatars) and resources that can coexist in any given virtual world. One of the main reasons for such a limitation is the common belief that fully decentralised solutions inhibit cheating prevention. The purpose of the OGRES project is to break this belief by designing and implementing a P2P gaming architecture that monitors the game at least as efficiently and as securely as a centralised architecture. Our approach delegates game refereeing to the player nodes. A reputation system assesses node honesty in order to discard corrupt referees and malicious players, whilst an independent multi-agent system challenges nodes with fake game requests so as to accelerate the reputation-building process. I was leader of this project, which obtained internal funding for the year 2012 by the Laboratoire d’Informatique de Paris 6 and a 3-year PhD scholarship (2012-2015) from the french Ministry of Higher Education and Research.

GEocentric Mobile Syndication

The GEMS project proposes a new approach towards filtering and processing the ever growing quantity of data published from mobile devices before it even reaches the Internet. We intend to tackle this issue by syndicating geocentric data on the fly as they get published by mobile device owners. By circumscribing data to the zone where it is published, we believe it is possible to extract information that is both trustworthy and relevant for a majority of users.
This project obtained internal funding for the year 2012 by the Laboratoire d’Informatique de Paris 6

Cooperative Certification in Peer to Peer Networks

In peer to peer networks, trusted third parties (TTPs) are useful for certification purposes, for preventing malicious behaviours and for monitoring processes, among other things. In these environments, traditional centralised approaches towards TTPs generate bottlenecks and single points of failure/attack; distributed solutions must address issues of dynamism, heterogeneity, and scalability. The purpose of this work is to design a system that builds and maintains a community of the most trustworthy nodes in a DHT-based peer to peer network. The resulting system must be scalable and decentralised, where nodes can build sets of reputable peers efficiently in order to constitute collaborative TTPs, and must be able to cope well with malicious behaviours.
This project (2008-2011) got funded by the INRIA and the CONYCIT (Chilean national research agency) as a joint research effort between REGAL and the University Frederico Santa Maria, Valparaiso, Chile.

Dynamic Agent Replication eXtension (DARX)

Distributed applications are very sensitive to host or process failures. This is all the more true for multi-agent systems, which are likely to deploy multitudes of agents on a great number of locations. However, fault tolerance involves costly mechanisms; it is thus advisable to apply it wisely. The DARX project investigated the dynamic adaptation of fault tolerance within multi-agents platforms. The aim of this research was double: (a) to provide effective methods for ensuring fail-proof multi-agent computations, and (b) to develop a framework for the design of scalable applications, in terms of the number of hosts as well as the number of processes/agents.
This project (2007-2011) got funded by the french national research agency ANR (ACI SETIN) as a joint research effort between INRIA, LIP6 and LIRMM. I was assistant coordinator of this project, and leader on the INRIA side.

Dependable DEployment oF Code in Open eNvironments (DDEFCON)

The primary objective of the DDEFCON project was to design a middleware for the dependable deployment of massively parallel, cooperative applications over open environments such as the Internet. In practical terms, we developed a middleware prototype which allows to run massively parallel computations in a fully decentralised manner, and hence takes into account the obstacles mentioned above as obstacles. Our work comprised three interdependent research efforts: (a) the study of the replication of cooperative software tasks within a P2P overlay, (b) the development of a secure runtime environment over a heterogeneous network, and (c) the design of a language for the dependable deployment of code.
I was leader of this project, which obtained internal funding during the year 2008 by the Laboratoire d’Informatique de Paris 6.

Fault Tolerant and Hierarchic Grid platform (FTH-GRID)

The MapReduce (MR) programming model/architecture allows to process huge data sets efficiently. The most popular platform, Hadoop, adopts a master/slave architecture that fits very well on top of computing grids and clouds. While there are many solutions that introduce crash recovery for Hadoop, to the best of our knowledge the issue of malicious nodes remains to be addressed. FTH-GRID injects a simple task replication scheme along with a results comparison mechanism into Hadoop. Voting out inconsistent results allows to detect corrupt outputs as well as potentially malicious nodes.
This project (2008-2010), was a joint research effort between the INRIA/LIP6 REGAL team and the LASIGE, and obtained an allocation from the EGIDE european fund. I was leader of this project on the REGAL team side.

 

PhD Students

Former

Rudyar Cortes – Data Engineer

Maxime Veron – R&D Engineer in Enterprise Data Management

Erika Rosas – Assistant Professor at the University of Santiago, Chile