Research into the engineering of infrastructure systems is increasingly data intensive. Researchers build computational models to explore scenarios such as investigating the merits of infrastructure plans, analysing historical data to inform system operations or assessing the impacts of infrastructure on the environment. Models are more complex, at higher resolution and with larger coverage. Researchers also require a ‘multi-systems’ approach to explore interactions between systems, such as energy and water with urban development, and across scales, from buildings and streets to regions or nations. Consequently, researchers need enhanced computational resources to support cross-institutional collaboration and sharing at scale. The Data and Analytics Facility for National Infrastructure (DAFNI) is an emerging computational platform for infrastructure systems research. It provides high-throughput compute resources so larger data sets can be used, with a data repository to upload data and share these with collaborators. Users’ models can also be uploaded and executed using modern containerisation techniques, giving platform independence, scaling and sharing. Further, models can be combined into workflows, supporting multi-systems modelling and generating visualisations to present results. DAFNI forms a central resource accessible to all infrastructure systems researchers in the UK, supporting collaboration and providing a legacy, keeping data and models available beyond the lifetime of a project.
The infrastructure systems of a country or region, including energy supplies, water systems, transport networks, digital communications, land use and the built environment, are key investments for economic, social and environmental wellbeing (Thacker et al., 2019), and one estimate suggests that US$94 trillion of investments will be required by 2040 for new and replacement infrastructure (GI Hub, 2017). However, the impact of these investments is hard to predict, as infrastructure is subject to environmental, social and economic pressures. Researchers across disciplines, including environmental sciences, geography, civil engineering, urban planning and economics, use computational modelling and analysis to explain and predict the effects of change on infrastructure systems, while policymakers use the outputs of such models to inform planning decisions. Infrastructure systems are becoming ever more complex, and models are becoming more detailed, combining data from different infrastructures and disciplines and at different scales, from a country or a region down to a locality or building (Hall, 2019). Thus, there is a need for advanced large-scale computing and data infrastructure to manage and analyse data, together with cloud systems for on-demand remote access.
The Data and Analytics Facility for National Infrastructure (DAFNI, 2023) is a major national facility under development in the UK to provide world-leading capability to advance infrastructure systems research. It provides a scalable platform supporting storage and querying of heterogeneous national infrastructure data sets and the execution, creation and visualisation of complex modelling applications. This platform improves the quality and opportunities for national infrastructure systems research while reducing the complexity of using data and models for end users. Thus, DAFNI enables new advances in infrastructure research and improves the readiness of research tools and methods for real-world challenges at scale, nationally and internationally.
This paper presents DAFNI, discussing the motivations, aims and approach behind its development. It goes on to discuss its architecture and give more details on its approach to handling data and supporting user models in multi-systems workflows. Some pilot studies are discussed further, demonstrating how DAFNI is being used to support research, including support for systems-of-systems modelling. Finally, the paper discusses emerging themes for new developments. In particular, there is a need for a richer information framework for data integration and exchange using common standards and semantics, while digital twins present additional challenges, with the combination of sensor networks and real-time data analysis adding an additional layer of complexity, so the role of DAFNI to support an ecosystem of digital twins is considered.
Research undertaken in universities, exploring new models and algorithms, provide the leading edge of innovation in infrastructure systems analysis. Examples include Quantitative Urban ANalytics forecasTing (Batty and Milton, 2021), Synthetic Population Estimation and Scenario Projection Model (Lomax and Smith, 2020), Urban Development Model (Ford et al., 2019) and National Infrastructure Systems MODel (Nismod) (Hall et al., 2016, 2017). This research can be leveraged to exploit modern computing capacity and cloud computing technology (e.g. Microsoft Azure, Amazon Web Services, Kubernetes) coupled with advances in big data analytics, simulation, modelling and visualisation to scale up and integrate such models. This approach provides more detailed, high-quality projections of the impact of infrastructure development decisions on the natural, economic and social environment, so that more effective choices can be made in the provision of new infrastructure, and thus, that investment can best support human flourishing (Schooling et al., 2020). However, a number of challenges need to be overcome to take advantage of these advances in computing.
The increase in data availability and resolution has enabled new modelling applications with increasing resolution and spatial and temporal coverage, with a corresponding increased demand for computational resources. However, maintaining large-scale resources, such as peta-scale data repositories or compute clusters, is costly and requires specialist skills, and high-performance computing (HPC) systems are technically challenging to access. Thus, the compute resources available to individual research groups may be limited, making iterative development and optimisation processes time consuming and slow to complete. This restricts the ability of modellers to understand impacts of simulations at a national scale while maintaining fine-grained resolution.
Data can be difficult to find and access, while licencing of data and models can be complex, with varied commercial and security conditions presenting a barrier to data sharing between organisations. A common approach to data security is needed, backed by specialised skills and processes so that data can be shared and accessed with trusted partners.
The need to ensure that results are reliable and repeatable makes it essential to store versioned copies of the underlying data sets, with auditable provenance of results.
Analyses are currently undertaken as an isolated activity at disparate institutions with minimal instances of coalescing and collaboration of outputs. However, infrastructure networks and their interactions with each other, people and the environment are inherently complex and heterogeneous, and handling this complexity can become beyond the capacity of single teams.
For models to reflect more accurately real-world situations, there is a need for them to capture the interactions between systems in multi-systems models. These multi-systems models can be along two axes. Firstly, the components within a system can be aggregated into systems of systems at a higher scale. Thus, equipment items can be aggregated into models of plants, which themselves can be aggregated with other features into models of organisations, or of geographic localities, which in turn can be aggregated into cities, regions or nations. Secondly, the interactions between different infrastructure systems, such as water, transport, energy, waste, communications and the built environment, can be integrated into a common infrastructure model, with interactions with the natural, social and economic environments taken into account. The latter case is becoming increasingly important in, for example, the effects on the power distribution network of the change of transport to electric vehicles (Chaudry et al., 2022) or the effects on water supply of economic activity resulting from new transport links (ITRC Mistral, 2020). The variety and variability of these models present a significant challenge, as extensive domain expertise is required to exploit each model. Further, the models themselves need to be interoperable, through programmatic interfaces and common libraries. Data need to be shared and exchanged across the models and domains and across different scales and semantic representations. Thus, a common data integration framework is needed for a flexible multi-systems modelling system.
In response to these challenges, DAFNI has been developed as a shared platform to provide a dedicated compute resource for the national infrastructure modelling community. DAFNI has been supported by the UK Collaboratorium for Research on Infrastructure and Cities (UKCRIC, 2023) in a 4-year development phase (2017–2021) involving a consortium of 12 UK universities, led by the University of Oxford. The Scientific Computing Department of the UK’s Science and Technology Facilities Council (STFC) was commissioned by the consortium as a development partner and host. STFC’s role is the support of national scientific research infrastructure and was seen as being well suited to the delivery of the platform.
The objectives of DAFNI are to provide a common platform to support scalable, collaborative research into infrastructure systems, as follows.
A common platform for sharing and combining data and models. The DAFNI platform provides a common computing hub for the infrastructure systems research community to store data and models and make them available to trusted collaborators.
A shared space to support collaborations and build multi-systems models. The shared platform can enable collaborations to build and execute more complex multi-system models at scale, accessing common data and combining shared models into workflows.
A legacy environment. Access to models, data and results in the repository can be made available and usable for the long term, providing a legacy environment persisting beyond the lifetime of individual research projects and traceability of the provenance of results.
DAFNI is intended to improve the opportunities for and quality of research and reduce the complexity of all aspects related to conducting the research in an HPC environment, including data access and processing, model execution, security and visualisation. It enables the combination of these features into a functional platform that addresses the data, licencing and scalability challenges identified earlier.
Within the research infrastructure landscape, there are other facilities that have a role similar to that of DAFNI within their respective domains, including the following.
The Australian Urban Research Infrastructure Network (Aurin, 2022) provides compute infrastructure and expert support for urban, regional and social science researchers across Australia. It develops advanced data and analytic capability for the adoption of high-impact research within the government and industry, holding reference data sets for long-term availability and providing simulation and visualisation capability for decision support. It does not provide a user environment with capability for users to supply their own models and data resources and construct their own workflows.
The Biodiversity and Climate Change Virtual Laboratory (Hallgren et al., 2016) is an Australian government-funded initiative aiming to reduce the barrier to entry into high-resolution climate change and biodiversity impact modelling, utilising high-end HPC infrastructure for non-technical literate researchers. Through the ‘virtual data laboratory’, users can access over 4000 climate data sets and 300 environmental descriptors collocated onto a common geospatial and temporal grid. Further, users can execute pre-validated, managed models and either download results for custom offline post-processing or utilise one of several predefined techniques to analyse their results.
The Urban Center for Computation and Data (UrbanCCD, 2022) is a joint initiative at the University of Chicago and Argonne National Laboratory to support the study of urban science. The UrbanCCD does not provide a dedicated computing facility, but researchers may make use of the Argonne Leadership Computing Facility for batch computing.
Jasmin (Lawrence et al., 2013) is a globally unique data-intensive supercomputer for environmental science and currently supports over 1500 users on over 200 projects. Jasmin users research topics ranging from earthquake detection and oceanography to air pollution and climate science. Jasmin provides the UK and European climate and earth-system science communities with the ability to access very large sets of environmental data, which are typically too big to download and process using their own computers. This reduces the time it takes to test new ideas and obtain results from months or weeks to days or hours.
DAFNI is designed around a number of core components, as shown in Figure 1, and briefly described in the following.
DAFNI is hosted on a dedicated hardware cluster currently providing some 792 cores and ten graphics-processing unit nodes, with 2 PB of storage with a combination of fast and long-term storage available, which can be configured for different performance characteristics. Long-term storage uses the MinIO object-store system (see MinIO, 2023), while the compute cluster is configured using Kubernetes. Kubernetes (2023) is an open-source container orchestration system for automating software deployment, scaling and management; this allows the flexible deployment of user applications. DAFNI has developed a number of components on this foundation to support user applications.
National Infrastructure Database (NID). The NID is a centrally managed access point to national infrastructure and other data sets required to support infrastructure research. This includes a centrally managed data store, a data catalogue and a data access and publication service.
National Infrastructure Modelling Service (NIMS). The NIMS provides support to improve the performance of existing models, reduce the complexity of creating models and facilitate the creation of multi-systems models. It includes a model catalogue and a workflow creation and execution framework based on Argo (2023).
National Infrastructure Cloud Environment (NICE). The NICE provides a scalable cloud environment with a number of platform-as-a-service offerings to users, including Jupyter notebooks (see Jupyter, 2023). Currently, the NICE is used within the internal architecture of DAFNI, to deploy services within the cluster.
National Infrastructure Visualisation Suite (NIVS). The NIVS supports visualisation tools to facilitate understanding of data, models, outputs and translation of findings to decision makers. This includes traditional visualisation as a service (e.g. graph and tabular representations) and user-developed analyses using Jupyter Notebooks.
DAFNI Security Service (DSS). The DSS manages the security of the platform, which allows users to access and use seamlessly those services that they have rights to while at the same time maintaining security and integrity of data. Services include authentication, authorisation, monitoring and group management.
These components have been implemented in a microservice architecture (Jamshidi et al., 2018). This allows the capabilities within DAFNI to be developed independently with an extensible and flexible delivery of the platform in line with the evolving nature of the national infrastructure modelling landscape. Two central components, the NID and the NIMS, are discussed in more detail.
The NID is the foundation of DAFNI, a core service that allows researchers to upload, access and share data sets that are necessary to their research. It then manages the provision of data to models, workflows and visualisations, with outputs from model executions published back to the NID, allowing the research community access to the latest model outputs.
The NID uses a MinIO object storage instance with a capacity of up to 900 TB. The adoption of object storage allows DAFNI to be flexible and store any data in any format required. MinIO provides a cloud-native solution that integrates seamlessly into the underlying Kubernetes environment of DAFNI. This is supported by databases that store and manage the metadata records for each data set, providing data search and data versioning capabilities around the data store itself.
DAFNI researchers interact with the data store through the DAFNI Data Repository, shown in Figure 2, a tailor-made repository service that allows researchers to upload data to the NID and manage the access to those data, allowing others on the platform to access these either globally, individually or through groups. In addition, researchers can update their data sets and create new versions, while all registered users on the DAFNI platform can access and download the open-access data sets.
DAFNI has adopted a rich metadata schema, based on Data Catalog Vocabulary version 2 (W3C, 2020), a World Wide Web Consortium recommendation for interoperability between data catalogues, augmented with additional features supporting geospatial data, such as categories by Infrastructure for Spatial Information in Europe (Inspire, 2022) and GeoNames (2023) for spatial coverage. This provides a search-and-discovery service on the DAFNI platform and positions the platform for interoperability with other data stores. The approach is to encourage users to provide a rich metadata record of data from the start, thus supporting the access and reuse of data according to the findability, accessibility, interoperability and reusability (Fair) data principles (Wilkinson et al., 2016).
The metadata combine top-level contextual and licencing information with more detailed data set attributes, which drill down to the file level. This is combined with a description of the ownership and publication history of the data set to provide traceability and link each data set on DAFNI to its infrastructure research community. The metadata are indexed by the data-search-and-discovery service, built using Elasticsearch (2023), a powerful full-text search and analytics engine. Users can find data sets of interest to their research through a text search or by spatio-temporal filtering. Filters by data source, theme and file format are also supported.
The NIMS encompasses both the model catalogue and model workflow systems on DAFNI. The purpose of the NIMS is to allow DAFNI users to run user-supplied models through the use of workflows without specialised knowledge of HPC systems or programming.
The execution of user-generated models and their combination into multi-systems models is challenging because of the compatibilities required between models. Each model, developed by independent groups of researchers and software engineers, has a set of dependencies on programming language, packages and libraries. These dependencies make porting models onto a common platform a complex and time-consuming process, a significant barrier to the use of HPC. Further, coupling models together requires the sharing of data in interoperable formats and access to application programming interfaces for models to communicate.
To simplify these challenges, the DAFNI NIMS utilises containerisation using the Docker packaging system (see Docker, 2023) to encapsulate functionality and dependencies. Docker builds self-contained packages encapsulating the model executable together with its execution environment and also bundling configuration and library files. A model definition file in the YAML Ain’t Markup Language (YAML, 2023) format is also provided to accompany the ‘dockerised’ model, specifying the interfaces, input parameters and data sets and outputs to the model, together with metadata that will be displayed about the model catalogue. Dockerised models can then be uploaded onto the platform and can be deployed and executed through the Kubernetes system. Thus, DAFNI can execute user code independent of its dependencies.
Models are uploaded into a model catalogue, shown in Figure 3(a), a repository of models, based on Harbor, an open-source system providing a registry of containers (see Harbor, 2023). User metadata describing the model is supplied by the user on upload, providing a searchable catalogue, subject to the user and group access permission set within the DSS.
Workflows allow users to create multi-systems models and to output the results of these workflows to share with other users. Each workflow consists of a series of chained containers characterising each operation with a centralised job manager to handle data collection and data exchange between the containers. On execution, the Kubernetes orchestration engine allocates resources and deploys the workflow into ‘pods’ across a number of nodes in the cluster where each can be executed on their own resources. This flexibility can allow for more dynamic allocation of resources within DAFNI and allows any operation that can be containerised to be used within the workflows (e.g. data transformation and visualisation). To build a workflow, users construct a series of interconnected steps, as shown in Figure 3(b). The step types are described in the following.
Model. The model step facilitates the execution of a model. Users can choose the model from the model catalogue and set any input parameters for the model, with data selected from the NID. Models can also be chained together in the workflow, passing output data from a model into the inputs of the next to allow for multi-systems modelling. For example, a model that simulates population growth can be chained to a model that relies on population numbers to predict house prices, thus allowing the exploration of the effect on house prices of different demographic scenarios.
Iterator. Iterators allow the same step in the workflow to be repeated multiple times while changing parameters within a given range either randomly or with a predefined increment. This allows multiple executions of the same model to be completed in parallel to one another where possible, so many runs of the same model can be completed across a range of values or across random values in Monte Carlo simulations where the same model can be run multiple times with different parameters.
Publisher. The publisher step takes outputs from a model and ingests them into the NID. The user supplies metadata about the resultant data set, which will be displayed in the data catalogue.
Visualisation. The visualisation step takes the outputs of a model and creates a visualisation builder containing those outputs using the NIVS. This allows the user to go directly from the results of a finished workflow into generating graphs or charts from those results in a visualisation builder or through a user-programmable Jupyter notebook.
The initial phase of the DAFNI construction programme (2017–2021) was a requirements and design study that developed a detailed architecture. As the DAFNI platform evolved, a series of pilots validated the functionality and refined requirements while demonstrating the benefits of its additional computing power. These pilots included railway station planning and demand prediction (Young and Blainey, 2018; Young et al., 2019), 5G cell tower placement, house demand and pricing and urban and economic development. Further, a programme of DAFNI, ‘Champions’, was introduced, looking at case studies in transport, including using the MATSim multi-agent transport simulation framework (Horni et al., 2016) and exploring how DAFNI might support a digital twin of road traffic in conjunction with the Sheffield Urban Observatory.
A significant pilot involved working closely with the Nismod system of the UK Infrastructure Transitions Research Consortium (ITRC, 2023), a key example of a collaborative environment within infrastructure systems research. Before implementation on DAFNI, Nismod access was available only to members of the immediate research group and the model had not been optimised for more general research challenges. The first DAFNI pilot focused on the Nismod-1 system-of-systems modelling application developed as part of the ITRC project and hosted at Newcastle University. Nismod-1 ran on a single machine supporting five models of UK infrastructure: energy supply (Chaudry et al., 2022), water supply (Dobson et al., 2020), solid waste, transport (Blainey and Preston, 2019) and waste water. The models explored the needs of these infrastructure components based on estimates of trends in areas such as population growth, economic growth and climate change. A key need for Nismod-1 was sensitivity analysis: determining whether the uncertainty of a given input parameter changes the ‘preferred’ solution to an infrastructure problem (Pianosi et al., 2016). Without proper understanding of this sensitivity, predictions are of limited use. With a large number of input parameters to each of the Nismod models, a full sensitivity analysis requires running very many simulations while varying each input in turn, a highly compute-intensive process. The first pilot ported the Nismod-1 system onto the DAFNI cluster and provided a batch processing system to submit multiple sensitivity analyses. As a result, the Nismod-1 team successfully ran a number of sensitivity analyses on the water supply models and achieved a speed of up to ten times faster than that of the original service. This demonstrates the benefits that can be derived by moving existing, proven infrastructure models onto a high-throughput cluster. Moving the data as well as the software to the DAFNI system is key to obtaining scalable performance. The work on the Nismod pilot has continued through the development of Nismod-2, and its implementation on DAFNI as the platform has evolved. Workflows supporting Nismod scenarios are now available on DAFNI, which provides Nismod users with a long-term execution environment.
Further projects are now using the DAFNI platform. The Open Climate Impact project (see UKRI Gateway, 2023) is developing a modelling framework to explore the impact of future climate change scenarios on infrastructure, exploring factors such as flood events in urban environments, the effect of extreme heat events on the population and the effect on agriculture. The project has particular emphasis on adapting the environment to climate change and the mitigating effects that those adaptations might have. DAFNI is being used in the project as a common modelling framework to connect the different models and to provide a legacy space so that the workflow can be accessed in the long term. The Centre for Greening Finance and Investment (CGFI, 2023) is also planning to use the DAFNI platform similarly to host and develop a shared data and modelling framework to explore how environmental change will impact the risks on investment, insurance and other activities within the finance industry.
Data remain the central driver for the future research and exploitation of computational models of infrastructure systems, and richer handling of data would enhance the power and range of DAFNI for researchers. The following extensions and enhancements to the NID are being explored.
Fair data publication and data curation. DAFNI supports a metadata description for data and models and thus partially satisfies the Fair principles. For DAFNI to support reusable reference data within the research community, this needs to be enhanced to support a data publication pipeline, underpinned by data curation processes to update and maintain data for the long term.
Scaling and querying large data. The handling of large data sets within workflows can be inefficient, as data copying and transfer is a high-latency exercise, and frequently queries are applied early in the workflow to extract the relevant data slice suitable for processing. Large-data immutable data sets can be treated as static objects that can be accessed in a common manner across different processes, with data slicing taking a ‘data-cube’ approach.
Interoperability framework. The data and modelling framework in DAFNI has the advantage of being generic and thus can accept data in any format. However, in linking models into workflows, there remains the need to undertake data manipulation tasks, such as queries, format transformation, projections between scales and other data transformations. By providing enhanced support for particular data formats and providing a suite of ‘data adaptors’ or ‘data transforms’ in an interoperability framework, the process of data manipulation can be simplified.
Semantic framework. A further extension to the interoperability framework would be to introduce the use of ontologies. By supporting a selected suite of ontologies, rich-data-enhanced mappings can be supported within workflows as well as enhancing the search-and-discovery service. An exploration of use and availability of suitable ontologies with recommendation for future development was undertaken within the DAFNI Champions programme (Varga et al., 2021, 2022).
The concept of digital twins (Batty, 2018; Callcut et al., 2021) has emerged over the last decade as a key technology for the future planning, delivery and operation of infrastructure systems. There has been a high level of interest from the government and industry in investing in digital twins as a tool to predict, optimise and control the outcomes of infrastructure investment. Initiatives such as the Digital Twin Hub (DT Hub, 2023) have been developing frameworks for combining digital twin models into a ‘national digital twin’ (NDT), ‘a digital model of our national infrastructure which will be able both to monitor our infrastructure in real-time, and to simulate the impacts of possible events’ (NIC, 2017: p. 3), through the sharing of data and computational resources into a common digital twin ecosystem. DAFNI can play a role in the development and deployment of digital twins for infrastructure systems by providing support for features that an NDT would require to be effective.
An NDT would require an ecosystem of models from a wide range of sources, which can be combined into large-scale, multi-system digital twins. Running multiple twins at scale will require HPC environments that allow models to be executed rapidly and scaled up in resolution and in geographical and temporal range. The platform-independent approach of DAFNI offers the basis of such an environment. Further, an NDT needs to support the combination of models into new workflows to support connected digital twins and provide visualisations of results for human decision support, again supported within DAFNI.
Digital twins also bring significant challenges for data management and integration, and an NDT requires a wide range of different data sources to be brought together into a shared trusted data space, in a common information-management framework. For the sustainability of the NDT, it needs to be maintained and curated for the long term. Again, DAFNI already offers the NID, which may form the basis of such a data space.
Thus, DAFNI can provide a hub for a digital twin infrastructure, supporting the research and development required to explore the opportunities of deploying digital twins within infrastructure systems development and operation.
DAFNI is working with initiatives such as the UKCRIC Urban Observatory programme (see Urban Observatory, 2023) and interacting with key stakeholders, including the Connected Places Catapult (CPC, 2023). It has developed pilot digital twins, including those on traffic management with Sheffield University. Further development on DAFNI is exploring how to provide additional functionality to support digital twins, including extending the information-management framework in DAFNI, as discussed earlier; interacting with real-time input and streaming data systems; and working with machine learning to form adaptive models for decision making from historic data.
DAFNI is an infrastructure platform to support the development of sharing of multi-systems models of national infrastructure. The data and model sharing allows access to models across infrastructure systems and across collaborations. DAFNI thus offers the infrastructure systems engineering community a space to leverage their research into wider and deeper applications.
Further, DAFNI also supports collaborations between researchers, the government and industry. Data and computation are central to the infrastructure engineering practise. The UK National Infrastructure Commission (NIC, 2017: p. 38) observes that ‘[d]ata is part of infrastructure and needs maintenance in the same way that physical infrastructure needs maintenance’. DAFNI provides the basis for a trusted, common, vendor-neutral hub for data sharing and exchange with support for maintaining the value of these assets for the longer term. Further, it is recognised that while large-scale computing is valuable to solve new business and research challenges, it remains hard to access and use for non-specialists (see e.g. GO-Science, 2021). DAFNI provides a user-friendly environment that seeks to overcome some of these technical barriers.
DAFNI has transitioned from a development project to a service platform. This enables DAFNI’s operational growth to increase usage and capability to support research in Engineering and Physical Sciences Research Council’s (EPSRC) Engineering Programme and related fields, so that the UK’s national infrastructure research can remain at the cutting edge. Further, it allows DAFNI to continue with its aim to support changing and sustainable infrastructure through working with the government and industry.
DAFNI has been supported by the EPSRC grants EP/R012202/1, UKCRIC National Infrastructure Database, Modelling, Simulation and Visualisation Facilities (2017–2021), and EP/V054082/1, DAFNI-Research Only Strategic Equipment (DAFNI-ROSE) (2021–2023). The authors would like to thank the members of the DAFNI support and development team in STFC and the collaborators and champions involved specifying requirements and develop pilot cases within the DAFNI environment.