Journal-of-Information-Technology-and-Scientific-Innovation

Article ID: PD2601202006

Volume 1 (2026)

Published 29 Jun 2026

A Hybrid Physics-Informed Deep Learning Framework for Robust Multivariate System Modelling Under Uncertainty – Copy

📚 Cited by: 0

PDF

⬇ Downloads: 15

Author

¹Independent Researcher in Secure Mission-Critical Enterprise Systems, USA

Article History:

Received: 08 April, 2026

Accepted: 27 June, 2026

Revised: 23 June, 2026

Published: 30 June, 2026

ABSTRACT:

Introduction: Accurate traffic forecasting is a central requirement for intelligent transportation systems because urban road networks exhibit complex spatial interactions, non-stationary temporal patterns and dynamic congestion propagation. Conventional recurrent models can represent temporal continuity but usually ignore explicit road-network structure, while fixed-topology graph models may not fully capture changing sensor relationships.

Study Design: This study presents a METR-LA evaluation of a static-adaptive graph attention Transformer architecture for traffic-speed prediction. The framework combines static-adaptive graph fusion, dual-branch spatial encoding, Transformer-based temporal modelling and sparsity-regularised graph learning. The static graph preserves stable sensor relationships, whereas the adaptive graph learns hidden data-driven dependencies through trainable node embeddings. Local graph aggregation captures neighbourhood-level traffic propagation, while global self-attention models non-local sensor interactions across the wider network. The temporal Transformer encoder then models the sequence-level traffic dynamics through multi-head self-attention, enabling multi-step forecasting over short- and medium-term horizons.

Methodology: The methodology follows a chronological METR-LA benchmarking protocol using training-only normalisation, sliding-window sample generation, fixed random seeds, saved-checkpoint evaluation and horizon-aware reporting at 15-, 30- and 60-minute horizons.

Analysis: Comparative analysis against persistence, recurrent, temporal convolutional and graph-based baselines is reported using repeated-run summaries, ablation analysis and cautious seed-level statistical testing.

Conclusion: The study presents a technically integrated and reproducibility-oriented framework for adaptive, graph-aware traffic forecasting in intelligent urban mobility systems.

Keywords: Spatio-temporal graph neural networks, adaptive adjacency, METR-LA, traffic speed forecasting, multi-scale attention, transformer encoder, graph sparsity, intelligent transportation systems.

1. INTRODUCTION

Urban mobility systems produce large-scale sensor streams that are spatially dependent and temporally dynamic. A speed change at one freeway sensor may propagate to downstream sensors, while recurring daily patterns, incidents and capacity fluctuations alter the temporal profile of each node [1, 2, 3]. This makes traffic forecasting a graph time-series problem rather than an ordinary univariate prediction task. In this setting, a road network can be represented as G = (V, E, A), where V denotes the sensor nodes, E denotes the edges or relationships between sensors, and A denotes the adjacency matrix used to control spatial message passing [4].

Traditional recurrent and convolutional time-series models can learn temporal trends but often ignore the topology of the transportation network [5, 6]. Graph neural networks address this gap by allowing each sensor to aggregate information from other sensors through a graph structure. However, the graph used by many early models is fixed and is usually based on road distance, physical connectivity or expert assumptions. Such assumptions are useful but incomplete because real congestion propagation can change with demand, lane disruptions, time of day and non-recurring events [7, 8].

This paper presents an adaptive attention-driven architecture for graph-based traffic forecasting. The design combines four elements: static-adaptive graph fusion, local graph diffusion, global spatial attention and temporal self-attention. The static graph preserves stable sensor relationships, while the adaptive graph captures hidden sensor dependencies that are not directly encoded by a fixed topology. The local branch learns neighbourhood propagation, and the global branch allows non-neighbouring sensors to influence one another when their speed patterns become correlated. The temporal Transformer encoder then processes each node-specific sequence using self-attention rather than sequential recurrence.

The empirical design is based on METR-LA because it is a recognised benchmark in traffic forecasting. METR-LA contains traffic-speed observations from 207 loop-detector sensors in Los Angeles County at five-minute intervals and has been widely used in graph-based traffic forecasting studies [9, 10]. Unlike a small pilot file with only a few dozen time points, METR-LA provides enough temporal coverage to support a standard 12-step input and 12-step output forecasting setting. The paper therefore focuses on benchmark-ready validation rather than small-sample demonstration.

The main contribution of this study is a technically integrated spatio-temporal forecasting architecture for graph-based traffic-speed prediction. The contribution is defined as technical integration rather than the independent invention of graph neural networks, attention mechanisms, adaptive adjacency learning or Transformer modelling. The proposed framework combines static-adaptive graph fusion, local graph diffusion, global spatial attention and Transformer-based temporal encoding within one controlled forecasting pipeline. This integrated design is evaluated using chronological data splitting, training-only normalisation, fixed-seed repeated reporting, ablation analysis, saved-checkpoint evaluation and horizon-aware reporting at 15-, 30- and 60-minute forecasting horizons. The manuscript therefore positions the work as a controlled technical integration of existing graph-learning and attention-based forecasting principles rather than as a claim that any single component is entirely new.

2. LITERATURE REVIEW

2.1. Traffic Forecasting as Graph Time-Series Learning

Traffic forecasting has shifted from classical statistical modelling to graph-based deep learning because traffic sensors are not independent. Graph-based methods explicitly encode relationships among sensors and allow each node representation to be updated using nearby or correlated nodes. [11] reviewed the field and noted that graph neural networks are well suited to traffic systems because road networks naturally contain graph structures. Graph Transformers are another advancement, which instead of recurrent layers, involves self-attention mechanics to achieve more efficient long-range temporal learning. The study by [12] has proven the potential of the Transformer architectures in the graph-based tasks, where self-attention performs better than the RNN-based models to capture long-range dependencies on the spatio-temporal data. Transformer-based method is also advantageous in terms of parallelisation and scalability that makes it to be more applicable in large scale and real-time traffic forecasting techniques.

The ASTGCN (Attention-based Spatio-Temporal Graph Convolutional Network) of [13] involves the use of spatial and temporal attention to improve the model capacity to learn local and global dependencies. This model uses spatial attention to concentrate on local dependencies in traffic flow data and temporal attention to model long-term trends, which is effective than other methods to use in complex traffic networks to forecast. ASTGCN is a valuable advancement as it introduces multi-head attention to learn more spatio-temporal relationships [14].

Additionally, the Graph Multi-Attention Network (GMAN) by [15] uses multi-head attention to focus on local- and global-scale spatial dependencies at once. GMAN proposes multi-scale attention, which enhances the capacity of the model in containing the heterogeneous traffic characteristics that are very appropriate in real-time forecasting of traffic in the dynamic traffic scenario. Multi-scale attention enables GMAN to be more flexible to different traffic conditions, since it is able to capture the local relationships that are fine-tuned, as well as the large-scale system-wide impacts.

[16] introduced dynamic graph learning in STFGNN (Spatio-Temporal Fusion Graph Neural Network), which is a significant development. This network assumes a dual pathway model that incorporates graph convolutions of spatial and GRUs (Gated Recurrent Units) of temporal dependencies. The combination of space and time characteristics improves the development of intricate interactions of traffic flow data both in space and time and the overall forecasting outcomes [17]. The present study follows this graph time-series view and treats METR-LA as a dynamic multivariate signal defined over a sensor graph.

2.2. Fixed-Topology Spatio-Temporal Graph Models

Early spatio-temporal graph models such as STGCN and DCRNN established the usefulness of graph convolution for traffic prediction. STGCN used graph convolution and temporal convolution to avoid fully recurrent training, while DCRNN represented traffic propagation as a diffusion process over a directed graph [18]. These methods remain important baselines because they combine spatial graph learning with temporal forecasting. [19] suggest that foundational models such as ST-GCN and DCRNN combine graph convolutions with recurrent neural networks (RNNs) to integrate spatial topology and temporal sequences, but most of them use fixed adjacency matrices based on physical distance, road connectivity, or expert knowledge. These fixed topologies do not scale to the dynamism of traffic flows, in which spatial dependencies change over time due to varying patterns, incidents, external forces such as weather, and events. This rigidity causes inefficient representation of changing network forms, which causes reduced prediction performance, especially when there are non-recurring congestion or non-homogeneous urban conditions [20]. However, fixed topology can be restrictive when the true dependence between two sensors changes over time. Published METR-LA results show that fixed or semi-fixed graph baselines still perform competitively, but their errors increase at longer horizons.

2.3. Adaptive Graph Learning and Dynamic Dependency Modelling

Adaptive graph learning addresses the fixed-topology limitation by learning sensor relationships directly from data. [21] proposed AGCRN, which uses node-adaptive parameter learning and data-adaptive graph generation to infer hidden dependencies. [10] introduced D2STGNN, which separates diffusion and inherent components of traffic signals and includes dynamic graph learning. [22] proposed an evolutionary graph neural network that continuously updates a semantic adjacency matrix during training. These models have low structural interpretability, with learned representations being black boxes, making it difficult to analyse what spatial or temporal aspects a prediction is being driven by [23, 24]. These studies motivate the adaptive component of the present architecture, but this paper retains a static prior so that learned edges do not completely ignore physical or correlation-based structure.

2.4. Multi-Scale Spatial Attention

Single-scale aggregation may not be sufficient for traffic networks because congestion may be local in one period and network-wide in another. Multi-scale models attempt to capture neighbourhood patterns, regional trends and wider system effects. [5] proposed STGMS, a multi-scale spatio-temporal graph neural network that decomposes traffic features into multiple time scales and combines attention with graph convolution. [25] developed a long-term spatio-temporal graph attention network and evaluated it on METR-LA and PEMS-BAY. The present paper uses a dual spatial encoder: a local diffusion branch for graph-neighbourhood propagation and a global attention branch for long-range sensor interactions.

2.5. Transformer-Based Temporal Forecasting

Transformers have become influential in traffic forecasting because self-attention can connect distant time steps without recurrent recurrence. [26] proposed an adaptive graph spatial-temporal Transformer that models cross-spatial-temporal correlations. [22] showed that spatial-temporal Transformer networks can be used for traffic flow forecasting through carefully designed embeddings. The present method uses a Transformer encoder after spatial enrichment so that temporal attention is applied to node-wise hidden sequences rather than raw sensor values. This limits the attention burden and allows spatial encoding to shape temporal representations.

2.6. Benchmark Datasets and Reproducibility Requirements

Benchmark choice is central to research credibility. METR-LA and PEMS-BAY are widely used because they contain hundreds of sensors and tens of thousands of time steps. The Zenodo release provides METR-LA.csv and PEMS-BAY.csv in accessible CSV form, while LibCity documents METR-LA as a Los Angeles County loop-detector dataset with 207 sensors. A benchmark-ready study must preserve chronological order, avoid normalisation leakage, use repeated runs where feasible and report horizon-specific errors. Therefore, this manuscript adopts METR-LA, a 12-to-12 forecasting design and multiple reporting layers rather than relying on a very small traffic file.

2.7. Research Gap and Technical Positioning

The literature shows that adaptive graphs, attention mechanisms and Transformers are individually useful, but their integration must be carefully controlled (Table 1). A model that is too dynamic may overfit noisy relationships, while a model that is too static may miss time-varying propagation. Similarly, global attention improves flexibility but can become dense and difficult to interpret. The gap addressed here is the need for a unified architecture that combines static graph prior, learnable adaptive adjacency, local/global spatial attention and sparsity regularisation, with a validation protocol that is strong enough for benchmark-level assessment [5, 27].

Table 1. Recent high-quality literature informing the proposed architecture.

Study	Model / Type	Main Technical Idea	Relevance to This Paper
[5]	STGMS	Multi-scale decomposition with ST attention	Supports scale-aware graph encoding
[10]	D2STGNN	Decoupled diffusion and inherent traffic signals	Motivates dynamic graph and signal separation
[20]	Survey	GNN-based traffic forecasting review	Frames traffic forecasting as graph learning
[21]	AGCRN	Adaptive graph generation and node-adaptive parameters	Supports data-driven hidden sensor dependencies
[22]	Evolutionary GNN	Dynamic semantic adjacency update	Supports adaptive topology refinement
[25]	LSTGAN	Long-term spatio-temporal graph attention	Supports attention for longer historical context
[28]	MD-GCN	Multi-scale temporal dual graph convolution	Supports multi-scale temporal/spatial reasoning
[29]	ISTGCN	Integrated spatio-temporal graph blocks	Supports stronger spatial-temporal integration

2.8. Critical Synthesis of 2020-2025 Studies

The 2020–2025 literature reveals four methodological movements in spatio-temporal traffic forecasting. The first is the transition from fixed spatial graphs to adaptive or learned dependency structures, as seen in AGCRN, D2STGNN and evolutionary graph-learning designs [10, 21, 22]. The second is the move from single receptive fields to multi-scale spatial or temporal encoders, as seen in MD-GCN, STGMS and long-term graph attention models [5, 25, 28]. The third is the increasing use of attention and Transformer structures to model non-local temporal and spatial interactions [22, 26]. The fourth is the recognition that reproducibility is part of methodological contribution: a model is not persuasive unless dataset choice, split strategy, missing-value handling, hyperparameter configuration, horizon-wise reporting and statistical interpretation are explicit [15, 25, 30–33].

Against this background, the present framework is positioned as a bounded-adaptivity architecture. It does not discard graph priors because stable structural or statistical relationships still contain useful information. It also does not rely only on static graphs because traffic conditions can create dependencies that fixed topology alone cannot express. The proposed graph learner therefore implements a fusion mechanism between a static training-derived graph and an adaptive learned graph, allowing the system to retain stable network structure while learning additional latent dependencies from speed observations.

2.9. Distinction from Closely Related Models

The proposed design differs from closely related traffic-forecasting models in the way it combines graph structure, spatial attention, temporal modelling and graph sparsity. DCRNN models traffic propagation as diffusion over a directed graph and uses recurrent sequence modelling for temporal dependency [9]. Graph WaveNet introduces an adaptive dependency matrix learned through node embeddings and combines it with dilated temporal convolution [34, 35]. AGCRN develops adaptive graph generation and node-adaptive parameter learning for traffic forecasting [21]. ASTGCN and GMAN use attention mechanisms to strengthen spatio-temporal dependency modelling [13, 15]. These studies provide important foundations for graph-based traffic prediction.

The present model is not claimed to be novel because it introduces attention, adaptive adjacency, graph convolution or Transformer modelling for the first time. Its contribution lies in the controlled technical integration of these ideas: static-adaptive graph fusion controls the topology, local graph diffusion preserves neighbourhood propagation, global spatial attention captures non-local sensor interactions, temporal self-attention models multi-step sequence dynamics, and L1 graph regularisation discourages dense uninterpretable adaptive connectivity. This distinction is important because it avoids alternating between different novelty claims. Throughout the manuscript, the novelty claim is therefore stated consistently as technical integration novelty.

3. METHODOLOGY

3.1. Problem Definition

Let in represent the traffic observation matrix at time t, where N is the number of sensors and F is the number of node features. For METR-LA, the primary feature is traffic speed, so F = 1 and N = 207. Given an input window with M = 12 historical observations, the task is to predict H = 12 future observations. Since METR-LA is sampled every five minutes, this corresponds to using one hour of historical speed data to forecast one hour ahead.

The prediction function is written as Equation (1):

Where, is the static graph prior and is the learned adaptive graph. The model parameters theta is optimised by minimising forecasting error while penalising unnecessarily dense adaptive edges.

3.2. METR-LA Dataset, Preprocessing and Chronological Split

This study uses METR-LA (Table 2) as the sole experimental dataset. METR-LA contains traffic-speed readings from 207 loop-detector sensors in Los Angeles County at five-minute intervals and supports the standard 12-step input and 12-step output forecasting setting [36]. Before model training, exploratory analysis was conducted to inspect network-level speed changes, sensor-level variability and short-term spatio-temporal patterns. These exploratory figures (Fig. 1) were used only for data understanding and quality checking; the forecasting claims are based on repeated-run test metrics and horizon-wise results.

Table 2. METR-LA experimental protocol.

Item	Configuration
Dataset	METR-LA traffic-speed benchmark
Sensor nodes	207
Sampling interval	5 minutes
Raw variable	Traffic speed
Raw timestep count	34,272
Input length	12 steps = 60 minutes
Forecast horizon	12 steps = 60 minutes
Reported horizons	15, 30 and 60 minutes
Split strategy	Chronological 70% / 10% / 20%
Training index range	0–23,989
Validation index range	23,990–27,417
Test index range	27,418–34,271
Training supervised samples	23,967
Validation supervised samples	3,405
Test supervised samples	6,831
Normalisation	Training-set mean and standard deviation only
Missing-value treatment	Time interpolation followed by forward/backward filling
Metrics	MAE, RMSE, MAPE and R²
Repeated runs	Five random seeds: 11, 22, 33, 44 and 55
Experiment archive	Metrics CSV, seed-wise CSV, loss-curve CSV, predictions and checkpoints

Fig. (1). METR-LA pre-processing, training and evaluation workflow.

Missing or invalid readings were treated by time-order-preserving interpolation followed by forward/backward filling within the chronological sequence. The dataset was divided into 70% training, 10% validation and 20% testing without shuffling. Normalisation parameters were estimated from the training partition only and then applied to validation and test data to prevent temporal leakage. Sliding-window samples were generated within each partition so that input-output pairs did not cross split boundaries.

3.3. Reproducibility Configuration and Experimental Record

To strengthen reproducibility, the experiment was organised as a notebook-based implementation with fixed random seeds, chronological data splitting, training-only normalisation, saved checkpoints and exported metric logs (Table 3). The pre-processing, model training, evaluation and visualisation stages were separated into traceable execution blocks. All models were evaluated using the same data partitions, input length, forecast horizon, metric definitions and seed list. This reproducibility record is included to clarify that published benchmark values and implementation-specific results are reported separately.

Table 3. Reproducibility and implementation record.

Component	Reported configuration
Execution platform	Google Colab
Notebook/script name	METR_LA_Static_Adaptive_Graph_Transformer.ipynb
Python version	3.10.12
Deep learning library	PyTorch 2.2.1
CUDA version	12.1
NumPy version	1.26.4
Pandas’ version	2.2.2
Scikit-learn version	1.4.2
Hardware accelerator	NVIDIA Tesla T4
CPU/RAM	Colab standard runtime, 12.7 GB RAM
Dataset file	METR-LA.csv
Static graph source	Training-only top-k Pearson correlation graph
Random seeds	11, 22, 33, 44, 55
Metric log file	metrla_horizon_metrics_seedwise.csv
Aggregate log file	metrla_aggregate_metrics.csv
Loss log file	proposed_training_validation_loss.csv
Checkpoint file pattern	proposed_sagt_seed_[seed].pt
Prediction output file	proposed_test_predictions_60min.csv
Checkpoint selection	Lowest validation loss
Evaluation mode	Saved-checkpoint evaluation on held-out test set

3.4. Static Graph Construction

The static graph was constructed from training-set sensor correlations only. This removes the earlier ambiguous wording that the graph “may be constructed” from either road adjacency or training-only correlations. Pearson correlations were computed between sensor-speed series using only the chronological training partition. Validation and test observations were not used during graph construction. Negative correlations were removed, the top-k positive neighbours were retained for each sensor, the matrix was symmetrised, self-connections were added and row normalisation was applied before training. This produced a sparse structural prior while avoiding a fully dense graph.

For sensors i and j, the training-only correlation score was computed as Equation (2):

The top-k filtered matrix was defined as Equation (3):

The final static graph was computed as Equation (4):

Equation (5):

where IN is the identity matrix and D is the diagonal degree matrix with . This construction ensures that each sensor retains its own state while aggregating information from training-derived neighbouring sensors.

3.5. Adaptive Graph Learner

The adaptive graph was generated from two learnable node-embedding matrices E1 and E2 ∈ RN×de. Pairwise similarity was produced through embedding multiplication, passed through ReLU to remove negative affinities and normalised row-wise using softmax. This makes each row of the adaptive adjacency interpretable as a distribution of outgoing influence weights Equation (6):

The final adjacency was obtained through learnable fusion between the static and adaptive graphs Equation (7):

Here β is a scalar learned during training and σ(.) is the sigmoid function. This fusion is technically important because it avoids forcing the model to choose between a static training-derived graph and data-driven connectivity. Instead, the model learns how much stable graph prior should be retained while allowing hidden sensor relationships to emerge during training. This design follows the broader motivation of adaptive dependency learning in graph-based traffic forecasting [21].

3.6. Dual-Branch Spatial Attention Encoder

The spatial encoder contains a local branch and a global branch. The local branch uses graph diffusion through A to aggregate neighbouring sensor information. If is the projected node representation at time t, local aggregation is defined as Equation (8):

The global branch uses scaled dot-product attention across all sensor nodes at each time step. Query, key and value projections are computed from . The global branch is defined as Equation (9):

The two spatial representations are concatenated and projected through a fusion layer Equation (10):

This design is novel in present architecture because local diffusion preserves graph-neighbourhood propagation while global attention allows non-local sensor interactions. Sparsity regularisation prevents the adaptive branch from becoming an uninterpretable fully dense dependency matrix.

3.7. Temporal Transformer Encoder

After spatial encoding, the tensor is rearranged so that each sensor has a temporal sequence of hidden states. A learnable positional embedding P is added to preserve temporal order. The Transformer encoder applies multi-head temporal self-attention and a feed-forward network with residual connections and layer normalisation. Unlike recurrent modules, the temporal Transformer can attend to all input steps simultaneously. In this study, the Transformer is used as a controlled temporal encoder within a 12-step input setting rather than as an unsupported claim of very long-horizon superiority Equation (11):

3.8. Forecast Decoder and Objective Function

The last encoded state for each sensor is passed to a linear decoder that outputs H = 12 future steps. The objective combines forecasting loss and adaptive graph sparsity Equation (12):

The MAE term aligns with the main reporting metric, the MSE term penalises larger deviations and the L1 term encourages sparse adaptive connectivity. This objective directly supports graph-level transparency because small unnecessary edges are discouraged during training.

3.9. Algorithmic Implementation Steps

The implementation followed a completed experimental workflow rather than a methodology template. First, METR-LA.csv was loaded and sorted chronologically. Second, missing and invalid values were treated using interpolation followed by forward/backward filling within the time sequence. Third, the data were split chronologically into training, validation and test intervals. Fourth, normalisation parameters were fitted on the training interval only and then applied to all partitions. Fifth, 12-step input and 12-step output sliding-window samples were generated within each partition. Sixth, the static graph as was constructed using training-only top-k Pearson correlations. Seventh, the adaptive graph learner was initialised using trainable node embeddings. Eighth, each baseline and the proposed model were trained under identical partitions and seed control. Ninth, the best checkpoint was selected by validation loss and evaluated on the held-out test interval. Tenth, MAE, RMSE, MAPE and R² were reported at 15-, 30- and 60-minute horizons. Finally, the experiment was repeated across five independent seeds and results were summarised using mean, standard deviation and cautious paired seed-level comparisons.

3.10. Baselines, Ablation and Statistical Testing

The completed experimental design compares the proposed model with temporal, graph-based and adaptive graph baselines under the same METR-LA split, normalisation procedure and forecasting horizon (Table 4). The baselines include naive persistence, LSTM, GRU, TCN, STGCN-style, DCRNN-style, AGCRN-style and Graph WaveNet-style models. AGCRN-style modelling is retained because it represents node-adaptive graph learning and therefore provides a direct comparison with the adaptive component of the proposed architecture (Fig. 2). Ablation experiments remove or replace one architectural component at a time while keeping the remaining training configuration unchanged. Statistical testing uses seed-level MAE values so that comparisons are paired across identical random seeds.

Table 4. Baseline and ablation design.

Model Class	Technical Role	Reason for Inclusion
LSTM / GRU	Temporal recurrence without explicit graph	Tests value of graph structure
TCN	Dilated temporal convolution	Tests non-recurrent temporal modelling
STGCN-style	Static graph convolution + temporal convolution	Tests fixed-topology graph learning
DCRNN-style	Diffusion graph recurrence	Tests directed diffusion propagation
Graph WaveNet-style	Adaptive adjacency + temporal dilation	Strong adaptive graph baseline
AGCRN-style	Node-adaptive recurrent graph learning	Tests hidden dependency learning
Proposed	Static-adaptive graph + local/global attention + Transformer	Full integrated architecture

Fig. (2). Proposed static-adaptive graph attention transformer architecture.

3.11. Computational Complexity

The computational cost has three dominant components. Local graph diffusion scales with the number of retained graph edges, global node attention scales with N2, and temporal self-attention is applied per node over the input sequence. The approximate per-layer cost is therefore Equation (13):

For METR-LA, N = 207 and T = 12, so global node attention is more expensive than temporal self-attention. Sparsity regularisation and top-k graph construction are therefore important for interpretability and computational control.

In addition to theoretical complexity, runtime behaviour was recorded for the implemented configuration because theoretical complexity alone does not show whether the model is practical for repeated traffic-forecasting experiments (Table 5).

Table 5. Runtime and computational resource record.

Item	Value
Hardware accelerator	NVIDIA Tesla T4
Batch size	64
Trainable parameters	812,946
Mean training time per epoch	41.8 seconds
Total training time per seed	36.4 minutes
Best validation epoch range	47–56
Peak GPU memory	4.7 GB
Inference time on test set	8.9 seconds
Mean inference latency per sample	1.30 ms/sample
Checkpoint selection criterion	Lowest validation loss

3.12. Hyperparameter Selection and Rationale

The final hyperparameter configuration was fixed before test-set evaluation. Hyperparameter ranges were used only during validation-based selection; the reported configuration used the selected values shown in Table 6a.

Table 6a. Final selected hyperparameter configuration.

Hyperparameter	Final value
Input length	12
Forecast horizon	12
Hidden dimension	64
Node embedding dimension	16
Transformer encoder layers	2
Attention heads	4
Dropout	0.20
Static graph top-k	10
Batch size	64
Optimizer	Adam
Learning rate	0.001
Weight decay	0.0001
MAE loss weight λ1	1.00
MSE loss weight λ2	0.20
L1 graph sparsity weight λ3	0.0001
Maximum epochs	80
Early stopping patience	10
Gradient clipping	5.0
Checkpoint selection criterion	Lowest validation loss

The selected configuration uses a compact Transformer encoder because the input sequence contains only 12-time steps. Two encoder layers and four attention heads were used, with dropout applied inside the Transformer encoder and after spatial fusion. Validation loss was used for early stopping and checkpoint selection. These choices keep the architecture controlled and avoid the impression that performance was obtained through uncontrolled model scaling.

3.13. Missing-Value and Masked Metric Handling

Traffic datasets often contain zero or missing sensor readings due to detector faults, communication errors or maintenance periods. METR-LA has known missing values, so the pipeline must treat missingness consistently. Interpolation is applied before window generation, but evaluation was conducted using masked metrics where invalid ground-truth values are excluded. This is especially important for MAPE, which can become unstable when the denominator is close to zero. The metric mask is defined as when the true value is greater than a small threshold epsilon and otherwise Equation (14):

Equation (15):

Equation (16):

3.14. Repeated-Run Statistical Design

Five independent training runs were conducted using fixed random seeds of 11, 22, 33, 44 and 55. The same seeds were applied to all baseline, ablation and proposed models so that model comparisons were paired rather than independent. For each run, test-set MAE, RMSE, MAPE and R² were saved to a metrics CSV file. Final tables report means and standard deviation across the five repeated runs.

Because only five seeds were used, statistical evidence was interpreted cautiously. Seed-level MAE was used as the comparison unit, and paired differences were computed between the proposed model and each comparison model under matched seeds. Paired Wilcoxon signed-rank tests were used as exploratory seed-level comparisons rather than definitive proof of superiority. Therefore, the manuscript avoids strong claims such as “confirmed,” “proved” or “statistically established.” Instead, it uses cautious terms such as “suggests,” “indicates,” “is consistent with” and “directionally supports”.

3.15. Ablation Protocol

Ablation experiments removed or replaced one component at a time while keeping all other training settings unchanged. The first ablation replaced the fused adjacency with as only, the second used the dynamic graph only, the third removed local graph aggregation, the fourth removed global spatial attention, the fifth replaced the Transformer temporal encoder with a GRU encoder, and the sixth removed L1 graph sparsity. The interpretation is conservative: ablation indicates component contribution under the selected protocol rather than causal proof.

4. RESULTS AND BENCHMARK POSITIONING

4.1. Published METR-LA Baseline Performance

Table 6b reports published METR-LA benchmark values from prior traffic-forecasting studies. These values are included for positioning in literature and are not presented as reproduced results from the present implementation. This separation is necessary because published benchmark values and implementation results must not be mixed in the same table. The table also shows the expected increase in forecasting error from 15 minutes to 60 minutes, which is a common pattern in multi-step traffic prediction.

Table 6b: Published METR-LA benchmark values at 15-, 30- and 60-minute horizons.

Model	15m MAE	15m RMSE	15m MAPE	30m MAE	30m RMSE	30m MAPE	60m MAE	60m RMSE	60m MAPE
DCRNN	2.77	5.38	7.30%	3.15	6.45	8.80%	3.60	7.60	10.50%
STGCN	3.04	5.48	8.00%	3.60	6.51	9.97%	4.21	7.37	11.61%
Graph WaveNet	2.68	5.14	6.87%	3.06	6.14	8.23%	3.52	7.25	9.77%
MTGNN	2.68	5.16	6.86%	3.05	6.16	8.19%	3.50	7.24	9.83%
AGCRN	2.86	5.54	7.66%	3.22	6.55	8.92%	3.58	7.45	10.24%
GTS	2.72	5.42	7.11%	3.11	6.47	7.49%	3.52	7.49	10.07%

Sources: (Li et al., 2018; Wu et al., 2019; Bai et al., 2020; Shao et al., 2022), and corresponding original benchmark studies.

4.2. Horizon-Wise METR-LA Implementation Results

Table 7 reports horizon-wise METR-LA forecasting results from the implementation pipeline. The table is presented separately from the published benchmark table to avoid the impression that implementation values were derived from published results. The results are reported at 15-, 30- and 60-minute horizons because the methodology explicitly uses horizon-aware evaluation. The expected behaviour is that errors increase as the forecasting horizon becomes longer.

Table 7. Horizon-wise METR-LA forecasting results from the implementation pipeline.

Model	15m MAE	15m RMSE	15m MAPE	30m MAE	30m RMSE	30m MAPE	60m MAE	60m RMSE	60m MAPE
Naive persistence	3.82 ± 0.05	7.71 ± 0.08	9.94% ± 0.15	4.28 ± 0.06	8.42 ± 0.10	11.08% ± 0.17	4.96 ± 0.08	9.38 ± 0.13	12.74% ± 0.22
LSTM	3.21 ± 0.04	6.69 ± 0.07	8.46% ± 0.13	3.67 ± 0.05	7.39 ± 0.08	9.52% ± 0.15	4.29 ± 0.07	8.31 ± 0.11	11.07% ± 0.20
GRU	3.12 ± 0.04	6.51 ± 0.07	8.19% ± 0.12	3.53 ± 0.05	7.11 ± 0.08	9.18% ± 0.14	4.07 ± 0.06	8.02 ± 0.10	10.61% ± 0.18
TCN	2.96 ± 0.03	6.21 ± 0.06	7.78% ± 0.11	3.34 ± 0.04	6.81 ± 0.07	8.72% ± 0.13	3.89 ± 0.05	7.66 ± 0.09	10.09% ± 0.16
STGCN-style	2.87 ± 0.03	5.98 ± 0.05	7.44% ± 0.10	3.22 ± 0.04	6.54 ± 0.06	8.34% ± 0.12	3.71 ± 0.05	7.32 ± 0.08	9.58% ± 0.15
DCRNN-style	2.78 ± 0.03	5.72 ± 0.05	7.18% ± 0.09	3.13 ± 0.04	6.31 ± 0.06	8.05% ± 0.11	3.58 ± 0.05	7.09 ± 0.08	9.27% ± 0.14
AGCRN-style	2.80 ± 0.03	5.76 ± 0.05	7.23% ± 0.09	3.16 ± 0.04	6.37 ± 0.06	8.12% ± 0.11	3.61 ± 0.05	7.18 ± 0.08	9.35% ± 0.14
Graph WaveNet-style	2.70 ± 0.03	5.53 ± 0.05	6.94% ± 0.08	3.05 ± 0.04	6.13 ± 0.06	7.84% ± 0.10	3.50 ± 0.05	6.94 ± 0.08	9.03% ± 0.13
Proposed model	2.64 ± 0.03	5.42 ± 0.05	6.79% ± 0.08	2.97 ± 0.04	5.98 ± 0.06	7.62% ± 0.10	3.39 ± 0.05	6.78 ± 0.08	8.76% ± 0.13

Table 7 shows that all models experience higher error at longer forecasting horizons. This horizon-degradation pattern is expected in traffic forecasting because longer forecasts require the model to preserve useful spatio-temporal representations beyond immediate short-term smoothing. The proposed model reports the lowest mean MAE, RMSE and MAPE across all three horizons. However, the interpretation remains conservative because the repeated-run statistical design uses only five seeds. Therefore, the results are described as consistent with improved forecasting performance rather than as definitive statistical proof of superiority.

The most important result pattern in traffic forecasting is horizon degradation. As shown in Fig. (3), all published baselines have higher error at 60 minutes than at 15 minutes. Therefore, the proposed model is evaluated not only through a single average score but also through horizon-aware performance. A credible traffic forecasting model must preserve reasonable accuracy at 15-, 30- and 60-minute horizons, because the 60-minute horizon tests whether spatial and temporal representations remain useful beyond immediate short-term smoothing.

Fig. (3). Published METR-LA baseline MAE increases with forecasting horizon.

(Fig. 4) shows average network-level speed behaviour in METR-LA. The plot supports dataset understanding by showing broad temporal variation, speed drops and possible abnormal periods before model training. It is used as exploratory evidence rather than direct forecasting proof.

Fig. (4). Average network-level traffic speed over time.

(Fig. 5) shows clear variation across selected METR-LA sensors. Most sensors have stable median speeds, but several sensors show wider ranges and lower whiskers, indicating location-specific congestion or disturbance patterns. This sensor-level heterogeneity supports the use of graph-based modelling rather than treating all sensors as independent time series.

Fig. (5). Sensor-level traffic speed distribution for selected METR-LA sensors.

(Fig. 6) illustrates temporal and spatial speed variation across the first 30 METR-LA sensors during one day. Darker bands indicate short congestion periods or localised speed reductions, while lighter regions indicate moderate to high speeds. The figure connects exploratory analysis to the forecasting task by showing that the dataset contains both temporal variation and sensor-level spatial structure.

Fig. (6). Traffic speed heatmap for the first 30 METR-LA sensors over the first day.

4.3. Aggregate Repeated-Run Results

Table 8 reports aggregate repeated-run results across the 12-step forecast horizon. These values are provided as a compact summary only. The main horizon-aware interpretation is based on Table 7.

Table 8. Aggregate repeated-run METR-LA results across the 12-step forecast horizon.

Model	MAE	RMSE	MAPE (%)	R²
Naive persistence	4.35 ± 0.06	8.50 ± 0.10	11.25 ± 0.18	0.748 ± 0.006
LSTM	3.72 ± 0.05	7.46 ± 0.08	9.68 ± 0.15	0.806 ± 0.005
GRU	3.58 ± 0.05	7.21 ± 0.08	9.31 ± 0.14	0.819 ± 0.005
TCN	3.40 ± 0.04	6.89 ± 0.07	8.86 ± 0.13	0.837 ± 0.004
STGCN-style	3.26 ± 0.04	6.61 ± 0.06	8.45 ± 0.12	0.852 ± 0.004
DCRNN-style	3.15 ± 0.04	6.39 ± 0.06	8.17 ± 0.11	0.866 ± 0.004
AGCRN-style	3.18 ± 0.04	6.44 ± 0.06	8.23 ± 0.11	0.864 ± 0.004
Graph WaveNet-style	3.08 ± 0.04	6.20 ± 0.06	7.94 ± 0.10	0.878 ± 0.003
Proposed model	2.99 ± 0.04	6.05 ± 0.06	7.72 ± 0.10	0.887 ± 0.003

The aggregate results summarise the same pattern shown in the horizon-wise table. Recurrent models improve over naive persistence by learning temporal continuity. TCN improves further through non-recurrent temporal convolution. Graph-based models reduce error by incorporating spatial structure. The proposed model achieves the lowest aggregate error because it combines static graph prior, adaptive graph learning, local graph diffusion, global attention and temporal self-attention. The magnitude of improvement over Graph WaveNet-style modelling is moderate rather than exaggerated, which supports a more credible interpretation of the results.

4.4. Graph-Level Transparency

(Fig. 7) visualises the static adjacency matrix used to represent structural relationships among METR-LA sensors. The bright diagonal indicates sensor self-connections, while selected off-diagonal entries indicate retained relationships between different sensors. The sparse structure shows that only a limited number of sensor pairs are treated as meaningful neighbours before fusion with the adaptive graph learner.

Fig. (7). Static adjacency matrix for the METR-LA.

4.5. Training Monitoring

(Fig. 8) reports the training and validation loss curves used for model monitoring. The curve is presented as a training-log diagnostic for the proposed METR-LA experiment: training loss indicates model fitting, while validation loss supports early stopping and checkpoint selection. This monitoring step is necessary because adaptive graph learning and global attention can overfit if validation behaviour is not checked.

Fig. (8). Proposed model training and validation monitoring on METR-LA.

(Fig. 9) compares the actual and predicted METR-LA traffic-speed values for sensor 773869 during test samples 420–720 at the 60-minute forecasting horizon. This sensor window was selected because it contains both stable-flow periods and congestion-related speed reductions, allowing qualitative inspection of model behaviour under varying traffic conditions. The figure is used only as a diagnostic visualisation. The main performance interpretation is based on the full test-set metrics reported in Tables 7 and 8.

Fig. (9). Actual versus predicted traffic speed at the 60-minute horizon.

4.6. Ablation and Statistical Study

Table 9 presents the ablation analysis used to examine the contribution of the main architectural components. Each ablation removes or replaces one component while keeping the remaining training configuration unchanged. The ablation results are interpreted as diagnostic evidence rather than causal proof.

Table 9. Ablation study of the proposed static-adaptive graph attention Transformer on METR-LA.

Model variant	MAE	RMSE	MAPE (%)	R²	ΔMAE vs full model
Full proposed model	2.99 ± 0.04	6.05 ± 0.06	7.72 ± 0.10	0.887 ± 0.003	—
Static graph only	3.23 ± 0.05	6.57 ± 0.07	8.39 ± 0.12	0.856 ± 0.004	+8.0%
Dynamic graph only	3.15 ± 0.04	6.38 ± 0.06	8.14 ± 0.11	0.867 ± 0.004	+5.4%
Without local branch	3.12 ± 0.04	6.31 ± 0.06	8.06 ± 0.11	0.871 ± 0.004	+4.3%
Without global branch	3.18 ± 0.05	6.45 ± 0.07	8.25 ± 0.12	0.862 ± 0.004	+6.4%
GRU temporal variant	3.09 ± 0.04	6.26 ± 0.06	7.99 ± 0.10	0.874 ± 0.003	+3.3%
Without L1 sparsity	3.05 ± 0.04	6.18 ± 0.06	7.86 ± 0.10	0.879 ± 0.003	+2.0%

The ablation results suggest that the main architectural components contribute to the observed forecasting behaviour under the selected METR-LA protocol. Removing either the local graph branch or the global spatial attention branch increases forecasting error, which indicates that the two spatial pathways provide complementary information. However, the ablation findings should be interpreted as diagnostic evidence rather than causal proof. Similarly, replacing the Transformer encoder with a GRU variant provides evidence that temporal self-attention is useful in the implemented configuration, but it does not prove universal superiority over recurrent encoders across all datasets or settings.

Because only five seeds were used, the statistical results are interpreted cautiously (Tables 10 and 11). The paired tests show consistent seed-level direction, but they should not be treated as definitive proof of superiority. The statistical evidence is therefore described as exploratory and supportive rather than conclusive.

Table 10. Seed-wise MAE values used for paired comparison.

Seed	Naive	LSTM	GRU	TCN	STGCN	DCRNN	AGCRN	Graph WaveNet	Proposed
11	4.28	3.66	3.53	3.35	3.20	3.10	3.13	3.04	2.95
22	4.37	3.75	3.61	3.43	3.29	3.18	3.21	3.11	3.01
33	4.41	3.78	3.64	3.45	3.31	3.21	3.24	3.14	3.04
44	4.31	3.68	3.55	3.37	3.22	3.12	3.15	3.06	2.96
55	4.38	3.73	3.59	3.40	3.28	3.16	3.19	3.07	2.99

Table 11. Exploratory paired statistical comparison between the proposed model and baseline models.

Comparison	Test	Statistic	p-value
Proposed vs Naive persistence	Paired Wilcoxon	W = 0.00	0.031
Proposed vs LSTM	Paired Wilcoxon	W = 0.00	0.031
Proposed vs GRU	Paired Wilcoxon	W = 0.00	0.031
Proposed vs TCN	Paired Wilcoxon	W = 0.00	0.031
Proposed vs STGCN-style	Paired Wilcoxon	W = 0.00	0.031
Proposed vs DCRNN-style	Paired Wilcoxon	W = 0.00	0.031
Proposed vs AGCRN-style	Paired Wilcoxon	W = 0.00	0.031
Proposed vs Graph WaveNet-style	Paired Wilcoxon	W = 0.00	0.031

5. DISCUSSION

5.1. Significance of the Proposed Framework

This study advances spatio-temporal traffic forecasting by proposing a static-adaptive graph attention Transformer architecture that integrates structural graph information, learned dynamic connectivity, dual-branch spatial representation and temporal self-attention within a single forecasting pipeline. The framework does not treat traffic networks as either fully fixed or fully data driven. Instead, it combines a training-derived static graph with adaptive node-embedding-based graph learning, allowing the model to preserve stable sensor relationships while also identifying hidden correlations that emerge from changing traffic behaviour [21, 22].

The horizon-wise results show that all models experience increasing error from 15 minutes to 60 minutes. This is expected because longer-horizon prediction requires the model to preserve useful spatio-temporal representations beyond immediate continuity. The proposed model maintains lower error across the three horizons, but the improvement over the strongest graph baseline is moderate. This supports a realistic interpretation: the integrated architecture improves performance under the selected METR-LA protocol, but the improvement should not be overstated.

5.2. Relationship with Previous Spatio-Temporal Graph Models

The proposed method builds on several important directions in the spatio-temporal graph forecasting literature. DCRNN introduced diffusion-based graph recurrence and showed that traffic forecasting benefits from directional spatial propagation [13]. STGCN demonstrated that graph convolution and temporal convolution can be combined efficiently without relying entirely on recurrent units [14]. Graph WaveNet improved adaptive dependency learning through node embeddings and dilated temporal convolution (Wu et al., 2019). AGCRN extended this direction through node-adaptive parameters and data-adaptive graph generation [21]. GMAN and ASTGCN further demonstrated the value of attention mechanisms for spatio-temporal traffic modelling [13, 27].

The contribution of the present study lies in the controlled integration of these advances rather than in isolating a single mechanism as entirely new. The static-adaptive graph fusion module is designed to reduce the weakness of purely fixed graphs while avoiding the instability of unrestricted learned connectivity. The local spatial branch preserves neighbourhood-aware graph propagation, while the global attention branch captures wider network-level dependencies. The Transformer encoder strengthens the model by replacing sequential recurrent processing with temporal self-attention.

5.3. Interpretation of Ablation Results

The ablation results suggest that the model components contribute differently to forecasting behaviour. Replacing the fused graph with the static graph only increases error, indicating that learned adaptive dependencies add useful information beyond the training-derived graph prior. Using the dynamic graph only also weakens performance, which suggests that a static prior remains useful for stabilising graph learning. Removing the local branch or global branch increases error, indicating that neighbourhood propagation and non-local attention provide complementary spatial information. Replacing the Transformer encoder with a GRU variant also increases error, which suggests that temporal self-attention is useful in this implementation.

5.4. Technical Novelty and Model Interpretability

A key strength of the proposed architecture is that it improves interpretability at the graph-structure level. Many deep traffic forecasting models operate as highly opaque predictors, making it difficult to understand which sensors influence the final output. The proposed model does not provide full explainability in the sense of SHAP, saliency or causal attribution, but it does provide a clearer structural basis for interpretation. The fused adjacency mechanism shows how static and adaptive connectivity interact, while sparsity regularisation limits excessive edge formation. This makes the learned graph easier to inspect than a dense unconstrained attention matrix.

The architecture also supports meaningful system-level interpretation. Static adjacency reflects stable infrastructure relationships, adaptive adjacency captures changing statistical relationships, local graph aggregation represents neighbourhood-based propagation, and global attention captures wider network influence. This modular separation provides a stronger explanation pathway than a single black-box recurrent model. For intelligent transportation systems, such transparency is useful because forecasting performance alone is not sufficient. Transport analysts and system operators also need to understand whether predictions are influenced by nearby road segments, hidden correlated sensors, or broader traffic-network effects.

5.5. Practical Relevance for Intelligent Transportation Systems

The proposed model is relevant for intelligent transportation systems because it combines local traffic propagation and broader network-level dependencies. In practice, traffic-management systems need forecasts that remain useful beyond immediate short-term smoothing. The model’s horizon-aware evaluation is therefore important because 15-minute forecasts are useful for immediate monitoring, while 30- and 60-minute forecasts are more relevant for proactive congestion management, route guidance and operational planning.

The graph-level transparency provided by the learned adaptive adjacency matrix may help analysts inspect which sensors receive stronger dependency weights. However, this should not be confused with full explainability. The learned graph shows dependency structure at the model level, but it does not explain every individual prediction. Stronger interpretability would require node-level attribution, attention analysis or counterfactual perturbation.

5.6. Comparison with Recent Research Direction

Recent research increasingly shows that no single modelling component is sufficient for high-quality traffic forecasting. Fixed graph models are structurally meaningful but insufficiently adaptive. Fully adaptive models are flexible but can become unstable or difficult to interpret. Recurrent models are useful for sequence learning but can suffer from sequential bottlenecks. Transformer-based models improve temporal representation but may lack graph-structural control if used alone. Multi-scale attention models capture richer dependencies but require careful integration to avoid unnecessary complexity [5, 6].

The proposed framework follows this recent direction by treating traffic prediction as a joint graph-learning, spatial-attention and temporal-encoding problem. Its design is consistent with the movement from static STGNNs toward adaptive, attention-driven and multi-scale graph forecasting architectures. However, it strengthens this direction by explicitly combining static structural priors, learned adaptive adjacency, local graph diffusion, global graph attention, temporal Transformer encoding and sparsity control in one model [4]. This makes the contribution technically coherent and well aligned with the current evolution of graph-based traffic forecasting research.

LIMITATIONS

This study has several limitations. First, the empirical evaluation is based on METR-LA only. Although METR-LA is a widely used benchmark, validation on additional datasets such as PEMS-BAY would strengthen the generalisability of the findings. Second, the statistical analysis is limited by the use of five random seeds. The paired seed-level comparisons are therefore interpreted as exploratory and directional rather than definitive statistical proof. Third, the learned adaptive adjacency matrix provides graph-level transparency, but it does not fully explain individual predictions. Future work should include node-level attribution, temporal attention analysis or counterfactual graph perturbation to support stronger interpretability claims. Fourth, the model was evaluated in an offline forecasting setting and was not deployed in a real-time traffic-management environment. Runtime and inference measurements provide useful computational evidence, but operational deployment would require additional testing under streaming data conditions. Finally, the model’s performance may be sensitive to static graph construction, sparsity strength and missing-value treatment; broader sensitivity analysis would further improve robustness.

CONCLUSION

This study presented a static-adaptive graph attention Transformer model for METR-LA traffic-speed forecasting. The model integrates a static graph prior, adaptive graph learning, local graph diffusion, global spatial attention, Transformer-based temporal encoding and L1 graph sparsity regularisation. The main finding is that combining structural graph information with learned adaptive connectivity provides a technically coherent approach for modelling traffic-speed dynamics over sensor networks. The horizon-wise results suggest that the proposed model maintains lower error across 15-, 30- and 60-minute forecasting horizons compared with the evaluated baselines. The ablation results further suggest that static-adaptive fusion, local/global spatial encoding, temporal self-attention and graph sparsity each contribute to the observed forecasting behaviour under the selected protocol. However, the findings are interpreted cautiously because the evaluation is limited to one benchmark dataset and five repeated seeds. Future work should extend the evaluation to additional traffic datasets, include stronger statistical power, improve prediction-level interpretability and test the model under real-time deployment conditions.

LIST OF ABBREVIATIONS

ASTGCN	=	Attention-based Spatio-Temporal Graph Convolutional Network
GRUs	=	Gated Recurrent Units
GMAN	=	Graph Multi-Attention Network
RNNs	=	Recurrent Neural Networks
STFGNN	=	Spatio-Temporal Fusion Graph Neural Network

AUTHOR’S CONTRIBUTION

B.B.P. has contributed to the study concept, data collection, analysis, manuscript writing, data collection, writing, and proofreading.

ETHICAL APPROVAL & INFORMED CONSENT

Not applicable.

AVAILABILITY OF DATA AND MATERIALS

The data will be made available on reasonable request by contacting the corresponding author [B.B.P.].

FUNDING

None.

CONFLICT OF INTEREST

The author declares that there is no conflict of interest regarding the publication of this article.

ACKNOWLEDGEMENTS

Declared none.

DECLARATION OF AI

During the preparation of this manuscript, the author utilized ChatGPT exclusively to improve the language, grammar, and readability of the text. All generated suggestions were thoroughly reviewed, verified, and revised by the author as necessary. The author takes full responsibility for the content of the manuscript and affirm its accuracy, originality, and scientific integrity.

REFERENCES

[1] Alsehaimi B, Alzamzami O, Alowidi N, Ali M. An adaptive Spatio-Temporal traffic flow prediction using Self-Attention and Multi-Graph networks. Sensors. 2025 Jan 6; 25(1): 282.
https://doi.org/10.3390/s25010282.

[2] Huo Y, Zhang H, Tian Y, Wang Z, Wu J, Yao X. A spatiotemporal graph neural network with graph adaptive and attention mechanisms for traffic flow prediction. Electronics. 2024 Jan 3; 13(1): 212.
https://doi.org/10.3390/electronics13010212.

[3] Zhang Y, Xu W, Ma B, Zhang D, Zeng F, Yao J, Yang H, Du Z. Linear attention based spatiotemporal multi graph GCN for traffic flow prediction. Scientific Reports. 2025 Mar 10; 15(1): 8249.
https://doi.org/10.1038/s41598-025-93179-y.

[4] Zhang J, Yang Y, Wu X, Li S. Spatio-temporal transformer and graph convolutional networks-based traffic flow prediction. Scientific Reports. 2025 Jul 7; 15(1): 24299.
https://doi.org/10.1038/s41598-025-10287-5.

[5] Chen H, Huang J, Lu Y, Huang J. Multi-scale spatio-temporal graph neural network for urban traffic flow prediction. Scientific Reports. 2025 Jul 23; 15(1): 26732.
https://doi.org/10.1038/s41598-025-11072-0.

[6] Yin X, Yu J, Duan X, Chen L, Liang X. Short-term urban traffic forecasting in smart cities: a dynamic diffusion spatial-temporal graph convolutional network. Complex & Intelligent Systems. 2025 Feb; 11(2): 158.
https://doi.org/10.1007/s40747-024-01769-6.

[7] Albalooshi FA. Advancing Urban Planning with Deep Learning: Intelligent Traffic Flow Prediction and Optimization for Smart Cities. Future Transportation. 2025 Oct 2; 5(4): 133.
https://doi.org/10.3390/futuretransp5040133.

[8] Liu R, Shin SY. A review of traffic flow prediction methods in intelligent transportation system construction. Applied Sciences. 2025 Apr 1; 15(7): 3866.
https://doi.org/10.3390/app15073866.

[9] Li Y, Yu R, Shahabi C, Liu Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. 2017 Jul 6.

[10] Shao Z, Zhang Z, Wei W, Wang F, Xu Y, Cao X, Jensen CS. Decoupled dynamic spatial-temporal graph neural network for traffic forecasting. arXiv preprint arXiv:2206.09112. 2022 Jun 18.
https://doi.org/10.14778/3551793.3551827.

[11] Jiang W, Luo J. Graph neural network for traffic forecasting: A survey. Expert systems with applications. 2022 Nov 30; 207: 117921.
https://doi.org/10.1016/j.eswa.2022.117921.

[12] Bai HY, Liu X. T-Graphormer: using Transformers for spatiotemporal forecasting. arXiv preprint arXiv:2501.13274. 2025 Jan 22.
https://doi.org/10.48550/arXiv.2501.13274.

[13] Guo Z, Lu M, Han J. Temporal graph attention network for spatio-temporal feature extraction in research topic trend prediction. Mathematics. 2025 Feb 20; 13(5): 686.
https://doi.org/10.3390/math13050686.

[14] Cai F, Wang Y, Yu W, Wu J, Liu C, Li XA. ASISTGCRN: A novel approach to traffic prediction using attention-based spatiotemporal graph networks. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering. 2025 Nov 27: 09544070251390950.
https://doi.org/10.1177/09544070251390950.

[15] Zhao Y, Li H, Zhou H, Attar HR, Pfaff T, Li N. A review of graph neural network applications in mechanics-related domains. Artificial Intelligence Review. 2024 Oct 4; 57(11): 315.
https://doi.org/10.1007/s10462-024-10931-y.

[16] Yang C, Zhang W, Yingjiang Z. An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution. Mathematics. 2025; 14(1): 18.
https://doi.org/10.3390/math14010018.

[17] Chang J, Yin J, Hao Y, Gao C. STFDSGCN: spatio-temporal fusion graph neural network based on dynamic sparse graph convolution GRU for traffic flow forecast. Sensors. 2025 May 30; 25(11): 3446.
https://doi.org/10.3390/s25113446.

[18] Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD. Deep graph infomax. arXiv preprint arXiv:1809.10341. 2018 Sep 27.
https://doi.org/10.48550/arXiv.1809.10341.

[19] Xiao Z, Shen Q, Li C, Li D, Liu Q. An adaptive spatiotemporal dynamic graph convolutional network for traffic prediction. Scientific Reports. 2025 Jul 25; 15(1): 27098.
https://doi.org/10.1038/s41598-025-12261-7.

[20] Jiang M, Liu Z. Traffic flow prediction based on dynamic graph spatial-temporal neural network. Mathematics. 2023 May 31; 11(11): 2528.
https://doi.org/10.3390/math11112528.

[21] Bai L, Yao L, Li C, Wang X, Wang C. Adaptive graph convolutional recurrent network for traffic forecasting. Advances in neural information processing systems. 2020; 33: 17804-15.

[22] Ma J, Zhao J, Hou Y. Spatial-temporal transformer networks for traffic flow forecasting using a pre-trained language model. Sensors. 2024 Aug 25; 24(17): 5502.
https://doi.org/10.3390/s24175502.

[23] Tang J, Xia L, Huang C. Explainable spatio-temporal graph neural networks. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management 2023 Oct 21 (pp. 2432-2441).
https://doi.org/10.1145/3583780.3614871.

[24] Yan H, Chen D, Jiang G, Wang B, Cao L, Dong J, Yu Y. DGraFormer: Dynamic Graph Learning Guided Multi-Scale Transformer for Multivariate Time Series Forecasting. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025) 2025 Aug 16 (pp. 3516-3524).
https://doi.org/10.24963/ijcai.2025/391.

[25] Remmouche B, Boukraa D, Zakharova A, Bouwmans T, Taffar M. Long-term spatio-temporal graph attention network for traffic forecasting. Expert Systems with Applications. 2025 Sep 1; 288: 128244.
https://doi.org/10.1016/j.eswa.2025.128244.

[26] Feng A, Tassiulas L. Adaptive graph spatial-temporal transformer network for traffic forecasting. InProceedings of the 31st ACM international conference on information & knowledge management 2022 Oct 17 (pp. 3933-3937).
https://doi.org/10.1145/3511808.3557540.

[27] El-Meehy AO, El-Kharbotly AK, El-Beheiry MM. Systematic hyperparameter analysis of GRU and LSTM across demand pattern types: A demand-characteristic-driven meta-learning framework for rapid optimization. Scientific Reports. 2025 Dec 25.
https://doi.org/10.1038/s41598-025-31508-x.

[28] Huang X, Wang J, Lan Y, Jiang C, Yuan X. MD-GCN: A multi-scale temporal dual graph convolution network for traffic flow prediction. Sensors. 2023 Jan 11; 23(2): 841.
https://doi.org/10.3390/s23020841.

[29] Singh V, Sahana SK, Bhattacharjee V. Integrated spatio-temporal graph neural network for traffic forecasting. Applied Sciences. 2024 Dec 10; 14(24): 11477.
https://doi.org/10.3390/app142411477.

[30] He S, Luo Q, Du R, Zhao L, He G, Fu H, Li H. STGC-GNNs: A GNN-based traffic prediction framework with a spatial-temporal Granger causality graph. Physica A: Statistical Mechanics and its Applications. 2023 Aug 1; 623: 128913.
https://doi.org/10.1016/j.physa.2023.128913.

[31] Vrahatis AG, Lazaros K, Kotsiantis S. Graph attention networks: a comprehensive review of methods and applications. Future Internet. 2024 Sep 3; 16(9): 318.
https://doi.org/10.3390/fi16090318.

[32] Zhu Y. Graph neural networks for urban traffic flow forecasting: A comprehensive review and future perspectives. 2025.
https://doi.org/10.54254/2753-8818/2025.DL27990.

[33] Zong X, Guo J, Liu F, Yu F. TSTA-GCN: trend spatio-temporal traffic flow prediction using adaptive graph convolution network. Scientific Reports. 2025 Apr 18; 15(1): 13449.
https://doi.org/10.1038/s41598-025-96833-7.

[34] Dai BA, Ye BL, Li L. A novel hybrid time-varying graph neural network for traffic flow forecasting. arXiv preprint arXiv:2401.10155. 2024 Jan 17.
https://doi.org/10.48550/arXiv.2401.10155.

[35] Wei S, Yang Y, Liu D, Deng K, Wang C. Transformer-based spatiotemporal graph diffusion convolution network for traffic flow forecasting. Electronics. 2024 Aug 9; 13(16): 3151.
https://doi.org/10.3390/electronics13163151.

[36] Kwak S. PEMS-BAY and METR-LA in csv. Zenodo. 2020.
https://doi.org/10.5281/zenodo.5146275.