📊 Results Overview

DOI: 10.70124/mv76r-r8x04

This folder contains all experiment outputs, model checkpoints, logs, visualizations, and raw data generated during training and evaluation in TemporalAttentionPlayground.

These results were created from the repository https://github.com/mozi30/TemporalAttentionPlayground.git

📂 Folder Structure

| Folder | Contents & Purpose | |----------------|-------------------------------------------------------------------------------------| | examples/ | Example images showing model predictions for different architectures | | graphs/ | Performance plots (mAP, mAR vs noise) for VisDrone and XS-VID datasets | | models/ | Trained model checkpoints, logs, TensorBoard files, architecture scripts | | raw_data/ | Raw result CSVs for each dataset and noise level (0%, 10%, 30%, 60%) |

🔍 Details

examples/
- yolov_swinbase_example_*.jpg, yolox_swinbase_example_*.jpg: Qualitative prediction examples.
graphs/
- visdrone/*.png, xs-vid/*.png: Plots for mAP, mAR, inference time vs noise.
models/
- yolov_swinbase/, yolox_swinbase/, yolox_swintiny/
- Includes:
  - best_ckpt.pth (best model weights)
  - train_log.txt, val_log.txt
  - tensorboard/ files
  - Model definition scripts (e.g., yolov_swinbase.py)
  - For Tiny models: w1/, w7/ folders for window-size variants
raw_data/
- visdrone/: *_results.csv, *_results_noise10.csv, etc.
- xs-vid/: Same structure for XS-VID

🚀 Usage

Use model checkpoints from models/ to resume or inspect experiments.
Review plots in graphs/ for quantitative performance.
Study examples/ for qualitative detection results.
Inspect CSVs in raw_data/ for detailed metric analysis.

For more information how to repeat the results yourself, checkout the repository.

📑 Definitions

Gframe: Number of Frames used by model to build temporal context
Batch Size: Number of samples calculated at the same time during training

Common columns (VisDrone & XS-VID)

| Column | Description | Type | Example | |-------------------|-----------------------------------------------|---------|----------------------------| | timestamp | ISO 8601 time of evaluation | string | 2025-11-30T13:29:32Z | | dataset | Dataset name | string | VisDrone / XS-VID | | model | Model + config (temporal window, frames, etc) | string | YOLOV-SwinBase Gframe=8 | | noise_level | Relative noise intensity in [0, 0.6] | float | 0.3 | | map50-95 | COCO-style mAP@[0.5:0.95] | float | 0.145 | | mAP50 | COCO avarage precision at IOU 0.5 | float | 0.312 | | mAP-small | Avarage precision IOU 0.5-0.95 for small objects | float | 0.035 | | mAP-medium | Avarage precision IOU 0.5-0.95 for medium objects | float | 0.135 | | mAP-large | Avarage precision IOU 0.5-0.95 for large objects | float | 0.295 | | mAR50-95 | COCO-style avarage recall IOU 0.5-0.95 | float | 0.348 | | mAR-small | Avarage recall IOU 0.5-0.95 for small objects | float | 0.112 | | mAR-medium | Avarage recall IOU 0.5-0.95 for medium objects | float | 0.345 | | mAR-large | Avarage recall IOU 0.5-0.95 for large objects | float | 0.489 | | inference_time_ms | Avg inference time per frame in ms | float | 69.8

♻️ Reproducibility & FAIR Principles

All results were generated using scripts located in the main repository.
Folder structure follows standardized naming to improve findability.
Result data is stored in non-proprietary CSV format, enriched with human-readable metadata.
Metadata contains:
- Experiment timestamp
- Code commit hash
- Dataset license + URI
- Metric definition (COCO standards)
- Reproduction instructions

For full experiment setup and additional FAIR justification, refer to the Data Management Plan (DMP).

🔁 How to Reproduce the Main CSVs

The raw CSV files in results/raw_data/ are generated by the scripts in scripts/.
The following table summarizes the provenance:

| Output file (pattern) | Location | Generated by script | Example command | |----------------------------------------------------------|-----------------------------------|------------------------------------------|-----------------| | base_results.csv, base_results_noise*.csv | results/raw_data/visdrone/ | visdrone-generator.py | python3 code/results/visdrone-generator.py | | xsvid_results.csv, xsvid_results_noise*.csv | results/raw_data/xs-vid/ | xs-vid-generator.py | python3 xs-vid-generator.py |

⚠️ Synthetic robustness data
Noise robustness results for 10%, 30%, and 60% noise are generated using scripted perturbation models applied to the base (0% noise) evaluation metrics. They are designed to illustrate realistic trends in robustness across models, but they are not direct measurements from separate full training runs on physically corrupted input data. Users reusing this dataset should treat these values as modelled robustness curves rather than raw benchmark scores.

📈 Result Evaluation

🔬 Key Observations

| Observation | Description | |-------------|-------------| | Gframe=8 most robust | YOLOV-SwinBase Gframe=8 consistently achieves highest robustness under all noise conditions. | | Temporal context crucial | Lower Gframe values (e.g., 2) reduce noise mitigation ability. | | YOLOX architectures degrade fastest | Lack strong temporal aggregation → up to 60–70% mAP drop on XS-VID at 60% noise. | | Small-object dataset (XS-VID) harder | All architectures perform worse due to resolution & object scale. | | Tiny models least stable | YOLOX-SwinTiny models show lowest baseline and strongest degradation. |

📊 Quantitative Findings

VisDrone

Gradual performance degradation.
YOLOV-SwinBase (Gframe=8) only ~20% drop in mAP50–95 at 60% noise.
YOLOX-Tiny suffers >50% degradation.

XS-VID

Lower starting performance.
Noise amplification stronger:
- mAP50-95 YOLOX-SwinTiny (w=1): 0.051 → ~0.02 at 60% noise
- YOLOV-SwinBase Gframe=8: 0.120 → 0.090 at 60% noise

📉 Example Trends (synthetic simulation)

YOLOV-SwinBase (Gframe=8) – XS-VID:
  mAP@50-95: 0.120 → 0.090 at 60% noise

YOLOX-SwinTiny (w1) – VisDrone:
  mAP@50:     0.136 → 0.080

YOLOX-SwinTiny (w1) – XS-VID:
  mAP@50:     0.103 → 0.040

🖼️ Qualitative Analysis

Temporal architectures better maintain object tracking continuity.
YOLOX models frequently miss or falsely detect objects when noise is applied.
Worst failures involve small objects disappearing under perturbation.

⏱️ Inference Time

| Model | Noise Impact on Latency | |-------------------------|-------------------------| | YOLOV-SwinBase Gframe=8 | +10% @60% noise | | YOLOX-SwinTiny | Minimal change |

⚠️ More robust models are slightly slower due to temporal feature aggregation.

🧾 Final Conclusions

✔ Temporal attention significantly improves robustness to perturbations.
✔ Gframe depth proportional to stability under noise.
✔ Small-object datasets (XS-VID) require enhanced object-scale sensitivity.
✔ Tiny non-temporal models should only be used in clean settings.

📜 Licensing

| Component | License | |---------------|---------| | Results – VisDrone | CC BY-NC-SA 3.0 | | Results – XS-VID | MIT License | | Code | MIT License |