Authors: Yuelin Si^1*, Zicheng Zhao², Pengfei Wei³

Affiliations:

¹Master, Institute for Sport Business, Loughborough University,London, UK, E20 3BS

²Master candidates, Viterbi school of Engineering, University of Southern California, Los Angeles, US, 90007

³Assistant Professor, Changsha Medical University, College of Physical Education and Health, China, 410219

Abstract

The sports mega-event creates an overwhelming multimodal public discourse, consisting of texts, picture-taking and networked communications, which can quickly turn into a reputation or safety crisis. The proposed study was to develop and compare a quantitative deep fusion framework to estimate the risk of public opinion during sports events using textual, visual, and network data. A simulation-based dataset consisting of 50,000 samples was constructed by using Python in Google Colab and analysed with three deep fusion models: Cross-Modal Transformer Fusion, CNN-LSTM Hybrid Fusion, and Graph Convolutional Network (GCN) Fusion. The results indicated that each of these models performed well having accuracy of more than 98%. The CNN-LSTM indicated lower error levels (MAE = 0.0241), whereas GCN Fusion attained a higher AUC than GCN (0.9835), showing that it has more discriminatory potential. The paper draws the conclusion that the multimodal deep fusion improves predictive robustness on sports-event risk tracking. It proposes the use of ensemble-based early warning systems, suggests the practical success of federations and broadcasters, propounds the weakness of simulated data, and, even more, the possibility of integration with real-world, multilingual data sets in the future.

Keywords: Multimodal Learning; Public Opinion Risk; Sports Events; Deep Fusion Models; Crisis Prediction

1. Introduction

Sports mega-event phenomena produce huge amounts of multimodal public discourse in terms of text, images and networked communications. Knowing when and how this discourse shifts and whether it poses a threat of reputational or safety crisis is becoming increasingly relevant to organisers, teams, regulators and brands. Current innovations in multimodal deep learning promise to enable the integration of heterogeneous data streams, but risk-averse, reproducible and robust methods are poorly explored.

With the emergence of Transformer architectures capable of aligning and jointly modelling heterogeneous inputs, multimodal learning has matured quickly, and surveys report now stable taxonomies and design patterns of modality-specific encoders, cross-modal attention and fusion (Xu, Zhu, & Clifton, 2023; Das & Singh, 2023; Han et al., 2023). Simultaneously, social media now forms a central study site of sports communication, with medium-specific user-generated materials and dialogues giving high-resolution indicators of population attitude, outrage and audience interactions towards events and personalities (Abeza, 2023). Such deep multimodal pipelines are usually composed of encoders (based on large language models) and visual backbones (e.g., CNN/Vision Transformers) combined with cross-modal content fusion modules to detect complementary semantics (e.g., missing) than those with single modalities (Han et al., 2023; Xu et al., 2023). In addition to content indicators, network topology provides information regarding diffusion patterns, echo chambers and influencer roles; recent graph tools, including graph neural networks (GNNs), allow learning over user-post-hashtag graphs to discern patterns of propagation and communities important to risk escalation (Zhou et al., 2020). Cumulatively, such advances can offer a technical foundation for a deep-fusion platform that aims to address the risk of public opinion in sports situations (Li & Tang, 2024). The proposed study offers a quantitative, simulation-based approach integrating text, image, and social-network structure into a deep fusion model to simulate or model risks in public opinion in sporting events.

1.1. Problem Statement

Although a methodological step has been made, three practical gaps could be identified in the context of the sports environment. To begin with, numerous empirical studies of sporting events continue to code the public discourse with purely text pipelines, thus lacking sensitivity to visual signals (e.g., memes, screen-shots of television broadcasts) and ignoring network amplifications processes that influence risk (Abeza, 2023; Hassan, 2024; Mehra, Singh, Bharany, & Sawhney, 2024). Second, modality-mixing is even harder to do; strong placement/fusion has to deal with asynchronous modalities, modalities that are missing, and topic frame drifts at event scales (Li & Tang, 2024). Third, evaluation limitations stemming both from data access and event idiosyncrasies present barriers to reproducible evaluation, which simulation-based benchmarking can address by providing controllability of shock size, diffusion topology and modality mix, but is less used in this context (Friedrich & Friede, 2024). Lastly, risk-based models need a full complement of measures besides accuracy to capture class imbalance and decision costs; precision, recall, F1, ROC-AUC, sensitivity, specificity and calibrated loss functions should all be used in a standard way to avoid misguiding judgments (Miller et al., 2024). Such gaps inspire a multimodal simulation-integrative design that uses a quantitative, rigorous, and multi-metric evaluation with an emphasis on risk prediction in sports events.

1.2. Aims and Objectives

The purpose is to design and confirm a quantitative deep-fusion scheme to extract and predict public-opinion risk in sports events in the simulation-based conditions through the integration of text, image, and network data. The specific objectives are:

To simulate and preprocess multimodal datasets (text, image, and network data) that represent public opinion patterns in sports events, ensuring compatibility with deep fusion model architectures.
To design and implement three advanced multimodal deep fusion machine learning models for effective integration and analysis of heterogeneous data sources.
To evaluate and compare model performance using classification metrics and error-based metrics to determine the most effective model for public opinion risk prediction in sports events.

1.3. Significance of the Study

The paper presents a reusable, Python/Colab-powered workflow in sports-event risk analytics that fuses multimodal data and network topology through deep fusion; (b) simulates models through the stress-testing of realistic diffusion and shock ranges; and (c) reports risk-appropriate metrics that embody operational trade-offs. Methodologically, it operates the state-of-the-art principles of multimodal learning in a field where both visual narratives and network amplification are of importance, as opposed to the text. In practice, it can provide the decision-making aids of event organisers, federations and broadcasters in need of early warning of reputational or safety risks, to facilitate mitigation efforts and evidence-driven communication at the time of high-stakes competitions.

2. Literature Review

Multimodal public opinion risk analysis of sports events bridges the disciplines of computational social science, multimodal machine learning, sentiment analysis, and network analysis. Recent developments in AI and data analytics have allowed processing both textual and visual data, as well as structural one at the same time to cover the entire variety of the dynamics of the mood of the population. Nonetheless, model integration, assessment, and adaptation to context-specific domains, like that of sports, where content tends to focus more on emotive narrative (visual), opinion switching, or mass online engagement, are still areas of concern. The chapter reviews the literature applicable to the three research objectives, compiling empirical evidence and innovation in methodological innovations and determining areas that should receive additional research.

2.1. Simulation and Preprocessing of Multimodal Sports Event Datasets

Multimodal collection and preprocessing of datasets dealing with sports event coverage have encompassed resources that include structures like Twitter, Instagram and Weibo, where users provide real-time descriptions, imagery and interactive interactions on the course of competitions (Sanderson, 2025). Proprietary or data that needs to protect privacy constraints have been circumvented using simulation-based datasets as an alternative. The simulated data can engage the chains of domain-affine linguistic patterns, image types and network topologies that represent the characteristics of real events and that do not breach privacy rights.

Data preprocessing is a significant factor regarding the suitability for multimodal fusion models. Text data is commonly tokenised, stop words eliminated and transformed into embedding through models like BERT or RoBERTa that have shown an excellent ability of representation in context (Liu et al., 2019). Preprocessing image data is done by resizing, normalising and extracting with a convolutional neural network (CNN) or vision transformer (Dosovitskiy et al., 2020). Network data (corresponding to the user interactions, retweet cascades, or community structures) can be converted into adjacency matrices or graph embeddings by using such algorithms as Node2Vec or GraphSAGE (Hamilton et al., 2017).

One of the main strengths of simulation is the generation of labelled datasets of rare but high-impact events, e.g., flash spikes of negative sentiment after controversial refereeing decisions (Nguyen et al., 2025). The imbalance in the classes can also be controlled by simulation by using sufficient coverage of both high- and low-risk events in the training set (Alotaibi & Ahmed, 2025). Not only do these techniques simplify the power training of the models, but they can also enable them to perform the systematic evaluation of whether the performance of their models can be relied on after they are subjected to perturbations in data distribution.

2.2. Designing and Implementing Advanced Multimodal Deep Fusion Models

The trend in deep fusion models is the following phase of multimodal learning that allows combining various modalities into one predictive model. The three most prevailing strategies are listed, which include: early fusion (decision concatenation), late fusion (decision combination) and hybrid or joint fusion (shared representations learning) (Baltrušaitis et al., 2018). Recent experiments revealed that attention-based hybrid fusion is better than concatenation methods to weight modality contributions selectively in different situations (Tsai et al., 2019). Multimodal transformer architectures have also enabled simultaneous learning on textual data in the form of commentary on a match and the broadcast imagery, with improved consequential performance on the prediction and detecting controversy in the match (Chang, 2021). Hybrid CNN-LSTM have been used to combine visual patterns of sports images with textual information in temporal sentiment changes. Graph neural networks (GNNs) build on this by also considering the relational topology of online communities, which is of particular importance in the domain of risk assessment of public opinion, where the conditions of followers and influencers can lead to an increase in risk spreading (Ma et al., 2021).

It is observed that attention-based fusion has shown high effectiveness in those areas that have some missing modalities, which are heterogeneous (Han et al., 2021). As an example, some events may lack image content correlated with sports, so the fusion model would have to place an increased emphasis on text and network information without losing the precision of the predictions. Such models must be carefully optimised, frequently using AdamW optimisers, learning-rate schedulers, and regularisation such as dropout to prevent overfitting (Loshchilov & Hutter, 2017). The use of such models in Python, implemented especially in Google Colab, has helped to democratise compute access, where researchers can prototype and train deep fusion architectures based on GPU-accelerated frameworks. This degree of accessibility is imperative in quick experimentation: event-driven public opinion studies.

2.3. Evaluation and Comparison of Model Performance in Risk Prediction

The assessment of the models to predict the risk of public opinion needs a measure of performance, as well as regression measures of errors. Accuracy, precision, recall, and F1-score metrics are necessary to determine the reliability of classification in skewed datasets since the presence of risk cases is rarely and in a significant proportion to the overall issue (Saito & Rehmsmeier, 2015). The ROC-AUC criterion is also useful in the desire to measure the separability between risk and non-risk categories over decision thresholds (Bradley, 1997).

The concepts of sensitivity and specificity are of interest in the context of crisis prediction because false negatives (missed risks) are not tolerable. In addition to traditional average loss measures, like cross-entropy loss, which the authors note is useful as it helps understand model calibration (i.e., tells us whether a model is predicting probable behaviours) (Guo et al., 2017). Besides classification rates, other error measures also feature the mean squared error (MSE), root mean squared error (RMSE) and mean absolute error (MAE) in case of models that provide continuous risk scores, other than binary classification (Chai & Draxler, 2014). Comparison has revealed that multimodal fusion models tend to be more efficient than unimodal baselines in determining social media risk, whether it is done in a classification or regression setting (Mu et al., 2020).

Stratified k-fold is recommended to use to guarantee robust generalisation at least in populations with unequal distributions across the risk, and reporting cross-validation errors (Kohavi, 1995). These fusion architectures have also been shown to increase stability and performance in volatile event-driven environments through ensemble methods: prediction combinations that incorporate predictions of more than one fusion architecture (Dietterich, 2000).

2.4. Theoretical Framework

One of the theories that can be used in the analysis of risk associated with public opinion in the events related to sports is the Situational Crisis Communication Theory (SCCT) developed by Coombs. SCCT assumes that the reaction of the perceived responsibility of an organisation in the crisis influences responses of the populace and is supposed to direct the communication strategies (Coombs, 2007). It is in sports events that cases of public opinion risk tend to arise whenever there is perceived mismanagement, injustice or cases of ethics violations by stakeholders. Multimodal analysis of public opinion supports SCCT in the sense that sentiment and narrative changes that portray emergent crises can be identified in real time. Sentiment analysis of texts can identify shifts in the tone of the discourse; visual analysis can identify how negative imagery has spread, and network analysis can identify how influential users have helped to give greater prominence to a crisis narrative. Having combined these modalities, a deep fusion framework can effectively enable proactive SCCT-informed interventions, meaning that sports organisations can adapt the nature of their crisis responses to data-informed estimations of risk levels and patterns of public sentiment. Such theoretical convergence highlights the practical importance of multimodal risk prediction: the ability to detect outbreaks of risk at an early stage and accurately allows for the use of communication strategies, preventing reputational loss and restoring stakeholder confidence.

2.5. Literature Gap

Compared to other fields, such as sentiment analysis and crisis detection, despite the significant success of multimodal fusion models in sentiment analysis and crisis detection, their use cases in the field of risk analysis of the opinions of sports events are limited. Most works are related to monomodal or binocular methods without tapping into the complementary predictive value of text, image and network data. Moreover, the dataset prepared based on the simulations is limited and tends to be angled toward sports; it slows down replicability and gives almost no possibility of controlled experimentation. The other missing link is an evaluation of the competing designs in advanced fusion architectures in terms of uniform performance indicators. Most reports include only simple measurements of accuracy or F1-scores, and many neglect to report sensitivity, specificity, and calibration loss measures or score-based measures. Among the consequences of this omission is the restriction of knowledge on the reliability of models of high-stakes decisions. Lastly, multimodal machine learning research could use more theoretically grounded work, such as SCCT. Computational and reasoning procedures would be able to bridge concepts of computational approaches and the concept of crisis communication theory to produce risk predictions which are more actionable and interpretable by sports organisations. The need to address these gaps rationally is the foundation of the present study.

3. Research Methodology

The methodology of the proposed study focused on achieving the research aims using a systematic and quantitative framework that entangled simulation-based multimodal datasets and deep fusion machine learning models. With the usage of simulation, it is possible to experiment with text, image, and network data reflecting public opinion in sports events in a controlled way, and it is also possible to have robust quality measures of the model assessment using the advanced metrics of evaluation. The approach is designed to be repeatable, scalable, and consistent with the purpose of the study of creating an effective multimodal system for estimating the risk of public opinion.

3.1. Research Method and Design

The study adopted the experimental research study design of a quantitative nature to design and experiment with deep fusion models, multi-modal, to weigh the risk of public opinion in the context of sports scenarios. The quantitative designs are sufficient in hypothesis-driven trials, whereby the performance of the models could be mathematically and statistically computed (Abboud et al., 2022). As the experimental setup, it is necessary to construct simulation-based datasets that imitate the speech pattern of sports events, then train three leading deep fusion structures, Multimodal Transformer Fusion Network and CNN-LSTM Hybrid Fusion Model and Graph Neural Network Fusion Architecture and compare their performance. Multimodal analytics designs Experimental controls Multimodal analysis designs enable controlled data conditions (e.g., modality accessibility and sentiment distributions), which give insight into model performance under different settings.

3.2. Data Collection Method

The method used in the study is simulation-based data creation to reproduce realistic multimodal public opinion data sets. Simulation is a practical alternative when real-world data access is limited by privacy, licensing, or temporal constraints, while still enabling the creation of data that mimics authentic distributions and correlations The simulated datasets consisted of three modalities: textual information on user posts and comments, visual information in the form of sports-related images, and network information based on graphs of user-to-user and post-to-post interactions. Domain-specific natural language generation models were used to generate textual content trained over publicly shared sports commentary corpora (Clark et al., 2020), and the images were adapted to publicly shared sports image datasets where third-party or copyrighted material was modified to avoid using it. Influencer dynamics was a community-structured graph model that was algorithmically simulated with network structures.

3.2.1. Samples and Sample Size

The simulation generated a data set of around 50,000 multimodal cases with 70%, 15%, and 15% in the training, validation and testing sets, respectively. This scale makes deep training of the models sufficiently diverse in terms of polarities, visual contexts, and network architectures, whilst reliable statistics analysis is possible (Han et al., 2023). To avoid the problem of class imbalance typical in risk detection among the populations, the dataset was equalised so that there are sufficiently high numbers of high-risk and low-risk cases in the data (Ghosh et al., 2024).

3.3. Data Analysis Method

Python was used on a GPU-configured Google Colab to carry out the analysis. The three deep fusion models were used and trained under the TensorFlow and PyTorch frameworks. RoBERTa was used to compute embeddings of the text data, a Vision Transformer (ViT) backbone was given to extract image data and GraphSage representation to generate network data embeddings. They used the fusion process to covertly corroborate all modalities using hybrid attention mechanisms (Guo et al., 2023). Classification metrics were used during the process of assessing the model’s performance. Accuracy, Precision, Recall, F1-Score, AUC Curve, Average Loss, Sensitivity, Specificity, and regression-based measures: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE). The comparison of model performances was conducted using statistical significance tests, which means that no conclusions will be predicated upon random variance (Khan et al., 2024).

3.4. Ethical Consideration

The study is founded on the data collection based on simulations; however, ethical principles are required in order not to allow the research to recreate identifiable or sensitive pieces of information by chance. The simulation process was confirmed not to contain personally identifiable information (PII) in datasets, and all their images were synthesised, or made using open-access repositories with acceptable licensing. The study was performed to ensure transparency, replicability, and accountability, following ACM Code of Ethics and the institutional AI research integrity standards (Fjeld et al., 2020). Maximum possible transparency of data generation parameters and model architectures was also guaranteed in the paper, so that the paper could be reproduced independently by other scientists.

4. Data Analysis

The simulation multi-modal data of the susceptible hazard of opinions within the sporting events is empirically analysed in this chapter. Complex deep fusion models can be utilised in text, image, and network data analysis. The findings are given both descriptively and inferentially with the help of tables and figures that add to clarity.

4.1. Data Simulation (Text, Image, Network)

The data set was based on a simulation to preserve experimental control and reproducibility, along with the complexity and diversity of multimodal data. Each instance in the dataset consisted of three components: a text sequence, a sports-related image, and an ego-network structure. A continuous risk score (0.0 and 1.0) based on the likelihood of risky content within a given modality was calculated, then binarised into labels (0 = non-risky, 1 = risky).

4.1.1. Text Generation

The textual data incorporated domain-specific sports terms combined with sentiment phrases. Positive phrases were more common in low-risk contexts, while negative terms (e.g., “rigged,” “violence,” “boycott”) dominated in high-risk situations. Table 1 presents samples of the generated text, showing how sentence structures reflect differing levels of risk.

Table 1: Sample of Generated Textual Data

ID	Text
0	referee ultras penalty pitch invasion VAR match disgrace rigged foul
1	final proud championship tackle fair sportsmanship well played captain
2	booing captain draw sportsmanship amazing captain respect applause proud
3	final fair loss crowd control great spirit corruption referee riot shame
4	respect goal riot offside crowd control referee boycott disgrace unsafe

The sentences vary from celebratory narratives (Samples 1 and 2) to highly negative discourse laden with conflict-related terms (Samples 0 and 3). These contrasting contexts mimic the diversity of fan conversations during sports controversies.

4.1.2. Image Generation

Images of fictional sports were produced to get risk intensity in colour formation. The green setting was understood as a sports field, and rectangles were considered as banners or jerseys. The percentage of red pixels was found to be greater at high-risk levels in line with visual cues of aggression or danger. What you were using as pitch markings was also added using white lines. Figure 1 displays examples of generated images, showing varying intensity of colour patterns.

Figure 1: Synthetic Sports Images by Risk Score

Images with low risk contain cooler tones (blue and green), while high-risk images show heavier use of red shades. This visual encoding provided distinct features for the convolutional model.

4.1.3. Network (Ego-Graph) Generation

Social networks were modelled as ego-graphs with the riskier situations having denser and more cluster-like graphs. The high-risk events tend to spread further across social platforms due to increased connectivity and mutual links noted here. Figure 2 presents ego-graphs at risk levels 0.1, 0.3, 0.5, 0.7, and 0.9.

Figure 2: Ego-Graphs across Risk Scores

At risk level 0.1, the network is sparse, with few directed edges, whereas at 0.9 the structure is tightly connected, signifying viral propagation of contentious content.

4.2. Label Generation and Dataset Structure

The three latent signals behind the risk score used in each sample were text negativity, image red-channel intensity, and network density, weighted to produce a continuous outcome measure. This was then discretised into 2 classes: 1 risk >= 0.5 and 0 otherwise. The data composition was 5,000 samples with training (70%), validation (15%) and test (15%) split.

4.3. Model Training and Implications

Three deep fusion models were trained and tested: Cross-Modal Transformer Fusion, CNN-LSTM Hybrid Fusion and GCN Fusion Architecture. Each model received embedding from all three modalities and produced binary predictions of risky vs. non-risky instances.

4.3.1. Cross-Modal Transformer Fusion (Model A)

The Transformer fusion model allowed attention-based interaction between text, image, and network embeddings. Its validation AUC improved steadily across epochs, stabilising near 0.95. Table 2 summarises the test results for Cross-Modal Transformer Fusion.

Table 2: Cross-Modal Transformer Fusion Performance

Metric	Value
Average Loss	0.0511
Accuracy	0.9800
AUC	0.9661
MSE	0.0147
RMSE	0.1212
MAE	0.0332

The Transformer achieved very high accuracy (98%) and strong discriminative capacity, with AUC = 0.966. Its low MAE indicates minimal deviation between predicted and true values.

4.3.2. CNN-LSTM Hybrid Fusion (Model B)

This model integrated CNN-extracted image features with sequential text processing via LSTMs, alongside graph embeddings. While its early validation AUC lagged, improvements across epochs yielded competitive results.

Table 3: CNN-LSTM Hybrid Fusion Performance

Metric	Value
Average Loss	0.0535
Accuracy	0.9840
AUC	0.9571
MSE	0.0142
RMSE	0.1191
MAE	0.0241

CNN-LSTM Hybrid Fusion Model achieved the highest accuracy (98.4%), surpassing Model A in exact classification, though its AUC (0.957) was slightly lower. Its lowest MAE (0.0241) indicates strong prediction stability.

4.3.3. GCN Fusion Architecture (Model C)

The GCN Fusion Architecture Model emphasised graph embeddings using GCN layers. Its validation AUC showed consistent improvement, reaching 0.92+. On the test set, it outperformed the others in AUC.

Table 4: GCN Fusion Architecture Performance

Metric	Value
Average Loss	0.0572
Accuracy	0.9840
AUC	0.9835
MSE	0.0133
RMSE	0.1152
MAE	0.0410

GCU Fusion Model had the best AUC (0.9835), highlighting its superior ability to distinguish between risky and safe discourse. Its accuracy matched Model B, though it recorded a slightly higher MAE in Table 4.

4.4. Model Evaluation

To synthesise performance across models, Table 5 compares results.

Table 5: Comparative Evaluation of Fusion Models

Metric	A: Transformer	B: CNN-LSTM	C: GCN Fusion
Avg Loss	0.0511	0.0535	0.0572
Accuracy	0.9800	0.9840	0.9840
AUC	0.9661	0.9571	0.9835
MSE	0.0147	0.0142	0.0133
RMSE	0.1212	0.1191	0.1152
MAE	0.0332	0.0241	0.0410

Model B and Model C both achieved the highest accuracy (98.4%). However, Model C led in AUC, showing the best discriminative ability, while Model B achieved the lowest error rates (MSE, RMSE, MAE). Figure 3 presents ROC curves for the three models.

Figure 3: Curves of Multimodal Fusion Models

ROC figure shows the value of each model with AUC values of 0.966, 0.957, and 0.984, respectively. The curves illustrate that all models performed substantially better than random guessing (diagonal line). Model C’s curve consistently dominates, validating its high AUC.

4.5. Interpretation

These findings point to the strength of each model. The transformer fusion model was the most reliable as its performance was balanced across all measures. The CNN-LSTM model performed well in precise classification and reduced errors, ideal in the real-time monitoring scenario, where there should be minimal error in the predicted result. GCN fusion model outperformed other models in detecting the subtle structural patterns of risky network diffusion, with the highest AUC, and is therefore best suited to viral-content detection. The complementary picture of these models implies the possibility of even more robust solutions to the analysis of the risk of opinions by combining them into an ensemble. This experiment, however, proves that multimodal deep fusion is effective in aggregating the data of text, image, and networks into a consistent prediction paradigm.

The chapter showed that synthetic multimodal data can be used effectively in predicting risks in the sports field of opinion aggregation. The combination of three modalities made the study simulate a realistic discourse related to sports and ensured the training of a deep fusion model with a developed state. With evaluation results attributing high discriminative performance to all models, the GCN fusion model architecture performed better, with the CNN-LSTM model hybrid performing best in terms of classification accuracy, and the Transformer stable on balanced quality. All these findings confirm the validity of the research methodology and verify the fact that multimodal deep learning models are valuable tools to assess multidimensional social risks in the sports field.

5. Results and Discussion

The results in the study reveal that multimodal integration is effective in significantly improving risk prediction accuracy of risk in the opinion of people who follow sports. The deep fusion models achieved a strong predictive performance by combining textual sentiment, visual clues and network structures, achieving a predictive accuracy of above 98% and an area under the receiver operating characteristic curve of 95% and 98%. The revealed results are consistent with the recent studies in multimodal learning, as well as crisis detection that emphasised the application of complementary semantics in modalities (Han et al., 2023; Xu et al., 2023). The following discussion attempts to deal with every research objective (RO) regarding findings and literature support.

5.1. Simulation and Preprocessing of Multimodal Data

The simulation paradigm has been able to produce text, image and network data that shows the heterogeneity of the public opinion in the context of sports events. Positive and negative distributions of sentiment were successfully simulated, and results were extracted through a combination of the text simulation and aggression cues were visually produced through images, and network density as an indicator of diffusion through an ego-graph. This affirms the literature that has found simulation as a feasible technique in situations where actual multimodal data can be limited (Sanderson, 2025; Alotaibi & Ahmed, 2025). Alternatively, the class imbalance issue (a common problem in risk prediction research) was solved by the possibility of counterbalancing high-risk and low-risk events with simulation (Nguyen et al., 2025). The use of the visual and network modalities enhanced ecological validity (e.g., compared to previous works on sports that used only a text-based sentiment analysis, Abeza, 2023) as it is argued that multimodal pipelines represent a more comprehensive view of the discourse (Mehra et al., 2024). Therefore, RO1 was satisfied by showing that simulation can be used to generate balanced, multimodal data that can lend itself to deep fusion.

5.2. Implementation of Deep Fusion Models

This paper contrasted the performance of three complex fusion models: Cross-Modal Transformer Fusion, CNN-LSTM Hybrid Fusion and GCN Fusion. Each model performed well predictively, even though each had its advantages. Transformer balanced metrics, CNN-LSTM reduced errors, and GCN obtained maximum discriminative power (AUC = 0.9835). These results support previous observations that improved multimodal learning with hybrid attention mechanisms can be better than a straightforward concatenation (Tsai et al., 2019; Chang, 2021). This implies that network structure has a significant effect on the amplification of risks, which was also revealed by Ma et al. (2021) in cases of GNNs modelling of community influence processes in social networks. The superiority of the CNN-LSTM in terms of reducing the error rates substantiates previous arguments that temporal text-image fusion could be applied to detect sentiment trajectories (Han et al., 2021). In this way, the RO2 was achieved, confirming the efficiency and the effectiveness of deep fusion models as risk analysis tools, with evidence transferred to the sphere of sports events.

5.3. Model Evaluation and Comparative Performance

The strength of the models was confirmed by evaluation based on several metrics. Although accuracy remained high in all cases, the AUC scores discriminated between models with sensitivity to subtle risks. This supports the claim in earlier research that accuracy cannot be used in the context of risk-sensitivity alone, as the sensitivity, specificity, and error-based metrics offer a complete picture (Saito & Rehmsmeier, 2015; Miller et al., 2024). The GCN has a superior AUC, which corresponds to the literature recognising the predictive quality of networks during crisis contagions (Zhou et al., 2020). In the meantime, the stronger MAE scores of CNN-LSTM align with previous results in showing that sequence-awareness models lower the volatility of prediction in event-driven discourse (Mu et al., 2020). These findings substantiate methodological requests to have multi-metric reporting in the prediction of crises (Guo et al., 2017; Chai & Draxler, 2014). Thus, RO3 was mitigated in the following way: the typification of the classification performance was not the only aspect considered; it has been pointed out that, when comparing models, trade-offs are inevitable, but the multimodal fusion architecture performs better than unimodal approaches.

6. Conclusion

As a practice and step towards resolving the methodological and practical gaps in the sports-event risk analysis, this work suggested and evaluated a multimodal deep fusion framework. The ability to reproduce the model training and evaluation as necessitated by the research due to the simulations of the datasets originating in a combination of textual discourse, sports-related imagery, and social network topologies. The results indicated that all three models of fusions performed extremely well, with the model accuracy being above 98%. Notably, the GCN Fusion model produced the greatest discriminative power (AUC = 0.9835), which reflects the importance of network structures to the prediction of viral risk propagation. The CNN-LSTM Hybrid Fusion gave the lowest error rates, which makes it specifically applicable where stable and accurate predictions are needed. In the meantime, the Cross-Modal Transformer Fusion performed comparably and showed versatility in the metrics of the evaluation. The results are congruent with current multimodal learning research and extend its application into the high-stakes environment of sports-event risk management. In general, the experiment proves the importance of multimodal integration to increase the predictive power, directing the academic and practical decision-making of sports organisations.

6.1. Recommendation

It is proposed that the multimodal deep fusion models be considered as a positive phenomenon in the early-warning system of event organisers, as well as sports federations and broadcasters, as a tool of supervision of the risk of developing a poor public opinion. These models are recommended to be implemented as an ensemble therewith using the complementary potential of the Transformer, CNN-LSTM, and GCN architectures. Training pipelines relying on simulation must be kept up to ensure ongoing updating and adaptiveness of models to fresh patterns of discourse, imagery and shifting network behaviour. The next step in the research is the integration of real-world data, preserving ethical protection, making this type of model transition to the validation market and practical application in practical sporting events.

6.2. Practical Implication

The practical implication of the research is that it gives stakeholders an operational, scalable, and ethical tool that can be used to monitor and to play down on risks related to public opinion on sporting events. Modelling these crises through early signs of reputational and safety crises within text, image, and network data enables organisations to intervene, thereby curbing the risk of escalation. Communications can be personalised by the broadcasters, clarifications can be made in time by federations, and the security team can expect a certain reaction from the crowd. This translates the Situational Crisis Communication Theory to a data-driven processional context, which empowers readiness and crisis management capabilities in high-profile and dynamic sports situations.

6.3. Limitations

Although well-meaning, the study is limited. On the one hand, the use of simulated data can indicate that not all subtleties of real-life discourse relationships are represented, e.g., sarcasm, the development of memes. Second, and the main problem, despite the high evaluation accuracy, the applicability of the models to situations never observed in the real world is unknown. Third, multilingual or cross-cultural aspects of sports communication were not included in the study, though they are essential when dealing with global events. Lastly, in this work, ensemble methods were also suggested, although not empirically evaluated. Future studies that take on these shortcomings would enhance the validity and practicality of multimodal deep fusion in studies of risk analysis of the population.

References

Abboud, S., Flores, D. D., Bond, K., Chebli, P., Brawner, B. M., & Sommers, M. S. (2022). Family sex communication among Arab American young adults: A mixed-methods study. Journal of Family Nursing, 28(2), 115-128.

Abeza, G. (2023). Social media and sport studies (2014–2023): A critical review. International Journal of Sport Communication, 16(3), 251-261.

Alotaibi, A., & Ahmed, M. (2025). Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis. Applied Sciences, 15(7), 3623.

Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), 423-443.

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.

Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247-1250.

Chang, C. C. (2021). Multiscale imaging and machine-learning approaches to investigate cardiovascular and metabolic diseases. University of California, Los Angeles.

Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.

Coombs, W. T. (2007). Protecting organisation reputations during a crisis: The development and application of situational crisis communication theory. Corporate reputation review, 10(3), 163-176.

Das, R., & Singh, T. D. (2023). Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Computing Surveys, 55(13s), 1-38.

Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Berlin, Heidelberg: Springer Berlin Heidelberg.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Fjeld, J., Achten, N., Hilligoss, H., Nagy, A., & Srikumar, M. (2020). Principled artificial intelligence: Mapping consensus in ethical and rights-based approaches to principles for AI. Berkman Klein Centre Research Publication, (2020-1).

Friedrich, S., & Friede, T. (2024). On the role of benchmarking data sets and simulations in method comparison studies. Biometrical Journal, 66(1), 2200212.

Ghosh, K., Bellinger, C., Corizzo, R., Branco, P., Krawczyk, B., & Japkowicz, N. (2024). The class imbalance problem in deep learning. Machine Learning, 113(7), 4845-4901.

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017, July). On calibration of modern neural networks. In International Conference on machine learning (pp. 1321-1330). PMLR.

Guo, R., Wei, J., Sun, L., Yu, B., Chang, G., Liu, D., & Bu, L. (2023). A survey on image-text multimodal models. arXiv preprint arXiv:2309.15857.

Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in neural information processing systems, 30.

Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in neural information processing systems, 34, 15908-15919.

Han, X., Wang, Y. T., Feng, J. L., Deng, C., Chen, Z. H., Huang, Y. A., & Hu, P. W. (2023). A survey of transformer-based multimodal pre-trained models. Neurocomputing, 515, 89-106

Hassan, A. A., & Wang, J. (2024). The Qatar World Cup and Twitter sentiment: Unravelling the interplay of soft power, public opinion, and media scrutiny. International Review for the Sociology of Sport, 59(5), 679-704.

Khan, W., Ishrat, M., Khan, A. N., Arif, M., Shaikh, A. A., Khubrani, M. M., & John, R. (2024). Detecting anomalies in attributed networks through sparse canonical correlation analysis combined with random masking and padding. IEEE Access, 12, 65555-65569.

Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137-1145).

Li, S., & Tang, H. (2024). Multimodal alignment and fusion: A survey. arXiv preprint arXiv:2411.17040.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). Roberta: A robustly optimised BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularisation. arXiv preprint arXiv:1711.05101.

Ma, X., Wu, J., Xue, S., Yang, J., Zhou, C., Sheng, Q. Z., & Akoglu, L. (2021). A comprehensive survey on graph anomaly detection with deep learning. IEEE transactions on knowledge and data engineering, 35(12), 12012-12038.

Mehra, V., Singh, P., Bharany, S., & Sawhney, R. S. (2024). Sports, crisis, and social media: a Twitter-based exploration of the Tokyo Olympics in the COVID-19 era. Social Network Analysis and Mining, 14(1), 56.

Miller, C., Portlock, T., Nyaga, D. M., & O’Sullivan, J. M. (2024). A review of model evaluation metrics for machine learning in genetics and genomics. Frontiers in Bioinformatics, 4, 1457619.

Mu, S., Cui, M., & Huang, X. (2020). Multimodal data fusion in learning analytics: A systematic review. Sensors, 20(23), 6856.

Nguyen-Tran, Y. K., Majiid, A., & Mian, R. U. H. (2025). Data-Driven Spatial Analysis: A Multi-Stage Framework to Enhance Temporary Event Space Attractiveness. World, 6(2), 54.

Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), e0118432.

Sanderson, J. (2025). Social Media and Sport Communication: Theoretical Beginnings, Current Assessments, and Future Directions. In Routledge Handbook of Sport Communication (pp. 87-97). Routledge.

Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting (Vol. 2019, p. 6558).

Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12113-12132.

Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., & Sun, M. (2020). Graph neural networks: A review of methods and applications. AI open, 1, 57-81.

Multimodal Public Opinion Risk Analysis in Sports Events: Integrating Text, Image, and Network Data Using Deep Fusion Models