Zeru Zhou

Affiliation:

The School of Automation, Centrual South University, Changsha 410083, Hunan, China

Abstract: In the current clinical diagnosis and treatment system of cardiovascular diseases, echocardiography has become the core imaging tool for cardiac structural assessment due to its non-invasive, real-time, and efficient characteristics. However, due to the noise interference, low contrast, and individual differences inherent in ultrasound images, traditional automated segmentation methods have significant limitations in spatiotemporal modeling and real-time processing. In response to this challenge, this paper proposes a real-time segmentation method for echocardiography videos based on a multi-scale spatiotemporal Transformer (MST Transformer) system. By introducing axial attention mechanism, multi-scale fusion strategy of anatomical structure perception, and hardware collaborative optimization, the system achieves efficient real-time processing of echocardiography videos while ensuring high-precision segmentation. The experimental results show that the MST Transformer system achieves an excellent balance between segmentation accuracy and real-time performance, especially in the calculation of key clinical parameters, which has significant advantages over traditional methods. This study provides an innovative solution with practical value for automated segmentation technology of echocardiography, and provides strong data support for precision medicine.

Keywords: Echocardiography; Multi Scale Spatiotemporal Transformer; Video Segmentation; Clinical Diagnosis and Treatment

I. Introduction

In the modern cardiovascular disease diagnosis and treatment system, echocardiography has become the core imaging modality for evaluating cardiac structure due to its non-invasive, real-time, and cost-effective advantages (Favot et al., 2016). It is widely used in the entire diagnosis and treatment chain from primary medical institutions to tertiary comprehensive hospitals, and can dynamically capture the motion characteristics of the heart during the complete cardiac cycle, providing an irreplaceable visual basis for early disease screening, pathological mechanism interpretation, and treatment response monitoring (Popp, 1976).

Compared with CT, MRI and other modalities, the dynamic information value of echocardiography is particularly prominent, which can intuitively reflect the instantaneous physiological states such as myocardial contraction coordination and valve opening and closing function. It is an important cornerstone for accurate diagnosis and treatment of cardiovascular diseases (Krishnamurthy, 2009). Automated segmentation technology has a milestone key value in echocardiography analysis, with its core being the precise delineation of cardiac chamber structures such as the left ventricle and atrium, as well as myocardial tissue, providing objective basis for quantitative analysis (Zhang et al., 2018; Tromp et al., 2022).

In clinical practice, accurate measurement of left ventricular ejection fraction relies on precise segmentation of end diastolic and end systolic cardiac chamber volume, which is the gold standard for evaluating the degree of cardiac dysfunction and guiding treatment strategies for heart failure (Tromp et al., 2022; Mele et al., 2020; Paulus et al., 2007). At the same time, automated segmentation results can support quantitative analysis of parameters such as changes in myocardial thickness and abnormal wall motion, providing a quantitative scale for the grading evaluation of diseases such as cardiomyopathy and myocardial infarction (Slomka et al., 2017). In the field of surgical planning, automated segmentation of cardiac structures before surgery can assist in developing personalized intervention plans, such as precise matching of valve size during transcatheter aortic valve replacement, significantly improving surgical safety and effectiveness (Ferrari et al., 2012).

Therefore, the construction of real-time segmentation systems has become an urgent need for clinical diagnosis and treatment (Chen, 2023). In the interventional surgery scene, the real-time segmentation results can dynamically guide the operation path of catheter, guide wire and other devices, avoiding the surgical risk caused by image delay, such as the precise positioning of target myocardium in radiofrequency ablation. In the context of real-time assessment at the bedside, real-time segmentation can quickly provide core parameters of cardiac function for critically ill patients, providing data support for emergency decision-making such as shock and acute heart failure, shortening the time window from diagnosis to treatment, and improving patient prognosis. This demand for real-time performance is driving the paradigm shift of echocardiography segmentation technology from offline analysis to online dynamic processing (Zhai et al., 2018).

However, the inherent characteristics of ultrasound imaging pose a serious challenge to automated segmentation (Wells, 2006). The speckle noise formed by sound wave scattering during the imaging process significantly reduces the signal-to-noise ratio of the image, resulting in blurred tissue structure boundaries, insufficient grayscale contrast between myocardium and blood pool, and increased difficulty in feature extraction (Damerjian et al., 2014). At the same time, the differences in cardiac morphology, anatomical variations, and ultrasound probe angles among individuals result in highly heterogeneous imaging. Traditional segmentation methods based on prior models are difficult to adapt to these individual differences, and their generalization ability is limited (Perdios et al., 2021). These characteristics collectively constitute the fundamental obstacles to improving segmentation accuracy.

In addition, the multi-scale nature of the heart structure is one of the core challenges that constrain segmentation performance. During a complete cardiac cycle, the spatial scale of the heart chamber can reach several centimeters, while the thickness of the valve structure is only a few millimeters, and the thickness of the myocardial wall is between the two, with differences in scale of up to an order of magnitude between different structures (Huang et al., 2024). This significant scale span requires the segmentation model to have multi-scale feature perception ability, which can capture the overall shape of the heart chamber and distinguish the boundaries of small valve structures and thin myocardial walls. Traditional single scale modeling methods are difficult to balance the segmentation accuracy of different hierarchical structures.

The existing mainstream segmentation methods have obvious limitations. Static or frame by frame segmentation models such as U-Net and its variants only focus on the spatial features of single frame images, cutting off the inter frame motion correlation, resulting in a lack of temporal consistency in segmentation results, making it difficult to reflect the physiological laws of heart motion (Azad et al., 2024). In traditional temporal modeling methods, ConvLSTM captures local temporal information through recursive structures, but the receptive field is limited to short-term temporal windows, making it difficult to model long-distance dependencies across cardiac cycles (Mukhametkaly et al., 2023); Although 3D CNN can integrate spatiotemporal information, the number of parameters and computational cost increase exponentially with the increase of time dimension, making it difficult to achieve real-time processing on ordinary clinical devices (Perdios et al., 2021).

The Transformer technology that has emerged in recent years has demonstrated strong global modeling capabilities in the field of vision, but its direct application to echocardiography video segmentation still faces bottlenecks (Deng et al., 2021). Standard Vision Transformers and Video Transformers such as Timestampformer use self attention mechanisms to capture global dependencies, and their computational complexity is squared with the number of spatial tokens and temporal frames. When processing high-resolution ultrasound videos, memory consumption increases sharply, making it difficult to meet the frame rate requirements for real-time ultrasound video processing (Adhyapak & Menon, 2024). How to achieve breakthrough improvements in computational efficiency while ensuring spatiotemporal modeling capabilities has become a core challenge in current research fields.

Based on this, this study aims to break through the technical bottleneck in the field of echocardiography video segmentation and propose an innovative solution using the MST Transformer framework. By introducing axial attention mechanism and multi-scale fusion strategy driven by anatomical structures, this study overcomes the inherent challenges of ultrasound imaging and promotes deep fusion of spatiotemporal modeling and accurate segmentation. Based on this, the experimental results further validated the excellent performance of this method in terms of temporal consistency, segmentation accuracy, and real-time processing capability, demonstrating its broad potential for application in medical image analysis. This research not only provides unprecedented technical support for the automated segmentation of echocardiography in the diagnosis and treatment of cardiovascular diseases, but also opens up a new world for the expansion of spatio-temporal modeling methods in medical image processing. By accurately capturing the spatiotemporal evolution of cardiac structures in dynamic imaging, this method provides more precise and timely data support for real-time clinical diagnosis and treatment, especially in the rapid assessment and intervention decision-making of cardiac function in critically ill patients, demonstrating irreplaceable value. In addition, this study has important academic value for deepening the multidimensional data processing concept of medical imaging and promoting the deep integration of medical artificial intelligence and clinical decision-making.

II. Technological Base

In terms of segmentation methods for echocardiography. In the development process of echocardiography segmentation, early research mainly relied on traditional image processing techniques, where the active contour model achieved boundary fitting through iterative evolution of energy functions, and the level set method captured structural morphology using zero level sets of high-dimensional functions. These methods have shown certain effectiveness in scenes with ideal image quality (Mazaheri et al., 2013). However, in the face of the inherent low signal-to-noise ratio characteristics and blurred structural boundaries of ultrasound images, such methods are susceptible to speckle noise interference, leading to convergence to local optima and significantly insufficient robustness in areas of mixed grayscale between myocardium and blood pool. In recent years, with the rise of deep learning technology, convolutional neural networks (CNNs) have become the mainstream method in this field due to their powerful feature learning capabilities (Perdios et al., 2021). U-Net and its variants achieve the fusion of semantic information and detailed features through encoder and decoder architecture and skip connections. Among them, Attention U-Net introduces spatial attention mechanism to enhance key region weights (Wang et al., 2024), while ResUNet alleviates the gradient vanishing problem of deep networks through residual modules, achieving breakthrough progress in static frame segmentation tasks. However, such models only focus on spatial feature modeling of single frame images, completely ignoring the temporal dynamic information of heart motion, resulting in significant inconsistency in cross frame segmentation results, typically manifested as incorrect recognition of key frames such as end diastolic and end systolic phases, directly affecting the accuracy of calculating cardiac function parameters (Azad et al., 2024; Xu et al., 2022). To compensate for this deficiency, temporal extension methods have emerged. ConvLSTM/GRU embeds recurrent neural network units after the CNN encoder to model short-term temporal dependencies through memory mechanisms. However, its inherent sequential computation limits parallel processing capabilities, and memory decay makes it difficult to cover the long-term dependencies of the complete cardiac cycle (Zhang, 2024); 3D CNN integrates spatiotemporal information into a three-dimensional cube for end-to-end processing. However, its computational complexity increases exponentially with the time dimension, spatial resolution, and number of channels. Due to hardware limitations, it cannot balance segmentation accuracy and processing efficiency (Perdios et al., 2021). Overall, there is a fundamental conflict between the efficiency of spatiotemporal modeling and the capture of long-range dependencies in existing methods, which has become a core bottleneck restricting the improvement of echocardiographic video segmentation performance.

Video Transformer and spatiotemporal modeling. Vision Transformer utilizes self attention mechanism to achieve global feature interaction by segmenting images into non overlapping patches and transforming them into sequence vectors. Its breakthrough achievement in natural image recognition provides a new approach for video understanding tasks (Deng et al., 2021). The native Video Transformer extends to the spatiotemporal dimension based on this, such as ViViT modeling video sequences through spatiotemporal separation of attention, and Timestampformer designing multiple spatiotemporal attention modes to capture dynamic features. These models decompose video frames into spatiotemporal blocks and treat them as token sequences. Through a global self attention mechanism, they construct dependencies across frames and regions, demonstrating excellent performance in general video tasks such as action recognition (Adhyapak & Menon, 2024). However, the computational complexity of such models is positively correlated with the square of the total spatiotemporal tokens. When applied to echocardiography videos, their computational complexity far exceeds the existing hardware processing capabilities, making them completely impractical for practical applications. To alleviate this problem, efficient Transformer variants start with optimizing attention mechanisms to reduce complexity. However, this dimension separation strategy inevitably loses cross dimensional correlation information, making it difficult to capture the coupling relationship between spatial morphology and temporal phase in cardiac motion (Adhyapak & Menon, 2024). The current efficient Transformer method has not yet established a collaborative modeling mechanism for multi-scale anatomical structures and long-term dynamic changes in medical images, which provides a clear direction for the technological innovation of this study.

In terms of real-time medical image processing system. The clinical scenario has put forward strict requirements for the real-time processing of medical images, which has driven the rapid development of model efficiency optimization technology. In terms of model compression, network pruning reduces the number of parameters by removing redundant convolution kernels and connection paths, while quantization techniques convert floating-point operations to integer operations to reduce computational intensity. These methods have achieved a balance between accuracy and efficiency in static image segmentation tasks (Fernandes&Yen, 2020). However, when applied to ultrasound video segmentation models with complex spatiotemporal modeling, a simple compression strategy can easily impair the ability to extract key features, especially in low contrast myocardial boundary regions, where segmentation accuracy can significantly decrease. At the level of hardware collaborative optimization, the inference engine improves computational throughput through layer fusion, kernel automatic tuning, and efficient memory management, which can increase the inference speed of pre trained models by several times. However, its optimization effect highly relies on the lightweight design of the underlying model architecture. For spatiotemporal models with large parameter quantities and complex computational paths, the marginal benefits of hardware acceleration sharply decrease, making it difficult to meet clinical real-time requirements. From the perspective of current technological status, existing research on real-time processing is mostly focused on static image analysis, and an end-to-end optimization paradigm for long-term high-resolution medical videos such as echocardiography sequences has not yet been formed. How to achieve efficient inference while ensuring multi-scale spatiotemporal modeling capabilities is still an urgent technical challenge that needs to be overcome.

The applicable improvements of the MST Transformer System architecture in this article compared to traditional methods are shown in Table 1.

Table 1 Comparison of Technical Applicability

Method Category	Spatio-temporal Modeling Capability	Multi-scale Adaptability	Real-time Feasibility	Applicability to Ultrasound Segmentation
CNN (U-Net et al.)	✗	✓	✓✓	Partially Applicable
ConvLSTM/3D CNN	✓	△	△	Moderately to Lowly Applicable
Standard Video Transformer	✓✓✓	✗	✗	Theoretically Feasible
Efficient Transformer	✓✓	△	✓	Partially Applicable
Proposed MST-Transformer	✓✓✓	✓✓	✓✓	Highly Optimized

III. Proposed Methodology: MST-Transformer System

Overall system architecture

The MST Transformer system proposed in this article is an end-to-end processing pipeline for real-time segmentation of echocardiography videos. As shown in Figure 1, it achieves a balance between accuracy and speed through deep fusion of multi-scale feature extraction, efficient spatiotemporal modeling, and hardware collaborative optimization.

Figure 1 Schematic diagram of the overall system architecture

This architecture mainly consists of five core modules, as shown in Table 2.

Table 2 MST Transformer Architecture Modules

Module	Function
Input Preprocessing	Frame scaling, normalization, temporal sampling
Multi-scale Encoder	1) Lightweight CNN backbone (MobileNetV3) 2) Feature Pyramid Network
MST-Transformer Module	Multi-scale spatio-temporal modeling based on axial decomposition
Lightweight Decoder	Upsampling + encoder feature skip connections
Real-time Optimization Engine	TensorRT inference engine + INT8 quantization

Data processing

The input preprocessing module is responsible for the standardized conversion of multi-scale features in ultrasound videos, achieving real-time processing under the constraint of ≤ 8ms delay, while ensuring the complete preservation of key spatiotemporal information of heart movement. In response to the characteristic of ultrasound videos often containing redundant frames, this module balances computational efficiency through a dynamic spatiotemporal sampling mechanism Direct full frame processing will result in ineffective consumption of computing resources, while blind downsampling may lose key frames such as end diastolic and end systolic, thereby affecting the accuracy of cardiac function quantification.

The core of dynamic spatiotemporal sampling lies in the phase aware sampling algorithm, which detects heart motion trajectories through optical flow to accurately identify ED/ES keyframes. For complete cardiac cycle videos, a uniform sampling strategy is used to maintain temporal integrity. Real time guarantee relies on lightweight RNN classifiers to achieve millisecond level cycle judgment.

The spatial standardization process focuses on adapting the characteristics of ultrasound images, and locates the core area through adaptive cropping of the cardiac ROI, while preserving boundary details while compressing the data volume. The medical imaging dedicated processing chain further optimizes image quality. The anatomical structure design for constructing multi-scale feature pyramids is shown in Table 3.

Table 3 Anatomical Structure Design

Feature Scale	Resolution	Corresponding Medical Structures	Feature Enhancement Mechanism
P2	64×64	Valves/Endocardial Boundaries	Spatial Attention
P3	32×32	Mid-myocardial Layer	Channel Attention
P4	16×16	Ventricular Cavity Regions	Hybrid Attention
P5	8×8	Global Structures	Global Pooling

Multi scale spatiotemporal Transformer module

This module is the core innovation of the system, which achieves efficient segmentation of ultrasound videos through hierarchical spatiotemporal modeling and anatomical structure perception mechanism.

The spatial correlation constitutes the basis of single frame image analysis, and its core lies in the morphological dependency relationship between the cardiac chamber boundary, myocardial tissue, and valve structure in a two-dimensional plane. In a single frame ultrasound image, the inner and outer membrane boundaries of the left ventricular wall follow a specific anatomical topology, and the morphological changes of the apex and basal segments are continuous. However, there are strict constraints on the spatial position of valve attachment points and adjacent myocardial tissue. This spatial correlation makes the segmentation of each structure not an isolated event. For example, there is an inherent mapping between the circular contour of the short axis of the left ventricle and the pear shaped shape of the long axis. Ignoring such correlations can easily lead to morphological distortions in the segmentation results, such as abnormal myocardial wall thickness or deviation in cardiac volume measurement.

The continuity of motion reflects the dynamic evolution of cardiac structure between consecutive frames, with valve motion as a typical representative. The opening and closing process of the mitral and aortic valves presents smooth trajectory characteristics, and their movement speed and direction show regular changes with the phase of the cardiac cycle. The displacement between adjacent frames is usually controlled within a certain physiological range. This continuity imposes strict requirements on the temporal consistency of segmentation results. If the inter frame motion modeling is insufficient, valve position jumps or myocardial motion artifacts may occur, directly affecting the accuracy of diagnosing valve dysfunction and other lesions.

Phase periodicity is manifested as the rhythmic repetition of cardiac morphology within the complete cardiac cycle, and the periodic evolution from end diastole to end systole and back to end diastole constitutes the basic physiological unit. During this process, the cardiac chamber volume, myocardial thickness, and motion velocity exhibit characteristic patterns of change, which provide important basis for keyframe recognition and cross cycle feature comparison. Accurately capturing phase periodicity can effectively improve the accuracy of segmentation models in determining the start and end times of cardiac cycles, laying a reliable foundation for the calculation of quantitative parameters of cardiac function.

This article decomposes the traditional global attention complexity (O(HWT)²) into three orthogonal axis processing, expressed as follows:

（1）

Among them, A_H/A_W captures the spatial variation of myocardial wall thickness; A_T tracks the periodic movement of valve opening and closing.

The multi-scale parallel processing and anatomical structure oriented scale division in this article are shown in Table 4.

Table 4 Scale division of anatomical structure orientation

Scale Level	Axial Attention Configuration	Focused Medical Structures
P₂ (64×64)	A_H Enhanced	Valve micromovement
P₃ (32×32)	A_W Enhanced	Myocardial texture changes
P₄ (16×16)	A_T Dominant	Ventricular volume changes
P₅ (8×8)	Triaxial Equilibrium	Global cardiac motion

The cross scale gating fusion architecture is shown in Figure 2.

Figure 2 Schematic diagram of fusion architecture

Real time processing system optimization

This module constructs a three-level collaborative optimization system for modeling, compilation, and hardware, ensuring strict control of end-to-end processing latency while ensuring segmentation accuracy. At the model layer, we propose an axial attention specific compression technique. Based on the characteristics of cardiac motion, head dimension distillation technique is used to compress 32 heads of attention to 4 heads, preserving the four key directions of cardiac motion (longitudinal contraction, lateral expansion, rotational motion, and anatomical position maintenance). Spatial axis optimization is based on generating dynamic calculation masks using cardiac ROI, and skipping background calculation in non cardiac regions. The compilation layer achieves architectural breakthroughs by optimizing the TensorRT chain through ultrasonic specialization. The core innovation is axial attention fusion technology, which combines traditional three-stage computation (i.e. height axis width axis time axis) into a single fusion operator through customized CUDA kernel. Establish a clinical level real-time support architecture at the hardware level. The resource scheduling system ensures GPU utilization through CUDA stream priority control.

IV. Experiments and Results

1.Experimental Setup

This study used an internationally recognized publicly available dataset of echocardiography for systematic validation. The CAMUS dataset serves as the core validation benchmark, containing 500 patients’ 2D echocardiography videos from the University Hospital of Rennes in France, covering apical two chamber and four chamber views. The EchoNet Dynamic dataset is used to validate long-term modeling capabilities, providing 10030 four chamber view video sequences from Stanford Medical Center. The dataset provides frame level segmentation labels for the left ventricular cavity and clinical indicators such as EDV/ESV/EF.

This article selects 5 key evaluation indicators, as shown in Table 5.

Table 5 Definition of Evaluation Indicators

Metric Symbol	Metric Name	Definition and Role
Dice	Dice Similarity Coefficient	Measures the overlap between the segmented region and the gold standard (core segmentation accuracy metric)
HD	Hausdorff Distance	Evaluates the maximum error of the segmentation boundary (sensitively reflects edge segmentation quality)
EF_err	Absolute Error of Ejection Fraction	Accuracy of clinically critical functional parameters (directly determines diagnostic reliability)
FPS	Frames Per Second	Core metric for real-time performance (≥30fps meets clinical real-time requirements)
Latency	Per-frame Processing Latency	System response speed (determines interaction fluency)

This article selects 5 main methods as comparison methods, as shown in Table 6.

Table 6 Comparison Methods

Method Name	Core Architecture	Key Innovations
ST-UNet	3D CNN + U-Net	Spatio-temporal convolution for fusing sequence features
EchoNet-Dynamic	ResNet-50 + LSTM	Joint training for clinical EF value prediction
Cardiac-Transformer	Standard ViT + temporal extension	First application of Transformer in cardiac segmentation
LightSeg-3D	Lightweight 3D CNN	Channel pruning + quantization for real-time processing
MVF-Net	Multi-view fusion network	Dual-plane feature collaboration
MST-Transformer	Multi-scale axial Transformer	Axial decomposition + multi-scale gated fusion + real-time optimization engine

Experimental environment

The specific experimental environment of this article is shown in Table 7.

Table 7 Experimental Environment Configuration Table

Category	Configuration Item	Parameter/Version	Description
Hardware Platform	GPU	NVIDIA Tesla T4 (×2)	16GB GDDR6, 320 Tensor Cores, FP16 performance 65 TFLOPS
	CPU	Intel Xeon Gold 6248R	24 cores 48 threads, 3.0GHz base frequency
	Memory	256GB DDR4 ECC	2933MHz, 8 channels
	Storage	Samsung PM1733 NVMe SSD	3.2TB capacity, sequential read 6.5GB/s
Software Environment	Operating System	Ubuntu 20.04 LTS	Kernel 5.8.0
	CUDA	11.7	cuDNN 8.5.0
	Deep Learning Framework	PyTorch 1.13.1	TorchVision 0.14.1
	Inference Acceleration Engine	TensorRT 8.5.1.7	FP16/INT8 quantization support
	Medical Imaging Library	MONAI 1.1.0	Providing ultrasound-specific data augmentation
Test Equipment	Ultrasound Simulator	SonoSim® Ultrasound Trainer	Generating standard test video streams
	Clinical Acquisition Device	GE Vivid E95	Probe frequency 1.5-4.6MHz, output 1080p@30fps
	Edge Device	NVIDIA Jetson AGX Orin	64GB memory, 2048 CUDA cores
Measurement Tools	Latency Analysis	NVIDIA Nsight Systems 2022.4	End-to-end pipeline analysis
	Memory Monitoring	DCGM 2.4.4	Real-time memory/power monitoring
	Frame Rate Acquisition	FFmpeg 5.1.2	Video stream timestamp analysis

Experimental results

The specific experimental results of this article are shown in Table 8.

Table 8 Experimental Results

Method	Dice↑	HD(mm)↓	EFerr(%)↓	FPS↑	Latency(ms)↓
ST-UNet	0.841	4.32	6.8	9.2	108.7
EchoNet-Dynamic	0.853	3.95	5.2	18.1	55.2
Cardiac-Transformer	0.872	3.41	4.7	7.5	133.3
LightSeg-3D	0.826	4.78	7.3	31.0	32.3
MVF-Net	0.885	2.98	4.1	6.8	147.1
MST-Transformer	0.915	2.15	3.2	33.5	29.8

Intuitively, as shown in Figure 3, the MST Transformer system proposed in this paper exhibits significant advantages. The Dice coefficient reaches 0.91, which is 3.4% higher than the optimal comparison method MVF Net, proving the modeling ability of axial attention for thin layer structures. The Hausdorff distance has been reduced to 2.15mm, mainly due to the synergistic effect of boundary enhancement loss and multi-scale gating fusion, effectively suppressing edge blurring caused by ultrasound speckle noise. EFerr is only 3.2%, meeting the requirements of ASE diagnostic guidelines. The real-time performance leads all solutions with a processing speed of 33.5fps, which is 8% higher than LightSeg-3D, proving the efficiency of the real-time optimization engine.

Figure 3 Combination of Experimental Results

Among them, in the specialized testing of pathological scenarios, the test results of this scheme are shown in Table 9. In specialized tests covering four typical cardiac pathologies, MST Transformer exhibits differentiated performance. Excellent performance was achieved in dilated cardiomyopathy with Dice 0.902 and EF error of 3.5%, demonstrating the robustness of multi-scale pyramids to structural deformation. The axial attention mechanism effectively captures the diffuse motion characteristics of thin-walled myocardium, overcoming the problem of boundary segmentation caused by abnormal cardiac chamber morphology in traditional methods. The Dice 0.886 and EF error of 4.1% in hypertrophic cardiomyopathy are attributed to the P3 scale myocardial texture enhancement mechanism. The exposure model of apical balloon like transformation is limited, with Dice reduced to 0.823 and a failure rate of 7%. Analysis shows that the lack of movement in the apical region leads to the failure of temporal axis attention, while spatial axis attention is interfered by ultrasound near-field artifacts. Artificial valve replacement cases face the greatest challenge. The comet tail artifact generated by the metal valve is misidentified as a valid signal in the attention mechanism, resulting in over segmentation of the valve region.

Table 9 Pathological Scene Special Testing

Pathological Type	Dice↑	EFerr↓	Failure Rate↓
Dilated Cardiomyopathy	0.902	3.5%	0%
Hypertrophic Cardiomyopathy	0.886	4.1%	0%
Apical Ballooning	0.823	5.8%	7%
Prosthetic Valve Replacement	0.801	6.2%	12%

The results of the ablation experiment are shown in Table 10. The ablation experimental system validated the contributions of the three core modules of MST Transformer and revealed the key mechanisms for improving model performance.

Table 10 Results of ablation experiment

Model Variants	Dice↑	Hausdorff Distance (mm)↓	FPS↑
w/o Axial Attention	0.879	3.27	41.5
w/o Multi-scale Fusion	0.892	2.84	36.2
w/o Real-time Optimization Engine	0.907	2.31	18.7
Full Model	0.915	2.15	33.5

As shown in Figure 4, removing the axial attention module resulted in a sharp drop of 0.036 in Dice coefficient and an increase of 1.12mm in HD distance. This indicates that axial decomposition is irreplaceable for long-range spatiotemporal modeling It reduces computational complexity by two orders of magnitude while maintaining the global receptive field by separating spatial and temporal dimensions. After disabling multi-scale gating fusion, the HD distance deteriorated to 2.84mm, but Dice only decreased by 0.023. This reflects that the core value of multi-scale fusion lies in precise boundary positioning, and the gating mechanism enhances key anatomical regions through dynamic weighting, effectively suppressing edge blurring caused by ultrasound speckle noise. It is worth noting that even with the removal of this module, axial attention can still maintain the basic segmentation accuracy, indicating that multi-scale design is the “last mile” optimization for precision breakthroughs. Turning off the real-time optimization engine caused FPS to plummet from 33.5 to 18.7, resulting in an 80% increase in latency. This proves that the module is the key to breaking the speed bottleneck.

Figure 4 Results of ablation experiment

Conclusion

This study proposes and validates an innovative segmentation method for the MST Transformer system to address the challenges of real-time and accuracy in echocardiography video segmentation. The experimental results show that the proposed MST Transformer system significantly outperforms existing mainstream methods in segmentation accuracy, temporal consistency, and real-time processing capability. Specifically, MST Transformer exhibits excellent performance in the calculation accuracy of Dice coefficient, Hausdorff distance, and key parameters of cardiac function, especially in the detailed modeling of cardiac structure boundaries, successfully overcoming challenges such as ultrasound image noise interference and boundary blurring. At the same time, the system achieves clinical standards in real-time processing, significantly improving the accuracy of clinical decision-making.

Further research has shown that the deep fusion of multi-scale spatiotemporal modeling and axial attention mechanism provides an unprecedented technological path for automated segmentation of echocardiography videos; The real-time inference engine based on hardware collaborative optimization effectively ensures the balance between segmentation accuracy and real-time performance, promoting the clinical application process of cardiovascular imaging analysis technology. At the practical level, the MST Transformer system proposed in this article can be widely applied for precise diagnosis and treatment of heart diseases, especially in real-time assessment and surgical planning of critically ill patients, which has significant advantages.

Reference

Favot, M., Courage, C., Ehrman, R., Khait, L., & Levy, P. (2016). Strain echocardiography in acute cardiovascular diseases. Western Journal of Emergency Medicine, 17(1), 54.
Popp, R. L. (1976). Echocardiographic assessment of cardiac disease. Circulation, 54(4), 538-552.
Krishnamurthy, R. (2009). The role of MRI and CT in congenital heart disease. Pediatric radiology, 39(Suppl 2), 196-204.
Zhang, J., Gajjala, S., Agrawal, P., Tison, G. H., Hallock, L. A., Beussink-Nelson, L., … & Deo, R. C. (2018). Fully automated echocardiogram interpretation in clinical practice: feasibility and diagnostic accuracy. Circulation, 138(16), 1623-1635.
Tromp, J., Seekings, P. J., Hung, C. L., Iversen, M. B., Frost, M. J., Ouwerkerk, W., … & Ezekowitz, J. A. (2022). Automated interpretation of systolic and diastolic function on the echocardiogram: a multicohort study. The Lancet Digital Health, 4(1), e46-e54.
Mele, D., Andrade, A., Bettencourt, P., Moura, B., Pestelli, G., & Ferrari, R. (2020). From left ventricular ejection fraction to cardiac hemodynamics: role of echocardiography in evaluating patients with heart failure. Heart failure reviews, 25(2), 217-230.
Paulus, W. J., Tschöpe, C., Sanderson, J. E., Rusconi, C., Flachskampf, F. A., Rademakers, F. E., … & Brutsaert, D. L. (2007). How to diagnose diastolic heart failure: a consensus statement on the diagnosis of heart failure with normal left ventricular ejection fraction by the Heart Failure and Echocardiography Associations of the European Society of Cardiology. European heart journal, 28(20), 2539-2550.
Slomka, P. J., Dey, D., Sitek, A., Motwani, M., Berman, D. S., & Germano, G. (2017). Cardiac imaging: working towards fully-automated machine analysis & interpretation. Expert review of medical devices, 14(3), 197-212.
Ferrari, V., Carbone, M., Cappelli, C., Boni, L., Melfi, F., Ferrari, M., … & Pietrabissa, A. (2012). Value of multidetector computed tomography image segmentation for preoperative planning in general surgery. Surgical endoscopy, 26(3), 616-626.
Chen, X. (2023). Real-Time Semantic Segmentation Algorithms for Enhanced Augmented Reality. Journal of Computational Innovation, 3(1).
Zhai, X., Eslami, M., Hussein, E. S., Filali, M. S., Shalaby, S. T., Amira, A., … & Ahmed, A. Z. (2018). Real-time automated image segmentation technique for cerebral aneurysm on reconfigurable system-on-chip. Journal of computational science, 27, 35-45.
Wells, P. N. (2006). Ultrasound imaging. Physics in medicine & biology, 51(13), R83.
Damerjian, V., Tankyevych, O., Souag, N., & Petit, E. (2014). Speckle characterization methods in ultrasound images–A review. Irbm, 35(4), 202-213.
Perdios, D., Vonlanthen, M., Martinez, F., Arditi, M., & Thiran, J. P. (2021). CNN-based image reconstruction method for ultrafast ultrasound imaging. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 69(4), 1154-1168.
Huang, Y., Yang, J., Sun, Q., Yuan, Y., Li, H., & Hou, Y. (2024). Multi-residual 2D network integrating spatial correlation for whole heart segmentation. Computers in Biology and Medicine, 172, 108261.
Azad, R., Aghdam, E. K., Rauland, A., Jia, Y., Avval, A. H., Bozorgpour, A., … & Merhof, D. (2024). Medical image segmentation review: The success of u-net. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Mukhametkaly, A., Momynkulov, Z., Kurmanbekkyzy, N., & Omarov, B. (2023). Deep Conv-lstm network for arrhythmia detection using ECG data. International Journal of Advanced Computer Science and Applications, 14(9).
Deng, K., Meng, Y., Gao, D., Bridge, J., Shen, Y., Lip, G., … & Zheng, Y. (2021, September). Transbridge: A lightweight transformer for left ventricle segmentation in echocardiography. In International workshop on advances in simplifying medical ultrasound (pp. 63-72). Cham: Springer International Publishing.
Adhyapak, S., & Menon, P. (2024). Classification of Echocardiography Videos Using TimeSformer for Detecting Incipient Heart Failure in Asymptomatic Patients with Normal Ejection Fraction and Patients with Heart Failure. Circulation, 150(Suppl_1), A4120990-A4120990.
Mazaheri, S., Sulaiman, P. S. B., Wirza, R., Khalid, F., Kadiman, S., Dimon, M. Z., & Tayebi, R. M. (2013, December). Echocardiography image segmentation: A survey. In 2013 international conference on advanced computer science applications and technologies (pp. 327-332). IEEE.
Wang, K., Hachiya, H., & Wu, H. (2024). A Multi‐Fusion Residual Attention U‐Net Using Temporal Information for Segmentation of Left Ventricular Structures in 2D Echocardiographic Videos. International Journal of Imaging Systems and Technology, 34(4), e23141.
Xu, Z., Yu, F., Zhang, B., & Zhang, Q. (2022). Intelligent diagnosis of left ventricular hypertrophy using transthoracic echocardiography videos. Computer Methods and Programs in Biomedicine, 226, 107182.
Zhang, Y. (2024). Utilizing Hybrid LSTM and GRU Models for Urban Hydrological Prediction (Doctoral dissertation, University of Guelph).
Adhyapak, S., & Menon, P. (2024). Classification of Echocardiography Videos Using TimeSformer for Detecting Incipient Heart Failure in Asymptomatic Patients with Normal Ejection Fraction and Patients with Heart Failure. Circulation, 150(Suppl_1), A4120990-A4120990.
Fernandes, F. E., & Yen, G. G. (2020). Automatic searching and pruning of deep neural networks for medical imaging diagnostic. IEEE Transactions on Neural Networks and Learning Systems, 32(12), 5664-5674.

Real time processing system of multi-scale spatiotemporal Transformer in echocardiography video segmentation