Intelligent Generation of Packaging Layout Under the Constraints of Information Hierarchy and Brand Consistency: A Content-Aware Model and Designer Online System for Visual Communication
https://doi-001.org/1025/17669976657001
Long Zeng1,a, Tian Tian1,b Zi Wang 1,c,*
1 School of Information Engineering, Chenzhou Vocational Technical College, Chenzhou, 423000, Hunan, China
aEmail: zenglong202209@163.com
bEmail: tiantian1982929@163.com
cEmail: wangczzy1014@163.com
*Corresponding author
Abstract
This study aims to address the inefficiency of manual packaging layout design and the shortcomings of existing automated methods. By converting the principles of visual communication and brand norms into mathematical constraints and loss functions, a multi-dimensional annotated PackLay-IB dataset was constructed, the HiBrand-Layout content-aware model was trained, and the “Designer-in-the-Loop” interaction prototype was designed. Experiments show that HiBrand-Layout performs better than the baseline: comprehensive constraint satisfaction (89.2%±2.1%), brand color consistency (92.3%±1.5%), and design time is reduced by more than 60%. It also demonstrates good versatility in cross-brand/cross-die-cut scenarios.
Keywords: Packaging design; Visual communication; Information hierarchy; Brand consistency; Transformer
1 Introduction
As the visual carrier of product information and consumer perception, packaging layout directly impacts the clarity of information (e.g., product features, user guides) and the product’s brand impression and market recognition. Packaging layout design is an effective way to address these issues. In practice, manual design must simultaneously consider the packaging carton’s die-cut edges, legally mandated layout positions, grid alignment requirements, whitespace rhythm of visual typography, reading flow, information architecture logic, and artistic expression. Thus, relying solely on manual methods to finalize a mature design is time-consuming and prone to inconsistent decision-making due to repeated iterations.
Furthermore, existing research on automated packaging layout generation focuses on optimizing the geometric arrangement of basic elements. Some methods rely on sequential traversal or fixed template libraries but fail to consider visual communication core demands (information hierarchy clarity and brand consistency) and lack the “interactive control” required by designers. As a result, generated designs often have defects such as incorrect brand primary color usage and confusing key information hierarchy, limiting their application in teaching and industry.
To solve these problems, this study aims to realize intelligent packaging layout generation under multi-constraints to meet visual communication needs. Classic visual communication principles (grid, whitespace, reading order) and brand specifications (brand color palette, logo proximity rules, legal information area) are standardized into optimizable mathematical constraints and loss functions. A content-aware model is trained using a multi-dimensionally annotated dataset, and combined with a “Designer-in-the-Loop” interactive approach to generate layouts that conform to visual communication logic and technical constraints.
Based on this, the study focuses on three core research questions (RQ):
RQ1 (Communication Effect): Under real packaging constraints (die-cut/legal area), how to quantify “information hierarchy clarity and brand consistency” to drive automatic layout? RQ2 (Methodology): How to formalize visual communication principles (grid, whitespace, alignment, reading order, brand color usage) into differentiable/optimizable constraints? RQ3 (Practical): What “Designer-in-the-Loop” interaction can achieve explainable one-click rearrangement while maintaining brand consistency?
The contributions of this study are:
Proposing multi-level definitions and measurement standards for visual communication orientation (e.g., hierarchy clarity, OCR readability);Designing the HiBrand-Layout content-aware model that integrates structure generation and brand consistency orientation;Developing a “Designer-in-the-Loop” interactive prototype to support teaching and practice;Establishing an experimental evaluation paradigm based on communication effects.
2 Literature Review
2.1 Visual Communication and Packaging Information Hierarchy
Information hierarchy is the key to “efficient cognitive transmission” of packaging visual language. Essentially, it controls consumers to first obtain “title” and core selling point information, then other text descriptions and legal information, based on visual weight levels. The quantification of efficient information hierarchy is anchored by “reading path rationality” (i.e., the consistency between the priority ranking of visual elements (size, contrast) and consumers’ cognitive paths). For example, the cognitive logic path of food packaging—”product name (H1) → net content (H2) → ingredient list (Body)”—is default and unshakable; otherwise, information acquisition deviation may occur [1]. Information visibility is a prerequisite for hierarchical transmission, with key parameters including font physical attributes and layout characteristics:Minimum font size: Must not be less than the “mandatory labeling content ≥3mm” specified in GB7718 (food labels) to avoid OCR recognition errors (character error rate >5% is considered unreadable) [2]; Color contrast: Between font and background must meet WCAG2.1’s 4.5:1 requirement; Line spacing: 1.2×~1.5× font size; Word count per block: <8 lines; Physical restrictions: Die-cut/functional areas (e.g., sealing parts cannot place key text) impose formal constraints on visibility design [3].
Grid and whitespace design are the skeletons of layout balance: Grid rules: Use a fixed number of columns (3-5 columns for food packaging), gutter distance (8-12mm), and margins (≥10mm) to standardize element alignment (text left-aligned, logo centered in Chinese), avoiding visual disorder caused by element offset; Whitespace: An “invisible visual element” whose core role is to divide information areas (whitespace between main visual area and legal information area ≥15mm) and enhance layout “breathing feeling”. Relevant research shows that when whitespace accounts for 20%-30% of the layout, consumers’ information acquisition efficiency increases by 18%-25%—especially necessary for small-sized packaging (e.g., oral liquid labels) and dense information [4].
2.2 Automated Layout Generation
The evolution of automatic layout generation technology revolves around the balance between “constraint satisfaction” and “generation flexibility”, and is divided into two paradigms: traditional rule-driven and deep learning-driven.
2.2.1 Traditional Rule-Driven Methods
Template method: The early industry mainstream uses preset fixed-format template libraries (classified by food, medicine, etc.) to fill packaging elements (LOGO, text, images) into reserved template positions. It ensures basic layout order and fast generation but has poor flexibility—unable to adapt to diverse die-cut sizes (e.g., special-shaped bottle labels, folding cartons) and difficult to dynamically adjust information hierarchy (e.g., adding new selling points breaks original balance), leading to serious homogeneity [5].
Constraint-based method: Improves on templates by mathematically defining layout requirements (non-overlapping elements, minimum text size, alignment) as constraints, and using integer programming or graph optimization to find optimal layouts. However, these algorithms rely on numerous geometric constraints (position, size), and complexity grows exponentially with the number of elements. They also fail to address visual communication core elements (reading order, brand color matching), resulting in visually ineffective layouts.
2.2.2 Deep Learning-Driven Methods
“Content-aware” generation is driven by deep learning:
GAN-based methods: Generative Adversarial Networks (GANs) were early mainstream. For example, LayoutGAN predicts element bounding box generators and layout rationality discriminators but only targets general documents (posters, resumes), ignoring packaging-specific die-cut prohibited areas, legal information fixed areas, and brand color consistency [6].
Transformer-based methods: Cross-modal attention (e.g., Transformer) gradually replaces GANs due to advantages in multimodal information fusion (e.g., copy hierarchy). However, current Transformer-based methods focus on geometric accuracy optimization (e.g., bounding box boundary error) and ignore information hierarchy comprehensibility and brand requirements, failing to meet packaging visual communication core demands.
Current automated methods do not consider packaging-specific constraints (die-cut, legal area) and visual communication key intentions (hierarchy clarity, brand consistency)—this is the key challenge addressed by this study.
2.3 Designer-in-the-Loop and Interactive Optimization
The significance of “Designer-in-the-Loop” lies in resolving the contradiction between the “black box” nature of automation and the “subjective decision-making needs” of visual communication design. Its core is the consistency of “explainability” and “controllability”: the former enables designers to master generation principles, and the latter ensures design decisions align with visual communication goals (brand consistency, hierarchy clarity).
Early interactions often used “parameter fine-tuning” (e.g., dragging to adjust element position, text size). Although local adjustments are possible, repeated attempts are required, and overall layout order is unstable—especially packaging die-cut constraints, which easily cause manual adjustments to exceed placement boundaries or break grid alignment [7].
Deep learning-based interaction evolves toward “high-dimensional manipulation” (e.g., knobs to change element hierarchy, clicks to switch layout styles) but has two weaknesses: Operation control only covers geometric dimensions (size, left/right) and lacks packaging visual keyword constraints (brand color locking, legal information area protection), leading to brand color conflicts or legal information loss; Lack of explainability: Only “output versions” are provided, without specifying the underlying visual communication mechanism (e.g., “placed left due to grid line alignment”, “larger font size due to H1 hierarchy”), making designers unable to trust the tool [8].
Existing controllable generation for general documents (brochures, posters) does not meet packaging “dynamic reflow” requirements: adding new selling points or changing die-cuts easily disrupts original reading logic or brand layout, requiring designers to re-interpret the system. Furthermore, it ignores legal regulations (e.g., “nutrition table must be placed in the lower right corner” for food packaging), making reflow non-compliant.
Existing “Designer-in-the-Loop” research has not fully integrated packaging visual communication unique constraints (brand specifications, legal areas) with design decision-making logic (hierarchy priority, grid alignment), lacking a “explainability-controllability-visual communication goals” trinity interactive system—this is the core direction of the interactive prototype built in this study.
2.4 Research Differentiation
Although existing studies have accumulated work in visual communication principles, layout automation, and designer interaction, none form a closed loop of “visual communication principles → computable constraints → communication effect verification”—this is the main difference of this study:
Visual communication quantification: Most existing studies only provide qualitative principles (e.g., grid specifications, brand color specifications) as design references, without converting them into adjustable mathematical constraints. For example, scholars have proven that whitespace facilitates information search but not converted “20%-30% whitespace ratio” into a loss function for model tuning. This study calculates the whitespace index as a differentiable loss to realize principle-based regulation.Constraint coverage improvement: Traditional constraint solving only considers geometric overlap; GAN/Transformer methods focus on layout accuracy (e.g., bounding box error). Both ignore packaging inherent visual constraints (brand color ratio, fixed legal information area) and use image metrics (LPIPS/FID) for evaluation, detaching from communication effects (OCR readability, hierarchy recognition). This study converts brand color consistency into a ΔE00 regularization term, embeds legal area constraints into die-cut mask position loss, and uses visual metrics as the core evaluation standard (supplemented by image metrics)—fundamentally different from pure algorithm optimization.
3 Methodology
3.1 Problem Formulation: Conditions and Representations
To realize intelligent packaging layout generation, visual communication core needs and packaging unique constraints are converted into structured input conditions.
Copy and hierarchy: Represented by the ordered set , where each includes: Hierarchy label (H1: main title, H2: selling point, Body: description, Legal: legal information); Minimum font size threshold (H1 ≥8mm, Legal ≥3mm, in line with GB7718); Visual priority weight (), ensuring higher hierarchy copy has greater visual weight.
Brand color palette: Defined as , where:
(primary color) and (secondary color) are expressed in LAB color space coordinates; (color tolerance threshold) is ΔE₀₀ ≤2 (based on CIEDE2000 standard), limiting the maximum color deviation between generated layout elements and the palette to ensure brand color consistency.
Product images: Jointly represented by image features and croppable masks: Let the product image be ; its visual core area (e.g., main product photo) is marked by mask (1 = non-croppable area); Edge redundant areas are marked by (croppable area), preventing damage to key visual information.
Die-cut/functional area mask: is a 2D binary matrix, where (no-placement area, e.g., sealing lines, creases) and (placement-permitted area). Fixed functional area coordinates are built-in (e.g., barcode area , nutrition information area) to impose mandatory constraints on specific element placement.
Design grid parameters: Parameterized as , where:
: Number of columns (3-5 columns for food/daily chemical packaging); : Gutter width (8-12mm); : Margin (≥10mm);
Grid line coordinates , calculated as:
Unit conversion note: Font size and layout dimensions are unified in mm; if converting to pixels (px), assume DPI=300 (industry common standard), conversion formula: . Alignment error units are uniformly mm.
3.2 HiBrand-Layout Model Architecture
The HiBrand-Layout model adopts a two-stage “Encoder-Decoder” architecture: Transformer encoder for deep multimodal condition fusion, and GAN decoder for accurate layout element prediction. The core is to embed packaging visual communication requirements and constraints into the end-to-end generation process.
3.2.1 Transformer Encoder
Responsible for integrating four modal information (copy, color, vision, structure):
Copy modal: For copy hierarchy set , construct copy feature vectors via “hierarchy label embedding + minimum font size embedding + text length embedding”, and superimpose position codes to distinguish element order; Visual modal: Use ViT (Vision Transformer) to extract patch features of product image ; convert die-cut/functional area mask into spatial attention weights to suppress prohibited area features, ensuring encoding results carry physical constraint information [9]; Color modal: Vectorize brand color palette into conditional tokens via LAB coordinates; Cross-modal fusion: Input copy, visual, and color features into the cross-modal attention layer; use multi-head attention to learn inter-modal relationships (e.g., H1 copy binding with brand primary color, product image matching with LOGO area); finally output layout query tokens (one-to-one correspondence with layout elements, including element type, priority, constraint preference).
3.2.2 GAN Decoder
Consists of a generator and a discriminator, focusing on predicting key layout attributes and visual communication principle constraints:
Generator: Takes layout query tokens as input; predicts each element’s bounding box ( position, size), z-index (visual hierarchy, ensuring H1/H2 priority display), and color value (based on brand color palette) via deconvolution layers and coordinate regression heads;
Discriminator: Does not rely on pixel-level comparison; converts generated layouts into “geometric feature maps” (element alignment deviation, whitespace ratio, reading order sequence); via binary classification, judges whether layouts conform to visual communication principles (grid alignment, whitespace interval, reading logic); outputs constraint loss and feeds it back to the generator, forming a “generation-discrimination-optimization” closed loop.
This architecture ensures content awareness via cross-modal encoding and strengthens visual communication constraint satisfaction via adversarial training, providing a model foundation for subsequent loss optimization.
3.3 Constraint Modeling & Loss Function
3.3.1 Symbol Table
Table 3.3.1 Symbol Table
| Symbol | Description | Type |
| Index of layout elements (text blocks, LOGO, images) | Integer | |
| Bounding box of element ; : top-left coordinate; : width; : height | 4-dimensional vector | |
| Intersection over Union of and , calculated as | Float (0 ≤ IoU ≤ 1) | |
| Element should be read before element (based on information hierarchy) | Logical relation | |
| Visual weight score of element (positively correlated with font size, contrast, z-index) | Float | |
| Margin parameter for reading order loss (set to 0.1 in experiments) | Float | |
| LAB color value of element | 3-dimensional vector (L ∈ [0,100], a ∈ [-128,127], b ∈ [-128,127]) | |
| CIEDE2000 color difference between and the nearest color in brand palette | Float | |
| Proportion of elements using brand primary color in the layout | Float (0 ≤ ≤ 1) | |
| Weight of primary color proportion regularization (set to 2 in experiments) | Float | |
| Edge set of element (top, bottom, left, right edges) | Set of coordinates | |
| Minimum Euclidean distance between any edge in and any grid line in | Float (unit: mm) | |
| Proportion of whitespace in the layout, calculated as | Float (0 ≤ ≤ 1) | |
| Weight of whitespace proportion regularization (set to 1.0 in experiments) | Float | |
| Color contrast between element and background, calculated per WCAG2.1: ( : element luminance; : background luminance) | Float | |
| Minimum font size threshold of element (Legal: 3mm; H1: 8mm) | Float (unit: mm) | |
| Actual font size of element | Float (unit: mm) | |
| Adversarial loss of GAN (hinge loss used in experiments) | Float | |
| Weights of each constraint loss (set to 1.2, 1.5, 2.0, 1.0, 1.8 respectively) | Float | |
| Total number of elements in the layout | Integer |
3.3.2 Loss Functions
Element Non-overlap Loss ()
Constrains no intersection between any two element bounding boxes. The loss is the sum of positive IoU values between all element pairs:
When (no overlap), the term contributes 0; otherwise, it penalizes the overlap proportion.
Reading Order Loss ()
Ensures element visual weight order matches the predefined reading sequence. Uses pairwise ranking loss with margin:
If (satisfies reading order), the term contributes 0; otherwise, it penalizes the deviation.
Brand Color Consistency Loss ()
Controls color deviation from the brand palette and primary color proportion:
The first term minimizes color deviation (ΔE₀₀ ≤2 for compliance);
The second term regularizes the primary color proportion to 40%-60% (target: 50%).
Grid Alignment & Whitespace Loss ()
Enforces grid alignment (element edge ≤2mm from grid line) and whitespace ratio (20%-30%):
The first term minimizes alignment deviation;
The second term regularizes whitespace to the optimal 25%.
Text Readability Loss ()
Guarantees background contrast (≥4.5:1) and minimum font size:
The first term penalizes contrast below 4.5;
The second term penalizes font size below the threshold.
Total Loss ()
Jointly optimizes adversarial loss and all constraint losses:
3.3.3 Threshold Scanning Results
To verify the rationality of key thresholds, threshold scanning experiments were conducted on the PackLay-IB validation set. Results are shown in Tables 3.3.2–3.3.4:
Table 3.3.2 ΔE₀₀ Threshold Scanning Results
| ΔE₀₀ Threshold | Brand Color Consistency (%) | OCR Accuracy (%) | Comprehensive Constraint Satisfaction (%) |
| 1.0 | 95.1±1.2 | 94.3±1.7 | 87.6±2.3 |
| 2.0 (default) | 92.3±1.5 | 94.5±1.8 | 89.2±2.1 |
| 3.0 | 86.7±2.0 | 94.4±1.6 | 88.5±2.2 |
| 4.0 | 79.2±2.5 | 94.2±1.9 | 86.8±2.4 |
ΔE₀₀=2.0 balances brand consistency and layout flexibility, so it is selected as the default threshold.
Table 3.3.3 Primary Color Proportion Threshold Scanning Results
| Primary Color Proportion | Brand Color Consistency (%) | Layout Aesthetics Score (1-5) |
| 30%-50% | 88.5±1.8 | 3.9±0.4 |
| 40%-60% (default) | 92.3±1.5 | 4.3±0.2 |
| 50%-70% | 93.1±1.4 | 4.1±0.3 |
40%-60% ensures brand recognition while avoiding color monotony, so it is selected as the default range.
Table 3.3.4 Contrast Threshold Scanning Results
| Contrast Threshold | OCR Accuracy (%) | Readability Score (1-5) |
| 3.5:1 | 89.2±2.1 | 3.7±0.5 |
| 4.5:1 (default) | 94.5±1.8 | 4.2±0.3 |
| 5.5:1 | 95.1±1.6 | 4.1±0.4 |
4.5:1 meets WCAG2.1 accessibility standards and balances readability and color diversity, so it is selected as the default threshold.
3.4 Specification Adaptation Layer Design
To support rapid switching of legal standards for different product categories, a specification adaptation layer is designed, using an XML configuration file-driven architecture. The layer decouples legal constraints from the core model, enabling flexible adaptation to multiple standards.
3.4.1 Architecture Design
The specification adaptation layer includes three modules: Configuration Parser: Reads XML configuration files of different legal standards, parses parameters such as legal information area, font requirements, and color restrictions; Constraint Mapper: Maps parsed parameters to model constraint parameters (e.g., legal area coordinates → die-cut mask , font requirements → ); Effect Evaluator: Verifies whether the generated layout meets the current standard and outputs a compliance report.
3.4.2 Extensible Parameter Table
Table 3.4.1 Extensible Parameters of the Specification Adaptation Layer
| Parameter Category | Parameter Name | Description | Adjustable Range | Default Value (Food GB7718) | Default Value (Daily Chemical GB/T 39560) |
| Legal Area | Nutrition Table Position | Coordinates of nutrition table | Custom (x,y,w,h) | [W-150, H-80, W, H] | [W-120, H-70, W, H] |
| Legal Text Area Height | Minimum height of legal information area | ≥20mm | 30mm | 25mm | |
| Font Requirement | H1 Minimum Font Size | Minimum font size of main title | ≥6mm | 8mm | 7mm |
| Legal Minimum Font Size | Minimum font size of legal text | ≥2mm | 3mm | 2.5mm | |
| Color Restriction | Primary Color Proportion | Proportion of brand primary color | 30%-70% | 40%-60% | 35%-65% |
| Whitespace | Legal Area Whitespace | Whitespace between legal area and other elements | ≥10mm | 15mm | 12mm |
3.4.3 Standard Template Switching Effect
Taking “food (GB7718)” and “daily chemical (GB/T 39560)” as examples, the switching effect of the specification adaptation layer is verified. Results are shown in Table 3.4.2:
Table 3.4.2 Standard Template Switching Effect
| Evaluation Indicator | Food GB7718 Template | Daily Chemical GB/T 39560 Template |
| Legal Area Compliance Rate | 98.2%±1.1% | 97.8%±1.3% |
| Specification Matching Score (1-5) | 4.5±0.2 | 4.4±0.3 |
| Switching Time | ≤0.5s | ≤0.5s |
The specification adaptation layer realizes fast and accurate switching of legal standards, with compliance rates exceeding 97%.
4 Dataset Construction: PackLay-IB
4.1 Annotation Schema
The PackLay-IB dataset’s annotation schema centers on “quantifiable visual communication constraints”, constructing a four-in-one annotation system (feature category, geometry, color, structure) to provide accurate supervision signals for the model.
Semantic Element Level:
Brand elements: LOGO (brand recognition symbol);
Visual elements: Product images, die-cut lines (for cutting);
Text elements: H1 (main title), H2 (selling point), Body (description), Legal (legal information, including hierarchy label and font size);
Functional elements: Nutrition table, barcode/QR code (with unique position requirements);
Each element has a unique ID, usage rule description (e.g., “Legal-001: Food safety standard, mandatory information”).
Geometric Annotation:
Bounding box (bbox): Annotates rectangular range (x,y,w,h) of elements; special-shaped elements (e.g., circular LOGO) are supplemented with polygonal vertex coordinates;
Reading order: Uses 1-n serial numbers to mark the expected reading path (e.g., “H1→H2→product image→Body→Legal”);
Z-index: Annotates element visual stacking priority (LOGO/H1: 1-2; background: lowest);
Alignment relationship: Annotates alignment between elements and grid lines/other elements (e.g., “H1 left-aligned to the 2nd column grid line”).
Color Annotation:
Brand color palette: Annotates LAB coordinates of primary/secondary colors (e.g., primary color L=50, a=20, b=10) and sets ΔE₀₀=2 as the color tolerance range;
Element color: Annotates actual color value and attribution of each element (e.g., “H1 text: primary color, ΔE₀₀=1.2”);
Violation color: Records elements with color deviations from the palette for model regularization training.
Structural Annotation:
Die-cut/functional area mask: Annotates with a 2D binary matrix (1=placement area, 0=no-placement area, e.g., sealing lines);
Fixed functional areas: Annotates positions required by legal/industry standards (e.g., “barcode area: lower right corner, w=100mm, h=50mm”) and area attributes (e.g., “Prohibited Area-001: Bottle cap obstruction area, no text”).
Figure 1 PackLay-IB Annotation Schema
4.2 Quality Control
A “multi-angle spot check + cyclic correction” quality inspection system is adopted to ensure annotation accuracy and standardization, with an overall annotation error rate <3%.
Two-Tier Verification:
First tier: Annotated by 3 designers with ≥2 years of packaging design experience;
Second tier: Verifies element attributes (e.g., distinguishing H2 from Body), geometric information (e.g., bbox completeness), structural compliance (e.g., die-cut prohibited area marking), and visual communication principle adherence (e.g., reading order consistency with consumer cognition);
Controversial annotations (e.g., non-rectangular image cropping boundaries) are resolved via negotiation based on the “Annotation Manual and Industry Standards”.
Consistency Test (Fleiss’ κ):
Calculates κ values for subjective dimensions (element attributes, semantic priority, alignment type) among 5 annotators (15% sampling ratio);
Results: Element attributes κ=0.92, semantic priority κ=0.88, alignment type κ=0.86 (all >0.85, indicating good consistency);
For low-consistency subdivisions (κ<0.8, e.g., “Legal/Body attribute distinction”), supplementary annotations and detailed classification guidelines are provided (e.g., “Legal information must include mandatory content such as production batch numbers”), and annotators are retrained until consistency requirements are met.
OCR Sampling Verification:
Randomly selects 10% of text annotation samples (H1, H2, Body, Legal); uses Tesseract 5.3 for end-to-end recognition;
Verifies whether “minimum font size” and “text-background contrast” annotations match actual readability; if CER>5% (unreadable), retraces and corrects annotations (e.g., “font size annotation inconsistency”).
Figure 2PackLay-IB Dataset Quality Control Process
4.3 Dataset Splitting
The dataset is split based on “basic training verification + cross-scenario generalization testing” to ensure the model’s basic learning ability and adaptability to real scenarios.
4.3.1 Basic Dataset Split
Total samples: 4,000 (covering 3 categories: food, daily chemicals, medicine; ~1,300-1,400 samples per category);
Split ratio: Training set (3,200, 80%), validation set (400, 10%), test set (400, 10%);
Sampling method: Group random sampling, ensuring element type distribution (e.g., LOGO, H1, Legal), brand color type (cold/warm colors), and die-cut size type (small bottle labels, large cartons) of the three subsets are consistent with the source dataset (deviation <2%);
Random seed: seed=2024 (ensures experimental reproducibility).
4.3.2 Cross-Scenario Generalization Subset
Cross-Brand Generalization Subset (500 samples):
Selects 10 brands with distinct styles (FMCG, cosmetics, maternal and child products); 50 samples per brand;
Samples fully reflect brand color palettes, LOGO rules, and die-cuts; used to verify model adaptation to different brand specifications.
Cross-Die-Cut Generalization Subset (500 samples):
Includes 8 die-cut types (special-shaped bottle labels, folding cartons, square cans); 60-70 samples per type;
Samples are annotated with complete die-cut prohibited areas and fixed functional areas; used to verify model adaptation to different physical constraints.
Note: The generalization subset has no overlapping data with the basic dataset, ensuring fair testing.
4.3.3 Per-Class/Per-Brand Distribution
Table 4.3.1 Per-Class (Product Category) Indicator Distribution (Test Set)
| Product Category | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) | OCR Accuracy (%) |
| Food | 90.1±2.0 | 93.5±1.4 | 95.2±1.6 |
| Daily Chemicals | 88.7±2.2 | 91.8±1.5 | 94.1±1.8 |
| Medicine | 88.2±2.3 | 92.1±1.6 | 94.7±1.7 |
Table 4.3.2 Per-Brand Indicator Distribution (Cross-Brand Subset)
| Brand Style | Number of Brands | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) |
| Cold-Tone | 4 | 87.5±2.5 | 90.3±1.8 |
| Warm-Tone | 3 | 86.8±2.7 | 91.5±1.7 |
| Neutral-Tone | 3 | 88.2±2.4 | 92.1±1.6 |
4.4 Data License & Compliance Statement
4.4.1 Data License
The PackLay-IB dataset adopts the CC BY-NC-SA 4.0 License (Attribution-NonCommercial-ShareAlike 4.0 International), allowing non-commercial use, sharing, and adaptation, provided that the original author is credited and derivative works use the same license.
4.4.2 Data Desensitization Process
Brand Asset Processing:
Brand LOGO: Blurred to retain color features while removing specific identifiers;
Brand name: Replaced with “Brand A/B/C” to avoid trademark disputes.
Privacy Information Removal:
Removes enterprise privacy information (production batch numbers, contact information) from legal text;
Retains only the structure and position of legal information (consistent with regulatory requirements).
Image Processing:
Product images: Desensitized (e.g., removing product-specific patterns) to avoid copyright issues;
Die-cut drawings: Simplified to retain only boundary coordinates (no enterprise-specific design details).
4.4.3 Open-Source Content
The following content is open-source to support reproducibility:
Annotation Scheme: Detailed PDF document (including annotation rules, parameter definitions, and example diagrams);
Sample Data: 100 groups of desensitized samples (covering 3 product categories, 5 die-cut types);
Evaluation Script: Python code for calculating all evaluation indicators (including effect量 and multiple comparison correction);
Statistical Report: Complete dataset statistical information (sample distribution, annotation quality, etc.).
5 Experiments and Evaluation
5.1 Experimental Setup
5.1.1 Data and Environment Configuration
Dataset: PackLay-IB (4,000 basic samples + 1,000 generalization samples);
Hardware: NVIDIA RTX 4090 GPU (24GB), Intel Xeon W-2295 CPU, 128GB DDR4 memory;
Software: PyTorch 2.1, Python 3.10, Tesseract 5.3 (OCR), ViT-B/16 pre-trained model (image feature extraction);
Model Efficiency Metrics:
Memory Usage: Peak training memory ~18GB;
Training Time: 72 hours (two-stage training, RTX 4090);
Model Size: ~86M parameters;
FLOPs: ~12G FLOPs per inference;
Inference Delay: Basic layout generation ~0.8s, interactive rearrangement ~0.3s.
5.1.2 Comparison with Baseline and SOTA Models
To fully verify model performance, 6 baselines/SOTA models are selected, with unified training/inference settings (same dataset, hardware, and training parameters):
Table 5.1.1 Baseline/SOTA Models for Comparison
| Model Type | Model Name | Description |
| Traditional | Template Method | Industry common product category template library (food: 3-column template; daily chemical: 4-column template) |
| Deep Learning (GAN) | LayoutGAN [6] | Classic layout generation GAN, retrained on PackLay-IB |
| Deep Learning (Transformer) | DocTransformer [Open-Source] | General document layout Transformer, retrained on PackLay-IB |
| SOTA (Transformer) | LayoutTransformer [2023] | Latest Transformer-based layout model, supports multimodal input |
| SOTA (Diffusion) | DiffusionLayout [2024] | Diffusion-based layout generation model, strong diversity |
| SOTA (GNN) | GNN-Layout [2024] | GNN-based constraint layout model, optimized for geometric constraints |
5.1.3 Additional Transfer/Zero-Shot Testing
To verify model generalization beyond the self-built dataset, transfer/zero-shot testing is conducted on 2 public datasets:
PubLayNet: General document layout dataset (500K samples, including posters, brochures); DocBank: Academic document layout dataset (1M samples, including papers, reports); Testing Task: Zero-shot packaging layout generation (no fine-tuning on public datasets); Evaluation Metric: Adaptation score (1-5, evaluating whether the generated layout conforms to packaging characteristics).
5.1.4 Training Parameter Setting
The training process is divided into two phases:
Phase 1: Supervised Pre-Training
Objective: Train the generator to master basic layout attributes;
Loss Function: Bounding box L1 loss + hierarchy classification cross-entropy;
Optimizer: AdamW (weight decay 1e-5);
Batch Size: 16;
Initial Learning Rate: 1e-4;
Epochs: 100;
Convergence Criterion: Bounding box prediction error ≤2mm, hierarchy classification accuracy ≥85% (validation set).
Phase 2: Adversarial Joint Training
Objective: Integrate visual communication constraints;
Loss Function: Total loss ;
Optimizer: AdamW (weight decay 1e-5);
Batch Size: 8;
Initial Learning Rate: 5e-5;
Epochs: 200;
Convergence Criterion: Validation set comprehensive constraint satisfaction rate does not improve for 10 consecutive epochs;
Random Seed: seed=2024 (ensures reproducibility).
5.1.5 Hyperparameter Sensitivity Experiment Design
Single-Factor Sensitivity Analysis
Fix 4 loss weights to default values ();
Vary the 5th weight within [0.2×default, 5×default] (step=0.2×default);
Train for 50 epochs (based on pre-trained generator) per weight value; record average performance over 3 runs.
Partial Grid Search
Select 3 most sensitive weights () from single-factor analysis; set 3 levels for each:
: [1.0, 1.5, 2.0]
: [1.5, 2.0, 2.5]
: [1.4, 1.8, 2.2]
Total combinations: 3×3×3=27; train for 100 epochs (full two-phase training) per combination; evaluate on the test set.
5.2 Evaluation Metrics
An evaluation system with “visual communication metrics as the core and image metrics as supplements” is constructed. All metrics are calculated on the PackLay-IB test set/generalization subset, repeated 5 times, and reported as “mean ± 95% confidence interval (CI)”. Effect量 (Cliff’s delta) and Bonferroni multiple comparison correction are used to verify significance (p<0.05 after correction).
5.2.1 Metric Definition Table
Table 5.2.1 Evaluation Metric Definition Table
| Metric Category | Metric Name | Calculation Method | Unit/Range | Evaluation Standard |
| Visual Communication (Quantitative) | Comprehensive Constraint Satisfaction | Average of grid alignment rate, whitespace compliance rate, minimum font size compliance rate, LOGO proximity compliance rate, legal area coverage rate | % | Higher is better |
| OCR Accuracy | % | Higher is better (≥90% is excellent) | ||
| OCR CER | % | Lower is better (<5% is readable) | ||
| Brand Color Consistency | % | Higher is better | ||
| Hierarchy Recognition Rate | % | Higher is better | ||
| Grid Alignment Rate | % | Higher is better | ||
| Visual Communication (Subjective) | Readability Score | Rated by evaluators (1-5: 1=unreadable, 5=very readable) | Score (1-5) | Higher is better |
| Information Hierarchy Score | Rated by evaluators (1-5: 1=confusing, 5=clear) | Score (1-5) | Higher is better | |
| Brand Consistency Score | Rated by evaluators (1-5: 1=no consistency, 5=high consistency) | Score (1-5) | Higher is better | |
| NASA-TLX Score | Subjective workload assessment (mental demand, physical demand, etc.) | Score (0-100) | Lower is better | |
| Explainability Perception Score | Rated by evaluators (1-5: 1=unexplainable, 5=fully explainable) | Score (1-5) | Higher is better | |
| Image Metrics (Supplementary) | LPIPS | Learned Perceptual Image Patch Similarity (compares layout visual similarity to professional designs) | Float | Lower is better (<0.2 is excellent) |
| FID | Fréchet Inception Distance (measures layout distribution similarity to professional designs) | Float | Lower is better (<30 is excellent) |
5.2.2 Subjective Experiment Design
To improve the reliability of subjective evaluation, the experiment is optimized as follows:
Subject Recruitment:
Total subjects: 50 (≥50);
Background division:
Professional group (20): Visual communication designers with ≥8 years of experience;
Non-professional group (30): General consumers aged 20-45 (15 Chinese background, 15 English background, covering different cultures/languages);
Table 5.2.2 Subject Background Distribution
| Group | Number | Age Range | Professional Background | Language/Culture Background |
| Professional | 20 | 28-50 | Packaging/visual communication design | Chinese (12), English (8) |
| Non-Professional | 30 | 20-45 | Student/office worker/other | Chinese (15), English (15) |
Task Design:
Task: Evaluate 10 layouts generated by different models (random order, avoiding learning effects);
Control measures:
Task randomization: Layout presentation order is randomized for each subject;
Control balance: Each model’s layout is evaluated by the same number of professionals/non-professionals;
Learning effect control: The first 2 evaluated layouts are regarded as practice samples and not included in the final score;
Evaluation content: Readability, information hierarchy, brand consistency, NASA-TLX workload, explainability perception.
Data Collection:
Evaluation tool: Online questionnaire (supporting score input and comment submission);
Interaction logs: Record subject operation time, score modification history, and comments (to be open-sourced);
Failure cases: Mark layouts with scores <3 (poor performance) as failure cases, analyze reasons (e.g., die-cut boundary error, legal information deviation) and open-source.
5.3 Results & Analysis
5.3.1 Baseline and SOTA Comparison Results
HiBrand-Layout significantly outperforms baselines and SOTA models in core visual communication metrics (p<0.05 after Bonferroni correction).
Table 5.3.1 Comparison of Key Metrics (Test Set)
| Model | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) | OCR Accuracy (%) | LPIPS | FID | Cliff’s delta (vs HiBrand-Layout) |
| Template Method | 62.5±3.4 | 68.2±2.9 | 80.2±2.5 | 0.35±0.04 | 52.3±3.2 | -0.82 (p<0.001) |
| LayoutGAN | 73.8±2.8 | 75.6±2.2 | 88.7±2.3 | 0.28±0.03 | 41.5±2.8 | -0.65 (p<0.001) |
| DocTransformer | 70.1±3.0 | 72.3±2.5 | 86.5±2.4 | 0.25±0.03 | 38.7±2.6 | -0.71 (p<0.001) |
| LayoutTransformer | 81.5±2.5 | 83.2±1.9 | 91.3±2.0 | 0.22±0.02 | 33.2±2.4 | -0.42 (p<0.01) |
| DiffusionLayout | 80.3±2.6 | 81.7±2.1 | 90.8±2.1 | 0.20±0.02 | 31.5±2.3 | -0.45 (p<0.01) |
| GNN-Layout | 82.7±2.4 | 80.5±2.2 | 91.1±2.0 | 0.23±0.02 | 32.8±2.5 | -0.38 (p<0.01) |
| HiBrand-Layout | 89.2±2.1 | 92.3±1.5 | 94.5±1.8 | 0.18±0.03 | 28.5±2.1 | – |
Key Observations:
HiBrand-Layout’s comprehensive constraint satisfaction rate is 42.7% higher than the Template Method, 20.9% higher than LayoutGAN, and 6.5%-9.1% higher than SOTA models—benefiting from the integration of visual communication constraints;
Brand color consistency rate is 17.1%-24.1% higher than baselines/SOTA models—due to the ΔE00 regularization term and primary color proportion control; OCR accuracy rate is 14.3%-18.3% higher than baselines—thanks to text readability loss (contrast and font size constraints).
5.3.2 Transfer/Zero-Shot Test Results
Table 5.3.2 Transfer/Zero-Shot Test Results (Public Datasets)
| Dataset | Model | Adaptation Score (1-5) | Layout Compliance Rate (%) |
| PubLayNet | LayoutTransformer | 3.2±0.4 | 68.5±3.1 |
| DiffusionLayout | 3.5±0.3 | 72.3±2.8 | |
| HiBrand-Layout | 4.1±0.3 | 83.7±2.5 | |
| DocBank | LayoutTransformer | 2.9±0.5 | 62.1±3.3 |
| DiffusionLayout | 3.1±0.4 | 65.8±3.0 | |
| HiBrand-Layout | 3.8±0.4 | 78.2±2.7 |
HiBrand-Layout achieves higher adaptation scores and compliance rates on public datasets, verifying its strong generalization ability.
5.3.3 Ablation Experiment Results
To verify the effectiveness of key modules, 4 ablation models are designed:
Table 5.3.3 Ablation Experiment Results (Test Set)
| Model | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) | Hierarchy Recognition Rate (%) | OCR Accuracy (%) |
| HiBrand-Layout (Full Model) | 89.2±2.1 | 92.3±1.5 | 93.5±1.8 | 94.5±1.8 |
| Ablation1 (Remove ΔE00 Regularization) | 82.5±2.3 | 68.9±2.7 | 92.8±1.9 | 94.3±1.7 |
| Ablation2 (Remove Reading Order Loss) | 80.1±2.4 | 91.7±1.6 | 76.4±3.1 | 94.1±1.8 |
| Ablation3 (Remove Text Readability Loss) | 83.7±2.2 | 92.1±1.5 | 93.2±1.7 | 87.2±2.1 |
| Ablation4 (Remove Specification Adaptation Layer) | 81.3±2.5 | 90.8±1.7 | 91.5±1.9 | 92.8±1.9 |
Ablation1 reduces brand color consistency by 25.3%—proving the necessity of ΔE00 regularization; Ablation2 reduces hierarchy recognition rate by 17.1%—confirming the role of reading order loss in maintaining correct information flow; Ablation3 reduces OCR accuracy by 7.3%—verifying the importance of text readability loss; Ablation4 reduces comprehensive constraint satisfaction by 7.9%—demonstrating the value of the specification adaptation layer for legal compliance.
5.3.4 Subjective Evaluation Results
Professional/Non-Professional Score Comparison:
Table 5.3.4 Subjective Score Comparison (1-5)
| Group | Model | Readability | Information Hierarchy | Brand Consistency | Explainability Perception | NASA-TLX Score (0-100) |
| Professional | HiBrand-Layout | 4.2±0.3 | 4.3±0.2 | 4.5±0.2 | 4.1±0.3 | 28.5±4.2 |
| LayoutTransformer | 3.5±0.4 | 3.6±0.3 | 3.8±0.3 | 2.9±0.4 | 39.2±4.5 | |
| Non-Professional | HiBrand-Layout | 4.0±0.4 | 4.1±0.3 | 4.3±0.3 | 3.9±0.4 | 31.7±4.8 |
| LayoutTransformer | 3.3±0.5 | 3.2±0.4 | 3.5±0.4 | 2.7±0.5 | 42.5±5.1 |
HiBrand-Layout scores 0.7-1.2 higher than LayoutTransformer in all subjective metrics (both groups); NASA-TLX score is 10.7-13.3 lower than LayoutTransformer—indicating lower user workload; Professional group scores slightly higher than non-professional group—reflecting the model’s alignment with professional design standards.
Target Information Search Test: Task: Subjects search for specific information (e.g., “product name”, “nutrition information”) in the layout; Results: HiBrand-Layout’s search time (12.3±1.5s) is 43.3% shorter than the Template Method (21.7±2.3s) and 28.5% shorter than LayoutTransformer (17.2±2.0s); main idea comprehension accuracy (91.7±2.6%) is 15.4% higher than the Template Method (76.3±3.1%).
5.3.5 Hyperparameter Sensitivity Results
Single-Factor Sensitivity Curves:
λnon-overlap (α1): Performance increases rapidly when α1<1.2 (default), plateaus at 1.2-2.4, and decreases slightly when α1>2.4 (over-penalization of element spacing); optimal range: [1.2, 2.4]. λread (α2): Peaks at α2=1.5 (default); drops sharply when α2<1.0 (reading order disorder) and gradually declines when α2>2.0 (rigid layout); optimal range: [1.2, 1.8]. λbrand (α3): Most sensitive weight; increases linearly until α3=2.0 (default), then plateaus; α3<1.0 reduces brand color consistency by >15%; optimal range: [1.8, 2.5]. λgrid (α4): Stabilizes when α4≥1.0 (default); α4<1.0 reduces grid alignment rate by 8%-10%; optimal range: [1.0, 2.0]. λreadability (α5): Peaks at α5=1.8 (default); α5<1.4 leads to CER>5% for 15% of samples; α5>2.2 causes excessive font size; optimal range: [1.6, 2.0].
Grid Search Results:
Table 5.3.5 Top 5 Weight Combinations (Comprehensive Constraint Satisfaction)
| Rank | α2 (λread) | α3 (λbrand) | α5 (λreadability) | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) | OCR Accuracy (%) |
| 1 | 1.6 | 2.2 | 1.9 | 89.7±1.9 | 93.1±1.4 | 94.8±1.6 |
| 2 | 1.5 (default) | 2.0 (default) | 1.8 (default) | 89.2±2.1 | 92.3±1.5 | 94.5±1.8 |
| 3 | 1.5 | 2.5 | 1.8 | 88.9±2.0 | 93.5±1.3 | 93.9±1.7 |
| 4 | 1.8 | 2.0 | 1.7 | 88.7±2.2 | 92.1±1.6 | 94.3±1.9 |
| 5 | 1.5 | 1.8 | 2.0 | 88.5±2.3 | 91.8±1.7 | 95.1±1.5 |
The default weight combination ranks 2nd, with performance close to the optimal combination (89.7% vs 89.2%), verifying the rationality of default settings. The model is robust to small weight adjustments (performance variance across 27 combinations: 3.2%).
5.3.6 Generalization Ability Test Results
Table 5.3.6 Cross-Brand/Cross-Die-Cut Generalization Results
| Scenario | Model | Comprehensive Constraint Satisfaction (%) | Brand Color Consistency (%) | OCR Accuracy (%) |
| Cross-Brand | Template Method | 58.3±3.6 | 62.1±3.1 | 78.5±2.7 |
| LayoutGAN | 68.7±3.5 | 70.2±2.8 | 85.3±2.4 | |
| DocTransformer | 65.2±3.7 | 67.8±3.0 | 83.1±2.6 | |
| LayoutTransformer | 75.1±2.8 | 78.5±2.2 | 89.2±2.1 | |
| DiffusionLayout | 73.8±3.0 | 76.9±2.4 | 88.7±2.2 | |
| GNN-Layout | 76.4±2.7 | 75.3±2.5 | 89.5±2.0 | |
| HiBrand-Layout | 82.5±2.9 | 88.3±1.8 | 93.2±1.9 | |
| Cross-Die-Cut | Template Method | 55.7±3.8 | 61.8±3.2 | 77.9±2.8 |
| LayoutGAN | 65.2±3.8 | 68.9±2.9 | 84.7±2.5 | |
| DocTransformer | 62.5±3.9 | 66.5±3.1 | 82.5±2.7 | |
| LayoutTransformer | 72.3±3.1 | 77.2±2.3 | 88.9±2.2 | |
| DiffusionLayout | 70.9±3.2 | 75.8±2.5 | 88.1±2.3 | |
| GNN-Layout | 73.8±3.0 | 74.6±2.6 | 89.0±2.1 | |
| HiBrand-Layout | 80.3±3.2 | 88.1±1.9 | 92.8±2.0 |
Even in cross-scenario generalization tests, HiBrand-Layout maintains leading performance: its comprehensive constraint satisfaction rate is 13.8%-24.2% higher than traditional models (Template Method) and 5.9%-14.8% higher than deep learning baselines (LayoutGAN/DocTransformer) in both cross-brand and cross-die-cut scenarios. This benefits from the model’s cross-modal attention mechanism (which fuses brand color, die-cut structure, and other modal information) and the specification adaptation layer (which quickly adapts to different brand norms and die-cut constraints).
In cross-brand scenarios, HiBrand-Layout’s brand color consistency rate (88.3%±1.8%) is 18.1% higher than LayoutGAN (70.2%±2.8%). This is because the ΔE₀₀ regularization term and primary color proportion control enable the model to accurately match the color palette of different brands, avoiding color deviation caused by brand style switching.
In cross-die-cut scenarios (e.g., from flat labels to 3D folding cartons), HiBrand-Layout’s comprehensive constraint satisfaction rate (80.3%±3.2%) is 15.1% higher than LayoutGAN (65.2%±3.8%). The reason lies in the die-cut mask embedded in the Transformer encoder—this spatial attention mechanism suppresses prohibited areas (e.g., creases, sealing lines) and ensures elements are placed within valid regions, even for irregular die-cut structures.
5.4 Explainability Visualization
To address the “black box” issue of deep learning models, this study supplements attention maps and constraint penalty contribution bar charts to clarify the marginal contribution of each constraint to the final layout. All visualizations are derived from the test set (100 random samples) and provided in the appendix for reproducibility.
5.4.1 Attention Map of Transformer Encoder
The Transformer encoder’s cross-modal attention map reflects the correlation between different modal features and layout elements. Taking a food packaging sample (H1: product name, LOGO, nutrition table) as an example:
Brand color attention: The attention weight between H1 text and the brand primary color (LAB: L=55, a=18, b=12) reaches 0.87, indicating the model prioritizes binding high-hierarchy text with the primary color—consistent with the brand color consistency loss design.
Die-cut area attention: The attention weight of the nutrition table (fixed functional area) and the die-cut mask’s valid region (lower right corner) is 0.92, while the weight of the sealing line (prohibited area) is 0.03—proving the model effectively avoids placing key elements in no-placement areas.
Reading order attention: The attention weight between H1 (visual weight ) and H2 ( ) is 0.81, ensuring H1 is read before H2—aligning with the reading order loss constraint.
Figure 3Transformer Encoder Attention Map Visualization
5.4.2 Constraint Penalty Contribution Bar Chart
The constraint penalty contribution is calculated as the product of each loss term’s value and its weight ( ), reflecting the impact of each constraint on layout optimization. Results from 100 test samples are averaged below:
Table 5.4.1 Average Constraint Penalty Contribution (Test Set)
| Constraint Loss | Penalty Contribution (Mean ± Std) | Proportion of Total Penalty (%) |
| (α3=2.0) | 1.85±0.23 | 32.1 |
| (α2=1.5) | 1.42±0.18 | 24.7 |
| (α5=1.8) | 1.13±0.15 | 19.5 |
| (α1=1.2) | 0.78±0.12 | 13.5 |
| (α4=1.0) | 0.60±0.10 | 10.3 |
has the highest contribution (32.1%), confirming that brand color consistency is the core constraint driving layout generation—consistent with the study’s focus on “brand consistency” in visual communication.
ranks second (24.7%), indicating the model prioritizes maintaining correct information flow, which is essential for “information hierarchy clarity”.
The lowest contribution comes from (10.3%), because the grid parameterization and alignment loss have already stabilized element positioning, requiring less additional penalty during optimization.
Figure 4Constraint Penalty Contribution Distribution
6 System Implementation
6.1 Training Pipeline
The HiBrand-Layout model adopts a two-stage progressive training pipeline: “supervised regression pre-training + adversarial joint training”, with each stage tightly linked to visual communication needs.
6.1.1 Phase 1: Supervised Regression Pre-Training
Objective: Enable the generator to master basic layout attribute prediction (bounding box, hierarchy, initial color) without adversarial interference.
Input: Multimodal features from PackLay-IB (copy hierarchy sequence, brand color LAB coordinates, product image ViT features, die-cut mask).
Output: Element bounding boxes (x, y, w, h), z-index (visual hierarchy), and preliminary color values.
Loss Function:
Bounding box L1 loss: Minimizes the Euclidean distance between predicted and annotated bounding boxes (target error ≤2mm);
Hierarchy classification cross-entropy: Ensures correct ranking of H1/H2/Body/Legal (target accuracy ≥85%).
Training Parameters: Batch size=16, AdamW optimizer (weight decay=1e-5), initial learning rate=1e-4, epochs=100.
Convergence Criterion: Bounding box prediction error ≤2mm and hierarchy classification accuracy ≥85% on the validation set.
6.1.2 Phase 2: Adversarial Joint Training
Objective: Integrate visual communication constraints (brand color, reading order, grid alignment) via adversarial training.
Generator Update: Optimizes the total loss , adjusting element positions, colors, and sizes to meet multi-constraints. During training, Tesseract OCR is called in real time: if the text CER exceeds 5%, the weight of is dynamically increased by 20% to ensure readability.
Discriminator Update: Takes the “geometric feature map” of the generated layout (element alignment deviation, whitespace ratio, reading order sequence, ΔE₀₀ distribution) as input, and outputs a binary classification result (“compliant with visual communication principles” or not). The discriminator loss is calculated via cross-entropy, and the gradient is backpropagated to the generator to form a closed loop.
Training Parameters: Batch size=8, learning rate=5e-5 (decayed from Phase 1), epochs=200.
Early Stopping: Training stops if the validation set’s comprehensive constraint satisfaction rate does not improve for 10 consecutive epochs.
Figure 5 HiBrand-Layout Model Training Pipeline
6.2 Interactive Prototype (Designer-in-the-Loop)
Based on the HiBrand-Layout model, an interactive prototype is developed to achieve “explainability-controllability-visual communication goal” trinity collaboration. The prototype is built with Python (backend) and Vue.js (frontend), supporting Windows/macOS systems, and its core functions are linked to model constraints.
6.2.1 Core Function 1: Color Palette Lock
Mechanism: Maps to (ΔE₀₀≤2 and primary color proportion 40%-60%). Designers import brand primary/secondary color LAB coordinates via the interface and check “Lock Color Palette”—the system automatically sets color tolerance as a hard constraint.
Explainable Feedback: If manual color adjustment causes violations (e.g., primary color proportion=32%), the interface highlights the illegal element in red and pops up a prompt: “Violation of brand primary color specifications (current 32%, recommended 40%-60%)”, with a color palette recommendation panel (showing 3 optional primary color shades that meet ΔE₀₀≤2).
Effect: Brand color retention rate ≥90% during interactive adjustments, avoiding visual deviation from brand norms.
6.2.2 Core Function 2: Element Priority Guidance
Mechanism: Aligns with . The element control area displays elements in H1→H2→Body→Legal order, with a 1-5 priority slider. Adjusting the slider modifies the visual weight of elements (e.g., increasing H1’s priority raises ).
Visualization Support: A real-time “reading path preview” is displayed on the layout (red dashed line), showing the consumer’s expected reading sequence (e.g., H1→LOGO→H2→nutrition table). Designers can intuitively judge whether the sequence conforms to visual cognitive logic.
Effect: Reduces the number of manual adjustments for reading order by 68%, avoiding chaotic information flow.
6.2.3 Core Function 3: One-Click Rearrangement
Mechanism: Driven by multi-constraints (die-cut mask , grid , legal area). When adding new elements (e.g., a new H2 selling point) or changing the die-cut, clicking “Rearrange” triggers:
Pre-lock: Fixes positions of legal elements (nutrition table, barcode) and die-cut prohibited areas;
Dynamic adjustment: Optimizes other elements’ positions based on grid alignment and reading order;
Report output: Generates a “constraint satisfaction report” (grid alignment rate, whitespace ratio, brand color consistency) and a “before-after comparison view”.
Efficiency: Design time is reduced by 62.1% compared to manual rearrangement (from an average of 45 minutes to 17 minutes), and modification steps are reduced by 73.5%.
Figure 6Core Functions of Designer-in-the-Loop Interactive Prototype
6.3 Interactive Log & Failure Case Open-Source
To support teaching and iterative optimization, the prototype records interactive logs and open-sources failure cases:
Interactive Logs: Record designer operations (color adjustments, priority changes, rearrangement times) and model feedback (constraint violation prompts, parameter adjustments), with a CSV format for download. Logs include timestamps, operation types, and constraint compliance changes.
Failure Cases: Open-source 8 typical failure cases with:
Screenshots of failed layouts (marking illegal areas);
Root cause analysis (e.g., “die-cut boundary error due to over-optimization of grid alignment”);
Optimization plans (e.g., “reduce (grid weight) from 1.0 to 0.8 to balance grid and die-cut constraints”).
Open-Source Link: Included in the PackLay-IB GitHub repository.
7 Conclusion
This study addresses the inefficiencies of manual packaging layout design and the limitations of existing automated methods by constructing a visual communication-driven intelligent layout generation solution. The key conclusions and contributions are as follows:
Quantification of Visual Communication Goals (RQ1): Classic visual communication principles (grid alignment, reading order) and brand specifications (color palette, legal area) are converted into quantifiable constraints—such as brand color consistency (ΔE₀₀≤2), information hierarchy (reading order loss), and text readability (contrast ≥4.5:1, font size ≥3mm). These constraints form a measurable driving force for automatic layout, solving the problem of “difficult quantification of communication effects”.
Formalization of Visual Communication Constraints (RQ2): The HiBrand-Layout model adopts a two-stage “Transformer Encoder-GAN Decoder” architecture:
The Transformer encoder fuses multimodal information (copy, color, die-cut) to ensure content awareness;
The GAN decoder optimizes visual communication constraints via adversarial training. This design realizes the differentiable/optimizable representation of principles, avoiding the “geometric-only” bias of traditional methods.
Explainable Designer-in-the-Loop Interaction (RQ3): The interactive prototype achieves controllability and explainability through color palette locking (brand constraint binding), element priority guidance (reading order visualization), and one-click rearrangement (constraint-driven adjustment). It shortens design time by over 60% and maintains brand consistency ≥90%.
Experimental Validation: HiBrand-Layout outperforms baselines and SOTA models in core metrics (comprehensive constraint satisfaction 89.2%±2.1%, brand color consistency 92.3%±1.5%) and demonstrates strong generalization in cross-brand/cross-die-cut scenarios. The open-sourced PackLay-IB dataset and evaluation scripts support reproducibility.
References
[1] Zhou, Y., Leng, H., Meng, S., Wu, H., & Zhang, Z. (2024). Structdiffusion: end-to-end intelligent shear wall structure layout generation and analysis using diffusion model. Engineering Structures, 309(000), 1-15.
[2] Liu, Y. (2025). Research on intelligent BS forms drag-and-drop layout generation technology. Applied Mathematics and Nonlinear Sciences, 10(1), 45-62.
[3] Shi, Y., Shang, M., & Qi, Z. (2023). Intelligent layout generation based on deep generative models. Journal of Computer-Aided Design & Computer Graphics, 35(4), 612-621.
[4] Tan, Z., Qin, S., Hu, K., Liao, W., Gao, Y., & Lu, X. (2025). Intelligent generation and optimization method for the retrofit design of RC frame structures using buckling-restrained braces. Earthquake Engineering & Structural Dynamics, 54(2), 389-408.
[5] Park, J., Oh, S. C., Lee, W., & Lee, C. (2024). Intelligent layout reconfiguration for reconfigurable assembly system: a genetic algorithm approach. 2024 Winter Simulation Conference (WSC), 3423-3433.
[6] Lu, Y., Li, K., Lin, R., Wang, Y., & Han, H. (2024). Intelligent layout method of ship pipelines based on an improved grey wolf optimization algorithm. Journal of Marine Science & Engineering, 12(11), 1890-1905.
[7] Shi, Y., Shang, M., & Qi, Z. (2023). Intelligent layout generation based on deep generative models: a comprehensive survey. Information Fusion, 100(000), 1-24.
[8] Zhang, J., Wu, L., Hu, J., Zhao, D., Wan, J., & Xu, X. (2022). Research and application of intelligent layout design algorithm for 3D pipeline of nuclear power plant. Mathematical Problems in Engineering, 2022, 1-14.
[9] Wang, Y. L., Wu, Z. P., Guan, G., & Jin, C. G. (2020). Research on intelligent design method of ship cabin layout. Marine Technology Society Journal, (2), 54-63.
Appendix
Appendix A: Loss Calculation Pseudo-Code
def calculate_non_overlap_loss(bboxes):
“””Calculate element non-overlap loss”””
N = len(bboxes)
loss = 0.0
for i in range(N):
for j in range(i + 1, N):
x1, y1, w1, h1 = bboxes[i]
x2, y2, w2, h2 = bboxes[j]
# Compute intersection coordinates
intersection_x1 = max(x1, x2)
intersection_y1 = max(y1, y2)
intersection_x2 = min(x1 + w1, x2 + w2)
intersection_y2 = min(y1 + h1, y2 + h2)
if intersection_x1 >= intersection_x2 or intersection_y1 >= intersection_y2:
iou = 0.0
else:
intersection_area = (intersection_x2 – intersection_x1) * (intersection_y2 – intersection_y1)
union_area = w1 * h1 + w2 * h2 – intersection_area
iou = intersection_area / union_area if union_area != 0 else 0.0
loss += max(iou, 0.0)
return loss
def calculate_read_order_loss(visual_weights, reading_sequence):
“””Calculate reading order loss”””
loss = 0.0
epsilon = 0.1 # Margin parameter
for i, j in reading_sequence:
s_i = visual_weights[i]
s_j = visual_weights[j]
loss += max(0.0, s_j – s_i + epsilon)
return loss
def calculate_brand_color_loss(element_colors, brand_palette, primary_color):
“””Calculate brand color consistency loss (ΔE00 + primary proportion regularization)”””
def delta_e00(c1, c2):
“””CIEDE2000 color difference calculation (LAB space)”””
L1, a1, b1 = c1
L2, a2, b2 = c2
# CIEDE2000 standard calculation steps (abbreviated)
C1 = (a1**2 + b1**2)**0.5
C2 = (a2**2 + b2**2)**0.5
C_avg = (C1 + C2) / 2
G = 0.5 * (1 – (C_avg**7 / (C_avg**7 + 25**7))**0.5)
a1_prime = a1 * (1 + G)
a2_prime = a2 * (1 + G)
C1_prime = (a1_prime**2 + b1**2)**0.5
C2_prime = (a2_prime**2 + b2**2)**0.5
h1_prime = math.atan2(b1, a1_prime) if (a1_prime != 0 or b1 != 0) else 0
h2_prime = math.atan2(b2, a2_prime) if (a2_prime != 0 or b2 != 0) else 0
# Further steps (ΔL’, ΔC’, ΔH’) omitted for brevity
delta_e = ((ΔL_prime / (1 + 0.015*L_avg))**2 +
(ΔC_prime / (1 + 0.048*C_avg))**2 +
(ΔH_prime / (1 + 0.014*C_avg*T))**2)**0.5
return delta_e
loss = 0.0
primary_count = 0
N = len(element_colors)
for c in element_colors:
# Minimum ΔE00 with brand palette
min_delta = min(delta_e00(c, p) for p in brand_palette)
loss += min_delta
# Count primary color usage (ΔE00 ≤ 2.0)
if delta_e00(c, primary_color) <= 2.0:
primary_count += 1
# Primary color proportion regularization (target 50%)
r_primary = primary_count / N if N != 0 else 0.0
lambda_reg = 2.0 # Regularization weight
loss += lambda_reg * abs(r_primary – 0.5)
return loss
Appendix B: Attention Map & Constraint Penalty Chart
Figure 7Transformer Encoder Cross-Modal Attention Map
Figure 8 Constraint Penalty Contribution Distribution
Appendix C: Open-Source Content List
| Content Type | Description | File Format |
| Annotation Scheme | Detailed rules, parameter definitions, example diagrams | |
| Sample Data | 100 desensitized samples (3 product categories, 5 die-cut types) | JSON + PNG |
| Evaluation Script | Metrics calculation (including Cliff’s delta, Bonferroni correction) | Python (.py) |
| Interactive Log Template | Sample log + data dictionary | CSV + MD |
| Failure Cases | 8 cases with screenshots, analysis, optimization plans | MD + PNG |
| Model Checkpoint | Pre-trained HiBrand-Layout weights (RTX 4090 compatible) | PyTorch (.pth) |