Introduction
Product attributes such as brand, color, and size are essential for distinguishing products online. Extracting these attributes accurately is crucial for enhancing the customer experience and improving product search and recommendation systems. This article explores a novel approach to attribute extraction that leverages product images in addition to textual descriptions. This method, known as the PAM (Product Attribute Multimodal) model, integrates visual features, Optical Character Recognition (OCR) tokens, and text to extract attributes across multiple product categories effectively. Many Amazon sellers have latched onto and are now leveraging Rekognition, a video and image analysis tool within AWS. Those who have been experimenting will see how this article crosses over and the impact it can have on ranking.
Key Points (TL;DR)
- Multimodal Attribute Extraction: The PAM model combines text, images, and OCR data using advanced techniques to identify product attributes more effectively.
- Improved Accuracy: By analyzing both visual elements and OCR content, the model outperforms traditional text-only approaches with higher recall and F1 scores.
- Category-Specific Adaptability: Tailoring its approach for different product types, the model fine-tunes its output to better match each category’s unique characteristics.
- Cross-Validation and Additional Information: Using product images as extra validation, the model ensures more accurate and complete attribute extraction.
- Comprehensive Evaluation: Rigorous testing across a range of products confirms the model’s superior performance compared to other methods.
Unlocking Insights: Mastering Automated Product Image Analysis
This research enables computers to extract product attributes from images more accurately compared to approaches that use only text. It provides a framework for unified reasoning across product photos, descriptions, and image text. The model combines transformers, category-specific knowledge, and multimodal inputs.
- Product Category: The top section indicates the hierarchical navigation path, showing that this product belongs to the ‘Beauty & Personal Care > Skin Care > Sunscreens & Tanning Products > Sunscreens’ category, helping customers easily locate it within a broader shopping context.
- Text: The middle section provides the product title and detailed description, which includes key features such as ‘Alba Botanica Hawaiian Sunscreen Clear Spray, SPF 50, Coconut, 6 Oz,’ offering essential information to potential buyers.
- Attributes: The right section lists specific product attributes like ‘Item Form’ (Spray), ‘Brand’ (Alba Botanica), ‘Sun Protection Index in SPF’ (50 SPF), ‘Item Volume’ (9.6 Fluid Ounces), and ‘Age Range’ (Adult), helping customers quickly identify the product’s key characteristics.
- Text: The bottom section provides additional detailed descriptions and benefits, such as ingredients and usage recommendations, enhancing the customer’s understanding and trust in the product.
Diving Deeper into Attribute Recognition
- Title: The textual features extracted from the product title, provide initial product information but might not always include all critical details, such as the exact brand name.
- Image: The visual features from the product image, such as packaging design and colors, offer additional context that can help identify the product more accurately.
- OCR Tokens: Optical Character Recognition (OCR) extracts text directly from the image, such as the brand name “Lava,” which might not be explicitly mentioned in the title.
- Results: Compared to previous methods that might misidentify the brand attribute as “Bar Soap,” this approach correctly identifies the brand as “Lava” by integrating textual, visual, and OCR information.
Seamless Data Validation Through Dynamic Cross-Referencing
- Title: The text pulled from the product title suggests various forms the item could take, such as “Stick,” “Cream,” or “Powder.”
- Image: The product image clearly displays the item’s shape and design, indicating that it is a “Stick.”
- OCR Tokens: By extracting the word “Stick” from the product image using Optical Character Recognition (OCR), they reinforce the details given by the visual elements. This helps clarify any uncertainties that might arise from the various values mentioned in the title.
- Results: Past techniques could mistakenly label the item as “Powder” by just looking at the text. However, this method checks the details and accurately classifies it as “Stick.” It does this by combining text, images, and OCR data.
Don’t Be Scared by This Image – We’ve Got You!
This diagram presents an advanced framework designed to improve product attribute generation using multi-modal data, combining product text and images. The approach utilizes various AI technologies to enhance the accuracy and relevance of the generated attributes, crucial for better search and recommendation.
Key Components and Process Flow
- Input Modalities:
- Product Text: Textual information such as product titles.
- Product Images: Visual representations of the products.
- Token Selection:
- BERT for Text: The BERT model processes the product title, breaking it into word tokens.
- OCR Engine: Extracts text from the product images.
- Faster R-CNN: Detects objects within the product images.
- Multi-Modal Transformer:
- Combines information from text tokens, OCR tokens, and object tokens.
- Utilizes a dynamic vocabulary, adapting based on the product category and existing attribute values.
- Iterative Attribute Generation:
- Sequence Output: Generates a sequence of product attributes.
- Token Scores: Scores are assigned to each token (text, OCR, vocabulary) to determine the most relevant attributes.
Detailed Breakdown
- Product Text Input:
- The product title “Pull-Ups Girls’ Potty Training Pants Training Underwear Size 4, 2T-3T, 94 Ct” is input into the BERT model.
- BERT breaks this down into individual tokens (words) like “Pull-Ups,” “Girls’,” “Potty,” “Training,” etc.
- Product Image Input:
- The product image of Pull-Ups is processed by the OCR engine to identify any text present on the packaging, such as “PullUps,” “2T-3T,” “94 Ct.”
- Faster R-CNN analyzes the image to detect specific objects and locations, enhancing the understanding of the visual context.
- Multi-Modal Transformer Processing:
- The tokens from BERT, OCR, and Faster R-CNN are fed into a multi-modal transformer.
- The transformer uses this combined data to understand the product comprehensively.
- A dynamic vocabulary, conditioned on the product category, ensures relevant tokens are selected for attribute generation.
- Token Selection and Scoring:
- Token Selection: The system selects tokens based on their relevance and scores.
- Token Scores: Each token (text, OCR, vocabulary) is assigned a score indicating its importance.
- The process iteratively generates the final product attribute sequence.
Example Output Sequence
The output sequence might include attributes like “Product Category: Training Pants,” “Size: 2T-3T,” “Count: 94,” accurately describing the product based on combined textual and visual data.
Key Findings from Testing
- Comparison of Model Architectures: The PAM framework consistently outperforms baseline methods, showing significant improvements in recall and F1 scores for attributes such as “Item Form” and “Brand”.
- Impact of Different Components: Ablation studies reveal that each component (text, image, OCR) contributes uniquely to the model’s performance, with text being the most critical, followed by OCR and then images.
- Usefulness of Image, Text, and OCR Inputs: The combined use of text, image, and OCR inputs significantly enhances the model’s ability to accurately extract product attributes, demonstrating the necessity of a multimodal approach.
- Comparison with Baseline Methods: PAM surpasses various baseline models, including BiLSTM-CRF, OpenTag, and M4C, particularly in the extraction of attributes from product images.
- Generalization Across Categories: The PAM model is capable of handling multiple product categories effectively, showcasing its generalization ability and robustness in attribute extraction tasks.
Impact and Implications
The PAM framework’s capacity to utilize multimodal data (text, images, OCR) markedly enhances the precision of product attribute extraction. This progress is especially advantageous for Amazon, as it guarantees that product listings are comprehensive and precise, improving the customer experience and potentially leading to increased sales.
Challenges in Multi-Category Attribute Extraction
A key challenge is extracting attributes across diverse categories. For example, vocabularies for size in shirts (“small”, “large”) versus diapers (“newborn”, numbers) are totally different. Methods focused on single categories don’t transfer well.
The Key Challenges
The researchers identified 3 big challenges that make this problem really hard:
- Connecting the different types of input – photos, text, and text in images.
- Special vocabulary in product descriptions.
- Totally different attribute values across product types.
These 3 challenges of multi-modal inputs, unique vocabulary, and diverse attribute values across categories are what make automatic product understanding so difficult, especially compared to approaches focused on single product types and text alone. The researchers aimed to address these issues with their model design and training techniques.
The Proposed Model
The researchers designed a new AI model to tackle these challenges. Their model uses something called a transformer. What’s a transformer? It’s a type of machine learning model that’s good at understanding relationships between different types of input data. In this case, the transformer encodes, or represents, three kinds of information:
- The product description text
- Objects detected in the product photo
- Text extracted from the photo using OCR (optical character recognition)
The transformer figures out connections between the visual inputs from the photo and textual inputs from the description and OCR. Next, the decoder part makes predictions. It predicts the attribute values step-by-step, focusing on different parts of the encoded inputs at each step. The decoder outputs a probability distribution over possible tokens (words) at each step. To reduce meaningless nonsense text, the decoder can only output from certain useful tokens, not any random word. By leveraging a transformer and constraining outputs, the model can combine multimodal inputs and generate coherent attribute values tailored to the product category and description.
Using Category-Specific Word Lists
To deal with unique words and phrases for each product type, the model looks up category-specific vocabularies when making predictions. For example, it uses one word list with common size terms for shirts, and a totally different list with size words for diapers. Having these tailored word lists for each category gives the model useful information about which attribute values are likely.
Techniques to Handle Different Categories
The researchers used two key techniques to help the model extract attributes across varied product categories:
- Pick vocabulary based on predicted category – The model first guesses the general product category, then uses the word list for that category to predict attributes.
- Multi-task learning – The model is trained to jointly predict the product category and attribute value. Doing both tasks together improves the learning.
Results
When tested on 14 different Amazon product categories, the model achieved much higher accuracy than approaches using only text. The image, text, and OCR inputs all provided useful signals. Using category-specific vocabularies worked better than generic word lists. This demonstrates the benefits of using multimodal inputs and tailoring predictions to each product category.
For Amazon sellers, the key takeaway is the importance of providing comprehensive product images and detailed descriptions. Ensuring that key attributes are visually present and easily recognizable through text can enhance product discoverability and accuracy in search results.
Tips for Amazon Sellers
- High-Quality Images: Ensure product images are clear and include all relevant textual information directly on the product packaging.
- Detailed Descriptions: Provide thorough product descriptions that cover all key attributes, as text data significantly enhances attribute extraction accuracy.
- Use of OCR-Friendly Fonts: When possible, use clear, OCR-friendly fonts on product packaging to facilitate accurate attribute extraction.
- Consistent Attribute Labels: Maintain consistency in how attributes are labeled and presented across different products to improve model accuracy.
- Leverage Rich Media: Utilize a combination of images, videos, and detailed text to provide a comprehensive view of the product, making it easier for models to extract accurate attributes.
By integrating these practices, Amazon sellers can improve their product listings’ accuracy and visibility, ultimately leading to a better customer experience and increased sales.
Citation
Citation: This summary is based on the research paper titled “Understanding Product Images in Cross-Product-Category Attribute Extraction” presented at the KDD 2021 conference .