How to Create Custom Checks and Rules in ASIATOOLS Crawler

When you need to extract data from complex websites that standard crawlers simply cannot handle, the ability to create custom checks and rules becomes absolutely essential. ASIATOOLS provides a powerful crawler framework that lets you define your own validation logic, data extraction patterns, and processing workflows. In this guide, I will walk you through the complete process of building custom checks and rules from scratch, covering everything from basic configuration to advanced scenarios that you will encounter in real-world production environments.

Understanding the Architecture Behind Custom Rules in ASIATOOLS Crawler

Before diving into the implementation details, you need to grasp how the ASIATOOLS crawler processes requests and applies validation logic. The system operates on a three-layer architecture that determines how your custom rules integrate with the crawling pipeline.

The first layer handles request initialization and URL queuing. When you submit a starting URL, the crawler adds it to an internal queue with metadata including depth level, priority score, and custom tags that you can assign. The second layer performs the actual HTTP fetching and response parsing. Here is where your custom extraction rules come into play, determining which elements get captured and how they get processed. The third layer applies validation checks before data gets stored or forwarded to your output destination.

Custom rules in ASIATOOLS are executed in a specific sequence: Pre-fetch checks → Request modification → Content extraction → Post-extraction validation → Data storage. Understanding this flow helps you place your custom logic at the optimal point in the pipeline.

This architecture matters because it tells you where your custom code will execute. If you need to modify a URL before it gets requested, you hook into the request modification layer. If you need to validate that extracted data meets certain criteria, you use the post-extraction validation layer. Placing your logic in the wrong layer produces unexpected behavior or performance issues.

Setting Up Your Development Environment for Custom Rule Development

You will need a properly configured development environment before writing your first custom rule. The setup process involves three main components that must work together seamlessly.

First, ensure you have Python 3.8 or newer installed. ASIATOOLS crawler core modules require specific Python features that were introduced in version 3.8, including dataclasses and typing improvements. Run python --version in your terminal to verify your current installation. If you see an older version, download the latest Python installer from python.org and complete the installation process.

Second, install the ASIATOOLS SDK using pip. The command pip install asiatools-crawler fetches the latest stable release along with all dependencies. For development work, you want to install additional testing packages with pip install asiatools-crawler[dev]. This installs pytest for unit testing, black for code formatting, and mypy for type checking.

Third, configure your IDE to work with the ASIATOOLS project structure. If you use VS Code, create a settings.json file in your .vscode folder with Python path configuration and enable type checking. PyCharm users should mark the src directory as Sources Root and enable the type checking inspection level to strict.

Here is a typical project structure you should follow:

  • project_root/
    • config/
      • rules.yaml
      • crawler.yaml
    • src/
      • custom_checks/
        • __init__.py
        • validators.py
        • extractors.py
      • run_crawler.py
    • tests/
      • test_validators.py
      • test_extractors.py
    • requirements.txt

Creating Your First Custom Validation Check

Validation checks ensure that the data your crawler extracts meets quality standards before storage. Without proper validation, you risk collecting malformed data that creates problems downstream in your data pipeline.

Open your validators.py file and create a class that inherits from the base validator. The base class provides the interface that ASIATOOLS expects, including methods for initialization, validation execution, and error reporting.

Here is a practical example of a validator that checks whether extracted product prices fall within an expected range:

from asiatools.crawler.validators import BaseValidator
from typing import Dict, Any, List

class PriceRangeValidator(BaseValidator):
    def __init__(self, min_price: float = 0.0, max_price: float = 99999.99):
        self.min_price = min_price
        self.max_price = max_price
        self.failed_urls: List[str] = []
    
    def validate(self, extracted_data: Dict[str, Any], context: Dict[str, Any]) -> bool:
        url = context.get('url', 'unknown')
        
        if 'price' not in extracted_data:
            self.failed_urls.append(url)
            self.log_error(f"Missing price field for {url}")
            return False
        
        try:
            price_value = float(str(extracted_data['price']).replace('$', '').replace(',', ''))
            
            if price_value < self.min_price:
                self.log_error(f"Price ${price_value} below minimum ${self.min_price} for {url}")
                self.failed_urls.append(url)
                return False
            
            if price_value > self.max_price:
                self.log_error(f"Price ${price_value} above maximum ${self.max_price} for {url}")
                self.failed_urls.append(url)
                return False
            
            return True
            
        except ValueError as e:
            self.log_error(f"Invalid price format: {extracted_data['price']} for {url}")
            self.failed_urls.append(url)
            return False

This validator demonstrates several important patterns. First, it uses type hints throughout, which helps catch bugs early and makes your code more maintainable. Second, it tracks failed URLs in a list that you can inspect after the crawl completes. Third, it provides detailed error messages that help you diagnose problems quickly.

To use this validator in your crawler configuration, add it to your rules.yaml file like this:

post_extraction_validators:
  - class: custom_checks.validators.PriceRangeValidator
    params:
      min_price: 1.99
      max_price: 9999.99
    run_on_all_fields: false
    target_fields:
      - price

Building Advanced Extraction Rules with Pattern Matching

Sometimes the data you need is embedded in complex HTML structures that simple CSS selectors cannot handle. ASIATOOLS crawler supports XPath expressions and regex-based extraction that gives you fine-grained control over data capture.

Consider a scenario where you need to extract product information from a website that uses JavaScript-rendered content. The HTML structure might look something like this:

<div class="product-container" data-product-id="SKU12345">
    <script type="application/ld+json">
    {
        "name": "Wireless Bluetooth Headphones",
        "offers": {
            "price": "79.99",
            "priceCurrency": "USD"
        },
        "aggregateRating": {
            "ratingValue": "4.5",
            "reviewCount": "1284"
        }
    }
    </script>
    <div class="description">
        Premium noise-canceling headphones with 30-hour battery life.
    </div>
</div>

Extracting this data requires a multi-step approach. First, you target the JSON-LD script element. Second, you parse the JSON content. Third, you map the JSON fields to your output schema. Here is how you implement this in your extractors.py file:

from asiatools.crawler.extractors import BaseExtractor
from typing import Dict, Any, Optional
import json
import re

class JsonLdProductExtractor(BaseExtractor):
    def __init__(self, target_schema: Dict[str, str]):
        self.target_schema = target_schema
        self.json_pattern = re.compile(r'<script type="application/ld\+json">(.*?)</script>', re.DOTALL)
    
    def extract(self, html_content: str, metadata: Dict[str, Any]) -> Dict[str, Any]:
        result = {}
        
        json_matches = self.json_pattern.findall(html_content)
        
        if not json_matches:
            metadata['extraction_warnings'].append('No JSON-LD structured data found')
            return result
        
        for json_string in json_matches:
            try:
                data = json.loads(json_string)
                
                if data.get('@type') == 'Product':
                    for output_field, json_path in self.target_schema.items():
                        value = self._navigate_json_path(data, json_path)
                        if value:
                            result[output_field] = value
                
            except json.JSONDecodeError:
                metadata['extraction_errors'].append(f'Invalid JSON in script tag')
                continue
        
        return result
    
    def _navigate_json_path(self, data: Dict, path: str) -> Optional[Any]:
        keys = path.split('.')
        current = data
        
        for key in keys:
            if isinstance(current, dict) and key in current:
                current = current[key]
            else:
                return None
        
        return current

The configuration for this extractor in rules.yaml would be:

extractors:
  - name: json_ld_product
    class: custom_checks.extractors.JsonLdProductExtractor
    params:
      target_schema:
        product_name: name
        price: offers.price
        currency: offers.priceCurrency
        rating: aggregateRating.ratingValue
        review_count: aggregateRating.reviewCount
        sku: sku

Implementing Conditional Rule Execution Based on URL Patterns

Different pages on the same website often require different extraction strategies. A product listing page needs to capture multiple items, while a product detail page needs to capture single-item details. ASIATOOLS lets you define conditional rules that activate based on URL patterns.

The conditional execution system uses URL matching with support for wildcards and regular expressions. You define conditions in your crawler configuration, and the system evaluates them in order until it finds a match.

conditional_rules:
  - name: category_listing_rule
    url_pattern: "*/category/*"
    priority: 10
    extractors:
      - class: custom_checks.extractors.ListingPageExtractor
    validators:
      - class: custom_checks.validators.MinItemsValidator
        params:
          min_items: 5
    post_processors:
      - class: custom_checks.processors.CategoryFormatter
  
  - name: product_detail_rule
    url_pattern: "*/product/*"
    priority: 20
    extractors:
      - class: custom_checks.extractors.ProductDetailExtractor
    validators:
      - class: custom_checks.validators.PriceRangeValidator
        params:
          min_price: 0.01
          max_price: 50000.00
      - class: custom_checks.validators.RequiredFieldsValidator
        params:
          required:
            - name
            - price
            - description
  
  - name: search_results_rule
    url_pattern: "*/search*"
    priority: 15
    extractors:
      - class: custom_checks.extractors.SearchResultsExtractor
    validators:
      - class: custom_checks.validators.SearchResultsValidator

Each rule specifies a priority value. Higher priority rules execute first. This matters when multiple patterns could match the same URL. For example, a URL like /category/product/123 could potentially match both the category pattern and the product pattern. Setting product_rule priority to 20 and category_rule priority to 10 ensures the product rule takes precedence.

Handling Rate Limiting and Request Throttling Through Custom Rules

Websites implement rate limiting for various reasons, and your crawler needs to respect these limits while maximizing throughput. ASIATOOLS provides built-in rate limiting capabilities, but custom rules give you additional control over how your crawler responds to different scenarios.

Create a custom throttling rule that adjusts request frequency based on server responses:

from asiatools.crawler.middleware import BaseThrottleMiddleware
import time
from collections import deque

class AdaptiveThrottleMiddleware(BaseThrottleMiddleware):
    def __init__(self, initial_delay: float = 1.0, max_delay: float = 60.0):
        self.initial_delay = initial_delay
        self.current_delay = initial_delay
        self.max_delay = max_delay
        self.response_times = deque(maxlen=100)
        self.error_count = 0
        self.success_count = 0
    
    def on_request(self, request, spider):
        time.sleep(self.current_delay)
        return request
    
    def on_response(self, request, response):
        self.response_times.append(response.elapsed.total_seconds())
        self.success_count += 1
        
        if response.status == 429:
            self.increase_delay()
        elif response.status == 200:
            self.decrease_delay()
        
        return response
    
    def on_error(self, request, exception):
        self.error_count += 1
        self.increase_delay()
        
        if self.error_count > 10:
            self.log_critical(f"High error rate detected: {self.error_count} errors in recent requests")
    
    def increase_delay(self):
        self.current_delay = min(self.current_delay * 1.5, self.max_delay)
        self.log_info(f"Increased throttle delay to {self.current_delay:.2f} seconds")
    
    def decrease_delay(self):
        avg_response_time = sum(self.response_times) / len(self.response_times) if self.response_times else 1.0
        self.current_delay = max(self.initial_delay, avg_response_time * 2)

This middleware monitors your crawler performance and automatically adjusts delays. When it encounters 429 Too Many Requests responses, it backs off. When responses are fast and successful, it reduces delays to maximize throughput. The deque with maxlen=100 ensures you only track recent performance, preventing old data from skewing your calculations.

Data Transformation Rules for Output Formatting

Raw extracted data often needs transformation before it can be useful. Maybe you need to normalize date formats, clean up text encoding issues, or convert units of measurement. ASIATOOLS post-processors handle these transformations efficiently.

Here is a comprehensive data transformation processor that handles common cleaning scenarios:

from asiatools.crawler.processors import BaseProcessor
from typing import Dict, Any, List
import re
from datetime import datetime

class DataCleaningProcessor(BaseProcessor):
def __init__(self, transformations: List[Dict[str, Any]]):
self.transformations = transformations

def process(self, data: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
cleaned_data = data.copy()

for transform in self.transformations:
field = transform['field']
if field not in cleaned_data:
continue

transform_type = transform['type']

if transform_type == 'strip_html':
cleaned_data[field] = self._strip_html_tags(cleaned_data[field])

elif transform_type == 'normalize_whitespace':
cleaned_data[field] = self._normalize_whitespace(cleaned_data[field])

elif transform_type == 'extract_numbers':
cleaned_data[field] = self._extract_numbers(cleaned_data[field])

elif transform_type == 'parse_date':
cleaned_data[field] = self._parse_date_string(cleaned_data[field], transform.get('format'))

elif transform_type == 'lowercase':
cleaned_data[field] = cleaned_data[field].lower()

elif transform_type == 'remove_special_chars':
cleaned_data[field] = self._remove_special_characters(cleaned_data[field], transform.get('keep'))

return cleaned_data

def _strip_html_tags(self, text: str) -> str:
clean = re.sub(r'<[^>]+>', '', text)
clean = re.sub(r'\s+', ' ', clean)
return clean.strip()

def _normalize_whitespace(self, text: str) -> str:
return ' '.join(text.split())

def _extract_numbers(self, text: str) -> float:
numbers = re.findall(r'[\d.]+', text)
return float(numbers[0]) if numbers else 0.0

def _parse_date_string(self, date_str: str, format_str: str = None) -> str:
formats = ['%Y-%m-%d', '%m/%d/%Y', '%d-%m-%Y', '%B %d, %Y'] if not format_str else [format_str]

for fmt in formats:
try:
dt = datetime.strptime(date_str, fmt)
return dt.isoformat()
except ValueError:
continue

return date_str

def _remove_special_characters(self, text: str, keep: str = '') -> str:
pattern = f'[^a-zA-Z0-9\\s{re.escape(keep)}]'
return re.sub(pattern, '', text

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Scroll to Top