Security Testing: Validating File Types and Content

File upload vulnerabilities represent one of the most critical attack vectors in web applications. Beyond simple extension checking, robust security testing requires deep content validation, malware detection, and comprehensive threat modeling.

This guide provides enterprise-grade security testing strategies for file validation, covering everything from content-based detection to compliance requirements.

Understanding File Security Threats

Common Attack Vectors

File Type Spoofing

Attackers disguise malicious files by manipulating extensions and MIME types:

// Vulnerable approach - only checking extension
const isImageFile = (filename) => {
  return /\.(jpg|jpeg|png|gif)$/i.test(filename);
};

// Security issue: malicious.php.jpg passes this check

Polyglot Files

Files that are valid in multiple formats, often hiding malicious content:

Image files with embedded JavaScript
PDF files containing executable payloads
Archive files with path traversal vulnerabilities

Content Injection Attacks

SVG files with embedded XSS scripts
Office documents with malicious macros
Media files with metadata exploits

Security Testing Categories

Content-Based Validation Testing

Magic number verification
File structure analysis
Metadata examination
Embedded content detection

Behavioral Security Testing

Malware scanning integration
Sandbox execution testing
Network communication monitoring
System resource abuse detection

Content-Based File Validation

Magic Number Detection

Implement robust file type detection using file signatures:

import magic
import os

class SecureFileValidator:
    def __init__(self):
        self.allowed_types = {
            'image/jpeg': [b'\xFF\xD8\xFF'],
            'image/png': [b'\x89\x50\x4E\x47\x0D\x0A\x1A\x0A'],
            'image/gif': [b'GIF87a', b'GIF89a'],
            'application/pdf': [b'%PDF-'],
            'text/plain': []  # Variable signatures
        }

    def validate_file_content(self, file_path, expected_type):
        """Validate file content matches expected type"""
        try:
            # Get actual MIME type using python-magic
            actual_type = magic.from_file(file_path, mime=True)

            # Check if actual type matches expected
            if actual_type != expected_type:
                return {
                    'valid': False,
                    'reason': f'Content mismatch: expected {expected_type}, got {actual_type}'
                }

            # Verify magic numbers
            with open(file_path, 'rb') as f:
                file_header = f.read(32)

            expected_signatures = self.allowed_types.get(expected_type, [])
            if expected_signatures:
                signature_match = any(
                    file_header.startswith(sig) for sig in expected_signatures
                )

                if not signature_match:
                    return {
                        'valid': False,
                        'reason': 'File signature does not match expected type'
                    }

            return {'valid': True, 'actual_type': actual_type}

        except Exception as e:
            return {'valid': False, 'reason': f'Validation error: {str(e)}'}

# Usage in security tests
def test_file_type_spoofing():
    validator = SecureFileValidator()

    # Test: PHP file disguised as image
    malicious_file = create_test_file_with_content(
        'malicious.jpg',
        '<?php system($_GET["cmd"]); ?>'
    )

    result = validator.validate_file_content(malicious_file, 'image/jpeg')
    assert not result['valid'], "Should detect content mismatch"
    assert 'text/x-php' in result['reason'], "Should identify PHP content"

Advanced Content Analysis

SVG Security Testing

import xml.etree.ElementTree as ET
from xml.parsers.expat import ExpatError
import re

class SVGSecurityValidator:
    def __init__(self):
        self.dangerous_elements = [
            'script', 'object', 'embed', 'iframe', 'link',
            'meta', 'style', 'foreignObject'
        ]
        self.dangerous_attributes = [
            'onload', 'onclick', 'onmouseover', 'onerror',
            'href', 'xlink:href'
        ]

    def validate_svg_content(self, svg_content):
        """Validate SVG content for security threats"""
        try:
            # Parse SVG
            root = ET.fromstring(svg_content)

            # Check for dangerous elements
            for elem in root.iter():
                tag_name = elem.tag.split('}')[-1] if '}' in elem.tag else elem.tag

                if tag_name in self.dangerous_elements:
                    return {
                        'valid': False,
                        'threat': f'Dangerous element found: {tag_name}'
                    }

                # Check attributes
                for attr in elem.attrib:
                    if any(dangerous in attr.lower() for dangerous in self.dangerous_attributes):
                        return {
                            'valid': False,
                            'threat': f'Dangerous attribute found: {attr}'
                        }

                    # Check for JavaScript in attribute values
                    if 'javascript:' in elem.attrib[attr].lower():
                        return {
                            'valid': False,
                            'threat': f'JavaScript found in {attr} attribute'
                        }

            return {'valid': True}

        except ExpatError as e:
            return {'valid': False, 'threat': f'Malformed XML: {str(e)}'}

# Test case for SVG with embedded script
def test_svg_xss_detection():
    malicious_svg = '''
    <svg xmlns="http://www.w3.org/2000/svg">
        <script>alert('XSS')</script>
        <circle cx="50" cy="50" r="40"/>
    </svg>
    '''

    validator = SVGSecurityValidator()
    result = validator.validate_svg_content(malicious_svg)

    assert not result['valid']
    assert 'script' in result['threat']

PDF Security Analysis

import PyPDF2
import re

class PDFSecurityValidator:
    def __init__(self):
        self.suspicious_keywords = [
            '/JavaScript', '/JS', '/URI', '/GoTo', '/Launch',
            '/EmbeddedFile', '/FileAttachment', '/Annot'
        ]

    def analyze_pdf_security(self, pdf_path):
        """Analyze PDF for potential security threats"""
        threats = []

        try:
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)

                # Check for suspicious objects
                for page_num, page in enumerate(pdf_reader.pages):
                    page_content = page.extract_text()

                    # Look for JavaScript
                    if '/JavaScript' in str(page.get_contents()):
                        threats.append(f'JavaScript found on page {page_num + 1}')

                    # Check for external links
                    if '/URI' in str(page.get_contents()):
                        threats.append(f'External URI found on page {page_num + 1}')

                # Check document info for anomalies
                doc_info = pdf_reader.metadata
                if doc_info:
                    for key, value in doc_info.items():
                        if any(keyword in str(value) for keyword in self.suspicious_keywords):
                            threats.append(f'Suspicious content in metadata: {key}')

                return {
                    'safe': len(threats) == 0,
                    'threats': threats,
                    'page_count': len(pdf_reader.pages)
                }

        except Exception as e:
            return {
                'safe': False,
                'threats': [f'Analysis failed: {str(e)}'],
                'page_count': 0
            }

# Security test for PDF files
def test_pdf_malware_detection():
    # Create test PDF with JavaScript
    malicious_pdf = create_pdf_with_javascript()

    validator = PDFSecurityValidator()
    result = validator.analyze_pdf_security(malicious_pdf)

    assert not result['safe']
    assert any('JavaScript' in threat for threat in result['threats'])

Malware Scanning Integration

ClamAV Integration

import subprocess
import tempfile
import os

class MalwareScanner:
    def __init__(self):
        self.clamav_available = self._check_clamav_availability()

    def _check_clamav_availability(self):
        try:
            subprocess.run(['clamscan', '--version'],
                         capture_output=True, check=True)
            return True
        except (subprocess.CalledProcessError, FileNotFoundError):
            return False

    def scan_file(self, file_path):
        """Scan file for malware using ClamAV"""
        if not self.clamav_available:
            return {'error': 'ClamAV not available'}

        try:
            result = subprocess.run([
                'clamscan',
                '--no-summary',
                '--infected',
                file_path
            ], capture_output=True, text=True, timeout=30)

            if result.returncode == 0:
                return {'clean': True, 'output': result.stdout}
            elif result.returncode == 1:
                return {
                    'clean': False,
                    'infected': True,
                    'threat': result.stdout.strip()
                }
            else:
                return {
                    'clean': False,
                    'error': result.stderr.strip()
                }

        except subprocess.TimeoutExpired:
            return {'clean': False, 'error': 'Scan timeout'}
        except Exception as e:
            return {'clean': False, 'error': str(e)}

    def scan_memory_content(self, content):
        """Scan content in memory using temporary file"""
        with tempfile.NamedTemporaryFile(delete=False) as temp_file:
            temp_file.write(content)
            temp_file.flush()

            try:
                return self.scan_file(temp_file.name)
            finally:
                os.unlink(temp_file.name)

# Integration test
def test_malware_scanning():
    scanner = MalwareScanner()

    # Test with known malware signature (EICAR test string)
    eicar_signature = (
        'X5O!P%@AP[4\\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*'
    )

    result = scanner.scan_memory_content(eicar_signature.encode())

    if scanner.clamav_available:
        assert not result['clean']
        assert result.get('infected', False)
    else:
        assert 'error' in result

Sandbox Testing

import docker
import time
import json

class SandboxTester:
    def __init__(self):
        self.client = docker.from_env()
        self.sandbox_image = 'ubuntu:20.04'

    def test_file_in_sandbox(self, file_path, timeout=30):
        """Test file behavior in isolated sandbox"""
        try:
            # Create sandbox container
            container = self.client.containers.run(
                self.sandbox_image,
                command='sleep 60',
                detach=True,
                network_disabled=True,
                mem_limit='512m',
                cpu_quota=50000,  # 50% CPU limit
                security_opt=['no-new-privileges:true']
            )

            # Copy file to container
            with open(file_path, 'rb') as f:
                container.put_archive('/tmp/', f.read())

            # Monitor container behavior
            start_time = time.time()
            stats = []

            while time.time() - start_time < timeout:
                try:
                    stat = container.stats(stream=False)
                    stats.append({
                        'timestamp': time.time(),
                        'cpu_usage': stat['cpu_stats']['cpu_usage']['total_usage'],
                        'memory_usage': stat['memory_stats']['usage']
                    })
                    time.sleep(1)
                except Exception:
                    break

            # Analyze behavior
            analysis = self._analyze_sandbox_behavior(stats)

            # Cleanup
            container.stop()
            container.remove()

            return analysis

        except Exception as e:
            return {'error': f'Sandbox test failed: {str(e)}'}

    def _analyze_sandbox_behavior(self, stats):
        """Analyze container behavior for suspicious activity"""
        if not stats:
            return {'suspicious': False, 'reason': 'No stats collected'}

        # Check for CPU spikes
        cpu_values = [s['cpu_usage'] for s in stats[1:]]  # Skip first reading
        if cpu_values:
            avg_cpu = sum(cpu_values) / len(cpu_values)
            max_cpu = max(cpu_values)

            if max_cpu > avg_cpu * 5:  # CPU spike detection
                return {
                    'suspicious': True,
                    'reason': 'Unusual CPU activity detected'
                }

        # Check for memory growth
        memory_values = [s['memory_usage'] for s in stats]
        if len(memory_values) > 5:
            memory_growth = memory_values[-1] - memory_values[0]
            if memory_growth > 100 * 1024 * 1024:  # 100MB growth
                return {
                    'suspicious': True,
                    'reason': 'Excessive memory usage detected'
                }

        return {'suspicious': False, 'stats': stats}

Path Traversal Protection

Filename Validation

import os
import re
from pathlib import Path

class PathSecurityValidator:
    def __init__(self):
        self.dangerous_patterns = [
            r'\.\./',  # Parent directory traversal
            r'\.\.\\',  # Windows parent directory traversal
            r'/etc/',   # Unix system files
            r'\\windows\\',  # Windows system files
            r'%00',     # Null byte injection
            r'%2e%2e%2f',  # URL encoded traversal
            r'%c0%af',  # UTF-8 overlong encoding
        ]

        self.reserved_names = {
            'windows': [
                'CON', 'PRN', 'AUX', 'NUL',
                'COM1', 'COM2', 'COM3', 'COM4', 'COM5',
                'COM6', 'COM7', 'COM8', 'COM9',
                'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5',
                'LPT6', 'LPT7', 'LPT8', 'LPT9'
            ]
        }

    def validate_filename(self, filename):
        """Validate filename for security issues"""
        if not filename or filename.strip() == '':
            return {'valid': False, 'reason': 'Empty filename'}

        # Check for dangerous patterns
        for pattern in self.dangerous_patterns:
            if re.search(pattern, filename, re.IGNORECASE):
                return {
                    'valid': False,
                    'reason': f'Dangerous pattern detected: {pattern}'
                }

        # Check for reserved names
        name_without_ext = os.path.splitext(filename)[0].upper()
        if name_without_ext in self.reserved_names['windows']:
            return {
                'valid': False,
                'reason': f'Reserved system name: {name_without_ext}'
            }

        # Check filename length
        if len(filename) > 255:
            return {
                'valid': False,
                'reason': 'Filename too long (max 255 characters)'
            }

        # Check for control characters
        if any(ord(c) < 32 for c in filename if c != '\t'):
            return {
                'valid': False,
                'reason': 'Control characters not allowed'
            }

        return {'valid': True}

    def sanitize_filename(self, filename):
        """Sanitize filename by removing dangerous elements"""
        if not filename:
            return 'unnamed_file'

        # Remove dangerous patterns
        sanitized = filename
        for pattern in self.dangerous_patterns:
            sanitized = re.sub(pattern, '', sanitized, flags=re.IGNORECASE)

        # Remove control characters
        sanitized = ''.join(c for c in sanitized if ord(c) >= 32 or c == '\t')

        # Limit length
        if len(sanitized) > 255:
            name, ext = os.path.splitext(sanitized)
            max_name_length = 255 - len(ext)
            sanitized = name[:max_name_length] + ext

        # Ensure not empty
        if not sanitized.strip():
            sanitized = 'sanitized_file'

        return sanitized

# Security tests for path traversal
def test_path_traversal_protection():
    validator = PathSecurityValidator()

    dangerous_filenames = [
        '../../../etc/passwd',
        '..\\..\\windows\\system32\\config',
        'test%00.exe',
        'CON.txt',
        '../../uploads/../admin.php',
        '%2e%2e%2f%2e%2e%2f%65%74%63%2f%70%61%73%73%77%64'
    ]

    for filename in dangerous_filenames:
        result = validator.validate_filename(filename)
        assert not result['valid'], f"Should reject dangerous filename: {filename}"

    # Test sanitization
    sanitized = validator.sanitize_filename('../../../etc/passwd')
    assert '../' not in sanitized
    assert 'etc/passwd' not in sanitized

Compliance Requirements Testing

GDPR Compliance for File Handling

import hashlib
import json
from datetime import datetime, timedelta

class GDPRFileCompliance:
    def __init__(self):
        self.data_retention_periods = {
            'profile_images': timedelta(days=1095),  # 3 years
            'documents': timedelta(days=2555),       # 7 years
            'temporary_files': timedelta(days=30)    # 30 days
        }

    def check_file_compliance(self, file_metadata):
        """Check if file handling complies with GDPR"""
        compliance_issues = []

        # Check data retention
        if 'upload_date' in file_metadata and 'category' in file_metadata:
            upload_date = datetime.fromisoformat(file_metadata['upload_date'])
            category = file_metadata['category']

            if category in self.data_retention_periods:
                retention_period = self.data_retention_periods[category]
                if datetime.now() - upload_date > retention_period:
                    compliance_issues.append(
                        f'File exceeds retention period for {category}'
                    )

        # Check for personal data in metadata
        personal_data_fields = ['user_email', 'full_name', 'phone_number']
        for field in personal_data_fields:
            if field in file_metadata:
                compliance_issues.append(
                    f'Personal data found in metadata: {field}'
                )

        # Check anonymization
        if 'anonymized' not in file_metadata or not file_metadata['anonymized']:
            if any(field in file_metadata for field in personal_data_fields):
                compliance_issues.append('File contains non-anonymized personal data')

        return {
            'compliant': len(compliance_issues) == 0,
            'issues': compliance_issues,
            'recommendations': self._get_compliance_recommendations(compliance_issues)
        }

    def _get_compliance_recommendations(self, issues):
        """Provide recommendations for compliance issues"""
        recommendations = []

        for issue in issues:
            if 'retention period' in issue:
                recommendations.append('Implement automated file deletion')
            elif 'personal data' in issue:
                recommendations.append('Remove or anonymize personal data')
            elif 'anonymized' in issue:
                recommendations.append('Implement data anonymization process')

        return recommendations

# GDPR compliance tests
def test_gdpr_compliance():
    compliance_checker = GDPRFileCompliance()

    # Test file exceeding retention period
    old_file_metadata = {
        'upload_date': '2020-01-01T00:00:00',
        'category': 'temporary_files',
        'user_id': 'hashed_user_id',
        'anonymized': True
    }

    result = compliance_checker.check_file_compliance(old_file_metadata)
    assert not result['compliant']
    assert any('retention period' in issue for issue in result['issues'])

    # Test compliant file
    compliant_metadata = {
        'upload_date': datetime.now().isoformat(),
        'category': 'profile_images',
        'user_id': 'hashed_user_id',
        'anonymized': True
    }

    result = compliance_checker.check_file_compliance(compliant_metadata)
    assert result['compliant']

HIPAA Compliance for Healthcare Files

import ssl
import logging
from cryptography.fernet import Fernet

class HIPAAFileCompliance:
    def __init__(self):
        self.required_encryption = True
        self.access_log_required = True
        self.audit_trail_required = True

    def validate_hipaa_compliance(self, file_path, metadata):
        """Validate HIPAA compliance for healthcare files"""
        compliance_checks = {
            'encryption': self._check_encryption(file_path),
            'access_control': self._check_access_control(metadata),
            'audit_trail': self._check_audit_trail(metadata),
            'data_integrity': self._check_data_integrity(file_path, metadata)
        }

        is_compliant = all(check['compliant'] for check in compliance_checks.values())

        return {
            'hipaa_compliant': is_compliant,
            'checks': compliance_checks,
            'recommendations': self._get_hipaa_recommendations(compliance_checks)
        }

    def _check_encryption(self, file_path):
        """Check if file is properly encrypted"""
        try:
            # Check if file appears to be encrypted
            with open(file_path, 'rb') as f:
                header = f.read(100)

            # Simple heuristic: encrypted files should have high entropy
            entropy = self._calculate_entropy(header)

            return {
                'compliant': entropy > 7.5,  # High entropy indicates encryption
                'entropy': entropy,
                'detail': 'File encryption check based on entropy analysis'
            }

        except Exception as e:
            return {
                'compliant': False,
                'error': str(e),
                'detail': 'Failed to analyze file encryption'
            }

    def _calculate_entropy(self, data):
        """Calculate Shannon entropy of data"""
        import math
        from collections import Counter

        if not data:
            return 0

        counts = Counter(data)
        entropy = 0
        for count in counts.values():
            p = count / len(data)
            entropy -= p * math.log2(p)

        return entropy

    def _check_access_control(self, metadata):
        """Check access control implementation"""
        required_fields = ['access_level', 'authorized_users', 'department']

        missing_fields = [
            field for field in required_fields
            if field not in metadata
        ]

        return {
            'compliant': len(missing_fields) == 0,
            'missing_fields': missing_fields,
            'detail': 'Access control metadata validation'
        }

    def _check_audit_trail(self, metadata):
        """Check audit trail completeness"""
        required_audit_fields = [
            'created_by', 'created_at', 'last_accessed',
            'access_log', 'modification_log'
        ]

        missing_audit_fields = [
            field for field in required_audit_fields
            if field not in metadata
        ]

        return {
            'compliant': len(missing_audit_fields) == 0,
            'missing_fields': missing_audit_fields,
            'detail': 'Audit trail completeness check'
        }

    def _check_data_integrity(self, file_path, metadata):
        """Check data integrity verification"""
        if 'checksum' not in metadata:
            return {
                'compliant': False,
                'detail': 'No integrity checksum found'
            }

        try:
            # Calculate current file checksum
            with open(file_path, 'rb') as f:
                file_content = f.read()
                current_checksum = hashlib.sha256(file_content).hexdigest()

            stored_checksum = metadata['checksum']

            return {
                'compliant': current_checksum == stored_checksum,
                'current_checksum': current_checksum,
                'stored_checksum': stored_checksum,
                'detail': 'File integrity verification'
            }

        except Exception as e:
            return {
                'compliant': False,
                'error': str(e),
                'detail': 'Failed to verify file integrity'
            }

    def _get_hipaa_recommendations(self, checks):
        """Generate HIPAA compliance recommendations"""
        recommendations = []

        for check_name, check_result in checks.items():
            if not check_result['compliant']:
                if check_name == 'encryption':
                    recommendations.append('Implement end-to-end file encryption')
                elif check_name == 'access_control':
                    recommendations.append('Add comprehensive access control metadata')
                elif check_name == 'audit_trail':
                    recommendations.append('Implement complete audit logging')
                elif check_name == 'data_integrity':
                    recommendations.append('Add file integrity verification')

        return recommendations

Automated Security Testing Pipeline

CI/CD Security Integration

# Security testing pipeline
name: File Security Tests
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  security-tests:
    runs-on: ubuntu-latest
    services:
      clamav:
        image: clamav/clamav:latest
        options: --health-cmd "clamscan --version" --health-interval 30s

    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v3
        with:
          python-version: "3.9"

      - name: Install security testing dependencies
        run: |
          pip install python-magic PyPDF2 python-docx
          apt-get update && apt-get install -y clamav

      - name: Update malware signatures
        run: freshclam

      - name: Generate test files
        run: python scripts/generate_security_test_files.py

      - name: Run content validation tests
        run: python -m pytest tests/security/test_content_validation.py -v

      - name: Run malware scanning tests
        run: python -m pytest tests/security/test_malware_scanning.py -v

      - name: Run path traversal tests
        run: python -m pytest tests/security/test_path_traversal.py -v

      - name: Run compliance tests
        run: python -m pytest tests/security/test_compliance.py -v

      - name: Generate security report
        run: python scripts/generate_security_report.py

      - name: Upload security results
        uses: actions/upload-artifact@v3
        with:
          name: security-test-results
          path: security-report.json

Conclusion

Robust file security testing requires a multi-layered approach combining content validation, malware detection, and compliance verification. Key strategies include:

Content-Based Validation: Never trust file extensions; always validate content
Malware Scanning: Integrate real-time scanning with multiple engines
Path Security: Implement comprehensive filename validation
Compliance Testing: Automate regulatory requirement verification
Behavioral Analysis: Use sandbox testing for suspicious files

Regular security testing ensures your file handling systems remain resilient against evolving threats while maintaining compliance with industry regulations.

Next Steps:

Implement automated security scanning in your CI/CD pipeline
Set up real-time threat monitoring for file uploads
Create incident response procedures for security violations
Regularly update malware signatures and threat databases

FileMock