Duc Phat Nguyen

A student looking for opportunities in Data Science & AI

Project Psychological Analysis Thumbnail

Research Automation

Streamlining clinic data collection with Gemini-powered web scraping

View on GitHub

Built With

Python Badge Pandas Badge BeautifulSoup4 Badge Requests Badge Google Gemini API Badge Openpyxl Badge Pydantic Badge

Project Overview

The Psychology Researcher Automation project is a specialised tool designed to streamline the process of researching psychology clinics in Australia. By combining web scraping techniques with AI-powered data extraction, the system can automatically gather and structure information about psychology clinics, including contact details, practitioner information, and pricing data.

Key Capabilities

This automation pipeline extracts critical information from clinic websites:

  • Email addresses for clinic contacts
  • Doctor/Team page URLs
  • Lists of psychologists with their specialization types (Clinical or General)
  • Pricing information for initial and follow-up consultations

The Problem

Researching psychology clinics manually is an extremely time-consuming process. For each clinic, a researcher needs to:

This process becomes especially challenging when dealing with hundreds of clinics, each with different website structures and information organization. Our automation system reduces what could take days of manual effort into a streamlined pipeline that delivers consistent results in a fraction of the time.

The Pipeline Architecture

Our solution is divided into five distinct stages that form a complete data processing pipeline:

Stage 1: Excel Parsing and Initial Validation

The first stage processes the input Excel file containing clinic information. It:

  • Identifies rows highlighted in green (indicating they should be processed)
  • Validates address formats according to Australian standards
  • Checks for duplicate phone numbers and missing critical data
  • Verifies website URL formats

This initial validation ensures we begin with clean, standardized data before web scraping.

Stage 2: Website Scraping and Content Extraction

The second stage handles the web scraping process for each clinic website:

  • Fetches and parses the main website content
  • Identifies and follows links to relevant pages (team/staff pages, services, pricing)
  • Extracts text content while preserving structural information (headings, paragraphs, lists)
  • Implements rate limiting and exponential backoff for respectful scraping
  • Saves extracted content in structured text files for further processing

The scraper includes sophisticated error handling and retries to handle various website configurations.

Stage 3: LLM-based Information Extraction

The third stage leverages Google's Gemini API to extract specific information from the scraped text:

  • Uses specialized prompts to identify psychologist names and their types (Clinical vs. General)
  • Extracts contact emails with priority for primary clinic contacts
  • Finds URLs for doctor/team pages
  • Identifies pricing information for initial and follow-up consultations
  • Structures the extracted data in a consistent JSON format

The AI-powered extraction significantly improves accuracy compared to regex-only approaches, especially for diverse website layouts.

Stage 4: Validation and Structural Formatting

The fourth stage validates and formats the extracted information:

  • Cleanses and standardizes extracted data (emails, URLs, pricing, etc.)
  • Validates information against expected formats
  • Identifies and flags discrepancies between existing and newly extracted data
  • Prepares data for Excel output, including creating new rows for additional psychologists

This stage ensures the final output meets quality standards and is ready for analysis.

Stage 5: Excel Output Generation

The final stage generates a professionally formatted Excel file containing all the extracted information:

  • Creates a structured Excel file with all extracted data
  • Applies color coding to highlight different data categories
  • Formats phone numbers and other fields for better readability
  • Generates an invoice file for submission
  • Preserves original green row highlighting while adding new extracted information

The result is a comprehensive, ready-to-use Excel document that saves hours of manual research.

Implementation Highlights

🔍 Intelligent Web Scraping

Our scraper goes beyond basic content extraction by intelligently identifying the most valuable pages on a clinic's website. It looks for patterns in link text and URL structures to find pages likely to contain practitioner information, such as "Our Team," "Meet Our Psychologists," or "Staff." The system also preserves the structure of the content, maintaining headings, paragraphs, and lists for more accurate AI processing.

🤖 AI-Powered Information Extraction

Using Google's Gemini API, we've developed specialized prompts that enable accurate extraction of complex information. The AI can distinguish between clinical and general psychologists based on context clues, identify the most relevant contact information, and extract pricing details even when they're presented in various formats across different websites.

🛠️ Robust Error Handling and Rate Limiting

The system implements comprehensive error handling with exponential backoff for failed requests. It respects website servers by implementing rate limiting between requests and uses batch processing to manage resources efficiently. This ensures the tool can run reliably even when processing hundreds of websites.

Challenges and Solutions

🌐 Website Diversity

Websites vary in structure & tech stacks.

Solution: Our scraper adapts using multiple content detection techniques.

⚖️ Rate Limiting and Respectful Scraping

High request rates may cause IP blocks.

Solution: Configurable delays & exponential backoff prevent server overload.

📄 Extracting Unstructured Information

Info appears in inconsistent formats.

Solution: AI-driven text parsing (vs. regex) improves accuracy.

Visit the GitHub Repository

For the full code and documentation, visit the GitHub repository.