Hi! I'm Ethan

Hard learning. Hardworking. Hard playing. Ready for what's next.

Project Psychological Analysis Thumbnail

Research Automation

Streamlining clinic data collection with Gemini-powered web scraping

View on GitHub

Built With

Python Badge Pandas Badge BeautifulSoup4 Badge Requests Badge Google Gemini API Badge Openpyxl Badge Pydantic Badge

Project Overview

The Psychology Researcher Automation project is a specialised tool designed to streamline the process of researching psychology clinics in Australia. The system leverages web scraping techniques and AI-powered data extraction to automatically gather and structure information about psychology clinics, including contact details, practitioner information, and pricing data.

Key Capabilities

This automation pipeline extracts critical information from clinic websites:

  • Email addresses for clinic contacts
  • Doctor/Team page URLs
  • Lists of psychologists with their specialization types (Clinical or General)
  • Pricing information for initial and follow-up consultations

The Problem

Researching psychology clinics manually is an extremely time-consuming process. For each clinic, a researcher needs to:

This process becomes especially challenging when dealing with hundreds of clinics, each with different website structures and information organization. Our automation system reduces what could take days of manual effort into a streamlined pipeline that delivers consistent results in a fraction of the time.

The Pipeline Architecture

Our solution is divided into five distinct stages that form a complete data processing pipeline:

Stage 1: Excel Parsing and Initial Validation

The first stage processes the input Excel file containing clinic information. It:

  • Identifies rows highlighted in green (indicating they should be processed)
  • Validates address formats according to Australian standards
  • Checks for duplicate phone numbers and missing critical data
  • Verifies website URL formats

This initial validation ensures we begin with clean, standardized data before web scraping.

Stage 2: Website Scraping and Content Extraction

The second stage handles the web scraping process for each clinic website:

  • Fetches and parses the main website content
  • Identifies and follows links to relevant pages (team/staff pages, services, pricing)
  • Extracts text content while preserving structural information (headings, paragraphs, lists)
  • Implements rate limiting and exponential backoff for respectful scraping
  • Saves extracted content in structured text files for further processing

The scraper includes sophisticated error handling and retries to handle various website configurations.

Stage 3: LLM-based Information Extraction

The third stage leverages Google's Gemini API to extract specific information from the scraped text:

  • Uses specialized prompts to identify psychologist names and their types (Clinical vs. General)
  • Extracts contact emails with priority for primary clinic contacts
  • Finds URLs for doctor/team pages
  • Identifies pricing information for initial and follow-up consultations
  • Structures the extracted data in a consistent JSON format

The AI-powered extraction significantly improves accuracy compared to regex-only approaches, especially for diverse website layouts.

Stage 4: Validation and Structural Formatting

The fourth stage validates and formats the extracted information:

  • Cleanses and standardizes extracted data (emails, URLs, pricing, etc.)
  • Validates information against expected formats
  • Identifies and flags discrepancies between existing and newly extracted data
  • Prepares data for Excel output, including creating new rows for additional psychologists

This stage ensures the final output meets quality standards and is ready for analysis.

Stage 5: Excel Output Generation

The final stage generates a professionally formatted Excel file containing all the extracted information:

  • Creates a structured Excel file with all extracted data
  • Applies color coding to highlight different data categories
  • Formats phone numbers and other fields for better readability
  • Generates an invoice file for submission
  • Preserves original green row highlighting while adding new extracted information

The result is a comprehensive, ready-to-use Excel document that saves hours of manual research.

Implementation Highlights

🔍 Smart Web Scraping

Our scraper identifies key pages like "Our Team" or "Staff" and preserves content structure for accurate AI processing.

🤖 AI-Driven Data Extraction

Using Google's Gemini API, the system extracts psychologist details, contact info, and pricing with high accuracy.

🛠️ Reliable and Efficient

Features robust error handling, rate limiting, and batch processing to ensure smooth operation across numerous websites.

Challenges and Solutions

🌐 Website Diversity

Websites vary in structure & tech stacks.

Solution: Our scraper adapts using multiple content detection techniques.

⚖️ Rate Limiting and Respectful Scraping

High request rates may cause IP blocks.

Solution: Configurable delays & exponential backoff prevent server overload.

📄 Extracting Unstructured Information

Info appears in inconsistent formats.

Solution: AI-driven text parsing (vs. regex) improves accuracy.

Visit the GitHub Repository

For the full code and documentation, visit the GitHub repository.