Streamlining clinic data collection with Gemini-powered web scraping
View on GitHubThe Psychology Researcher Automation project is a specialised tool designed to streamline the process of researching psychology clinics in Australia. The system leverages web scraping techniques and AI-powered data extraction to automatically gather and structure information about psychology clinics, including contact details, practitioner information, and pricing data.
This automation pipeline extracts critical information from clinic websites:
Researching psychology clinics manually is an extremely time-consuming process. For each clinic, a researcher needs to:
This process becomes especially challenging when dealing with hundreds of clinics, each with different website structures and information organization. Our automation system reduces what could take days of manual effort into a streamlined pipeline that delivers consistent results in a fraction of the time.
Our solution is divided into five distinct stages that form a complete data processing pipeline:
The first stage processes the input Excel file containing clinic information. It:
This initial validation ensures we begin with clean, standardized data before web scraping.
The second stage handles the web scraping process for each clinic website:
The scraper includes sophisticated error handling and retries to handle various website configurations.
The third stage leverages Google's Gemini API to extract specific information from the scraped text:
The AI-powered extraction significantly improves accuracy compared to regex-only approaches, especially for diverse website layouts.
The fourth stage validates and formats the extracted information:
This stage ensures the final output meets quality standards and is ready for analysis.
The final stage generates a professionally formatted Excel file containing all the extracted information:
The result is a comprehensive, ready-to-use Excel document that saves hours of manual research.
Our scraper identifies key pages like "Our Team" or "Staff" and preserves content structure for accurate AI processing.
Using Google's Gemini API, the system extracts psychologist details, contact info, and pricing with high accuracy.
Features robust error handling, rate limiting, and batch processing to ensure smooth operation across numerous websites.
Websites vary in structure & tech stacks.
Solution: Our scraper adapts using multiple content detection techniques.
High request rates may cause IP blocks.
Solution: Configurable delays & exponential backoff prevent server overload.
Info appears in inconsistent formats.
Solution: AI-driven text parsing (vs. regex) improves accuracy.
For the full code and documentation, visit the GitHub repository.