Streamlining clinic data collection with Gemini-powered web scraping
View on GitHubThe Psychology Researcher Automation project is a specialised tool designed to streamline the process of researching psychology clinics in Australia. By combining web scraping techniques with AI-powered data extraction, the system can automatically gather and structure information about psychology clinics, including contact details, practitioner information, and pricing data.
This automation pipeline extracts critical information from clinic websites:
Researching psychology clinics manually is an extremely time-consuming process. For each clinic, a researcher needs to:
This process becomes especially challenging when dealing with hundreds of clinics, each with different website structures and information organization. Our automation system reduces what could take days of manual effort into a streamlined pipeline that delivers consistent results in a fraction of the time.
Our solution is divided into five distinct stages that form a complete data processing pipeline:
The first stage processes the input Excel file containing clinic information. It:
This initial validation ensures we begin with clean, standardized data before web scraping.
The second stage handles the web scraping process for each clinic website:
The scraper includes sophisticated error handling and retries to handle various website configurations.
The third stage leverages Google's Gemini API to extract specific information from the scraped text:
The AI-powered extraction significantly improves accuracy compared to regex-only approaches, especially for diverse website layouts.
The fourth stage validates and formats the extracted information:
This stage ensures the final output meets quality standards and is ready for analysis.
The final stage generates a professionally formatted Excel file containing all the extracted information:
The result is a comprehensive, ready-to-use Excel document that saves hours of manual research.
Our scraper goes beyond basic content extraction by intelligently identifying the most valuable pages on a clinic's website. It looks for patterns in link text and URL structures to find pages likely to contain practitioner information, such as "Our Team," "Meet Our Psychologists," or "Staff." The system also preserves the structure of the content, maintaining headings, paragraphs, and lists for more accurate AI processing.
Using Google's Gemini API, we've developed specialized prompts that enable accurate extraction of complex information. The AI can distinguish between clinical and general psychologists based on context clues, identify the most relevant contact information, and extract pricing details even when they're presented in various formats across different websites.
The system implements comprehensive error handling with exponential backoff for failed requests. It respects website servers by implementing rate limiting between requests and uses batch processing to manage resources efficiently. This ensures the tool can run reliably even when processing hundreds of websites.
Websites vary in structure & tech stacks.
Solution: Our scraper adapts using multiple content detection techniques.
High request rates may cause IP blocks.
Solution: Configurable delays & exponential backoff prevent server overload.
Info appears in inconsistent formats.
Solution: AI-driven text parsing (vs. regex) improves accuracy.
For the full code and documentation, visit the GitHub repository.