Want to Find Keywords Across a Website? Here’s How to Build a Python Crawler Fast!
Have you ever needed to find a specific word or phrase across multiple pages of a website, but doing it manually seemed overwhelming? Well, you’re in the right place! In this post, I’ll walk you through a simple, yet powerful Python script that can help you automate this process. By the end, you’ll know how to crawl any website and search for specific keywords across its pages — all without needing to sift through them one by one.
Why You Should Use a Web Crawler
A web crawler can save you hours of manual work by automatically visiting pages, scanning their content, and identifying exactly where your keyword appears. Whether you’re working on SEO research, content analysis, or simply trying to gather data from a website, this tool can streamline the process.
Together, we’ll go through how to set up a Python-based web crawler that respects each website’s robots.txt
file, searches for your keyword, and exports the results into a neat text file for easy access.
Let’s get started!
Step 1: Getting Your Python Web Crawler Ready
First, let’s take a look at the Python script that will form the backbone of your web crawler. Don’t worry if you’re not a Python expert — I’ll break down each part so you can follow along easily.
Here’s what the script does in a nutshell:
- It prompts you for a website and the keyword you’re looking for.
- It goes through all the pages on that website.
- It checks if the keyword appears on each page.
- It collects all the URLs where the keyword is found and saves them to a text file.
Here’s the Code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
from collections import deque
def crawl(domain, keyword, output_file):
base_url = domain if domain.startswith('http') else 'http://' + domain
parsed_base = urlparse(base_url)
netloc = parsed_base.netloc
rp = RobotFileParser()
rp.set_url(urljoin(base_url, '/robots.txt'))
rp.read()
queue = deque([base_url])
visited = set()
matched_urls = []
headers = {'User-Agent': 'KeywordWebScraperBot/1.0'}
while queue:
url = queue.popleft()
if url in visited:
continue
visited.add(url)
if not rp.can_fetch(headers['User-Agent'], url):
continue
try:
response = requests.get(url, headers=headers)
except requests.exceptions.RequestException:
continue
if response.status_code != 200:
continue
content = response.text
if keyword.lower() in content.lower():
matched_urls.append(url)
soup = BeautifulSoup(content, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href']
next_url = urljoin(url, href)
parsed_next = urlparse(next_url)
if parsed_next.netloc != netloc:
continue
if next_url not in visited:
queue.append(next_url)
with open(output_file, 'w') as f:
for matched_url in matched_urls:
f.write(matched_url + '\n')
if __name__ == '__main__':
domain = input('Enter the domain to crawl (e.g., https://example.com): ')
keyword = input('Enter the keyword to search for: ')
output_file = 'matched_urls.txt'
crawl(domain, keyword, output_file)
print(f'Done. URLs containing the keyword have been saved to {output_file}')
Now let’s break this down into manageable steps so you know exactly what’s happening.
Step 2: Setting Up Your Environment
Before running the script, you’ll need a few things installed on your computer. Don’t worry — it’s just a few quick steps, and you’ll be ready to go.
Install Python 3
If you don’t have Python 3 installed yet, you can download it from the official Python website. Once installed, open your terminal or command prompt and type python --version
to confirm it’s working.
Install the Required Libraries
This script uses two external libraries: requests
for handling web requests and BeautifulSoup
(from the beautifulsoup4
package) for parsing HTML content. You can install them by running the following command in your terminal:
pip install requests beautifulsoup4
That’s it! Your environment is ready.
Step 3: Running the Web Crawler
Now that everything’s set up, let’s run the crawler and see it in action!
How to Run the Script
- Save the script: Copy the Python code above into a file and save it, for example, as
keyword_scraper.py
. - Run the script: Open your terminal or command prompt, navigate to the folder where you saved the script, and run
python keyword_scraper.py
- Enter the domain and keyword: The script will prompt you to enter the domain (for example,
https://example.com
) and the keyword you’re searching for (for instance,privacy policy
). - Get your results: The crawler will scan the website and collect all URLs where the keyword is found. Once it’s done, it will create a file called
matched_urls.txt
, which contains all the matching URLs.
Step 4: Understanding How It Works
Let’s take a closer look at how the script helps you achieve your goal of finding keywords across a website.
1. It Starts with a Queue
The script uses a queue to keep track of all the pages it needs to visit. It starts with the homepage (or whichever domain you enter), then follows internal links to visit other pages. It only stays within the same domain, so you won’t have to worry about it wandering off to other websites.
2. It Follows the Rules
Before visiting each page, the crawler checks the website’s robots.txt
file to make sure it’s allowed to crawl that page. This is important because some websites restrict certain parts from being accessed by crawlers to avoid overloading their servers.
3. It Searches for Your Keyword
Once the script fetches a page, it looks for your keyword in the page content. It does this in a case-insensitive way, so it doesn’t matter if the keyword is written in uppercase, lowercase, or a mix of both. If it finds the keyword, it saves the URL for you.
4. It Keeps Things Organized
The crawler parses each page’s HTML to find all the internal links. It then adds any new pages to the queue, ensuring it doesn’t visit the same page twice. This makes sure your crawler covers as much of the website as possible without unnecessary repeats.
Step 5: How to Make the Most of Your Crawler
Now that you have the basics down, there are a few ways you can extend or tweak the script to suit your needs. Here are some ideas to get you started:
1. Adjust the Output
By default, the script outputs the URLs to a text file. If you want to include additional information — like how many times the keyword appears on each page — you can modify the script to count keyword occurrences and add that data to the output file.
2. Set Crawl Limits
Depending on the size of the website you’re crawling, you might want to limit how many pages the crawler visits. You can add a page limit to the script or filter out certain types of URLs if you’re only interested in specific sections of the website.
3. Customize the User-Agent
Websites track requests based on the User-Agent
string, which tells them which browser or bot is accessing their pages. The script includes a custom user-agent (KeywordWebScraperBot/1.0
), but you can customize this to better fit your needs or match the guidelines of the website you’re crawling.
Best Practices for Responsible Crawling
As with any tool, it’s important to use web crawlers responsibly. Here are some key practices to keep in mind:
- Respect
robots.txt
: Always make sure to follow the rules set by the website in itsrobots.txt
file. - Don’t overload the server: Avoid sending too many requests in a short period, especially on smaller websites. Consider adding a short delay between requests.
- Be transparent: Use a clear
User-Agent
that identifies your crawler to website administrators.
Wrapping Up
And there you have it! With this web crawler, you can easily search for keywords across any website, automate data collection, and save a ton of time. Whether you’re doing research, analyzing content, or tracking changes on a site, this Python script has you covered.
If you have any questions or need help tweaking the script, feel free to reach out. Happy crawling!