Python Web Scraping Tutorial



Lab Objectives

  • Create a Pyhton script to scrape job postings from a forum post.
  • Analyse the data to see how popular different technologies are.

plan plan

[!IMPORTANT] > Web Scraping Ethics

  • Be mindful about how you scrape!
  • Don’t overdo it. Request data at a reasonable rate. Respect the owners of the data.

Make an HTTP request

  • Create an empty python file scraper.py containing the following code:
def main():
    print('Hello world!')

 if __name__ == "__main__":
    main()
  • Create a virtual environment:
python -m venv .venv
  • Activate the virtual environment:
.\.venv\Scripts\activate
  • lunch the program:
python scraper.py
def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  print(f"Scraping: {url}")

if __name__ == "__main__":
  main()
  • Let’s retrieve the contents of this link, we have to install the requests library and import it:
pip install requests

Requests is an elegant and simple HTTP library for Python, built for human beings (docs)

import requests

def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  response = requests.get(url)
  print(f"Scraping: {url}")
  print(response)

if __name__ == "__main__":
  main()
  • Execute the program and you will get this result:
Scraping: https://news.ycombinator.com/item?id=42919502
<Response [200]>
  • You can also show the response content and you will get all the data:
print(response.content)

Parse the HTTP response with “Beautifulsoup”

  • With response.content we get all the data, we have to parse data and make it more clean and exploitable.
  • For that we will use an other library called beautifulsoup:
pip install beautifulsoup4

Beautiful Soup is a Python library for pulling data out of HTML and XML files (docs).

To scrape the web site, we have to know the web page structure, for that inspect the web page in the browser to find out if there is any kind of class or id we can use to identify a particular job post.

  • To use Beautifulsoup, give it the response content and a parser (in this case it will be a html parser):
import requests
from bs4 import BeautifulSoup

def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  response = requests.get(url)

  soup = BeautifulSoup(response.content, "html.parser")
  # find all elements with class="comment"
  elements = soup.find_all(class_="comment")

  # Show the number of elementd found
  print(f"Elements: {len(elements)}")

if __name__ == "__main__":
  main()

Extract individual comments

  • After analysing the web site to scrape, you will find that what interest us is the elements with indentation 0 and their next element with content info:
import requests
from bs4 import BeautifulSoup

def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  response = requests.get(url)

  soup = BeautifulSoup(response.content, "html.parser")
  # find all elements with class="ind" and indent level = 0
  elements = soup.find_all(class_="ind" , indent=0)
  # for each of this elements, find the next element
  comments = [e.find_next(class_="comment") for e in elements]

  # Show the number of comments found
  print(f"Comments: {len(comments)}")

if __name__ == "__main__":
  main()
  • Execute the new code
  • Replace:
print(f"Comments: {len(comments)}")

with

# show each comment (job post)
for comment in comments:
  print(comment)

Clean up the response text

  • After executing the last modification, you will notice that still we have a lot of html tags that we want to get rid of that.
  • We can use comment.get_text():
import requests
from bs4 import BeautifulSoup

def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  response = requests.get(url)

  soup = BeautifulSoup(response.content, "html.parser")
  # find all elements with class="ind" and indent level = 0
  elements = soup.find_all(class_="ind" , indent=0)
  # for each of this elements, find the next element
  comments = [e.find_next(class_="comment") for e in elements]

  # show each comment (job post)
  for comment in comments:
    comment_text = comment.get_text()
    print(comment_text)

if __name__ == "__main__":
  main()
  • Execute the file again and it will look better!

Process the scraped content for useful data

  • We want to find out how often each technology is mentioned. We will limit to programming languages. We will look in each of this comments and see how many times a language like Python for example is mentioned or Javascript, Java, etc.
  • We will create a map of different languages we are interested in and then for each of these comments we will scan the words for occurrences of that language and count them. Also, we will count each keyword once for each unique post because we don’t want a post that mentions for example Javascript several times to count as three different things! We will only count once the post that mention Javascript for example.
  • Create the map of languages:
import requests
from bs4 import BeautifulSoup

def main():
  url = "https://news.ycombinator.com/item?id=42919502"
  response = requests.get(url)

  soup = BeautifulSoup(response.content, "html.parser")
  # find all elements with class="ind" and indent level = 0
  elements = soup.find_all(class_="ind" , indent=0)
  # for each of this elements, find the next element
  comments = [e.find_next(class_="comment") for e in elements]

  # Map of technologies keyword to search for
  # and the occurence initialized at 0
  keywords = {"python": 0, "javascript": 0, "typescript": 0, "go": 0, "c#": 0, "java": 0, "rust": 0 }

  # show each comment (job post)
  for comment in comments:
    # get the comment text and lower case it
    comment_text = comment.get_text().lower()

    # split comment by space which create an array of words
    words = comment_text.split(" ")

    print(words)

if __name__ == "__main__":
  main()
  • Execute the code. You will notice that in the array some of words contains punctuations, and also some words that repeating so we only want to count them once.
  • We definitely have to clean this up, we can do that by using the strip() function to strip the words. We do that for each word.
  • After the split instruction, add:
# Use the string strip function
# and place all the caracters we want to strip away
words = [w.strip(".,/:;!@") for w in words]
  • Execute the code.
  • Convert the list of words to a set to make the words unique and get rid of duplicate words, for that just replace the previous code with:
# Use the string strip function
# and place all the caracters we want to strip away
# Use a set to have unique words
words = {w.strip(".,/:;!@") for w in words}
  • Now that we have processed and cleaned up the scraped data, we can count our keywords inside the set. Replace the print(words) instruction with the following code:
# search for k in keywords, this give you the dictionory key
# if the key is in the words set, we add 1 to the keywords score
for k in keywords:
  if k in words:
    keywords[k] += 1
  • Just after the for comment in comments: loop, print out the keywords:
print(keywords)
  • Execute the code.

Visualizing the data with “matplotlib”

  • Now we will try to visualize the results in a graph, for that we will install matplotlib (most popular plotting library in python - (docs) dependency:
pip install matplotlib
  • To use matplotlib library, we have to import it:
import matplotlib.pyplot as plt
  • After print(keywords) instruction, add the following code:
# plot a bar graph
plt.bar(keywords.keys(), keywords.values())
# Add labels
plt.xlabel("Language")
plt.ylabel("# of Mentions")
plt.show()
  • Execute the code.
  • Generate requirements.txt:
pip freeze > requirements.txt

What Next?

  • Adapt the scraper to different websites.
  • Turn the scraper into a serverless API.
  • Use Selenium for complex browser interaction.
  • Store the Scraped Data in CSV file or database.
  • Create a Web Board Interface

Mini Project to do in pairs

Here are 4 interesting web scraping ideas, choose one of these:

  1. Product Price Tracker:
  • Scrape product prices from e-commerce websites to monitor price changes.
  • Send notifications when prices drop or hit a target price.
  • Useful for deal-hunters or businesses analyzing market trends.

Web sites samples:


  1. News Aggregator:
  • Collect news articles from multiple sources and compile them into a single feed.
  • Filter articles by keywords or topics of interest.
  • Summarize articles to get the gist quickly.

Web sites samples:


  1. Job Listings Scraper:
  • Gather job listings from various job boards.
  • Filter jobs by location, industry, or company.
  • Useful for job seekers looking for specific opportunities.

Web sites samples:


  1. Social Media Sentiment Analysis:
  • Scrape social media platforms (e.g., Twitter) for posts mentioning a particular topic or hashtag.
  • Perform sentiment analysis on the collected data to gauge public opinion.
  • Useful for businesses or individuals monitoring brand reputation or trends.

Web sites samples:


By Wahid Hamdi