A glimpse at web-scraping with Python

A glimpse at web-scraping with Python

Welcome to episode 1 of our Web-scraping Series. This series is created by our AVG team to help YOU, the readers of our blog regardless of your background, automate a very common, yet time-consuming and manual task in our daily work:

Navigate to a website, manually copy and paste information from their subpages and save it into a file of your choice!

……… Ready to embark on the journey? Let’s go.

I. What is Web Scraping?

Web scraping is the process of collecting structured web data in an automated fashion. It’s also called web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.

Enough talking, how about the technology?

II. Common libraries for web-scraping in Python _ Selenium, Beautiful Soup

Are you ready for some code? We will start with a common and generic structure of a web-scraping project for your own understanding. In the subsequent episodes, specific use cases with code snippets will be prepared for you to plug and play in your own project.

III. A glimpse at web scraping with Python

Step 1: importing the necessary library into Python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
Python

In this case, bs4 is also known as Beautiful Soup library. It will help us process the HTML tags retrieved from the webpage. We need this library because if some of the HTML tags are broken, the webpage will then be destroyed. This makes it difficult for us when it comes to processing the data inside these tags, therefore; the library will be of great use.

Step 2: Config your program so that it ignores the SSL Certificate error

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
Python

SSL Certificate is a digital certificate that authenticates a website’s identity and enables an encrypted connection. SSL stands for Secure Sockets Layer, a security protocol that creates an encrypted link between a web server and a web browser.

Ignoring the SSL Certificate Error allow us to scrap the URL which does not have SSL Certificate, and will likely throw an error.

The helper functions create_default_context() returns a new context with secure default settings. The settings are chosen by the SSL module, and usually represent a higher security level than when calling the SSLContext constructor directly.

The check_hostname is an attribute that determines whether to match the peer cert’s hostname before making the connection or not.

The verify_mode is used to determine Whether to try to verify other peers’ certificates and how to behave if verification fails.

Step 3: Asking for user input of the URL

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
Python

urlopen() is a function that is used to open the URL specified in the function call.

BeautifulSoup() is a function where you give in the code that you want to parse, as well as the parser type you want to use. In this case, we will use html parser to parse out HTML Code.

Step 4: Retrieve all of the anchor tags in the return result

tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)
Python

This piece of code will look for all of the anchors (<a></a>) tag in our retrieved code. After that, the data will be processed and each component inside the tag will be retrieved and print out to the screen.

And that’s what web-scraping is all about. For in-depth knowledge on how these codes work, or what are the specific uses of web-scraping, stay tuned for the upcoming episodes!