Web Scraping With Beautifulsoup4

Scraping With Beautiful Soup
Python Bs4 Web Scraping
Web Scraping With Beautifulsoup4
Web Scraping Using Beautifulsoup

In the last couple of weeks I have been residing within a thinking space occupied by data stories. Many of these data stories have been inspired from my frequent meanderings on the web. I was particularly ecstatic over a story on dissertation length by discipline on the blog ‘R is my friend‘. This data post has much going for it: nifty web scraping, beautiful ggplot2 visualisations and a data story that will have all procrastinating thesis students with their eyes glued to their screen and tongues sticking out in concentration as they figure out where their thesis will fall. Naturally, as an ex-procrastinating thesis student, I figured out where my thesis will lie (Sociology as it happens – not Physics: the subject I actually majored in). But since I no longer have a thesis to procrastinate over, I realised that I could have some guilt-free fun and use this project to improve my coding: get better at web scraping and ggplot2.

Scraping With Beautiful Soup

Conclusion: Web Scraping Python is an essential Skill to have. Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup. Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. Non-closed tags, so named after tag soup).It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project, and is additionally supported. Web Scraping “Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.” HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s.

A while ago, I dabbled briefly in a bit of web scraping by following a tutorial on Flowing Data. I blogged this here. It was cool but I never got inspired beyond that. So, here is a fabulous way to indulge in some data monkeying.

Web scraping, while simple is by no means robust. Since the scraper relies on a specifically formatted web page, it can break when the website is changed. The Python code shows a fairly simple scraping script. The Victoria University of Wellington (the university where I did my degree) research archive is fairly easy to scrape. All the theses hosted on the archive come up as search hits. The unique URL for a given thesis can be scraped from the search hits. Doing this is trivial though a little time consuming. I had to use the ‘Inspect Web Element’ feature in Chrome to figure out where in the HTML code the information is kept. Once I ascertained this, I needed to think of a way of parsing the code using BeautifulSoup4 tools to extract the required information. Not difficult but my relative inexpert background meant that I simply played around until I had a piece of code that worked. So, fair warning to those who do want to use this code: there may be a better way of getting the information!

Web scraping using Python and BeautifulSoup. Published Oct 07, 2018Last updated Oct 09, 2018. Intro In the era of data science it is common to collect data from websites for analytics purposes. $ pip install requests $ pip install beautifulsoup4 Collecting web page data. Now we are ready to go.
We will learn all about Web Scraping using Python and BeautifulSoup4 by building a real-world project. I don't want to give you a headache by teaching you how to scrape an ever-changing dynamic website. So, I built a static movie website, named TopMovies, which contains a list of the top 25 IMDb movies. This is the website we are going to scrape.

Python Bs4 Web Scraping

Anyway, once I had the individual URLs, I needed to get the thesis related data (e.g. degree type, author name, the faculty name etc.) Once again, this involved a bit of playing around with ‘Inspect Web Element’ and the BeautifulSoup4 tools. The scraped information was then stored as a csv file.

All of this may seem easy but it is worth noting that I did spend some time getting the data into the right shape. For this dataset, I had to do a few trial runs to see if the data was consistent; it wasn’t. For some reason, one of the schools (Information Management, I believe) published a lot of working papers with multiple authors and no degree information. Since this sort of academic publication was not relevant to my project (which only concerns theses), I discarded this set of results.

Another funny feature to note was the time of data retrieval. Since I was recording how long the whole script took to run etc. I could see where the script was stalling. The script frequently stalled in getting some piece of data from the research archive. I am not sure if this is because there was a server timeout or some type of blocker that slowed down the requests from my IP address (since I was making so many!). Whatever it was, I did manage to get the script working to the end after a couple of fails.

In the next post, we will see what insights we can get from this set of thesis data! Oh, I might also put up a cartoon of the web scraping script.

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Related Course:Complete Python Programming Course & Exercises

Web Scraping With Beautifulsoup4

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

We can convert it to JSON with:

And in a browser get the beautiful json output:

Web Scraping Using Beautifulsoup

Converting to lists

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

Pretty print pandas dataframe

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:
This will show in the terminal as: