The Princess and the Goblin
Scraping up Goblins
It's time to put Python to some more "applied", hands-on use.
Imagine, for a moment, that the Internet was nothing more than one huge dungeon containing countless smaller dungeons, and that each dungeon contains several rooms with various treasure chests. Each treasure chest might be analogous to a single web page, each dungeon analogous to a web site or network, and the one big dungeon (perhaps the one to rule them all) is analogous to the internet itself as a whole.
Now, we dungeon crawlers like to explore dungeons for the purpose of finding treasure and such. This is similar to "surfing the web." Just as we want to scour a dungeon for treasure, we want to scour the internet for data (text, video, audio, and so on). The process of picking up treasure on the internet is sometimes referred to as "scraping."
Let's see a brief, simple python script to doing a bit of scraping.
That is a very rudementary scrape. As you will see in future lessons, there are much more sophisticated ways of scraping we can do. For now, let's just figure out what's going on in the above script.
First, we import the urllib python module. This module has several functions designed to help us reach out and interact with the Internet. We then create a variable called "url" which contains (in our case) the url to Project Gutenberg's copy of The Princess and the Goblin by George MacDonald (one of my favorite books by one of my favorite authors). Then, we create a variable called "response" which will contain the data that is found in the internet at the location indicated in our url variable. Finally, we create a variable called "raw" which will contain the decoded data from the "response" variable, and then we print the first 75 characters in that decoded data.
To get more information on the urllib functions, head on over to the official Python Docs. But first...
- Change the final line in the script so that that first 100 characters of Princess and the Goblin are printed. Note: the ":" you see in the print statement is used for slicing. Head over to the Python Docs to find more info on slicing.
- Look around over at Project Gutenberg and find another book you'd like to scrape up off the internet. Go ahead and modify the script to scrape that book.
- What happens if you get rid of the 'utf8' inside the creation of the "raw" variable and run your script?
- What happens to your output if you change the print line to just print(raw)?
- Research more on slicing. How would you change your print line so that you only printed the last 100 characters of whatever document it is that you're scraping?
Join the Dungeon Crawl: A place for programmers to ponder, partake, and peruse postulations pertaining to programming, politics, potions, and pizza.
If you subscribe on Patreon, you will be granted access to the Dojo, a growing collection of quality computer science classes created and actively mantained and frequented by the Dungeon Master!Become a Patron!