Skip to Main Content

File Conversion

This guide helps users convert to / from different file formats for your project needs.

HTML Parsing Using Python

What is HTML? 

  • HyperText Markup Language
  • Allows website developers to create and structure sections, paragraphs, and links on a webpage using elements, tags, and attributes
  • Every website we browse is coded using HTML; the HTML contains the data on the website 
  • We can use Python to get, or “scrape” information from a website

Example of HTML for a Jeopardy Website: 

Download jeopardy.html

Resources

Parsing an HTML File in Python

If we want to scrape/get information from the HTML of a webpage, we can do this using BeautifulSoup. BeautifulSoup is a Python library that is used to pull data out of HTML files. It makes it easy to parse through the HTML tree and pull information from specific tags.

1. Import BeautifulSoup:

2. Create a BeautifulSoup object and tell Python to parse HTML data, specifically the file passed in. In this case, let’s use “Jeopardy.html”, the HTML file of the jeopardy website in the example above. (Website: https://thejeopardyfan.com/statistics/the-300-club)

3. There are many methods we can now use to pull data from certain tags in the HTML file. 

Method

What the Method Does

soup.head

Returns the head tag as a tag object 

soup.body

Returns a tag object representing the first body tag

soup.find(“tag”)

soup.find(“tagName”, {“tag-attribute” : “attribute-value”})

Returns the first instance of the inputted tag as a tag object

Commonly used when trying to pull tags from the same class in this format: 

soup.find(“tagName”, {“class” : “value”})

soup.find_all(“tag”)

soup.findAll(“tagName”, {“tag-attribute” : “attribute-value”})

Returns a list of tag objects, where the tag objects are all instances of the tag that was inputted

 

Commonly used when trying to pull tags from the same class in this format: 

soup.findAll(“tagName”, {“class” : “value”})

tag.text

Returns a string containing the text in a given tag

 

For our example, let’s say we wanted to pull all the names of the jeopardy players on the website. We can use soup.findAll() to accomplish this. We have to figure out what to pass as the parameter in the findAll() method. We figure this out by looking at the HTML code:

As we can see, the tag for the jeopardy player name line is “td” and the class for this line is “column-1”. Using this information, our findAll() method would be the following:

Printing output this prints these tag objects: 

As we can see, the resulting list contains the names of the Jeopardy players, but since findAll() populates the list with tag objects and not just the strings of the Jeopardy names themselves, we must clean this up ourselves. 

  1. Tag.text gives us just the string text in the tags (the names)
  2. The strip()method removes extra whitespaces and strip(*) removes the asterisks present after some of the names

Once these changes are made, the resulting list is the list of names from the HTML file:

 

Full Code