Getting data from the internet using Python libraries: urllib & requests

Getting data from the internet using Python libraries: urllib & requests

·

14 min read

The web is a rich source of data from which we can extract various types of insights and findings. In this article, we will learn the basics of scraping and parsing web data. We'll also learn how to get data from the web, whether it is stored in files or HTML.

Importing data files from the web

Much of the time as a data scientist, importing data from a local directory, won't be quite enough because we won't always have the data that we need. We will need to import it from the web.

How can we import data files from the web?

For example, we want to import the Wine Quality dataset from the Machine Learning Repository hosted by the University of California, Irvine. How do we get this file from the web? We could use our favorite web browser of choice to navigate to the relevant URL and click on the appropriate hyperlinks to download the file but this poses a few problems.

Reproducibility and Scalability

  • Firstly, it isn't written in code and so poses reproducibility problems. If another Data Scientist wanted to reproduce our workflow, he/she would necessarily have to do so outside Python.

  • Secondly, it is not scalable. If we wanted to download one hundred or one thousand such files, it would take one hundred or one thousand times as long, respectively. Whereas if we wrote it in code, our workflow could scale.

As reproducibility and scalability are situated at the very heart of Data Science, we're going to learn in this article how to use Python code to import and locally save datasets from the world wide web. We'll also learn how to load such datasets into pandas dataframes directly from the web, whether they are flat files or otherwise. Then we'll place these skills in the wider context of making HTTP requests. In particular, we'll make HTTP GET requests, which means getting data from the web. We'll learn the basics of scraping HTML from the internet and we'll use the wonderful Python package BeautifulSoup to parse the HTML and turn it into data.

Importing data using the urllib and requests packages

There are a number of great packages to help us import web data. Now we'll use the urllib and requests packages.

Saving data files from the web using urlretrieve()

The urllib package provides a high-level interface for fetching data across the World Wide Web internet. Let's now dive directly into importing data from the web with an example, importing the Wine Quality dataset for red wine. We will import 'winequality-red.csv' from the University of California, Irvine's Machine Learning repository. The data file contains tabular data, the physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating. We then use the urlretrieve() function to write the contents of the URL to a file winequality-red.csv.

The URL of the file is https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

After we import it, we'll load it into a pandas DataFrame.

  • Assign the URL of the file to the variable url.

  • Use the function urlretrieve() to save the file in the working directory as 'winequality-red.csv'.

  • Using pd.reac_csv(), load 'winequality-red.csv' in a pandas DataFrame and display its head. Note that we need to tell pd.read_csv that data is separated by ;.

from urllib.request import urlretrieve
# Import pandas
import pandas as pd
print('Using pandas version', pd.__version__)
output:
Using pandas version 1.3.5
# Assign url of file: url
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')
output:
('winequality-red.csv', <http.client.HTTPMessage at 0x7f9a982df3d0>)
# Read file into a DataFrame and print its head
# Note that we need to tell pd.read_csv that data is separated by ;
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.shape)
display(df.head())
output:
(1599, 12)
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45

Opening and reading data files from the web directly using pd.read_csv()

We have just imported a CSV file from the web, saved it locally and loaded it into a DataFrame. If we just wanted to load a file from the web into a DataFrame without first saving it locally, we can do that easily using pandas. In particular, we can use the function pd.read_csv() with the URL as the first argument and the separator sep as the second argument.

  • Read file into a DataFrame df using pd.read_csv(), recalling that the separator in the file is ';'.

  • Display the head of the DataFrame df.

# Assign url of file: url
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# display the head of the DataFrame
display(df.head())
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45

We've just loaded a flat data file from the web into a DataFrame without first saving it locally using the pandas function pd.read_csv(). This function is super cool because it has close relatives that allow us to load all types of files, not only CSV files.

Importing excel files from the web using read_excel()

Next, we'll use pd.read_excel() to import an Excel spreadsheet. We will use pd.read_excel() to read in all of its sheets, print the sheet names and then print the head of the first sheet using its name, not its index. Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

# Assign the URL of the file to the variable `file_path`.
file_path = 'https://github.com/PyProDev-official/datasets/blob/master/excel/population_data_2019_2020.xlsx?raw=true'

# Read the file in `file_path` into a dictionary `population_data_2019_2020` using `pd.read_excel()` 
# Note that, in order to import all sheets we need to pass `None` to the argument `sheet_name`.
# Read in all sheets of Excel file: population_data_2019_2020
population_data_2019_2020 = pd.read_excel(file_path, sheet_name = None)
print(type(population_data_2019_2020))
print(population_data_2019_2020)
output:
<class 'dict'>
{'2019':      country  population
0      India  1383112050
1    Myanmar    53040210
2   Thailand    71307760
3  Singapore     5703570
4      China  1407745000, '2020':      country  population
0      India  1396387130
1    Myanmar    53423200
2   Thailand    71475660
3  Singapore     5685810
4      China  1411100000}
# Print the names of the sheets in the Excel spreadsheet. 
# These will be the keys of the dictionary `population_data_2019_2020`.
print(population_data_2019_2020.keys())
output:
dict_keys(['2019', '2020'])
print(type(population_data_2019_2020['2019']))
output:
<class 'pandas.core.frame.DataFrame'>
# Print the head of the first sheet (using its name, NOT its index)
print(population_data_2019_2020['2019'].head())
output:
     country  population
0      India  1383112050
1    Myanmar    53040210
2   Thailand    71307760
3  Singapore     5703570
4      China  1407745000

HTTP requests to import files from the web

To import files from the web, we used the urlretrieve() function from urllib.requests. Let's understand a few things about how the internet works. URL stands for Uniform or Universal Resource Locator. They are references to web resources. The vast majority of URLs are web addresses, but they can refer to a few other things, such as

  • File Transfer Protocols(FTP) and

  • database access.

We'll currently focus on those URLs that are web addresses or the locations of websites. Such a URL consists of 2 parts,

The combination of protocol identifier and resource name uniquely specifies the web address.

HTTP

HTTP, which itself stands for HyperText Transfer Protocol. Wikipedia provides a great description of HTTP.

"The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web".

HTTPS

Note that HTTPS is a more secure form of HTTP.

"Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP). It uses cryptography for secure communication over a computer network, and is widely used on the Internet. In HTTPS, the communication protocol is encrypted using Transport Layer Security (TLS) or, formerly, Secure Sockets Layer (SSL)."

Each time we go to a website, we are sending an HTTP/HTTPS request to a server. This request is known as a GET request, by far the most common type of HTTP request. We are performing a GET request when using the function urlretrieve(). The ingenuity of urlretrieve() also lies in the fact that it not only makes a GET request but also saves the relevant data locally. In the following, we'll learn how to make more GET requests to store web data in our environment. In particular, we'll figure out how to get the HTML data from a webpage. HTML stands for Hypertext Markup Language and is the standard markup language for the web.

GET requests using Request()

To extract the HTML from the Wikipedia home page, we

  • import the necessary functions,

  • specify the URL,

  • create a request object using the function Request(),

  • send the request and catch the response using the function urlopen().

This returns an HTTP Response object, which has an associated read() method. We then apply this read() method to the response, which returns the HTML as a string, which we store in the variable html. Finally, we close the response. In particular, the urlopen() function is similar to the python built-in open() function.

from urllib.request import urlopen, Request
url = 'https://www.wikipedia.org/'

request = Request(url)
print(request)

response = urlopen(request)
print(response)
output:
<urllib.request.Request object at 0x7f8d1f5af8e0>
<http.client.HTTPResponse object at 0x7f8d20668040>
html = response.read()
print(html)
# close the response
response.close()
output:
b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world...
...
...
...
html codes of the web page...
...
...
...
</body>\n</html>\n'

GET requests using requests.get()

Now we are going to do the same, however here we'll use the requests package, which provides a wonderful API for making requests. According to the requests package website.

"Requests allows you to send organic, grass-fed HTTP requests, without the need for manual labor."

Moreover, requests library is one of the most downloaded Python packages of all time. Here, we import the package requests, specify the URL, package the request, send the request and catch the response with a single function requests.get(). Finally, we apply the text method to the response which returns the HTML as a string.

Note that unlike in the previous one using urllib, we don't have to close the connection when using requests.

import requests
url = 'https://www.wikipedia.org/'
r = requests.get(url)
print(r)

text = r.text
print(text)
output:
<Response [200]>
<!DOCTYPE html>
<html lang="en" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">...
...
...
...
html codes of the web page...
...
...
...
</style>
<!--<![endif]-->
<!--[if lte IE 9]>
<style>
    .langlist > ul {
        text-align: center;
    }
    .langlist > ul > li {
        display: inline;
        padding: 0 0.5em;
    }
</style>
<![endif]-->
</body>
</html>

Scraping the web in Python

We have just scraped HTML data from the web and we've done using two different packages, urllib and requests. We also saw that requests provided a higher-level interface in that we needed to write fewer lines of code to retrieve the relevant HTML as a string.

HTML

We've got the HTML of our page of interest but, generally, HTML is a mix of both unstructured and structured data. Structured data is data that has a pre-defined data model or that is organized in a defined manner. Unstructured data is data that does not possess either of these properties. HTML is interesting because, although much of it is unstructured text, it does contain tags that determine where headings and hyperlinks can be found.

Extracting data using BeautifulSoup

In general, to turn HTML that we have scraped from the world wide web into useful data, we'll need to parse it and extract structured data from it. We will see how we can perform such tasks using the Python package BeautifulSoup. Let's check out the BeautifulSoup package's website.

The first words at the top are:

"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects."

Firstly, why BeautifulSoup? In web development, the term "tag soup" refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! The main object created and queried when using this package is called BeautifulSoup object and it has a very important associated method called prettify().

Now, we will use requests to scrape the HTML from the web and create a BeautifulSoup object from the resulting HTML and prettify it.

# import BeautifulSoup
from bs4 import BeautifulSoup

url = 'https://www.crummy.com/software/BeautifulSoup/'

r = requests.get(url)
html_doc = r.text
print(type(html_doc))

soup = BeautifulSoup(html_doc)
print(type(soup))
print(soup)
output:
<class 'str'>
<class 'bs4.BeautifulSoup'>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link href="leonardr@segfault.org" rev="made"/>
<link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
<meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
<meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
<meta content="Leonard Richardson" name="author"/>
</head>...
...
...
...
html codes of the web page...
...
...
...
Site Search:

<form action="/search/" method="get">
<input maxlength="255" name="q" type="text" value=""/>
</form>
</td>
</tr>
</table>
</body>
</html>

Prettify the BeautifulSoup

Printing the prettified Soup and the original HTML, we can see that the prettified Soup is indented in the way we would expect properly written HTML to be.

# print the prettified soup
print(soup.prettify())
output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Beautiful Soup: We called him Tortoise because he taught us.
  </title>
  <link href="leonardr@segfault.org" rev="made"/>
  <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
  <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
  <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
  <meta content="Leonard Richardson" name="author"/>
 </head>...
...
...
...
html codes of the web pages...
...
...
...
     Site Search:
     <form action="/search/" method="get">
      <input maxlength="255" name="q" type="text" value=""/>
     </form>
    </td>
   </tr>
  </table>
 </body>
</html>

Extracting title using title

We'll explore a few of the methods that we can apply to our soupified HTML, such as title, which gives the title.

print(soup.title)
output:
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>

Extracting text using get_text()

And get_text(), which extracts text from HTML, respectively.

print(soup.get_text())
output:



Beautiful Soup: We called him Tortoise because he taught us.








#tidelift { }

#tidelift a {
 border: 1px solid #666666;
 margin-left: auto;
 padding: 10px;
 text-decoration: none;
}

#tidelift .cta {
 background: url("tidelift.svg") no-repeat;
 padding-left: 30px;
}


[ Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group  | Zine ]

Beautiful Soup

You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround
screen scraping projects.
Beautiful Soup is a Python library designed for...
...
...
...
texts in the web page
...
...
...
Development
Development happens at Launchpad. You can get the source
code or file
bugs.
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Monday, June 27 2022, 15:36:35 Nowhere Standard Time and last built on Monday, January 02 2023, 17:00:01 Nowhere Standard Time.Crummy is © 1996-2023 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.Document tree:
http://www.crummy.com/software/BeautifulSoup/




Site Search:

Extracting URLs using find_all()

We can also use the method find_all() to extract the URLs of all of the hyperlinks in the HTML.

for link in soup.find_all('a'):
    print(link.get('href'))
output:
#Download
bs4/doc/
#HallOfFame
enterprise.html
https://code.launchpad.net/beautifulsoup
https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
zine/
bs4/download/
http://lxml.de/
http://code.google.com/p/html5lib/
bs4/doc/
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=enterprise
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
https://tidelift.com/security
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=website
zine/
None
bs4/download/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
download/3.x/BeautifulSoup-3.2.2.tar.gz
https://tidelift.com/subscription/pkg/pypi-beautifulsoup?utm_source=pypi-beautifulsoup&utm_medium=referral&utm_campaign=website
None
http://www.nytimes.com/2007/10/25/arts/design/25vide.html
https://github.com/BlankerL/DXY-COVID-19-Crawler
https://blog.tidelift.com/how-open-source-software-is-fighting-covid-19
https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py
http://www.harrowell.org.uk/viktormap.html
http://svn.python.org/view/tracker/importer/
http://www2.ljworld.com/
http://www.b-list.org/weblog/2010/nov/02/news-done-broke/
http://esrl.noaa.gov/gsd/fab/
http://laps.noaa.gov/topograbber/
http://groups.google.com/group/beautifulsoup/
https://launchpad.net/beautifulsoup
https://code.launchpad.net/beautifulsoup/
https://bugs.launchpad.net/beautifulsoup/
/source/software/BeautifulSoup/index.bhtml
/self/
/self/contact.html
http://creativecommons.org/licenses/by-sa/2.0/
http://creativecommons.org/licenses/by-sa/2.0/
http://www.crummy.com/
http://www.crummy.com/software/
http://www.crummy.com/software/BeautifulSoup/

Awesome. Now we see that BeautifulSoup is handy in getting data out of web pages. Check BeautifulSoup's documentation for other methods.

Conclusion

In this article, we learned how to get, and import data from the web, whether it is stored in files or HTML. We also learned the basics of scraping and parsing web data.


#python #pandas #urllib #requests #web-scraping