Web scraping example using Python and Beautiful Soup

July 2019

Load in packages


#Packages
#--Web scraping packages
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import pandas as pd
import numpy as np

Load URLs we want to scrape into an array

#load URLs we want to scrape into an array
BASE_URL = [
'http://www.reuters.com/finance/stocks/company-officers/GOOG.O',
'http://www.reuters.com/finance/stocks/company-officers/AMZN',
'http://www.reuters.com/finance/stocks/company-officers/AAPL'
]

Loop through our URLs, scrape table, pass information to array

#loading empty array for board members
board_members = []
#Loop through our URLs we loaded above
for b in BASE_URL:
    html = requests.get(b).text
    soup = BeautifulSoup(html, &quot;html.parser&quot;)
    #identify table we want to scrape
    officer_table = soup.find('table', {&quot;class&quot; : &quot;dataTable&quot;})

    #try clause to skip any companies with missing/empty board member tables
    try:
        #loop through table, grab each of the 4 columns shown (try one of the links yourself to see the layout)
        for row in officer_table.find_all('tr'):
            cols = row.find_all('td')
            if len(cols) == 4:
               board_members.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
    except: pass

Create new array, check length to ensure things pulled in correctly

#convert output to new array, check length
board_array = np.asarray(board_members)
len(board_array)

Convert new array to dataframe

#convert new array to dataframe
df = pd.DataFrame(board_array)

Rename columns, preview output

#rename columns, check output
df.columns = ['URL', 'Name', 'Age','Year_Joined', 'Title']
df.head(10)

	URL	Name	Age	Year_Joined	Title
0	http://www.reuters.com/finance/stocks/company-...	Eric Schmidt	61	2015	Executive Chairman of the Board of Director
1	http://www.reuters.com/finance/stocks/company-...	Sergey Brin	43	2015	President, Director
2	http://www.reuters.com/finance/stocks/company-...	Lawrence Page	44	2015	Chief Executive Officer, Director
3	http://www.reuters.com/finance/stocks/company-...	Ruth Porat	59	2015	Chief Financial Officer, Senior Vice President
4	http://www.reuters.com/finance/stocks/company-...	Sundar Pichai	45	2017	Director, Chief Executive Officer, Google Inc.
5	http://www.reuters.com/finance/stocks/company-...	David Drummond	54	2015	Senior Vice President - Corporate Development,...
6	http://www.reuters.com/finance/stocks/company-...	John Hennessy	64	2007	Lead Independent Director
7	http://www.reuters.com/finance/stocks/company-...	Diane Greene	61	2015	Director
8	http://www.reuters.com/finance/stocks/company-...	L. John Doerr	65	2016	Independent Director
9	http://www.reuters.com/finance/stocks/company-...	Roger Ferguson	65	2016	Independent Director

Export data to CSV

#export data
df.to_csv('/Users/yourname/desktop/board_members.csv')

That's it! If you're interested in seeing how I used this data check out my visualization on the interconnectedness of companies through shared board members here.