Web scraping with Python using Beautiful Soup


Load in packages


#Packages
#--Web scraping packages
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import pandas as pd
import numpy as np

Load URLs we want to scrape into an array

#load URLs we want to scrape into an array
BASE_URL = [
'http://www.reuters.com/finance/stocks/company-officers/GOOG.O',
'http://www.reuters.com/finance/stocks/company-officers/AMZN',
'http://www.reuters.com/finance/stocks/company-officers/AAPL'
]

Loop through our URLs, scrape table, pass information to array

#loading empty array for board members
board_members = []
#Loop through our URLs we loaded above
for b in BASE_URL:
    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    officer_table = soup.find('table', {"class" : "dataTable"})
    #try clause to skip any companies with missing/empty board member tables
    try:
        #loop through table, grab each of the 4 columns shown (try one of the links yourself to see the layout)
        for row in officer_table.find_all('tr'):
            cols = row.find_all('td')
            if len(cols) == 4:
               board_members.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
    except: pass  

Create new array, check length to ensure things pulled in correctly

#convert output to new array, check length
board_array = np.asarray(board_members)
len(board_array)

49


Convert new array to dataframe

#convert new array to dataframe
df = pd.DataFrame(board_array)

Rename columns, preview output

#rename columns, check output
df.columns = ['URL', 'Name', 'Age','Year_Joined', 'Title']
df.head(10)
URL Name Age Year_Joined Title
0 http://www.reuters.com/finance/stocks/company-... Eric Schmidt 61 2015 Executive Chairman of the Board of Director
1 http://www.reuters.com/finance/stocks/company-... Sergey Brin 43 2015 President, Director
2 http://www.reuters.com/finance/stocks/company-... Lawrence Page 44 2015 Chief Executive Officer, Director
3 http://www.reuters.com/finance/stocks/company-... Ruth Porat 59 2015 Chief Financial Officer, Senior Vice President
4 http://www.reuters.com/finance/stocks/company-... Sundar Pichai 45 2017 Director, Chief Executive Officer, Google Inc.
5 http://www.reuters.com/finance/stocks/company-... David Drummond 54 2015 Senior Vice President - Corporate Development,...
6 http://www.reuters.com/finance/stocks/company-... John Hennessy 64 2007 Lead Independent Director
7 http://www.reuters.com/finance/stocks/company-... Diane Greene 61 2015 Director
8 http://www.reuters.com/finance/stocks/company-... L. John Doerr 65 2016 Independent Director
9 http://www.reuters.com/finance/stocks/company-... Roger Ferguson 65 2016 Independent Director

Export data to CSV

#export data
df.to_csv('/Users/yourname/desktop/board_members.csv')


That's it! If you're interested in seeing how I used this data check out my visualization on the interconnectedness of companies through shared board members here.



Ace your next data science interview

Get better at data science interviews by solving a few questions per week



Find a bug? Submit a suggested change on Github, or message me on Twitter.