Web Scraping Hacker News Using VBA
Overview
See how to use VBA to pull stories from Hacker News into Excel worksheet.
The source code is available on GitHub.
High-level plan:
- Check which HTML tag contains stories,
- Get website contents,
- Traverse through it,
- Display results.
1. Inspect HTML
Open Chrome DevTools by pressing F12
or Ctrl + Shift + I
.
After inspection we found out that:
- third table on the website contains stories,
a.storylink
element contains title and link of a story.
2. Set references
In a new Excel worksheet, open a VBA editor (ALT+F11
), go to Tools -> References
and select:
- Microsoft HTML Object Library
- Microsoft XML, v6.0
Objects can be early bound or late bound. In early binding, objects are created at compile time. In late binding at run time.
Therefore early binding is considerably faster than late binding during run-time.
If you want to use late binding, you need to declare a variable of Object type, then invoke CreateObject(“YOUR_LIBRARY”)
method to create an instance of that object.
It is also worth noting that early binding enables Intellisense meaning that VB Editor starts displaying the object methods and properties after pressing dot.
3. Main procedure
The Main method is the entry point that invokes all other procedures.
Public Sub Main()
Dim htmlDocument As htmlDocument
Dim tableWithStories As HTMLTable
Dim rowsWithStories As IHTMLElementCollection
Dim storylinkAnchors As Collection
Dim output As Variant
' Get website
Set htmlDocument = GetHtmlDocument("https://news.ycombinator.com/news")
' Extract data
Set tableWithStories = GetTableWithStories(htmlDocument)
Set rowsWithStories = GetRowsWithStories(tableWithStories)
Set storylinkAnchors = GetStorylinkAnchors(rowsWithStories)
' Display
Let output = BuildOutput(storylinkAnchors)
Call Display(output)
End Sub
4. Get entire HTML document
To send an HTTP request, we create an ServerXMLHTTP60
object, open
the URL, and send
the request.
Next, we check the HTTP status code of the HTTP response - if it’s 200, it’s successful.
After transaction completes, responseText
property will have a string containing the raw HTML. This could be difficult to manipulate.
That’s why, we inject the responseText
content into the body of a html document object.
Private Function GetHtmlDocument(url As String) As htmlDocument
Dim http As ServerXMLHTTP60
Dim htmlDoc As htmlDocument
Set http = New ServerXMLHTTP60
Set htmlDoc = New htmlDocument
With http
' Send an HTTP request
.Open bstrMethod:="GET", bstrUrl:=url, varAsync:=False
.send
' Check if the request has succeeded
If (.Status = 200) Then
' Inject raw HTML string into HTML document object
htmlDoc.body.innerHTML = .responseText
End If
End With
' Return
Set GetHtmlDocument = htmlDoc
End Function
I use ServerXMLHTTP and not XMLHTTP despite the latter being designed for client applications.
The XMLHTTP provides caching that makes extracting data with frequent changes problematic.
If you want to know more, see these links:
5. Get table
Get table with stories. It’s the third <table>
out of four available on the website.
Private Function GetTableWithStories(htmlDoc As htmlDocument) As HTMLTable
Dim allTables As IHTMLElementCollection
' Get all available HTML tables
Set allTables = htmlDoc.getElementsByTagName("table")
' Return the 3rd table containing Hacker News stories
Set GetTableWithStories = allTables.Item(2)
End Function
6. Get rows
Having the right table, get all tr.athing
rows containing a.storylink
elements with stories.
Private Function GetRowsWithStories(tableWithStories As HTMLTable) As IHTMLElementCollection
' <tr> rows with stories on Hacker News have class "athing"
Set GetRowsWithStories = tableWithStories.getElementsByClassName("athing")
End Function
7. Get stories
Create a collection of a.storylink
elements with titles and links.
Performance-wise, always use the For Each loop when looping through a collection.
For more details, see this excellent post: VBA For loop vs For Each loop.
Private Function GetStorylinkAnchors(tableRowsWithStories As IHTMLElementCollection) As Collection
Dim storyRow As HTMLTableRow
Dim storylink As HTMLAnchorElement
Dim storylinksCollection As Collection
' Create empty collection to hold <a> elements with title and link
Set storylinksCollection = New Collection
' Extract <a> elements with class "storylink" from <tr> rows
For Each storyRow In tableRowsWithStories
Set storylink = storyRow.getElementsByClassName("storylink").Item(0)
storylinksCollection.Add Item:=storylink
Next storyRow
' Return
Set GetStorylinkAnchors = storylinksCollection
End Function
8. Build output
To display collection from the previous step in the worksheet, it must be converted to an array first.
It is far more efficient to transfer one array of 100 values to the worksheet than access Range
100 times for the read and writes (see: VBA Arrays And Worksheet Ranges by Chip Pearson).
To put it simpler, avoid using Range
wherever possible to achieve best performance.
Private Function BuildOutput(storylinks As Collection) As Variant
Dim output As Variant
Dim storylink As HTMLAnchorElement
Dim rowIndex As Long
ReDim output(1 To storylinks.Count, 1 To 2)
' Convert a collection to array
For Each storylink In storylinks
Let rowIndex = rowIndex + 1
Let output(rowIndex, 1) = storylink.textContent
Let output(rowIndex, 2) = storylink.href
Next storylink
' Return
Let BuildOutput = output
End Function
9. Display
Write array back to worksheet.
Public Sub Display(output As Variant)
Dim targetWorksheet As Worksheet
Dim rowsQuantity As Long
Dim columnsQuantity As Long
Set targetWorksheet = ThisWorkbook.Worksheets(1)
Let rowsQuantity = UBound(output, 1)
Let columnsQuantity = 2
' Clear and return the array to the Range
With targetWorksheet
.UsedRange.Delete
.UsedRange.Clear
.Range(.Cells(1, 1), .Cells(rowsQuantity, columnsQuantity)) = output
End With
End Sub
End result
Run the Main method.
Your worksheet should now contain Hacker News stories titles in one column and links in the other.