See how to use VBA to pull stories from Hacker News into Excel worksheet.
The source code is available on GitHub.
- Check which HTML tag contains stories,
- Get website contents,
- Traverse through it,
- Display results.
1. Inspect HTML
Open Chrome DevTools by pressing
Ctrl + Shift + I.
After inspection we found out that:
- third table on the website contains stories,
a.storylinkelement contains title and link of a story.
2. Set references
In a new Excel worksheet, open a VBA editor (
ALT+F11), go to
Tools -> References and select:
- Microsoft HTML Object Library
- Microsoft XML, v6.0
Objects can be early bound or late bound. In early binding, objects are created at compile time. In late binding at run time.
Therefore early binding is considerably faster than late binding during run-time.
If you want to use late binding, you need to declare a variable of Object type, then invoke
CreateObject(“YOUR_LIBRARY”) method to create an instance of that object.
It is also worth noting that early binding enables Intellisense meaning that VB Editor starts displaying the object methods and properties after pressing dot.
3. Main procedure
The Main method is the entry point that invokes all other procedures.
Public Sub Main() Dim htmlDocument As htmlDocument Dim tableWithStories As HTMLTable Dim rowsWithStories As IHTMLElementCollection Dim storylinkAnchors As Collection Dim output As Variant ' Get website Set htmlDocument = GetHtmlDocument("https://news.ycombinator.com/news") ' Extract data Set tableWithStories = GetTableWithStories(htmlDocument) Set rowsWithStories = GetRowsWithStories(tableWithStories) Set storylinkAnchors = GetStorylinkAnchors(rowsWithStories) ' Display Let output = BuildOutput(storylinkAnchors) Call Display(output) End Sub
4. Get entire HTML document
Next, we check the HTTP status code of the HTTP response - if it’s 200, it’s successful.
After transaction completes,
responseText property will have a string containing the raw HTML. This could be difficult to manipulate.
That’s why, we inject the
responseText content into the body of a html document object.
Private Function GetHtmlDocument(url As String) As htmlDocument Dim http As ServerXMLHTTP60 Dim htmlDoc As htmlDocument Set http = New ServerXMLHTTP60 Set htmlDoc = New htmlDocument With http ' Send an HTTP request .Open bstrMethod:="GET", bstrUrl:=url, varAsync:=False .send ' Check if the request has succeeded If (.Status = 200) Then ' Inject raw HTML string into HTML document object htmlDoc.body.innerHTML = .responseText End If End With ' Return Set GetHtmlDocument = htmlDoc End Function
I use ServerXMLHTTP and not XMLHTTP despite the latter being designed for client applications.
The XMLHTTP provides caching that makes extracting data with frequent changes problematic.
If you want to know more, see these links:
5. Get table
Get table with stories. It’s the third
<table> out of four available on the website.
Private Function GetTableWithStories(htmlDoc As htmlDocument) As HTMLTable Dim allTables As IHTMLElementCollection ' Get all available HTML tables Set allTables = htmlDoc.getElementsByTagName("table") ' Return the 3rd table containing Hacker News stories Set GetTableWithStories = allTables.Item(2) End Function
6. Get rows
Having the right table, get all
tr.athing rows containing
a.storylink elements with stories.
Private Function GetRowsWithStories(tableWithStories As HTMLTable) As IHTMLElementCollection ' <tr> rows with stories on Hacker News have class "athing" Set GetRowsWithStories = tableWithStories.getElementsByClassName("athing") End Function
7. Get stories
Create a collection of
a.storylink elements with titles and links.
Performance-wise, always use the For Each loop when looping through a collection.
For more details, see this excellent post: VBA For loop vs For Each loop.
Private Function GetStorylinkAnchors(tableRowsWithStories As IHTMLElementCollection) As Collection Dim storyRow As HTMLTableRow Dim storylink As HTMLAnchorElement Dim storylinksCollection As Collection ' Create empty collection to hold <a> elements with title and link Set storylinksCollection = New Collection ' Extract <a> elements with class "storylink" from <tr> rows For Each storyRow In tableRowsWithStories Set storylink = storyRow.getElementsByClassName("storylink").Item(0) storylinksCollection.Add Item:=storylink Next storyRow ' Return Set GetStorylinkAnchors = storylinksCollection End Function
8. Build output
To display collection from the previous step in the worksheet, it must be converted to an array first.
It is far more efficient to transfer one array of 100 values to the worksheet than access
Range 100 times for the read and writes (see: VBA Arrays And Worksheet Ranges by Chip Pearson).
To put it simpler, avoid using
Range wherever possible to achieve best performance.
Private Function BuildOutput(storylinks As Collection) As Variant Dim output As Variant Dim storylink As HTMLAnchorElement Dim rowIndex As Long ReDim output(1 To storylinks.Count, 1 To 2) ' Convert a collection to array For Each storylink In storylinks Let rowIndex = rowIndex + 1 Let output(rowIndex, 1) = storylink.textContent Let output(rowIndex, 2) = storylink.href Next storylink ' Return Let BuildOutput = output End Function
Write array back to worksheet.
Public Sub Display(output As Variant) Dim targetWorksheet As Worksheet Dim rowsQuantity As Long Dim columnsQuantity As Long Set targetWorksheet = ThisWorkbook.Worksheets(1) Let rowsQuantity = UBound(output, 1) Let columnsQuantity = 2 ' Clear and return the array to the Range With targetWorksheet .UsedRange.Delete .UsedRange.Clear .Range(.Cells(1, 1), .Cells(rowsQuantity, columnsQuantity)) = output End With End Sub
Run the Main method.
Your worksheet should now contain Hacker News stories titles in one column and links in the other.