Web Scraping Hacker News Using VBA

Overview

See how to use VBA to pull stories from Hacker News into Excel worksheet.

The source code is available on GitHub.

High-level plan:

  1. Check which HTML tag contains stories,
  2. Get website contents,
  3. Traverse through it,
  4. Display results.

1. Inspect HTML

Open Chrome DevTools by pressing F12 or Ctrl + Shift + I.

After inspection we found out that:

  • third table on the website contains stories,
  • a.storylink element contains title and link of a story.

Inspecting Hacker News website

2. Set references

In a new Excel worksheet, open a VBA editor (ALT+F11), go to Tools -> References and select:

  • Microsoft HTML Object Library
  • Microsoft XML, v6.0

Adding references to VBA project

Objects can be early bound or late bound. In early binding, objects are created at compile time. In late binding at run time.

Therefore early binding is considerably faster than late binding during run-time. If you want to use late binding, you need to declare a variable of Object type, then invoke CreateObject(“YOUR_LIBRARY”) method to create an instance of that object.

It is also worth noting that early binding enables Intellisense meaning that VB Editor starts displaying the object methods and properties after pressing dot.

3. Main procedure

The Main method is the entry point that invokes all other procedures.

Public Sub Main()

    Dim htmlDocument As htmlDocument
    Dim tableWithStories As HTMLTable
    Dim rowsWithStories As IHTMLElementCollection
    Dim storylinkAnchors As Collection
    Dim output As Variant

    ' Get website
    Set htmlDocument = GetHtmlDocument("https://news.ycombinator.com/news")

    ' Extract data
    Set tableWithStories = GetTableWithStories(htmlDocument)
    Set rowsWithStories = GetRowsWithStories(tableWithStories)
    Set storylinkAnchors = GetStorylinkAnchors(rowsWithStories)

    ' Display
    Let output = BuildOutput(storylinkAnchors)
    Call Display(output)

End Sub

4. Get entire HTML document

To send an HTTP request, we create an ServerXMLHTTP60 object, open the URL, and send the request.

Next, we check the HTTP status code of the HTTP response - if it’s 200, it’s successful.

After transaction completes, responseText property will have a string containing the raw HTML. This could be difficult to manipulate.

That’s why, we inject the responseText content into the body of a html document object.

Private Function GetHtmlDocument(url As String) As htmlDocument

    Dim http As ServerXMLHTTP60
    Dim htmlDoc As htmlDocument

    Set http = New ServerXMLHTTP60
    Set htmlDoc = New htmlDocument

    With http
        ' Send an HTTP request
        .Open bstrMethod:="GET", bstrUrl:=url, varAsync:=False
        .send

        ' Check if the request has succeeded
        If (.Status = 200) Then

            ' Inject raw HTML string into HTML document object
            htmlDoc.body.innerHTML = .responseText
        End If
    End With

    ' Return
    Set GetHtmlDocument = htmlDoc

End Function

I use ServerXMLHTTP and not XMLHTTP despite the latter being designed for client applications.

The XMLHTTP provides caching that makes extracting data with frequent changes problematic.

If you want to know more, see these links:

5. Get table

Get table with stories. It’s the third <table> out of four available on the website.

Private Function GetTableWithStories(htmlDoc As htmlDocument) As HTMLTable

    Dim allTables As IHTMLElementCollection

    ' Get all available HTML tables
    Set allTables = htmlDoc.getElementsByTagName("table")

    ' Return the 3rd table containing Hacker News stories
    Set GetTableWithStories = allTables.Item(2)

End Function

6. Get rows

Having the right table, get all tr.athing rows containing a.storylink elements with stories.

Private Function GetRowsWithStories(tableWithStories As HTMLTable) As IHTMLElementCollection

    ' <tr> rows with stories on Hacker News have class "athing"
    Set GetRowsWithStories = tableWithStories.getElementsByClassName("athing")

End Function

7. Get stories

Create a collection of a.storylink elements with titles and links.

Performance-wise, always use the For Each loop when looping through a collection.
For more details, see this excellent post: VBA For loop vs For Each loop.

Private Function GetStorylinkAnchors(tableRowsWithStories As IHTMLElementCollection) As Collection

    Dim storyRow As HTMLTableRow
    Dim storylink As HTMLAnchorElement
    Dim storylinksCollection As Collection

    ' Create empty collection to hold <a> elements with title and link
    Set storylinksCollection = New Collection

    ' Extract <a> elements with class "storylink" from <tr> rows
    For Each storyRow In tableRowsWithStories

        Set storylink = storyRow.getElementsByClassName("storylink").Item(0)
        storylinksCollection.Add Item:=storylink

    Next storyRow

    ' Return
    Set GetStorylinkAnchors = storylinksCollection

End Function

8. Build output

To display collection from the previous step in the worksheet, it must be converted to an array first.

It is far more efficient to transfer one array of 100 values to the worksheet than access Range 100 times for the read and writes (see: VBA Arrays And Worksheet Ranges by Chip Pearson).

To put it simpler, avoid using Range wherever possible to achieve best performance.

Private Function BuildOutput(storylinks As Collection) As Variant

    Dim output As Variant
    Dim storylink As HTMLAnchorElement
    Dim rowIndex As Long

    ReDim output(1 To storylinks.Count, 1 To 2)

    ' Convert a collection to array
    For Each storylink In storylinks

        Let rowIndex = rowIndex + 1

        Let output(rowIndex, 1) = storylink.textContent
        Let output(rowIndex, 2) = storylink.href

    Next storylink

    ' Return
    Let BuildOutput = output

End Function

9. Display

Write array back to worksheet.

Public Sub Display(output As Variant)

    Dim targetWorksheet As Worksheet
    Dim rowsQuantity As Long
    Dim columnsQuantity As Long

    Set targetWorksheet = ThisWorkbook.Worksheets(1)

    Let rowsQuantity = UBound(output, 1)
    Let columnsQuantity = 2

    ' Clear and return the array to the Range
    With targetWorksheet
        .UsedRange.Delete
        .UsedRange.Clear

        .Range(.Cells(1, 1), .Cells(rowsQuantity, columnsQuantity)) = output
    End With

End Sub

End result

Run the Main method.

Your worksheet should now contain Hacker News stories titles in one column and links in the other.

Hacker News stories in Excel worksheet