Please note javascript is required for full website functionality.
MVP

Referencing Multi-Page Data from a Web Site with a Single URL

Referencing Multi-Page Data from a Web Site with a Single URL

Query

How can I link to data on a website where it is on more than one page of a table, all with the same URL?

 

Advice 

At the risk of it looking like all we do is Power Query (we don’t!), we thought we would share with you our solution to a rather awkward – yet common – problem.  Consider the data from the website http://www.quanthockey.com/khl/seasons/2017-18-khl-players-stats.html:

The webpage is nicely set out and contains a table of hockey player statistics.  The thing is, the embedded table actually has 17 pages of data and let’s say we wish to extract all of this data for analysis elsewhere.

There’s a problem though.  When you click on the second or subsequent page of data, the URL for the website does not change.  This seemingly defeats Power Query (or Power BI) as URLs for each page of table data are required.

So how may we extract all of the data?  To answer this, let’s get there in five steps.

 

Step 1: Manual Retrieval of Data

Before we move on to our proposed solution, we should first cover how to manually import data from the hockey statistics site.  You should note that this method does not yield the complete list; this will be detailed later on.  First, let’s focus on manual extraction.

To manually import the data from this hockey statistics site http://www.quanthockey.com/khl/seasons/2017-18-khl-players-stats.html using Power Query, first open Excel then navigate to the ‘Data’ tab and click on the ‘New Query’ option, then select the ‘Other Sources’ option followed by ‘Web’.

A dialog box will appear, allowing us to insert the URL. Next click ‘OK’.

The ‘Navigator’ dialog box will appear, allowing us to select exactly which table to pull data from.  At this point all looks good, however we should name the table, so let’s click on ‘Edit’. 

In the ‘Query Editor’ dialog box, we should give our Query a friendly name, say ‘Hockey Data’, then select ‘Close & Load’. 

All seems to have worked well, however upon closer inspection, we can see that Power Query was only able to retrieve the first 50 entries, i.e. page 1 of the table data.

This is because Power Query retrieves data based on the URL and in this case our Power Query friendly hockey statistics website displays data using JavaScript.  Essentially, the website uses JavaScript code to dynamically refresh the list of players on one page, thus enabling the webpage to dynamically refresh the player list in one page, without changing the webpage’s URL.  Additionally, we also note that we cannot be certain how many pages of data this website has either.

In summary, we have three issues: 

  1. We are unable to manually pull all of the data from the website
  2. We do not know how many pages of data the website has
  3. The webpage does not change its URL when a new page of data is displayed. 

Let’s deal with one issue at a time.

 

Step 2: Custom Functions

Let’s address the first issue, then, i.e. the inability to manually retrieve all of the data just by importing it using Power Query. 

One proposed solution, proposed by MVP Reza Rad, is to utilise Custom Functions in Power Query.  A custom function is a query that is run by other queries, for those of you who know Java from coffee beans, and is similar to what is known as an Object Method.  The benefit of having a custom function is that we can repeat the same steps to a refreshed dataset if need be.

Let’s work through a simple example to illustrate Custom Function’s utility.  For example, we wish to retrieve the gross earnings of all of the movies released in a particular year, along with their current rank and their studio. It does not matter which year we wish to begin with; for this example, we shall begin with 2017 (it’s a little too early in the year for 2018!):

http://www.boxofficemojo.com/yearly/chart/?yr=2017&p=.htm 

To launch Power Query, we’ll use Excel 2016 and select ‘New Query’ from the ‘Data’ tab:

Then, we’ll adopt the default options and click ‘OK’. 

A dialog box will appear, prompting the selection of the table.  We select ‘Table 1’ then click on ‘Edit’.

Now that we have the Query Editor window open we can define our parameter.  Parameters are needed for custom functions to work. 

We create a simple parameter, set the name to ‘Year’ type to ‘text’ and the initial value to 2017.

We can now add a custom column to ‘Table 1’, click on ‘Table 1’, then on the ‘Add Column’ tab, then ‘Custom Column’. 

We give the custom column a name ‘Year’ and make it equal to the parameter ‘Year’. 

Be sure to change the custom column’s data type to ‘Text’.

Next, we need to integrate our Parameter into the URL.  If we achieve this, we’ll be able to dynamically change the URL, ultimately altering the source of the database by the desired year.

With ‘Table 1’ selected, we can click on the setting icon for the ‘Source’ in ‘Applied Steps’, viz

Selecting the ‘Advanced’ option, we can Identify the part of the URL that has the date, and enter the parameter in its place instead.  We should also include the last bit of the URL after the Year parameter. We do this by ‘Adding Parts’ to the URL.

Once that is done click ‘OK’.

We now have to convert the query into a function.  To do this, we right click on the ‘Table 1’ query and then select ‘Create Function…’.

Here, we’ll name the function ‘GetMovies’ then click ‘OK’.

There is now a group folder containing the original ‘Table 1’ query, the Year 2017 parameter and the GetMovies function. 

We have created a copy of the ‘Table 1’ query and called it ‘GetMovies’, from now on every time we call on GetMovies, Power Query will perform the same tasks in that order.

For simplicity then, we will create a simple generator and use the List.Numbers function to create our generator.  To do this, you can create a new query by navigating to the ‘Home’ tab in the Query Editor, then ‘New Source’, ‘Other Sources’ and choosing Blank Query.  Then, enter the following formula in the formula bar:

=List.Numbers(2002,16)

The list may then be converted into a table using the ‘To Table’ option located in the ‘Convert’ group. 

The default conversion settings will suffice.  Lastly, change the data type to ‘Text’. 

With the ‘Query1’ query selected, we can invoke a custom function by going to the ‘Add Column’ tab and select the ‘Invoke Custom Function’ in the ‘General’ group. 

We will name the new column to ‘GetMovieData’, select the ‘GetMovies’ function and click ‘OK’.

A new column will be added. 

Clicking on each individual Table will reveal the movie data for its corresponding year.  For example, here’s the 2006 data: 

There are some limitations however:

  • Editing the M script of the function will cause the function and query to fail
  • Custom functions cannot be scheduled to update in Power BI.

It’s still better than nothing. 

If we expand our dataset,

this reveals this compiled table with the top 100 movies for each year:

The data still needs some cleaning up however, you can learn how to do that by keeping up with our Power Query Pointers series!

The above deals with manual importation of the data, but what about the page number issue?

 

Step 3: Unknown Number of Pages

The solution to this problem was produced by a combined effort of Matt Mason, his blog post can be found here and Miguel Escobar, whose video explains his method well.

Matt Mason’s method adopts a brute force method where it instructs Power Query to run through pages 1 to 10,000 and stop when Power Query runs into an error or a ‘null’ value.  He points out that if this method is used together with third party software such as Fiddler (more on Fiddler later), Power Query will be found trying to evaluate all 10,000 pages.  Furthermore, if you try Matt’s method now with the newest version of Power Query, you may receive an error claiming that you do not have access to the database.

This is where Miguel comes in and adjusts the code a little so that it does not adopt the brute force method anymore as well as fix the permissions bug that has developed in Power Query.  Building off Matt Mason’s ideal, we will only use his ‘GetData’ function and build off that.  

To begin with, let’s open Power Query from Excel and convert Matt’s GetData query into a function:

Now we create a whole new query: go to ‘New Source’ then ‘Other Sources’ and ‘Blank Query’.  The first line of code to be entered uses the List.Generate Function:

=List.Generate ( ()=>

The ()=> function nomenclature essentially says that we will find a function with no parameter. 

The next line is:

[Result= try GetData(1) otherwise null, Page = 1],

This line uses Matt’s original function; however, it throws in an error check.   Essentially, it states ‘try’ to GetData, if it returns with an error, return with ‘Null’ in Page 1:

The next line:

each [Result] <> null,

specifics a condition, where the result cannot be null. Or perform this function as long as the Result does not equal to null. 

The next line increments the Page to page 2:

each [Result = try GetData([Page]+1) otherwise null, Page = [Page]+1],

The last line in this function instructs Power Query to display the Result field:

each [Result])

Once we hit ENTER we will see the list of tables:

This is all of the different pages pulled of the domestic gross of 2016 from the Box Office Mojo website.  Notice that Power Query does not try to evaluate 10,000 pages.

Now we go through the table and define each column’s data type.  While this is still a list we can transform this into a table and expand the data:

Once the table has been transformed we can expand the table.

The expanded table should look something like this:

Closing and loading will not result an error but instead all of the movie data from the year of 2016 from Box Office Mojo.

Now that we have dealt with the page number issue, we can now move on to the final bit…

 

Step 4: Fiddling with the URL, Fiddle(r) with the URL

Power Query needs some assistance: a software program called Fiddler is one such assistant.  To download this software, please go to Telerik’s software page to download Fiddler.  When your PC has finished installing Fiddler, open it and the page should look something like this:

As the prompt “Please select a single Web Session to impact” will require a web browser; we have used Chrome for this example, just because that was what we felt like when we took the screenshots .

So we go back to where we started and navigate to the Hockey statistics website again (URL: http://www.quanthockey.com/khl/seasons/2017-18-khl-players-stats.html) and we will start to see some interesting things appearing on Fiddler:

Fiddler takes the source of the URL and displays is it here, so let’s see what happens when we try page 2 of the Hockey Stats.  Fiddler now returns with an alternate URL:

It seems to have been broken down into Seasons, and potentially pages too.  Let’s copy it and save it on to an Excel spread sheet to aid us in discovering any patterns.  Right click the line of URL and select ‘Copy just URL’.

After repeating the process, a couple of times, we spot a pattern. Fiddler is able to retrieve the URL and break it down into pages.  This is great news, we can finally use this to work with Power Query.

Step 5: A Solution

Now for the final part where we combine everything together.

The first step is to create a new Query in Power Query and create a new parameter:

Name the parameter ‘PageNumber’, set it to a Decimal Number type, and give it a current value of 1. 

Now create a new Blank Query and paste the original code from Matt Mason into the formula bar:

Source = Web.Page(Web.Contents("http://boxofficemojo.com/yearly/chart/?page=" & Number.ToText(page) & "&view=releasedate&view2=domestic&yr=2013&p=.htm")),

Then modify it using the new URL provided from Fiddler:

=Web.Page(Web.Contents("http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Season&pos=Players&SS=2017-18&af=0&nat=2017-18&st=reg&sort=P&so=DESC&page=2&league=KHL&lang=en&rnd=167379793&dt=1"))

We also have to include the PageNumber parameter and the Text.From Power Query function to ensure that it is inserted into the URL as a text format.  The following code should replace the page number (in case you were wondering, the ampersand symbols mean concatenate):

="&Text.From(PageNumber)&"

yielding this:

=Web.Page(Web.Contents("http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Season&pos=Players&SS=2017-18&af=0&nat=2017-18&st=reg&sort=P&so=DESC&page="&Text.From(PageNumber)&"&league=KHL&lang=en&rnd=276273473&dt=1"))

As you can see, the PageNumber parameter has been linked into the URL.  Power Query will return with a condensed table.  The next step is to select the top right ‘Table’ option:

This will expand the table resulting in a table that only imports data from the first page, or the first 50 records.

Now, create a new blank query and copy this code in, it is a modified version of Matt’s GetData function for our purpose:

= (PageNumber as number) => let

        Source = Web.Page(Web.Contents("http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Season&pos=Players&SS=2017-18&af=0&nat=2017-18&st=reg&sort=P&so=DESC&page="&Text.From(PageNumber)&"&league=KHL&lang=en&rnd=276273473&dt=1")),

        Data0 = Source{0}[Data],

        #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Rk", Int64.Type}, {"", type text}, {"Name", type text}, {"Age", Int64.Type}, {"Pos", type text}, {"GP", Int64.Type}, {"G", Int64.Type}, {"A", Int64.Type}, {"P", Int64.Type}, {"PIM", Int64.Type}, {"+/-", Int64.Type}, {"PPG", Int64.Type}, {"SHG", Int64.Type}, {"GWG", Int64.Type}, {"G/GP", type number}, {"A/GP", type number}, {"P/GP", type number}})

    in

        #"Changed Type"

The second section of code simply changes the data types accordingly for each column so that you don’t have to do it!  Once entered, rename the function to ‘PageData’.

Now to incorporate Miguel’s method, create another blank query and copy this code in:

= List.Generate( ()=>

[Result= try PageData(1) otherwise null, Page = 1],

each [Result] <> null,

each [Result=  try PageData(Page) otherwise null, Page = [Page] +1],

 each [Result])

Once entered, change the name of the query to ‘AllData’.

This time there are no modifications needed in the code, other than to convert this list into a table.

Once Power Query has converted it into a table, we can expand the table:

Expanding the table should yield this result, where Power Query is able to compile the entire list of Hockey players, way beyond 50:

We can now proceed to ‘Close & Load’.

There you have it, all 829 Hockey Player statistics (as at the time of writing) in one worksheet!

Until Microsoft introduces a new built in feature to circumvent all this nasty coding, there does appear to be little alternative to this common issue. 

We would like to thank Reza Rad, Miguel Escobar, Matt Mason, and Simon Sabin for all contributing to this solution.

Newsletter