SQL based web scraper language. Tutorial #1

by breeve 23. July 2015 12:30

Few things are as tedious as writing a web scraper. I wanted something simple, so I created a SQL based web scraper language—I always wanted to create a DSL that is useful. All you need to know is a little SQL and how CSS selectors work.

Download the zip file below to play with it yourself. It's a Windows Forms .NET application. Unzip it and double click on Scrape.ScrapeStudio.exe. It requires .NET framework 4.0.

Scrape.zip (288.52 kb)

Below is a screen shot of the application when it opens up.

Now we need to write some SQL like code to extract some data. For this first tutorial, we will be scraping data from a forum that I wrote located here http://vtss.brockreeve.com/.

If we go to http://vtss.brockreeve.com/ and click on the “Recent” tab then the “All Time” tab below it, we can view all the posts in a paged format. Scroll to the bottom and we see the pages listed out as 1 2 3 4 5 … 49. We need to write some SQL to get those numbers out. All I need to capture is the number 1 and 49, so I can generate the urls to scrape.

The first thing we need to learn is the “download page” statement.

select *
from download page 'http://vtss.brockreeve.com/?t=All'

Copy the above code into the editor and click the green arrow run button. You will see the below.

The "download page" statement takes a url and actually downloads the page from the web. Notice that the “download page” statement returns a table with columns. The most important one is the nodes column. This represents all the nodes in the html. We can write a where clause against it using CSS selectors and use a “pick” statement to pick out the individual text by using CSS selectors as well.

select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

If you run the above statement in the editor you see the below.

We see 1 and 49, but how did we get that from the statement above? The where clause returns all nodes that match the css selector ‘ol.page-nav’, then the pick statement picks nodes under the matched where statement also using css selectors.

To fully understand it, you have to look at the html for the page http://vtss.brockreeve.com/?t=All. I like to use firebug—a firefox plugin—to browse the html. Below is a view of it in firebug.

From this you see that the where clause is limiting the page to all nodes below the "ol.page-nav" node. From there the first pick statement "pick 'li.current a' take text" is picking the highlighted anchor node above—notice it is below the where statement node. The "take text" part of the statement will take the innertext of the anchor element which is 1.

The next pick statement "pick 'li:nth-child(7) a' take text" is taking the 7th list item element and taking the anchor text like the first pick statement. The important thing to note about the pick statements is they are limited by the where statement and a pick statement returns the first match—it cannot return multiple results because it is part of the SQL select.

Now we need to store these results in a table so we can write the rest of the script. This scrape language introduces two ways to do this. One writes to a file and the other is just a memory table.

create buffer postpages (startPage int, endPage int)

insert into postpages
select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

The first statement above declares a buffer table which means when data is inserted into it, the results are stored in memory only. When the program exits the results are lost. Cut and past the code into the editor and run it. It will produce no output because we are inserting the results into the postpages table.

If we do

select *
from postpages

after it you will see the results.

The second way to store information is with a file backed table like this.

create file postpages (startPage int, endPage int)
with (
    fieldterminator = '|',
    rowterminator = '\r\n'
)
location 'C:\Windows\temp\data.txt' --change this location

The fieldterminator determines how the columns are separated and the rowterminator is how the rows are separated. If we take the same code above and just replace the table definition, the results will be written to the location specified in the file table statement. Make sure you change the location to a location that exists on your file system before running the code below.

create file postpages (startPage int, endPage int)
with (
    fieldterminator = '|',
    rowterminator = '\r\n'
)
location 'C:\Windows\temp\data.txt'--change this location

insert overwrite postpages
select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

select *
from postpages

After the code runs, go to the file location and open it. You see 1|49 in the file.

The careful student will notice we used “insert overwrite” instead of “insert into”. There are two ways to insert into a table. “insert overwrite” will overwrite the file with the new contents where “insert into” will append the new contents. If you run the program multiple times, this obviously plays an important role.

File tables also load in the contents of the file when the program runs, so if you change the above to “insert into” and run the program multiple times you will see that not only does the file grow but the file table as well.

The pick statements can also have a match statement to specify a regular expression to match. It is often the case that you only want part of the text in a node. In our case it was exactly what we needed but see the example below to match just digits. If you run this, you get the same results as the previous runs but the match statement is running the regular expression to only match 1 or more digits.

select
    pick 'li.current a' take text match '\d+',
    pick 'li:nth-child(7) a' take text match '\d+'
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

Next we will show how to generate the urls for all the pages. Meet the expand statement.

select *
from expand (1 to 5)

If you run the above you get a table with 1 column called value and 5 rows 1 through 5. If we combine this with the each statement to iterate over our previous results we will be able to generate the urls. Below is the full code so far.


create buffer postpages (startPage int, endPage int)

insert into postpages
select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

create buffer pageurls (url string)

each(row in postpages){
   
    insert into pageurls
    select
        'http://vtss.brockreeve.com/Home/Index/' + value + '?t=All'
    from expand (row.startPage to row.endPage)
}

select *
from pageurls

Running the complete example code above produces the urls we need. Notice we create another buffer to hold the urls called “pageurls” and insert the expanded urls into it. The “each” loop allows us to loop through the “postpages” table. In this example, there is only 1 row. Since the scrapped url is relative we need to put a string before and after the “value” in the select statement.

Now we are getting somewhere. In the next tutorial we will finish this example out and introduce other language features.

Tags:

Software

Powered by BlogEngine.NET 1.5.0.7
Theme by Mads Kristensen

About Me

I am a Principal Engineer with 13 years experience developing and releasing software products. I started developing in C/C++ then moved into .NET and C# for the last 10 years and have tech lead multiple projects. I have developed products in Windows Forms, ASP.NET/MVC, Silverlight, and WPF. I currently reside in Austin, Texas.

Own Projects

Created ASP.NET MVC forum originally targeting home owner associations but now in use by an investor group.

http://vtss.brockreeve.com

A language for querying PGATour golf strokes.

http://pga.brockreeve.com/

Real time bidder for car ads demo

http://cargawk.com/

Simple front end tic tac toe

http://tictac.brockreeve.com/

Presentations

ADNUG: Hadoop / Hive

Currently Reading