Pickaxe - SQL based web scraper language. Tutorial #2

by breeve 31. July 2015 13:11

The last tutorial left us off with all the main post urls of http://vtss.brockreeve.com/. See Tutorial #1 for further explanation. Below is the code so far. Visit the download page to run the program. 

create buffer postpages (startPage int, endPage int)

insert into postpages
select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

create buffer pageurls (url string)

each(row in postpages){
   
    insert into pageurls
    select
        'http://vtss.brockreeve.com/Home/Index/' + value + '?t=All'
    from expand (row.startPage to row.endPage)
}

What we want now is the urls to the detailed post pages. We need to go through each main page url and pick out the urls to the detail pages.

create buffer detailurls (url string)

each(row in pageurls){
   
    insert into detailurls
    select
        'http://vtss.brockreeve.com' + pick 'h3 a' take attribute 'href'
    from download page row.url
    where nodes = 'div.topic'

}

From Tutorial #1, we learned about how the where and pick statements work. It works the same way here. The only noted difference is the pick statement selects the attribute value rather than the text of the node. You can either ‘take text’ or ‘take attribute’. Adding the above to the completed code so far and selecting out the detailurls we get 485 detailurls to visit. So the forum has 485 total posts.

Now we write more SQL code to download each of these and get out the post details and replies.

create buffer topicdetails (id identity, title string, post string, user string)
create buffer topicreplies (id int, url string, post string, user string)

each(row in detailurls) {
   
    downloadPage = download page row.url

    insert into topicdetails
    select
        pick 'h3' take text, --title
        pick 'p:nth-child(3)' take text, --post
        pick 'p.author a' take text --user
    from downloadPage
    where nodes = 'div.topic'
   
    insert into topicreplies
    select
        @@identity,
        row.url,
        pick 'p:nth-child(2)' take text,
        pick 'p.author a' take text
    from downloadPage
    where nodes = 'div.reply'

}

The download page statement can be assigned to a variable, so we don’t download the same page twice. The code loops through all the detailurls rows, downloads the page from the web and then selects out the information.

A couple things to note. The first memory table has the first column id as an identity. The value is auto incremented on each insert giving each row a unique value. Then when we insert the replies the @@identity value fetches the last identity value (this is the same as Microsoft T-SQL). This allows us to link replies to topics in a normalized fashion. Completed scrape code listing below:


create buffer postpages (startPage int, endPage int)

insert into postpages
select
    pick 'li.current a' take text,
    pick 'li:nth-child(7) a' take text
from download page 'http://vtss.brockreeve.com/?t=All'
where nodes = 'ol.page-nav'

create buffer pageurls (url string)

each(row in postpages){
   
    insert into pageurls
    select
        'http://vtss.brockreeve.com/Home/Index/' + value + '?t=All'
    from expand (row.startPage to row.endPage)
}

create buffer detailurls (url string)

each(row in pageurls){
   
    insert into detailurls
    select
        'http://vtss.brockreeve.com' + pick 'h3 a' take attribute 'href'
    from download page row.url
    where nodes = 'div.topic'

}

create buffer topicdetails (id identity, title string, post string, user string)
create buffer topicreplies (id int, url string, post string, user string)

each(row in detailurls) {
   
    downloadPage = download page row.url

    insert into topicdetails
    select
        pick 'h3' take text, --title
        pick 'p:nth-child(3)' take text, --post
        pick 'p.author a' take text --user
    from downloadPage
    where nodes = 'div.topic'
   
    insert into topicreplies
    select
        @@identity,
        row.url,
        pick 'p:nth-child(2)' take text,
        pick 'p.author a' take text
    from downloadPage
    where nodes = 'div.reply'

}

Now if we add

select *
from topicreplies

to the bottom we can print out all the topic replies.

Any of the memory tables can be changed to file tables to persist data to the disk. Tutorial #1 shows how to do this.

Tags:

Software

Powered by BlogEngine.NET 1.5.0.7
Theme by Mads Kristensen

About Me

I am a Principal Engineer with 13 years experience developing and releasing software products. I started developing in C/C++ then moved into .NET and C# and have tech lead multiple projects. I have developed products in Windows Forms, ASP.NET/MVC, Silverlight, and WPF. I currently reside in Austin, Texas.

Own Projects

Pickaxe - An easy to use web page scraper. If you know a little SQL and how basic CSS selectors work, no web scraping product will be easier to use.

Download Page

Source Code

Created ASP.NET MVC forum originally targeting home owner associations but now in use by an investor group.

http://vtss.brockreeve.com

A language for querying PGATour golf strokes.

http://pga.brockreeve.com/

Real time bidder for car ads demo

http://cargawk.com/

Simple front end tic tac toe

http://tictac.brockreeve.com/

Currently Reading