Pickaxe - SQL Screen Scraper

by breeve 12. April 2016 11:55

Pickaxe has come a long way in a year. I open sourced it and hosted it on github.

The language is powerful enough now for real use cases. To illustrate, below is an example of fetching airport weather data and writing it to a file using SQL like syntax. The (thread(threadcount)) hint next to the download page statement allows parallel downloads resulting in fast code.

Never has scraping the web been so easy. To learn more, there are tutorials and docs on github. Visit the Pickaxe download page to run the code below.

create buffer states(state string)

insert into states
select 'TX'

insert into states
select 'AR'

insert into states
select 'OR'

create buffer station(url string, st string, city string, state string)

insert into station
select
    pick 'td:nth-child(1) a' take attribute 'href', --link to details
    pick 'td:nth-child(1) a', --station
    pick 'td:nth-child(2)', --city
    pick 'td:nth-child(4)' --state
from download page (select
    'https://www.faa.gov/air_traffic/weather/asos/?state=' + state
    from states)
where nodes = 'table.asos tbody tr'

create buffer stationReadings(url string, stamp string, time string, wind string, visibility string, weather string, temp string, humidity string)

insert into stationReadings
select
    url,
    pick 'td:nth-child(1)', --date
    pick 'td:nth-child(2)', --time
    pick 'td:nth-child(3)', --wind
    pick 'td:nth-child(4)', --visibility
    pick 'td:nth-child(5)', --weather
    pick 'td:nth-child(7)', --temp
    pick 'td:nth-child(11)' --humidity
from download page (select url from station) with (thread(10))
where nodes = 'table[cellspacing="3"] tr'

create file airportWeather(
    city string,
    state string,
    stamp string,
    time string,
    wind string,
    visibility string,
    weather string,
    humidity string,
    temp string
)
with (
    fieldterminator = '|',
    rowterminator = '\r\n'
)
location 'C:\windows\temp\weather.txt'

insert overwrite airportWeather
select city, state, stamp, time, wind, visibility, weather, humidity, temp
from station s
join stationReadings r on r.url = s.url

Tags:

Software

Powered by BlogEngine.NET 1.5.0.7
Theme by Mads Kristensen

About Me

I am a Principal Engineer with 13 years experience developing and releasing software products. I started developing in C/C++ then moved into .NET and C# and have tech lead multiple projects. I have developed products in Windows Forms, ASP.NET/MVC, Silverlight, and WPF. I currently reside in Austin, Texas.

Own Projects

Pickaxe - An easy to use web page scraper. If you know a little SQL and how basic CSS selectors work, no web scraping product will be easier to use.

Download Page

Source Code

Created ASP.NET MVC forum originally targeting home owner associations but now in use by an investor group.

http://vtss.brockreeve.com

A language for querying PGATour golf strokes.

http://pga.brockreeve.com/

Real time bidder for car ads demo

http://cargawk.com/

Simple front end tic tac toe

http://tictac.brockreeve.com/

Currently Reading