Archiving Your Pocket List With Ruby
I’ve been seeking a more powerful and extensible alternative to Bash, and so I’ve recently begun experimenting with Ruby. For my first…
I’ve been seeking a more powerful and extensible alternative to Bash, and so I’ve recently begun experimenting with Ruby. For my first “real” test of the language, I decided to solve a problem I had been seeking an answer to for some time: Since the web is constantly changing, how could I go through my entire reading list and ensure that I had backup copies of the articles I’ve saved? As it turns out, there was a fairly simple solution to this- only 35 lines of Ruby!
The script itself uses the Curb and Nokogiri libraries to follow URL shorteners and parse HTML to ensure that the third main component, wkhtmltopdf (a personal favorite of mine), gets the most correct data for each link. To get your Pocket data into the script, you simply use Pocket’s nifty HTML export tool to get a webpage full of links to all of your saved articles.
Using the script is extraordinarily simple: Once dependencies are installed (see the top of the script for more information on that), you simply run
ruby pocket_export.rb ~/Downloads/ril_export.html
and you’re off! The script creates the directory pocket_export_data to store the PDFs it generates and pocket_export_errors.log to keep track of any links it has trouble with.
Enjoy!
=begin
Pocket Export.rb
My first 'real' Ruby script (hello world)!
More info here: http://blog.ctis.me/2015/12/archiving-your-pocket-articles-with-ruby.html
DEPENDENCIES:
pocket_export requires the following gems:
curb
nokogiri
pocket_export requires the following packages (install them with your system's package manager):
wkhtmltopdf - wkhtmltopdf.org
USAGE:
**make sure that dependencies are installed first**
Go to https://getpocket.com/export/, and download the HTML file with your pocket data. Then, run this script with
the full path to the HTML file supplied as an argument (e.g. ~/Downloads/ril_export.html). The script will begin downloading
items immediately, and will save download files in ./pocket_export_data. Errors, if encountered, are logged in pocket_export_errors.log
NOTE:
This process can potentially be fairly CPU-intensive, as all pages are downloaded and rendered as PDFs. If you have many items in your list, the process is
going to take a while.
=end
require 'curb'
require 'open-uri'
require 'nokogiri'
if ARGV.length < 1
abort("pocket_export.rb /path/to/ril_export.html")
else
pocket_data = ARGV[0]
Dir.mkdir("./pocket_export_data/") unless File.exists?("./pocket_export_data/")
end
Nokogiri::HTML(open(pocket_data)).css('a').each { |link|
begin
# Set link to value of href attribute of <a> tag.
link = link['href']
# Follow any redirects until final destination is found (url shorteners etc).
curl = Curl::Easy.perform(link.gsub("\n",'')) do |curl|
curl.head = true
curl.follow_location = true
end
# Fetch the webpage title for use in the filename.
title = Nokogiri::HTML(open(curl.last_effective_url)).at('title').text.gsub("'", "").gsub('"','')
puts "\n\n\n***Downloading #{title} (#{link})..."
# Run wkhtmltopdf
system("wkhtmltopdf '#{link}' ./pocket_export_data/'#{title}.pdf'")
rescue
# Catch and log any exceptions.
puts "\n\n\n!!!Downloading #{link} FAILED!!\n\n\n"
File.open('./pocket_export_data/pocket_export_errors.log', 'a') { |errorlog|
errorlog.write("Error: " << a << "\n")
}
end
}
Originally published at blog.ctis.me on December 3, 2015.