Charlton's Blog

Archiving Your Pocket List With Ruby

I’ve been seeking a more powerful and extensible alternative to Bash, and so I’ve recently begun experimenting with Ruby. For my first…

Published: Dec 3, 2015
Category: Programming, Projects
Tags: ,

I’ve been seeking a more powerful and extensible alternative to Bash, and so I’ve recently begun experimenting with Ruby. For my first “real” test of the language, I decided to solve a problem I had been seeking an answer to for some time: Since the web is constantly changing, how could I go through my entire reading list and ensure that I had backup copies of the articles I’ve saved? As it turns out, there was a fairly simple solution to this- only 35 lines of Ruby!

The script itself uses the Curb and Nokogiri libraries to follow URL shorteners and parse HTML to ensure that the third main component, wkhtmltopdf (a personal favorite of mine), gets the most correct data for each link. To get your Pocket data into the script, you simply use Pocket’s nifty HTML export tool to get a webpage full of links to all of your saved articles.

Using the script is extraordinarily simple: Once dependencies are installed (see the top of the script for more information on that), you simply run

ruby pocket_export.rb ~/Downloads/ril_export.html

and you’re off! The script creates the directory pocket_export_data to store the PDFs it generates and pocket_export_errors.log to keep track of any links it has trouble with.

Enjoy!

=begin
  Pocket Export.rb
  My first 'real' Ruby script (hello world)!
  More info here: http://blog.ctis.me/2015/12/archiving-your-pocket-articles-with-ruby.html

  DEPENDENCIES:
    pocket_export requires the following gems:
    curb
    nokogiri

    pocket_export requires the following packages (install them with your system's package manager):
    wkhtmltopdf - wkhtmltopdf.org

  USAGE:
                **make sure that dependencies are installed first**

    Go to https://getpocket.com/export/, and download the HTML file with your pocket data. Then, run this script with
    the full path to the HTML file supplied as an argument (e.g. ~/Downloads/ril_export.html). The script will begin downloading
    items immediately, and will save download files in ./pocket_export_data. Errors, if encountered, are logged in pocket_export_errors.log

  NOTE:
    This process can potentially be fairly CPU-intensive, as all pages are downloaded and rendered as PDFs. If you have many items in your list, the process is
    going to take a while.
=end

require 'curb'
require 'open-uri'
require 'nokogiri'

if ARGV.length < 1
  abort("pocket_export.rb /path/to/ril_export.html")
else
  pocket_data = ARGV[0]
  Dir.mkdir("./pocket_export_data/") unless File.exists?("./pocket_export_data/")
end

Nokogiri::HTML(open(pocket_data)).css('a').each { |link|
 begin
  # Set link to value of href attribute of <a> tag.
  link = link['href']

  # Follow any redirects until final destination is found (url shorteners etc).
  curl = Curl::Easy.perform(link.gsub("\n",'')) do |curl|
    curl.head = true
    curl.follow_location = true
  end

  # Fetch the webpage title for use in the filename.
  title = Nokogiri::HTML(open(curl.last_effective_url)).at('title').text.gsub("'", "").gsub('"','')
  puts "\n\n\n***Downloading #{title} (#{link})..."

  # Run wkhtmltopdf
  system("wkhtmltopdf '#{link}' ./pocket_export_data/'#{title}.pdf'")
  rescue
  # Catch and log any exceptions.
  puts "\n\n\n!!!Downloading #{link} FAILED!!\n\n\n"
        File.open('./pocket_export_data/pocket_export_errors.log', 'a') { |errorlog|
          errorlog.write("Error: " << a << "\n")
        }
  end
}

Originally published at blog.ctis.me on December 3, 2015.