InstaPaper

Center of my online reading Highlighting And Annotating process. I wrote scripts to pull down my highlights to auto-post to the WikiLog... ReadItLater app/service to scrape (typically longer-form) web articles and save clean-text format for reading later, typically with Data Synch to a Mobile device for Off-Line reading.

Created by Marco Arment. Launched 2008.

sold to BetaWorks 2013
Brian Donohue joined 2013 to build iOS app, then promoted to General Manager, later CEO.
acquired by PInterest 2016. (Brian went along.)
spun back out mid-2018

My Highlighting And Annotating process of online reading since 2017

2021 update, including other media: My Digital Gardening Process

2020 update: video of my process

I save all articles to InstaPaper.

I read everything on my phone via the app
I Highlight as I read

Periodically I process on my laptop

I run a script that hits the InstaPaper API to grab a bunch of articles' highlights. I save each to a separate file. (Details in other section below.)
Manually adjust each file
- decide whether I'm going to post it as standalone, or maybe combine multiple to-post articles, or maybe cut/paste to a non-Blogbit-page. If post-as-standalone.....
- fix filename-timestamp (typically requires opening original source, because nobody uses metadata), plus adjust filename to make URL I like
- add author name if I feel like it
- adjust any Markdown issues (usually code bugs)
- put in WikiWords as appropriate - sometimes it's just bracketing existing phrases, other times it's added a parenthetical, etc.
- sometimes I want something for my PrivateWiki instead, so I copy/paste by hand over there, save an empty file
- put each edited file into a separate folder as PipeLine
run script2 which auto-posts the edited files to my WikiLog
then move all those files to another folder
when I run script1 it deletes all the posted articles before grabbing new ones
then (less often) I have a separate process to update my WikiGraph.

old notes

For over a year I had the free service, and used the free/lite InstaFetch client on my Archos70 Tablet.

Sept'2012: started paying for full service, and bought the official Android client for my Archos70. (via Amazon AppStore since Archos didn't get Google store)

The bookmark to save doesn't work well with GitHub Mark Down pages - use http://instapaper.com/hello2?url=

actually, that seemed to work in the web UI, but didn't synch anything useful to my Android app

Dec'2013: Bought Nexus Seven. Everything migrated over nicely.

Nov'2014: frustrated that Amazon-store version is still stuck at v2.9.2, while official version is up to v4.2. They say if I just install the new app, it will synch up so I won't lose any of my inventory. Update: wow, it was so fast, I think the old local data file must have been left behind for re-use?

insta_repost

Sept'2015: would like to use their API to suck down items I've Liked/Starred/Hearted, so I can post them to my WikiLog. I want to scrape my highlights, the original piece title, and maybe even the first sentence/paragraph to quote.

The Python instapaperlib package that comes up most obviously uses the "simple" API which offers little except adding a URL to your account.

The instacache code uses the "full" API, so I'm going to just steal pieces of that to create my own stuff. I want to store my stuff in files so I can browse/edit them more easily before

Oct07: applying for API OAuth token.

Oct09: stepping through code one line at a time

when get to client.request("%s/oauth/access_token" get error which seems to related to linking in an old version of Open S S L.
python -c "import ssl; print ssl.OPENSSL_VERSION" from here gives me OpenSSL 0.9.7l 28 Sep 2006 which is bad
- hmm in the VirtualEnv for Flask I get Open S S L 0.9.8y 5 Feb 2013 - that might be recent enough
going to try updating Open S S L
- but brew update gives

xcrun: error: active developer path ("/Applications/Xcode.app/Contents/Developer") does not exist, use xcode-select to change
Error: Failure while executing: git init

update: ended up re-installing X Code, which fixed some issues

Nov09: nudging forward with VirtualEnv for Flask

typing piece at a time, now get past resp, token = client.request(...) successfully, so have token value
but everything I try to do after that gives me 403 response
ah, looks like need to create new Client instance with the token
and need to create the token object rather than just passing in the token value?
yep that works
can ask for folder_id='starred' in payload to get the starred/liked/hearted items
results look like:

'[
{"type": "meta"}, 

{"username": "fluxent@gmail.com", "user_id": 1761795, "type": "user", "subscription_is_active": "1"}, 

{"hash": "pzvgEV0W", "description": "", "bookmark_id": 653717723, "private_source": "", "title": "Open issues: lessons learned building an open source business", "url": "http://werd.io/2015/open-issues-lessons-learned-building-an-open-source-business", "progress_timestamp": 1447170795, "time": 1447162969, "progress": 0.966090679169, "starred": "1", "type": "bookmark"} ]'

trying to get the first-line of the original text can include lots of junk, and there's no clear end to the line except maybe </p>\n

<figure><a href="<https://www.flickr.com/photos/kid_pro_quo/243281786"><img> alt="South Park" src="<https://farm1.staticflickr.com/92/243281786_d03baeab9d_z.jpg?zz=1"/></a><figcaption><a> href="<https://www.flickr.com/photos/kid_pro_quo/243281786">South> Park</a></figcaption></figure>\n<p><strong><em>Prologue</em>:</strong></p>\n

getting the highlights for a bookmark gives a list for which one entry looks like:

{"highlight_id": 1816475, "text": "we didn\'t know how we were going to pay rent, and growth was linear. For a project, we were doing well. For a company, we weren\'t doing well - and there were still only two of us", "bookmark_id": 653717723, "time": 1446956770, "position": 0, "type": "highlight"},

Jan21'2017: being up-and-running with WikiFlux for a month now, getting back to this.

argh getting the same old-SSL error noted Oct09 above! Suspect VirtualEnv issue... derp yeah have to use my Flask VirtualEnv instead of WebPy one. Then print ssl.OPENSSL_VERSION gives OpenSSL 0.9.8za 5 Jun 2014

Jan22

turn all script bits into working code, as far as getting list
ToDo: handle UniCode content, decide when to turn to ascii (get unidecode package), when to allow
hrm time attribute reflects when I bookmarked it, not when page was written
- accept/embrace that?
- double-check that page-get-text function doesn't have anything - confirmed
- scrape head of original source for metatag? use urllib2.response.info().dict[last-modified'] - but how often will it be there? Will try this first.
- in some cases can parse url, but yuck

Jan23: dealing with unicode before time

thought if I just stripped unicode from chars going into url then I'd be ok
but can't just use file.write() with unicode either, sigh have to dig in

Feb18

UniCode/UTF8 solution, just f.write(title.encode("utf-8")) - seems to work
now getting HTTP error 403 on one article when grabbing the original to get the date MetaTag. Can grab target URL with browser fine. I wonder if it's blocking unknown agents?
grab it with CUrl and see it's giving a 301 redirect from http to https. Just putting this inside the try for now, let it skip these cases. Now grabbing all successfully. Next: grab the highlights.

Mar04'2017

derp somehow I hadn't updated some variables properly, so I had still been using some old API bits! (And/or they made a recent change.)
So now the highlights come along in the main initial call.
Interestingly, they aren't children of the bookmarked articles, but just a parallel flat list.
Frustratingly, it looks like highlight position is rarely used anymore! Which means if I scroll back in an article to add a highlight, it will be in time-order rather than article-position-order in my excerpts. Ah well.
- more precisely: position is used for when the same highlighted phrase occurs multiple times in the same document - so if the phrase occurs only once, it has pos=0.
Almost never getting date MetaTag on original articles, and noticing how few original URLs have full date in them, so I think I'm going to have to suck it up and use the save-date... later.
Starting to modify my autopost code to handle these posts...
Done! Auto-posted 10 pages.
Next - delete them from InstaPaper. Done!
Status: WikiLog has 17,135 nodes. (hmm don't see a way to get my count of Liked items in InstaPaper)

fixing highlight-position issue

In some cases the highlights come through in random order, all with identical timestamps. So I'm going to have to find the position of each highlight in the body.

Sept04'2020

want to
- fix the highlight-position issue
- scrape notes (which I rarely use so far) as well as highlights
position first...
API method /bookmarks/get_text returns the cleaned body of the article, in HTML. But the highlights are plaintext (despite the API doc) (unless my wrapper library is responsible for converting). Also, fairly basic chars like curly apostrophes will be "normal" in the body-HTML, but \u2019 in the highlight!
using body.encode('utf-8') but getting UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2137: ordinal not in range(128) - hmm it's a little trademark symbol
use body.decode('utf-8').encode('utf-8'), but then when I do str.find() I get the same can't decode error
reading Unicode notes to stop flailing
- If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding. wtf do you do if you don't know what encoding you're being handed? Trial and error?!?
  - ah, correct value is utf_8 not utf-8
- hmm maybe this: HTTP: Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.
  - it says UTF-8

Sept13

let's clarify
type(body)=<type 'str'> but it contains non-ascii characters. The HTTP response-header says it's UTF-8.
derp should be calling decode() not encode()? This makes it <type 'unicode'>
that seems to be working, or at least the error moved further down. Ah, had to restructure my dictionaries. Now working.
But sometimes highlights still not in the right order. Ah, if a highlight contains a special character like \u2019 then the find fails and the position gets set to -1.
derp I had a normalize() call on the body which turned it from <type 'unicode'> to <type 'str'>. I took out that call, so it stays unicode, and now everything seems like it's working.
found a failure case - when highlight crosses paragraphs, and has \n\n in it, it doesn't get found in the body. I'm just using highlight.splitlines()[0] so I just search for the first line of the highlight.
another failure.... the body has a place where there are 2 spaces in a row, and the highlight does not. ?!?!? I think I'm going to let this ride, and maybe try to count how often it happens or something....
- 50 articles -> 34 misses ugh
check another case - another double-space. Options
- convert all double-spaces to single-spaces. Pro: HTML ignores multi-spaces anyway (except maybe in multiple lines of code, which I never highlight). Going with this.
- only do find with the first sentence. Con: as that link shows, defining sentences can be tricky (and sometimes I start a highlight mid-sentence, which might make it even trickier).
- only do find on the first n characters - 32? Con: risk of non-uniqueness, plus having a short sentence/phrase that results in double-spaces earlier than you'd expect, esp given my occasional habit of starting highlight mid-sentence.
- that solved a bunch of cases. But still 26 fails (not exact match to prior experiment because the set of articles have shifted).

Sept14

checking first fail-article, which has 4 fails. Many characters like #8212 (that's not the full value, somewhere this page is converting it! - it's like &#___;) - going to handle that one by hand and check the others....
ugh failures are more serious than I thought - since I'm using the 'position' as the dictionary index for sortability, if there are multiple fails within a single piece they all get position= -1 and therefore keep writing over each other, so I lose n-1 of the failed highlights. Do I
- improve my logging of per-article rather than per-batch fails
- skip an article with more than 1 failure - or add an alert as a first highlight!
- solve all the failures....
- use the first-half of the highlight and search again
- handle multiple highlights with the same location - make the dictionary entry a list
- option2 smells promising.... though need to do something to make that situation (a) visible, and (b) transparent/fixable

Sept20 got opt2 working - adding that-article count. Will dig into some specific examples when I hit them.

Sept24

realized that this model is generating more exceptions than the original one. So have reverted for now. Will come up with test to catch when timestamps are wrong/matching, and only go to string-search for those articles...
for now, added support for my own notes (usage pattern: attach note to a highlight).

Edited: 2021-09-25 12:07:51.115725 | Tweet this! | Search Twitter for discussion

Bill Seitz