InstaPaper

(Fixing my highlight-position code....) Center of my online reading Highlighting And Annotating process. I wrote scripts to pull down my highlights to auto-post to the WikiLog... ReadItLater app/service to scrape (typically longer-form) web articles and save clean-text format for reading later, typically with Data Synch to a Mobile device for Off-Line reading.

Created by Marco Arment. Launched 2008.

  • sold to BetaWorks 2013
  • Brian Donohue joined 2013 to build iOS app, then promoted to General Manager, later CEO.
  • acquired by PInterest 2016. (Brian went along.)
  • spun back out mid-2018

For over a year I had the free service, and used the free/lite InstaFetch client on my Archos70 Tablet.

Sept'2012: started paying for full service, and bought the official Android client for my Archos70. (via Amazon AppStore since Archos didn't get Google store)

The bookmark to save doesn't work well with GitHub Mark Down pages - use http://instapaper.com/hello2?url=

  • actually, that seemed to work in the web UI, but didn't synch anything useful to my Android app

Dec'2013: Bought Nexus Seven. Everything migrated over nicely.

Nov'2014: frustrated that Amazon-store version is still stuck at v2.9.2, while official version is up to v4.2. They say if I just install the new app, it will synch up so I won't lose any of my inventory. Update: wow, it was so fast, I think the old local data file must have been left behind for re-use?

My Highlighting And Annotating process of online reading since 2017

2020 update: video of my process

I save everything to InstaPaper.

  • I read everything on my phone via the app
  • I Highlight as I read

Periodically I process on my laptop

  • I run a script that hits the InstaPaper API to grab a bunch of articles' highlights. I save each to a separate file. (Details in other section below.)
  • Manually adjust each file
    • fix filename-timestamp (typically requires opening original source, because nobody uses metadata), plus adjust filename to make URL I like
    • add author name if I feel like it
    • adjust any Markdown issues (usually code bugs)
    • put in WikiWords as appropriate - sometimes it's just bracketing existing phrases, other times it's added a parenthetical, etc.
    • sometimes I want something for my PrivateWiki instead, so I copy/paste by hand over there, save an empty file
    • put each edited file into a separate folder as PipeLine
  • run script2 which auto-posts the edited files to my WikiLog
  • then move all those files to another folder
  • when I run script1 it deletes all the posted articles before grabbing new ones
  • then (less often) I have a separate process to update my WikiGraph.

insta_repost

Sept'2015: would like to use their API to suck down items I've Liked/Starred/Hearted, so I can post them to my WikiLog. I want to scrape my highlights, the original piece title, and maybe even the first sentence/paragraph to quote.

The Python instapaperlib package that comes up most obviously uses the "simple" API which offers little except adding a URL to your account.

The instacache code uses the "full" API, so I'm going to just steal pieces of that to create my own stuff. I want to store my stuff in files so I can browse/edit them more easily before

Oct07: applying for API OAuth token.

Oct09: stepping through code one line at a time

  • when get to client.request("%s/oauth/access_token" get error which seems to related to linking in an old version of Open S S L.
  • python -c "import ssl; print ssl.OPENSSL_VERSION" from here gives me OpenSSL 0.9.7l 28 Sep 2006 which is bad
  • going to try updating Open S S L
    • but brew update gives
xcrun: error: active developer path ("/Applications/Xcode.app/Contents/Developer") does not exist, use xcode-select to change
Error: Failure while executing: git init 
  • update: ended up re-installing X Code, which fixed some issues

Nov09: nudging forward with VirtualEnv for Flask

  • typing piece at a time, now get past resp, token = client.request(...) successfully, so have token value
  • but everything I try to do after that gives me 403 response
  • ah, looks like need to create new Client instance with the token
  • and need to create the token object rather than just passing in the token value?
  • yep that works
  • can ask for folder_id='starred' in payload to get the starred/liked/hearted items
  • results look like:
'[
{"type": "meta"}, 

{"username": "fluxent@gmail.com", "user_id": 1761795, "type": "user", "subscription_is_active": "1"}, 

{"hash": "pzvgEV0W", "description": "", "bookmark_id": 653717723, "private_source": "", "title": "Open issues: lessons learned building an open source business", "url": "http://werd.io/2015/open-issues-lessons-learned-building-an-open-source-business", "progress_timestamp": 1447170795, "time": 1447162969, "progress": 0.966090679169, "starred": "1", "type": "bookmark"} ]'
  • trying to get the first-line of the original text can include lots of junk, and there's no clear end to the line except maybe </p>\n
<figure><a href="<https://www.flickr.com/photos/kid_pro_quo/243281786"><img> alt="South Park" src="<https://farm1.staticflickr.com/92/243281786_d03baeab9d_z.jpg?zz=1"/></a><figcaption><a> href="<https://www.flickr.com/photos/kid_pro_quo/243281786">South> Park</a></figcaption></figure>\n<p><strong><em>Prologue</em>:</strong></p>\n
  • getting the highlights for a bookmark gives a list for which one entry looks like:
{"highlight_id": 1816475, "text": "we didn\'t know how we were going to pay rent, and growth was linear. For a project, we were doing well. For a company, we weren\'t doing well - and there were still only two of us", "bookmark_id": 653717723, "time": 1446956770, "position": 0, "type": "highlight"}, 

Jan21'2017: being up-and-running with WikiFlux for a month now, getting back to this.

  • argh getting the same old-SSL error noted Oct09 above! Suspect VirtualEnv issue... derp yeah have to use my Flask VirtualEnv instead of WebPy one. Then print ssl.OPENSSL_VERSION gives OpenSSL 0.9.8za 5 Jun 2014

Jan22

  • turn all script bits into working code, as far as getting list
  • ToDo: handle UniCode content, decide when to turn to ascii (get unidecode package), when to allow
  • hrm time attribute reflects when I bookmarked it, not when page was written
    • accept/embrace that?
    • double-check that page-get-text function doesn't have anything - confirmed
    • scrape head of original source for metatag? use urllib2.response.info().dict[last-modified'] - but how often will it be there? Will try this first.
    • in some cases can parse url, but yuck

Jan23: dealing with unicode before time

  • thought if I just stripped unicode from chars going into url then I'd be ok
  • but can't just use file.write() with unicode either, sigh have to dig in

Feb18

  • UniCode/UTF8 solution, just f.write(title.encode("utf-8")) - seems to work
  • now getting HTTP error 403 on one article when grabbing the original to get the date MetaTag. Can grab target URL with browser fine. I wonder if it's blocking unknown agents?
  • grab it with CUrl and see it's giving a 301 redirect from http to https. Just putting this inside the try for now, let it skip these cases. Now grabbing all successfully. Next: grab the highlights.

Mar04'2017

  • derp somehow I hadn't updated some variables properly, so I had still been using some old API bits! (And/or they made a recent change.)
  • So now the highlights come along in the main initial call.
  • Interestingly, they aren't children of the bookmarked articles, but just a parallel flat list.
  • Frustratingly, it looks like highlight position is rarely used anymore! Which means if I scroll back in an article to add a highlight, it will be in time-order rather than article-position-order in my excerpts. Ah well.
    • more precisely: position is used for when the same highlighted phrase occurs multiple times in the same document - so if the phrase occurs only once, it has pos=0.
  • Almost never getting date MetaTag on original articles, and noticing how few original URLs have full date in them, so I think I'm going to have to suck it up and use the save-date... later.
  • Starting to modify my autopost code to handle these posts...
  • Done! Auto-posted 10 pages.
  • Next - delete them from InstaPaper. Done!
  • Status: WikiLog has 17,135 nodes. (hmm don't see a way to get my count of Liked items in InstaPaper)

fixing highlight-position issue

In some cases the highlights come through in random order, all with identical timestamps. So I'm going to have to find the position of each highlight in the body.

Sept04'2020

  • want to
    • fix the highlight-position issue
    • scrape notes (which I rarely use so far) as well as highlights
  • position first...
  • API method /bookmarks/get_text returns the cleaned body of the article, in HTML. But the highlights are plaintext (despite the API doc) (unless my wrapper library is responsible for converting). Also, fairly basic chars like curly apostrophes will be "normal" in the body-HTML, but \u2019 in the highlight!
  • using body.encode('utf-8') but getting UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2137: ordinal not in range(128) - hmm it's a little trademark symbol
  • use body.decode('utf-8').encode('utf-8'), but then when I do str.find() I get the same can't decode error
  • reading Unicode notes to stop flailing
    • If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding. wtf do you do if you don't know what encoding you're being handed? Trial and error?!?
      • ah, correct value is utf_8 not utf-8
    • hmm maybe this: HTTP: Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.
      • it says UTF-8

Sept13

  • let's clarify
  • type(body)=<type 'str'> but it contains non-ascii characters. The HTTP response-header says it's UTF-8.
  • derp should be calling decode() not encode()? This makes it <type 'unicode'>
  • that seems to be working, or at least the error moved further down. Ah, had to restructure my dictionaries. Now working.
  • But sometimes highlights still not in the right order. Ah, if a highlight contains a special character like \u2019 then the find fails and the position gets set to -1.
  • derp I had a normalize() call on the body which turned it from <type 'unicode'> to <type 'str'>. I took out that call, so it stays unicode, and now everything seems like it's working.
  • found a failure case - when highlight crosses paragraphs, and has \n\n in it, it doesn't get found in the body. I'm just using highlight.splitlines()[0] so I just search for the first line of the highlight.
  • another failure.... the body has a place where there are 2 spaces in a row, and the highlight does not. ?!?!? I think I'm going to let this ride, and maybe try to count how often it happens or something....
    • 50 articles -> 34 misses ugh
  • check another case - another double-space. Options
    • convert all double-spaces to single-spaces. Pro: HTML ignores multi-spaces anyway (except maybe in multiple lines of code, which I never highlight). Going with this.
    • only do find with the first sentence. Con: as that link shows, defining sentences can be tricky (and sometimes I start a highlight mid-sentence, which might make it even trickier).
    • only do find on the first n characters - 32? Con: risk of non-uniqueness, plus having a short sentence/phrase that results in double-spaces earlier than you'd expect, esp given my occasional habit of starting highlight mid-sentence.
    • that solved a bunch of cases. But still 26 fails (not exact match to prior experiment because the set of articles have shifted).

Sept14

  • checking first fail-article, which has 4 fails. Many characters like #8212 (that's not the full value, somewhere this page is converting it! - it's like &#___;) - going to handle that one by hand and check the others....
  • ugh failures are more serious than I thought - since I'm using the 'position' as the dictionary index for sortability, if there are multiple fails within a single piece they all get position= -1 and therefore keep writing over each other, so I lose n-1 of the failed highlights. Do I
    • improve my logging of per-article rather than per-batch fails
    • skip an article with more than 1 failure
    • solve all the failures....
    • option2 smells promising.... though need to do something to make that situation (a) visible, and (b) transparent/fixable

Edited: |