Converting Moinmoin Text To Markdown

Whatever next WikiEngine I work with after MoinMoin, I'll want the SmartAscii to be MarkDown, just because it's the winning standard.

What differences in syntax do I have to worry about?

The meta-question is: which Mark Down?

So I guess I'll have to see whether those differences matter...

(Also note I already tweaked MoinMoin during Cloning Zwiki With Moinmoin.)

Will strip beginning z from BlogBit pages.

Italics/bold are fine as-is, no conversion needed.

Bullet-lists (where are double-returns prohibited? before? between?), esp nested.

Pre-formatted text

  • inline - backticks - no change
  • blocks - Common Mark makes "fenced code blocks" when surrounded by triple-ticks. (Opening set must be on its own line. Does it need double-break before? Test!)

Oct'2015: just decided to switch from Smashed Together Words to Free Link for Automatic Linking.

Links to other sites are different: [label](url)

  • and just raw URL doesn't auto-link?!?! Have to do <url>. Ugh I have a lot of those, and they're not even always on a line by themselves

Images? Linked images? Do I have any?


Nov15'2015: have run the "scraping" code to make a metadata file and copy the latest MoinMoin version-file of each page to a re-named target, so I can download and convert.

I think the first conversion step is to replace the Smashed Together Words with double-brackets

  • and batch-convert to Expanding Wiki Words
    • both for page titles, and wherever referenced in other pages
  • and deal with BlogBit z-prefix cases
    • just decided to change pretty Title to look like Blogbit With Pretty Long Title Have To Decide How To Render (2015-11-12)
      • nope changed back to (2015-11-12) Blogbit With Pretty Long Title Have To Decide How To Render to keep order similar to URL
    • and eliminate - right after end of WikiWord for plural (and maybe other cases I'm not remember)
  • How?
    • already have list of pages; run code over list to make Expanding Wiki Words
    • manually review list, manually override exceptions (mainly single-word WikiWord-s) - should I copy this to a separate list?
    • update my WikiGraph list so I have every WikiWord reference
      • already have a fresh list, but have to manually review weird UniCode cases my code isn't handling right
      • hmm, also have issue with WikiWord-s that are used but don't have a page. Maybe make separate list of those and manually review, etc. (better to review merged list)
    • run my mapping list against that so I have nice list of substitutions specific to every page
    • then run code that steps through every page and does manual replacement of each case from the WikiGraph map
    • then delete any - immediately following ]]

Then do

Then do regular external links - that should just be a regex in text editor.

Then all the other SmartAscii stuff above...

Converting SmashedTogetherWords: realized my current code has lots of false positives, catching strings that are inside urls, etc. Discovered browsing graph subset of wikiwords not having pages. Options

  • assume bad cases aren't SmashedTogetherWords, just cap-strings. Browse list, delete non-words (by deleting the line from the look-up file, it doesn't get substituted when found)
  • add other bits of MarkDown conversion, run overall script against batch, then run against HTML validator.

Doing links

Fix the way that saving Edit redirect to FrontPage rather than page just edited. Done Mar03'2016

Doing lists

  • to avoid getting p tags around list items, keep it "tight", so get rid of the empty lines (double-returns): but how do this?
  • maybe could keep state as stepping through lines. Have to make sure I exit properly, so keep double-return between end of list and next paragraph.
  • ''or maybe could do regex!''
  • first-level list items don't need leading space
  • ugh nesting requirements are confusing, prefer to experiment
  • if first level has no leading space
  • child needs 2 leading spaces
  • grandchild needs 4, etc.
  • if first level has 1 leading space
  • child needs 3
  • grandchild need 5
  • hmm do I want to keep leading space for top-level? I prefer not-to. On the other hand, PikiPiki requires it. Conclusion: get rid of it.
  • so, given that MoinMoin pages always have top-level having 1 leading space... if a line in MoinMoin has n leading spaces, then for CommonMark should become:
  • if n=1, then nn=0
  • if n>1, then nn= (n-1)*2 (so actually don't need previous special case) - done Mar05?

Also

  • row of dashes does horizontal rule fine
  • Italics doesn't cover multiple lines - confirmed. Options? (Found in 92 files from 16k)
  • regexp - nope: multi-line, leading bullets, etc.
  • write the Python code - ugh not even consistent; sometimes ital-open is by itself on line, sometimes in middle of line; sometimes ital-close is by itself, sometimes at end of line
  • manually conform start/end, then batch-run code
  • decision: ignore, fix by hand when I run across it (might use Python for that, but manually set up the start/end)

Images

  • CommonMark doesn't give you a way to pass img width or put break around it; it actually surrounds with p tags!
  • and you can't just have the truly-raw URL, or raw-surrounded-with-angle-brackets.
  • hmm try just putting in parens
  • ''maybe I should batch convert to nicer chunk of HTML?''
  • yikes some of them already are
  • and some have local URLs
  • put examples from current pages into a test doc (with name of source); test possible target outcomes in WikiFlux. Conclusion: need ![label](url) format.
    • (update: also note CommonMark way to do linebreak, which might make sense before an image)
  • plan
  • convert raw URLs (jpeg, jpg, png) surrounded by whitespace into bracket-paren syntax
  • in TextWrangler, search-grep is (\s)(http[^ ]+(jpg|jpeg|png))(\s)
  • should I just do this in TextWrangler instead of code?
  • between this and already-HTML cases, vast majority handled, so ignore rest
  • maybe future feature to turn all CommonMark images into fancier output (right-flush, smaller, link-to-full...)

fenced code blocks - Mar30

  • start/end with 3 ticks not 3 curly-brackets
  • needs to open at start of line - so have to adjust those
  • my closings are fine
  • again, should I just use TextWrangler?

Next

  • When creating new page isn't starting out making the title SpaceSeparated - ToDo
  • need to add more InterWiki cases (in code): ISBN, ASIN - '''check ASIN'''
  • check the number of files - seems way low

Pondering auto-post architecture

  • cases for auto-posting
  • conversion/bootup/migration
  • InstaPaper
  • TwittEr
  • should the ongoing bits be combined? should everything be combined?
  • where/how will I run them?
  • probably never hands-off, will scrape/process, then review/tweak by hand, then post
  • could have scraper code separated-by-partner, with central AutoPost code
  • for ongoing probably better to post via HTTP, but up-front might be on-server - nah just do HTTP API
  • actually not API, just HttpPost like with form
  • pass created_date, with modified_date being same. Done Apr05
  • hmm, CSRF? do I even have that working?

Next: AutoPost for converted MoinMoin pages

  • have working v1 - Apr14'2016
  • using requests library to do HttpPost with cookies
  • bugs
  • title field has dbl-quare-brackets around it
  • body has asterisk at the end - yikes the real issue is that only the first line of the body got posted!
  • but dates are right!
  • fix brackets and 1st-line-only bugs: Apr15
  • now posting whole directory! Apr15

'''Back to SmashedTogetherWords''' Apr16

Weird cases

  • typical BlogBit pages
  • if my new style is to have the date piece in parens, then a space, what happens to the URL?
  • in naming the page, I give URL which doesn't change, then tweak the title. So non-issue.
  • in referring to such a page from a different page - ugh it doesn't compress the spacing, am I doing that on purpose? Apparently, because it ''does'' compress when referring to page that doesn't start with a number. That's in wikiweb/url_from_wiki_word(str).
  • where I want to change the spacing
  • what I want (in referring to a BlogBit page from inside another page)
  • typically pasting in URL-name, so that should work as-is
  • don't see need to add the parens in this usage
  • but could definitely see adding the spacing - and should do this in bulk-conversion
  • so wikiweb code should remove any spaces - but be careful capwords() might remove dashes
  • fixed wikiweb Apr25
  • early pages that just have tie-breaking single letter following the date? There are 600 of them!
  • if I rename them, have to catch any links to them!
  • process I could use
  • manually review each file
  • don't rename the file, but rename its entry in 1 meta file: page_names.txt
  • then in conversion process
  • catch any link and rename it
  • rename the output file

But I don't even have the regular link-conversion working yet!

  • ugh, realize in page_names.txt I have the paren-formatted BlogBit title, not the simpler format. So, to make the ref-handle, I'll have to check for that and convert in code. (Considered "fixing" the list file, but that creates redundant text so that changes have to be edited in both places.)
  • finish first working cut, but seeing some weird final results. Realize that should do this ''after'' running the regular link conversion code. Now working Apr26.

Run a batch, auto-post. Looking pretty good.

  • issue where a WikiWord is inside parens - get an extra space. Have to track that down.
  • it's not in the MarkDown, it's in the rendering of MarkDown.
  • ah, line already commented as suspicious in wikiweb/repl_wiki_word().....
  • isolate and comment out that logic. Now things look good. Apr26.
  • then lots of fixing of ugly WikiWord cases in the meta file - done Apr28
  • argh am I going to also review the PagelessWords meta file? Right now falling back to just use generic rule on all those cases, but that isn't kosher either. Or maybe it's ok, can correct as I find them.... but nervous about bad cases... should review file to look for some of them, go back to source to see if fail, or weird case. Then decide what to do.
  • did lots of pattern cleaning
  • searched for numbers and lots of spaces to dump suspicious strings
  • then dropped any string that lacked a lower-case letter! Because these aren't pages anyway... done Apr29
  • next - rewrite wikiweb/repl_wiki_word() to use pageless_words then fall back to just passing through - no rule-based bracketing - Apr29
  • finish all the posting: Apr30
  • derp realize I never ran all the regex stuff noted at the end of my convert script! But going to review anyway

Review sample of pages:

  • raw links not done! (need to record regex)
  • actually, weirdly inconsistent!
  • mis-linking subset-wikiword: AI inside AIML! http://localhost:8082/wiki/2014-07-06-AiAssistedCoaching
  • ISBN links not working (actually ISBN itself wants to link, but not chain)
  • See a working InterWiki case but it uses brackets. Raw case doesn't work.
  • pages that were renamed have ## comment at top, which is now getting converted to an H2. Maybe just remove that line?

Working on InterWiki

  • set regexp
  • probably have to run this ''before'' scripts so individual names don't get bracketed
  • grr doesn't help - it removed the existing brackets!?!?
  • options for getting through here
  • nope, makes more sense to fix afterwards - catch ]]:
  • important to make sure any space-names are defined in page_names to stay rendered as SmashedTogetherWords
  • need code cases for MUPC, PhilJones
  • hrm another edge-case: where page-name piece matches one of my page names, which gets space-separated, then initial rendering of WikiWikiWeb:WardCunningham becomes WikiWikiWeb:Ward Cunningham - options
  • live with it
  • do separate regex for it
  • prevent it with totally different approach
  • another fail - ISBN where I left the dashes in, only gets matched/bracketed up to the first dash (at least if I use \w+ as my regex pattern). Options:
  • replace those cases separately (remove the dashes before starting?)
  • loosen my core InterWiki to include dashes - probably better - yes that works

Another problem: false-positive matches: once a WikiWord string is identified in a doc, all cases of it get replaced with its mapped-value. But this is even true when it's part of a longer word. So, for instance, OSAF gets replaced even when it's inside OSAFStatusReports

  • just live with it, since it's pretty rare?

["z2016-06-17-SettingUpFlaskAtLinodeForWikiFlux"]

Jun30'2016 omg just realized my new BlogBit URL model means all my old URLs will break! Not Cool! So need to redirect them.

  • fixed

Nov12'2016 start conversion of content changed since last batch (a year ago!)

  • modify scrape function to only grab/list pages that were added/changed since last time
  • run script that generates page_names.txt and pageless_words.txt
  • manually edit page_names.txt
  • realize pageless_words.txt is wrong because it's working from subset. So toss it and use the old one (will be missing a little but that's ok) (also note it now has some false entries, for cases where I had a WikiWord but no page last year, but added a page since)
  • run conversion
  • fix InterWiki, then ISBN/ASIN
  • fix images
  • next: scan other pages for WikiWord substitution. Maybe it's using pages-list for full list of names to match on? Need separate file?

Nov25: work with new meta pages, re-run. Haven't reviewed yet.

Flip switch! ["z2016-11-21-PushWikilogMigrationThanksgiving"]

Nov29: review converted pages

  • some bad cases where wikiname is inside phrase. Will catch some by hand, move along.
  • issue with raw https links: fixed manually
  • ISBN, ASIN, InterWiki
  • next: images

(Have been hand-finishing some pages to post manually.)

Dec10

  • fenced blocks
  • hr tags
  • next: variation of auto-post code to test for pre-existence, do edit

Dec24

  • copy over production code to replace local, so can start working on changes to do post-edit
  • code not working! issue with CommonMark versions, VirtualEnv issue prob.
  • Dec26: Nope, just case different for HtmlRenderer() call.

Dec28

  • finish new autopost handler
  • actually post 2016 backlog of 500+ pages!

Dec29

  • 17102 records in 'nodes'
  • grr wikigraph db has 17191, and that isn't current
  • newest raw/full graph file from 2012 had 15664
  • had 16395 in May'2014 https://twitter.com/BillSeitz/status/466002169930723328
  • should export list from WikiGraph and see what's missing copy (select url from pages where space_name='WebSeitzWiki') to '/etc/postgresql/9.1/pages_wikigraph.txt'
  • ugh the difference is even worse because there are a bunch of cases where there are pages in wikiflux but ''not'' wikigraph, typically because of newer pages that didn't get added to the graph. So that means there are ''more'' MoinMoin pages missing than mere subtraction would imply.
  • plan: script to
  • local: list pages only missing from MoinMoin
  • at BlueHost: scrape that list of pages
  • then have to convert, autopost, etc.

Jan11'2017

  • finally generate diff-list. Oh, not even 20! (Think there was a dupes issue somewhere)

Edited:    |       |    Search Twitter for discussion