Converting Moinmoin Text To Markdown
Whatever next WikiEngine I work with after MoinMoin, I'll want the SmartAscii to be MarkDown, just because it's the winning standard.
What differences in syntax do I have to worry about?
The meta-question is: which Mark Down?
- Common Mark
- whichever flavor is used in Flask https://pythonhosted.org/Flask-Markdown/
- whichever flavor is used in WebPy
- actually for either Flask or WebPy I could probably swap in a Common Mark library without much trouble.
So I guess I'll have to see whether those differences matter...
(Also note I already tweaked MoinMoin during Cloning Zwiki With Moinmoin.)
Will strip beginning z
from BlogBit pages.
Italics/bold are fine as-is, no conversion needed.
Bullet-lists (where are double-returns prohibited? before? between?), esp nested.
Pre-formatted text
- inline - backticks - no change
- blocks - Common Mark makes "fenced code blocks" when surrounded by triple-ticks. (Opening set must be on its own line. Does it need double-break before? Test!)
Oct'2015: just decided to switch from Smashed Together Words to Free Link for Automatic Linking.
Links to other sites are different: [label](url)
- and just raw URL doesn't auto-link?!?! Have to do
<url>
. Ugh I have a lot of those, and they're not even always on a line by themselves
Images? Linked images? Do I have any?
Nov15'2015: have run the "scraping" code to make a metadata file and copy the latest MoinMoin version-file of each page to a re-named target, so I can download and convert.
I think the first conversion step is to replace the Smashed Together Words with double-brackets
- and batch-convert to Expanding Wiki Words
- both for page titles, and wherever referenced in other pages
- and deal with BlogBit z-prefix cases
- just decided to change pretty Title to look like
Blogbit With Pretty Long Title Have To Decide How To Render (2015-11-12)
- nope changed back to
(2015-11-12) Blogbit With Pretty Long Title Have To Decide How To Render
to keep order similar to URL
- nope changed back to
- and eliminate
-
right after end of WikiWord for plural (and maybe other cases I'm not remember)
- just decided to change pretty Title to look like
- How?
- already have list of pages; run code over list to make Expanding Wiki Words
- manually review list, manually override exceptions (mainly single-word WikiWord-s) - should I copy this to a separate list?
- update my WikiGraph list so I have every WikiWord reference
- run my mapping list against that so I have nice list of substitutions specific to every page
- then run code that steps through every page and does manual replacement of each case from the WikiGraph map
- check out the cases where a BlogBit is referenced on a page, since those already have double brackets
- then delete any
-
immediately following]]
Then do
Then do regular external links - that should just be a regex in text editor.
Then all the other SmartAscii stuff above...
Converting SmashedTogetherWords: realized my current code has lots of false positives, catching strings that are inside urls, etc. Discovered browsing graph subset of wikiwords not having pages. Options
- assume bad cases aren't SmashedTogetherWords, just cap-strings. Browse list, delete non-words (by deleting the line from the look-up file, it doesn't get substituted when found)
- add other bits of MarkDown conversion, run overall script against batch, then run against HTML validator.
Doing links
Fix the way that saving Edit redirect to FrontPage rather than page just edited. Done Mar03'2016
Doing lists
- to avoid getting
p
tags around list items, keep it "tight", so get rid of the empty lines (double-returns): but how do this? - maybe could keep state as stepping through lines. Have to make sure I exit properly, so keep double-return between end of list and next paragraph.
- ''or maybe could do regex!''
- first-level list items don't need leading space
- ugh nesting requirements are confusing, prefer to experiment
- if first level has no leading space
- child needs 2 leading spaces
- grandchild needs 4, etc.
- if first level has 1 leading space
- child needs 3
- grandchild need 5
- hmm do I want to keep leading space for top-level? I prefer not-to. On the other hand, PikiPiki requires it. Conclusion: get rid of it.
- so, given that MoinMoin pages always have top-level having 1 leading space... if a line in MoinMoin has n leading spaces, then for CommonMark should become:
- if n=1, then nn=0
- if n>1, then
nn= (n-1)*2
(so actually don't need previous special case) - done Mar05?
Also
- row of dashes does horizontal rule fine
- Italics doesn't cover multiple lines - confirmed. Options? (Found in 92 files from 16k)
- regexp - nope: multi-line, leading bullets, etc.
- write the Python code - ugh not even consistent; sometimes ital-open is by itself on line, sometimes in middle of line; sometimes ital-close is by itself, sometimes at end of line
- manually conform start/end, then batch-run code
- decision: ignore, fix by hand when I run across it (might use Python for that, but manually set up the start/end)
Images
- CommonMark doesn't give you a way to pass img width or put break around it; it actually surrounds with p tags!
- and you can't just have the truly-raw URL, or raw-surrounded-with-angle-brackets.
- hmm try just putting in parens
- ''maybe I should batch convert to nicer chunk of HTML?''
- yikes some of them already are
- and some have local URLs
- put examples from current pages into a test doc (with name of source); test possible target outcomes in WikiFlux. Conclusion: need
![label](url)
format.- (update: also note CommonMark way to do linebreak, which might make sense before an image)
- plan
- convert raw URLs (jpeg, jpg, png) surrounded by whitespace into bracket-paren syntax
- in TextWrangler, search-grep is
(\s)(http[^ ]+(jpg|jpeg|png))(\s)
- should I just do this in TextWrangler instead of code?
- between this and already-HTML cases, vast majority handled, so ignore rest
- maybe future feature to turn all CommonMark images into fancier output (right-flush, smaller, link-to-full...)
fenced code blocks - Mar30
- start/end with 3 ticks not 3 curly-brackets
- needs to open at start of line - so have to adjust those
- my closings are fine
- again, should I just use TextWrangler?
Next
- When creating new page isn't starting out making the title SpaceSeparated - ToDo
- need to add more InterWiki cases (in code): ISBN, ASIN - '''check ASIN'''
- check the number of files - seems way low
Pondering auto-post architecture
- cases for auto-posting
- conversion/bootup/migration
- InstaPaper
- should the ongoing bits be combined? should everything be combined?
- where/how will I run them?
- probably never hands-off, will scrape/process, then review/tweak by hand, then post
- could have scraper code separated-by-partner, with central AutoPost code
- for ongoing probably better to post via HTTP, but up-front might be on-server - nah just do HTTP API
- actually not API, just HttpPost like with form
- pass created_date, with modified_date being same. Done Apr05
- hmm, CSRF? do I even have that working?
Next: AutoPost for converted MoinMoin pages
- have working v1 - Apr14'2016
- using
requests
library to do HttpPost with cookies - bugs
- title field has dbl-quare-brackets around it
- body has asterisk at the end - yikes the real issue is that only the first line of the body got posted!
- but dates are right!
- fix brackets and 1st-line-only bugs: Apr15
- now posting whole directory! Apr15
'''Back to SmashedTogetherWords''' Apr16
Weird cases
- typical BlogBit pages
- if my new style is to have the date piece in parens, then a space, what happens to the URL?
- in naming the page, I give URL which doesn't change, then tweak the title. So non-issue.
- in referring to such a page from a different page - ugh it doesn't compress the spacing, am I doing that on purpose? Apparently, because it ''does'' compress when referring to page that doesn't start with a number. That's in
wikiweb/url_from_wiki_word(str)
. - where I want to change the spacing
- what I want (in referring to a BlogBit page from inside another page)
- typically pasting in URL-name, so that should work as-is
- don't see need to add the parens in this usage
- but could definitely see adding the spacing - and should do this in bulk-conversion
- so wikiweb code should remove any spaces - but be careful capwords() might remove dashes
- fixed wikiweb Apr25
- early pages that just have tie-breaking single letter following the date? There are 600 of them!
- if I rename them, have to catch any links to them!
- process I could use
- manually review each file
- don't rename the file, but rename its entry in 1 meta file:
page_names.txt
- then in conversion process
- catch any link and rename it
- rename the output file
But I don't even have the regular link-conversion working yet!
- ugh, realize in page_names.txt I have the paren-formatted BlogBit title, not the simpler format. So, to make the ref-handle, I'll have to check for that and convert in code. (Considered "fixing" the list file, but that creates redundant text so that changes have to be edited in both places.)
- finish first working cut, but seeing some weird final results. Realize that should do this ''after'' running the regular link conversion code. Now working Apr26.
Run a batch, auto-post. Looking pretty good.
- issue where a WikiWord is inside parens - get an extra space. Have to track that down.
- it's not in the MarkDown, it's in the rendering of MarkDown.
- ah, line already commented as suspicious in wikiweb/repl_wiki_word().....
- isolate and comment out that logic. Now things look good. Apr26.
- then lots of fixing of ugly WikiWord cases in the meta file - done Apr28
- argh am I going to also review the PagelessWords meta file? Right now falling back to just use generic rule on all those cases, but that isn't kosher either. Or maybe it's ok, can correct as I find them.... but nervous about bad cases... should review file to look for some of them, go back to source to see if fail, or weird case. Then decide what to do.
- did lots of pattern cleaning
- searched for numbers and lots of spaces to dump suspicious strings
- then dropped any string that lacked a lower-case letter! Because these aren't pages anyway... done Apr29
- next - rewrite
wikiweb/repl_wiki_word()
to use pageless_words then fall back to just passing through - no rule-based bracketing - Apr29 - finish all the posting: Apr30
- derp realize I never ran all the regex stuff noted at the end of my convert script! But going to review anyway
Review sample of pages:
- raw links not done! (need to record regex)
- actually, weirdly inconsistent!
- mis-linking subset-wikiword: AI inside AIML! http://localhost:8082/wiki/2014-07-06-AiAssistedCoaching
- ISBN links not working (actually ISBN itself wants to link, but not chain)
- See a working InterWiki case but it uses brackets. Raw case doesn't work.
- pages that were renamed have
##
comment at top, which is now getting converted to an H2. Maybe just remove that line?
Working on InterWiki
- set regexp
- probably have to run this ''before'' scripts so individual names don't get bracketed
- grr doesn't help - it removed the existing brackets!?!?
- options for getting through here
- nope, makes more sense to fix afterwards - catch
]]:
- important to make sure any space-names are defined in page_names to stay rendered as SmashedTogetherWords
- need code cases for MUPC, PhilJones
- hrm another edge-case: where page-name piece matches one of my page names, which gets space-separated, then initial rendering of
WikiWikiWeb:WardCunningham
becomesWikiWikiWeb:Ward Cunningham
- options - live with it
- do separate regex for it
- prevent it with totally different approach
- another fail - ISBN where I left the dashes in, only gets matched/bracketed up to the first dash (at least if I use
\w+
as my regex pattern). Options: - replace those cases separately (remove the dashes before starting?)
- loosen my core InterWiki to include dashes - probably better - yes that works
Another problem: false-positive matches: once a WikiWord string is identified in a doc, all cases of it get replaced with its mapped-value. But this is even true when it's part of a longer word. So, for instance, OSAF
gets replaced even when it's inside OSAFStatusReports
- just live with it, since it's pretty rare?
["z2016-06-17-SettingUpFlaskAtLinodeForWikiFlux"]
Jun30'2016 omg just realized my new BlogBit URL model means all my old URLs will break! Not Cool! So need to redirect them.
- fixed
Nov12'2016 start conversion of content changed since last batch (a year ago!)
- modify scrape function to only grab/list pages that were added/changed since last time
- run script that generates
page_names.txt
andpageless_words.txt
- manually edit
page_names.txt
- realize
pageless_words.txt
is wrong because it's working from subset. So toss it and use the old one (will be missing a little but that's ok) (also note it now has some false entries, for cases where I had a WikiWord but no page last year, but added a page since) - run conversion
- fix InterWiki, then ISBN/ASIN
- fix images
- next: scan other pages for WikiWord substitution. Maybe it's using pages-list for full list of names to match on? Need separate file?
Nov25: work with new meta pages, re-run. Haven't reviewed yet.
Flip switch! ["z2016-11-21-PushWikilogMigrationThanksgiving"]
Nov29: review converted pages
- some bad cases where wikiname is inside phrase. Will catch some by hand, move along.
- issue with raw https links: fixed manually
- ISBN, ASIN, InterWiki
- next: images
(Have been hand-finishing some pages to post manually.)
Dec10
- fenced blocks
- hr tags
- next: variation of auto-post code to test for pre-existence, do edit
Dec24
- copy over production code to replace local, so can start working on changes to do post-edit
- code not working! issue with CommonMark versions, VirtualEnv issue prob.
- Dec26: Nope, just case different for
HtmlRenderer()
call.
Dec28
- finish new autopost handler
- actually post 2016 backlog of 500+ pages!
Dec29
- 17102 records in 'nodes'
- grr wikigraph db has 17191, and that isn't current
- newest raw/full graph file from 2012 had 15664
- had 16395 in May'2014 https://twitter.com/BillSeitz/status/466002169930723328
- should export list from WikiGraph and see what's missing
copy (select url from pages where space_name='WebSeitzWiki') to '/etc/postgresql/9.1/pages_wikigraph.txt'
- ugh the difference is even worse because there are a bunch of cases where there are pages in wikiflux but ''not'' wikigraph, typically because of newer pages that didn't get added to the graph. So that means there are ''more'' MoinMoin pages missing than mere subtraction would imply.
- plan: script to
- local: list pages only missing from MoinMoin
- at BlueHost: scrape that list of pages
- then have to convert, autopost, etc.
Jan11'2017
- finally generate diff-list. Oh, not even 20! (Think there was a dupes issue somewhere)
Edited: | Tweet this! | Search Twitter for discussion