Converting WordPress Posts to Pelican MarkDown

A few weeks ago, I posted about my plans to convert my self-hosted WordPress blog to a static site using a static blog generator. Since then, I’ve decided to stop working on my exhaustive static blog generator comparison spreadsheet and just try Pelican - mainly because it’s written in Python which is my current strongest language, comes highly recommended, seems to have most of the features I want, and seems to be easily extensible.

So, I walked through the documentation for the latest version (3.3.0), started a GitHub repo, and tweaked a bunch of settings. The repo is public, so if you want to take a look behind the scenes, see my fabfile, etc. feel free.

Initial WordPress Import Attempt

I used the WordPress XML Export tool, as instructed in the Pelican Importer documentation. At first, I attempted to do a more-or-less default import from WordPress using the pelican-import tool, which writes rST, and then build the blog. What I ended up with was thousands of errors complaining about “Inline interpreted text or phrase reference start-string without end-string”, “Explicit markup ends without a blank line; unexpected uninden”, “malformed hyperlink target”, “Unknown target name” on all of my links, and a bevy of other Docutils errors. It was so utterly awful that I gave up.

WordPress Import as MarkDown

Next I tried importing as MarkDown instead of rST, using:

pelican-import --markup markdown --wpfile -o content/ --dir-page jasonantman039sblog.wordpress.2014-01-11.xml

That built without errors, and the posts looked somewhat right out of the box, without any of the previous thousands of errors. And the links looked mostly right - even the captions for images. Though I’m working at a Python shop and writing a lot of Python these days, my knowledge of MarkDown is still much better than rST, so this is fine for me. (I even wrote a fab post task that prompts for a title, generates all of the post metadata, writes it to the right file, and opens up an editor on it.)

The first problem was that the import script gave me one “content” directory with 346 “.md” files in it - not exactly easy to work with. Luckily the metadata was right, so a quick little bash script moved the posts into a YYYY/MM directory hierarchy.

Obvious Problems with Imported Posts

After getting the MarkDown import working, and the posts moved to the proper paths, I was still having some issues…

Syntax Hilighting Gone

In WordPress, I was using the WP-Syntax plugin to perform syntax hilighting via GeSHi. The plugin uses pre tags with a lang= attribute to specify the language, like:

<pre lang="bash">

Unfortunately, these translated to some really ugly MarkDown fenced blocks like:

~~~~ {lang="bash"}
cp /boot/efi/EFI/fedora/grub.cfg /boot/efi/EFI/fedora/grub.cfg.bak
echo 'GRUB_DISABLE_OS_PROBER="true"' >> /etc/default/grub
grub2-mkconfig > /boot/efi/EFI/fedora/grub.cfg
~~~~

that seem to be just a bit off from what MarkDown/Pygments can handle. The places where I just used bare <pre> blocks translated fine.

http://blog.gastove.com/2013-09-17_enabling_line_numbers_for_pygments.html

Fixed this by using fenced blocks with the ‘lang=’ stuff removed, and in class syntax like the MarkDown docs suggest. Some four-tab-indents with :::identifier work.

Broken Links

It seems that something in the conversion process introduced line wraps (could it really be Pandoc itself???) Unfortunately, this wreaks havoc with any explicit reference links that use long (long enough to break across lines) titles, depending on where they are in the line. It seems that in some places they end up breaking differently in the link in the text and in the link definition, which MarkDown misses, and then renders broken links and plain text of the link table at the bottom of the page. Manually removing the line breaks and any extraneous spaces seems to fix it.

So, yes, Pandoc was doing this because of the --reference-links parameter that pelican-import was calling it with. There was an issue and pull request to fix this, but when I started with Pelican the last release was 3.3.0 (4 months ago) and the PR was merged after that. So, if you’re having the same problem and the latest release of Pelican is still 3.3.0, you might as well just apply the patch yourself - it’s just a very simple removal of a parameter in pelican_import.py.

Overall Results

I’m quite happy with the overall results. I also spent a lot of time manually fixing markup issues that didn’t translate well through Pandoc, but I suppose that’s to be expected given that many of my older blog posts had HTML issues.