qrdn

quite random domain name

How to save a whole mediawiki into a git repo

Recently I tried to archive the contents of our old MediaWiki instance to a git repository. Somebody else had already done that, using some scripts from MediaWiki, but these ignored the page histories, saving only the recent version for each page, and offered no possibility to also save uploaded files, especially images.

So I decided to see if I could do better, and found Git-Mediawiki. I had to fiddle a bit, because of it not being installed, only copied, with archlinux' git package, and our broken TLS certificate, but eventually got the import to work:

pacman -Sy perl-mediawiki-api perl-datetime-format-iso8601 perl-lwp-protocol-https
sudo ln -s /usr/share/git/mw-to-git/git-mw.perl /usr/lib/git-core/git-mw
sudo ln -s /usr/share/git/mw-to-git/git-remote-mediawiki.perl /usr/lib/git-core/git-remote-mediawiki
export PERL5LIB=/usr/share/git/mw-to-git/
export PERL_LWP_SSL_VERIFY_HOSTNAME=0  # this makes the whole TLS encryption insecure -- I use it because we don't have a valid certificate, and I don't intend to write back to the wiki
git clone mediawiki::https://wiki.chaos-darmstadt.de/w

The result is a linear history with one commit for each saved revision of any page. There seem to be some bugs, though: - subpages are not exported, like our main pages' subsections "Hauptseite/Header" etc. - some page histories occur twice in the git history, e.g. for page "Mate-Basteln"


Some of the things I did wrong when getting it to work:

Errors

  1. wrong endpoint https://wiki.chaos-darmstadt.de/

    fatal: could not get the list of wiki pages.
    fatal: 'https://wiki.chaos-darmstadt.de/' does not appear to be a mediawiki
    fatal: make sure 'https://wiki.chaos-darmstadt.de//api.php' is a valid page
    fatal: and the SSL certificate is correct.
    fatal: (error 2: 404 Not Found : error occurred when accessing https://wiki.chaos-darmstadt.de//api.php after 1 attempt(s))
    fatal: Could not read ref refs/mediawiki/origin/master
    
  2. Wrong endpoint https://wiki.chaos-darmstadt.de/wiki/

    Searching revisions...
    No previous mediawiki revision found, fetching from beginning.
    Fetching & writing export data by pages...
    Listing pages on remote wiki...
    fatal: could not get the list of wiki pages.
    fatal: 'https://wiki.chaos-darmstadt.de/wiki/' does not appear to be a mediawiki
    fatal: make sure 'https://wiki.chaos-darmstadt.de/wiki//api.php' is a valid page
    fatal: and the SSL certificate is correct.
    fatal: (error 2: Failed to decode JSON returned by https://wiki.chaos-darmstadt.de/wiki//api.php
    Decoding Error:
    malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "<!DOCTYPE html>\n<ht...") at /usr/share/perl5/vendor_perl/MediaWiki/API.pm line 400.
    
    Returned Data:
    <!DOCTYPE html>
    <html lang="de" dir="ltr" class="client-nojs">
    <head>
    <meta charset="UTF-8" />
    <title>Diese Aktion gibt es nicht รข Chaos-Darmstadt Wiki</title>
    
    ... (all the HTML from the page) ...
    
    fatal: Could not read ref refs/mediawiki/origin/master
    
  3. PERL5LIB not set

    Klone nach 'wiki' ...
    Can't locate Git/Mediawiki.pm in @INC (you may need to install the Git::Mediawiki module) (@INC contains: /usr/lib/perl5/site_perl /usr/share/perl5/site_perl /usr/lib/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5/core_perl /usr/share/perl5/core_perl .) at /usr/lib/git-core/git-remote-mediawiki line 18.
    BEGIN failed--compilation aborted at /usr/lib/git-core/git-remote-mediawiki line 18.
    
  4. SSL/TLS cert not accepted. I work around by disabling the check, because I know the cert is broken, and I don't intend to write back to the wiki, so in the worst case my export attempt is tampered with. In general, always correctly check your certificates and treat this as an severe error!

    Searching revisions...
    No previous mediawiki revision found, fetching from beginning.
    Fetching & writing export data by pages...
    Listing pages on remote wiki...
    fatal: could not get the list of wiki pages.
    fatal: 'https://wiki.chaos-darmstadt.de/wiki/' does not appear to be a mediawiki
    fatal: make sure 'https://wiki.chaos-darmstadt.de/wiki//api.php' is a valid page
    fatal: and the SSL certificate is correct.
    fatal: (error 2: 500 Can't connect to wiki.chaos-darmstadt.de:443 (certificate verify failed) : error occurred when accessing https://wiki.chaos-darmstadt.de/wiki//api.php after 1 attempt(s))
    fatal: Could not read ref refs/mediawiki/origin/master
    
  5. git remote helper mediawiki not installed (ln -s commands from above):

    fatal: Unable to find remote helper for 'mediawiki'