Chapter 14 Protecting content

Handling spam

The web 2.0 revolution of user-generated content, for all its positive impact, has also been a godsend for spammers. Never before in human history has it been so easy to sell dubious merchandise or services, whether it's prescription drugs, life insurance, prescription drugs, college essay writing, or prescription drugs. So spam has infiltrated countless blog comments, Twitter feeds, and wiki pages. If your wiki is not public, or is public but has closed registration, then you have nothing to fear and you can skip ahead to the next chapter. If, however, your wiki allows contributions from the general public, then chances are good that, at some point, one or more groups of wiki-spammers will find it and will start trying to turn it into a free advertising platform.

MediaWiki already does an important task in preventing spam, which is to add a “nofollow” tag to the HTML of every external link – this tells search engines not to go to that link, thus greatly minimizing the benefit that adding links can provide to spammers. You can undo that behavior, by the way, by adding the following to LocalSettings.php, though you really shouldn't:

$wgNoFollowLinks = false;

Still, for whatever reason, some spammers really like to hit MediaWiki sites. Thankfully, there are a number of effective extensions that let you fight back against spam. The three most important ones, which are recommended for every publicly-editable wiki, are ConfirmEdit, SpamBlacklist and SmiteSpam; we'll get to those in the next sections.

ConfirmEdit

The ConfirmEdit extension comes bundled in with every MediaWiki install. Its documentation can be found here:

https://www.mediawiki.org/wiki/Extension:ConfirmEdit

It sets up a CAPTCHA tool for page edits, user registration and user login. A CAPTCHA (which stands for "Completely Automated Public Turing test to tell Computers and Humans Apart") is any input that's designed so that a human can enter but a software program can't: its most common variety is those now-ubiquitous tests in online forms that ask you to look at an image with distorted numbers and letters and type them in. ConfirmEdit provides, at the moment, five different CAPTCHA options. They are:

SimpleCaptcha – the default option. Displays a simple math problem.
FancyCaptcha – displays an image of stylized set of letters that users have to decipher (this option is most like the standard CAPTCHAs).
MathCaptcha – like SimpleCaptcha, but the math problem is displayed as an image.
QuestyCaptcha – asks a question, out of a pre-defined set (the administrator has to create the questions, and their allowed answers).
ReCaptchaNoCaptcha – uses Google's “No CAPTCHA reCAPTCHA” service, which uses various techniques to differentiate humans from robots. (For some reason, the ConfirmEdit option's name switches the order around.)

All of these options, including SimpleCaptcha, are better than nothing; though there's a big range of effectiveness among all of them. Currently, the two most effective ConfirmEdit options appear to be QuestyCaptcha and ReCaptchaNoCaptcha. With QuestyCaptcha, there's no automated software that can figure out the right answer to your questions, so even simple questions are generally effective. (Though it's helpful to replace the set of questions every once in a while, especially if spam starts getting through.)

Whichever CAPTCHA module you go with, ConfirmEdit offers the same standard set of additional options. First, it lets you customize which user groups will see CAPTCHAs, using the 'skipcaptcha' permission type. By default, only the 'bot' and 'sysop' user groups are exempted from CAPTCHAs (in other words, they have 'skipcaptcha' set to true). If you want to, say, exempt registered users as well, you could add the following to LocalSettings.php:

$wgGroupPermissions['user']['skipcaptcha'] = true;

That may seem like a reasonable change, but actually it's not necessary or recommended, as we'll see soon.

ConfirmEdit also lets you configure which actions result in a CAPTCHA test. The relevant actions are:

'edit' – any attempted page edit
'create' – the creation of a new page
'addurl' – any edit which results in a new URL being added to the page
'createaccount' – user registration
'badlogin' – when a user tries to log in after already having given an incorrect password (this is useful to guard against bots that try to guess passwords)

By default, 'addurl', 'createaccount' and 'badlogin' are checked, while 'edit' and 'create' are not. Why is that – surely every edit is worth checking? Actually, it's not usually necessary, because of the presence of the 'addurl' action. Spam almost always involves the addition of one or more URLs. (Not always, though, because, bizarrely, some "pseudo-spammers" like to just add random text to pages.) Meanwhile, in regular wiki editing, new external URLs get added to pages only occasionally. So checking the addition of URLs works to ward off most automated spammers, while being only a minor inconvenience to real users.

In an ideal world, a CAPTCHA system would block all spam. But spammers have figured out how to bypass CAPTCHAs, most likely by hiring humans to enter the inputs (the going rate, according to Wikipedia, is an absurd 0.1 cents for every completed CAPTCHA). Still, ConfirmEdit does seem to cut down significantly on spam: it stops the waves of automated spam that spammers sometimes like to unleash, where hundreds of spam pages can be created in a few hours.

SpamBlacklist

Another, complementary tool is the SpamBlacklist extension, which can block edits based on two criteria: what URLs they add to the page, and what IP address they originate from. URLs that spammers add tend to be to members of a very large, but finite, set of known websites. The SpamBlacklist extension lets you use as many URL "blacklists" as you want, each containing a set of domains (actually, it's a set of regular expressions for domains, so that, for instance, every domain containing the string "casino-online" can get blocked with one line). By default, SpamBlacklist uses a single blacklist – the Wikimedia Meta-Wiki blacklist, located at:

https://meta.wikimedia.org/wiki/Spam_blacklist

It's an impressively long list, and it seems to be fairly effective at blocking the spam edits that ConfirmEdit doesn't. The set of websites that spammers use, though, is always growing, but thankfully there's no shortage of additional blacklists available – the SpamBlacklist documentation lists a number of these.

There's more functionality in SpamBlacklist, including the ability to create "whitelists", of domains that match some anti-spam criteria but are actually fine. You can read more on the extension's web page:

https://www.mediawiki.org/wiki/Extension:SpamBlacklist

AbuseFilter

Even ConfirmEdit and SpamBlacklist, as helpful as they both are, don't block all spam. Most perniciously, some spammers simply link to the URLs of pages they've created on other wikis, which themselves contain spam. There's no real way to block such URLs, since they point to innocent domains. There's a third way of guarding against spam, though, which is to check attributes like the IP address and email address (assuming they've registered) of the user making the edit. The AbuseFilter extension (which is bundled in with MediaWiki) can be used for this purpose. It is a complex extension – designed for use on the large Wikimedia sites – and offers a significant amount of fine-grained control. You can read about this extension and how to use it here:

https://www.mediawiki.org/wiki/Extension:AbuseFilter

Deleting spam

If you do get hit with spam, there are three useful tools for getting rid of it quickly and easily: SmiteSpam, Nuke and DeleteBatch.

Of the three, SmiteSpam is the most powerful. It is an extension that analyzes all the pages in your wiki, looking for those that appear to be spam. (Thankfully, the average wiki page created by a spammer looks significantly different from legitimate wiki pages.) It then presents a graphical interface to let an administrator check its findings, and do a mass deletion, and user block, for pages that are indeed spam. You can read more about this extension here:

https://www.mediawiki.org/wiki/Extension:SmiteSpam

Figure 14.1: The main SmiteSpam interface, at Special:SmiteSpam

Nuke is an extension that's bundled in with MediaWiki that lets you delete all the pages created by a single user or IP address. If a spammer sticks to just a few user accounts or IP addresses, and they only create new pages, Nuke will work very well. You just need to enter the username or IP address, and it does the rest. You can get the extension here:

https://www.mediawiki.org/wiki/Extension:Nuke

Unfortunately, there's no current extension that does something similar with bad edits to pages that already existed – in other words, does a mass revert instead of a mass deletion. The closest thing is this JavaScript code, which you can add to MediaWiki:Common.js, which lets you do that via JavaScript – it's not nearly as efficient, but it's certainly better than nothing:

https://en.wikipedia.org/wiki/User:John254/mass_rollback.js

MediaWiki has a script, deleteBatch.php, that provides a different approach to undoing spam and vandalism – it lets administrators delete a large group of pages at once, by supplying a text file containing all the page names:

https://www.mediawiki.org/wiki/Manual:DeleteBatch.php

Additionally, there's the DeleteBatch extension, which lets you do essentially the same thing from the browser interface:

https://www.mediawiki.org/wiki/Extension:DeleteBatch

mediawiki.org has an entire “Combating spam” page, that lists these and other extensions, as well as other, more involved ways of avoiding spam. You can see it here:

https://www.mediawiki.org/wiki/Manual:Combating_spam

Restricting registration

Finally, there's an alternate approach to preventing spam, which is to control users' ability to register, and then to restrict editing to just logged-in users. It makes registration more difficult for users, but it may well be the most foolproof approach against spam.

There are two approaches that can work for this: authentication extensions, and ConfirmAccount. With authentication extensions (see here), users can (and if you configure it that way, must) use accounts from an outside system, like Google, to log in; which makes potential spammers' job quite a bit harder.

The other extension is ConfirmAccount, in which every user registration has to be approved by an administrator; which also works quite well against spam. You can read about it on here.

Access control and protecting pages

There are two kinds of access control: restricting the ability of certain users to read certain content, and restricting their ability to edit certain content. In MediaWiki, these are very different from one another, and we'll handle them in two separate sections.

Controlling read access

MediaWiki was never designed, or modified, to allow for restricting read-access. If you go to the page on mediawiki.org for almost any access-control extension, you will see the following friendly message:

If you need per-page or partial page access restrictions, you are advised to install an appropriate content management package. MediaWiki was not written to provide per-page access restrictions, and almost all hacks or patches promising to add them will likely have flaws somewhere, which could lead to exposure of confidential data. We are not responsible for anything being leaked, leading to loss of funds or one's job.

In reality, there are hooks in the MediaWiki code to allow extensions to restrict viewing of pages – using any of the access-control extensions, a user who is not allowed to view a page will most likely only see an error message if they go directly to the page. However, the warning is still appropriate, because, for whatever reason, there are places in the MediaWiki code that ignore these read restrictions. Currently there are two known ones: the search page, and the "Recent changes" page. If a user does a search on text contained in a restricted page, they will be able to see the name of the page, the fact that it contains the search text, and the text fragment around the search text. And any changes to restricted pages will show up in the "Recent changes" page, where at least the edit summary will be viewable by everyone.

In addition, for those using the Cargo or Semantic MediaWiki extensions, those pose a third security gap, because they, too, ignore read restrictions – so any such data stored within a restricted page will be viewable by everyone.

It could be that all of the current issues will be fixed in future versions of the software. Nevertheless, even then, trying to restrict people's ability to access content in MediaWiki still seems like it would be a bad idea. This being a wiki, anyone who can read a certain protected page can easily copy over its contents to another, unprotected, page; or make some mistake in editing the page that leads to it no longer being in a protected category; etc. Even if the mistake lasts for no more than five minutes, that's still enough time for someone to see the material and have a permanent copy of it. And you might never find out if such a breach happens.

The other big issue is that every extension you use has to restrict read-access permissions. If even one doesn't, like Cargo or Semantic MediaWiki, then all your restriction work may be in vain.

So what do you do if you want to store confidential information in your wiki? Probably the most foolproof solution for that case is to simply have a second wiki, one which is restricted to only the small group of people with preferred access (most likely, top-level managers or the like), which will hold all the confidential data. Then you can have an additional element of "security through obscurity" – people who don't have access to the wiki may not even know about it, or may not know its web address; so there's less chance of any sort of breach. It's much safer to prevent someone from reading a wiki entirely, than reading only certain parts of it.

How do you prevent people from getting to a wiki? If you're on an internal network, and there's already some server that only the people you want to restrict access to, have access to, then the easiest solution is to put the wiki on that server. Otherwise, the best way to restrict viewing of the wiki is via LocalSettings.php settings – that's covered on here, but here are the relevant lines again:

$wgGroupPermissions['*']['read'] = false;

$wgGroupPermissions['user']['read'] = true;

The variable $wgWhitelistRead can also be useful in this case, because it lets you define certain pages that non-logged-in users can see, even if they can't view the rest of the wiki. If you want everyone to be able to see the front page, for instance, you could add the following:

$wgWhitelistRead = ['Main Page'];

And if you're using the ConfirmAccount extension (here), which lets people request a user account, then at least the following would be necessary if the wiki is private:

$wgWhitelistRead = ['Special:RequestAccount'];

In addition to using $wgGroupPermissions, there are also general web-based solutions, like using an .htaccess file.

What about more complex requirements – like, for instance, if you want to implement some system where regular users can only read and edit their own contributions, while administrators can read and edit everything? There may be extensions intended to support specific configurations, but good general advice is to echo the warning message: "you are advised to install an appropriate content management package."

Controlling write access

Thankfully, all the many issues related to restricting reading ability don't apply to restricting writing ability. Unlike read restrictions, write restrictions in MediaWiki work quite well; and even if a security breach occurs, it can be easily undone. If you're an administrator, you can restrict the writing of any particular page just via the "Protect" tab (or dropdown selection). In Figure 14, you can see an example of the interface shown after clicking on that tab/selection. As you can see, an administrator can set different protection levels for editing and moving pages, and they can set expirations on that protection.

Figure 14.2: “Protect page” interface

All of the access-control extensions also let you restrict write access. These generally provide a way to restrict all pages in a category and/or namespace to editing by one or more user groups. Of these, the safest choice seems to be the extension Lockdown :

https://www.mediawiki.org/wiki/Extension:Lockdown

Another potentially useful extension is LockAuthor , which lets you prevent certain users from modifying any pages other than those that they themselves have created:

https://www.mediawiki.org/wiki/Extension:LockAuthor

In addition, you can practice a "nicer" form of write-restriction, by using one of the extensions that let you mark a certain revision of the page as approved; anyone can then modify the page further, but the newer edits won't be displayed to users until they, too, are approved. We'll cover the two extensions that allow that in the next section.

FlaggedRevs and Approved Revs

Running a wiki can be a frightening experience: you're in charge of a set of documents that are meant to reflect some sort of official view of things, but sometimes many people, and sometimes everyone, can change anything on the wiki at any time. Which means that you can check the wiki in the morning and find out that a page about some software utility has, for the last four hours, held nothing but a string of obscenities, or some incorrect information about the software, or a nice recipe for chocolate mousse. Or perhaps you find that that bad edit has been in place for a week.

This fear tends to be overblown, because (a) with the (major) exception of spam, edits that are malicious or obviously incorrect are usually pretty rare, (b) to the extent that users are reading the wiki and have editing power, they can usually be trusted to find and revert such changes on their own. Still, the threat is there; and beyond that, some administrators simply want to have final control over what gets displayed at all times. Editorial control, in many cases, can be a nice thing to have.

The natural solution to this problem is one that has its roots in software version control: having one or more versions of a wiki page that are certified as "stable" or approved, and then having the most recent stable version be the one shown by default to users. That way you don't interfere with the process of wiki-editing, while at the same time ensuring a certain level of security for the content.

FlaggedRevs

This view of things has reached to Wikipedia , where vandalism has always been a problem. The FlaggedRevs extension (referred to on Wikipedia as PendingChanges) was developed for that purpose. It is used in some of the languages of Wikipedia, including (very sparingly) on the English-language Wikipedia.

FlaggedRevs can be used on any MediaWiki-based wiki, though it takes effort to install and use it because it's a substantial piece of software (as its homepage puts it rather bluntly, it is “very clunky, complex and not recommended for production use”). It does more than simply enable setting a stable revision of a page: it provides a whole framework for users to review different revisions of a page according to different criteria, so that the decision about which revision(s) to go with can be made by consensus.

https://www.mediawiki.org/wiki/Extension:FlaggedRevs

Approved Revs

FlaggedRevs makes sense for Wikipedia, although it is probably overkill for small-to-medium-sized wikis, where decisions can just be made by one or a few people without the need for a full, open discussion. In such a case, the Approved Revs extension seems to be the better solution.

Approved Revs is an extension that essentially was created to be a much simpler alternative to FlaggedRevs. It basically lets administrators do two things: select a single revision of a page as the approved one, and select a single revision of a file as the approved one. When a user goes to a page, or file, that has an approved revision, that approved revision is what they will see by default (though they can still see any other revision if they go the "history" page).

Administrators can select which revision of a page will be the approved one by going to the history page; see Figure 14 for an example of this interface.

Figure 14.3: History page (as viewed by an administrator) where one revision has been approved via the Approved Revs extension

By default, if a page has no approved revision, users will just see the latest revision – Approved Revs will have no impact. However, the wiki can be set to instead show a blank page if there's no approved revision – this can be done by adding the following to LocalSettings.php:

$egApprovedRevsBlankIfUnapproved = true;

If normal users edit a page, or re-upload a file, that already has an approved revision, their changes won't show up on the page until another approval happens. But by default, if anyone who has revision-approval permission edits a page, their edit (and thus, the latest revision of the page) will automatically be marked as approved. That usually makes sense, since such editors presumably wouldn't make changes that they wouldn't themselves authorize. However, you can change that default behavior by adding the following to LocalSettings.php:

$egApprovedRevsAutomaticApprovals = false;

Besides protecting content, Approved Revs can also be used to turn MediaWiki into more of a publishing platform, or a traditional CMS, where “draft” versions of a wiki page exist before the page gets “published”. For this case, the $egApprovedRevsBlankIfUnapproved setting becomes quite useful. It's different from standard publishing schemes because readers can still see all the draft versions through the history page (although those can be hidden if necessary – see here), but the basic concept of a page that's kept hidden until it's reviewed and approved is still implemented.

You can also set the group of namespaces for which Approved Revs is applied, via the $egApprovedRevsEnabledNamespaces variable. By default it comprises five namespaces: NS_MAIN (the main namespace), NS_HELP (help pages), NS_FILE (files), NS_TEMPLATE (templates) and NS_PROJECT (the project namespace). And you can set Approved Revs to apply to specific individual pages, using the “__APPROVEDREVS__” behavior switch. This is best done via a template.

As an example, let's say you only wanted approval to apply to the set of pages in the category “Essays”. You would first turn off Approved Revs in general, by adding the following to LocalSettings.php:

$egApprovedRevsEnabledNamespaces = [];

You would then create an infobox template, to be used for every "essay" page, which defines pages as being in the “Essays” category; and you would add to that template the “__APPROVEDREVS__” behavior switch, so that it was added automatically to every such page.

Approved Revs also defines a new special page, Special:ApprovedRevs, which provides eight lists: all pages that have an approved revision, all the pages that don't, all the pages whose approved revision is not their latest, and all the pages with invalid approvals – and then the same four lists for files.

You can read more about about Approved Revs on its homepage:

https://www.mediawiki.org/wiki/Extension:Approved_Revs