Jump to content

Extension:SpamBlacklist

From DawoumWiki, the free Mathematics self-learning
이 확장은 미디어위키 1.21과 그 이후 버전과 함께 제공됩니다. 따라서 여러분은 그것을 다시 다운로드할 필요가 없습니다. 어쨌든, 여러분은 여전히 제공된 다른 지침을 따라야 합니다.
이 확장의 이름을 바꾸자는 제안이 T254649에서 논의되고 있습니다.

SpamBlacklist 확장은 도메인이 지정된 파일이나 위키 페이지에 정의된 정규식 패턴과 지정된 이메일 주소를 사용하는 사용자에 의한 등록과 일치하는 URL을 포함하는 편집을 막습니다.

누군가가 페이지를 저장하려고 시도할 때, 이 확장은 불법 호스트 이름의 (잠재적으로 매우 큰) 목록에 대해 텍스트를 검사합니다. 만약 일치하는 항목이 있으면, 확장은 사용자에게 오류 메시지를 표시하고 페이지 저장을 거부합니다.

Installation

미디어위키 확장 내려받기 지면에서 해당하는 버전을 다운로드하고 위키의 extensions 디렉토리에 SpamBlacklist에 푸십시오.

또는 개발자와 코드 기여자는 대신 다음을 사용하여 Git에서 확장 프로그램을 설치해야 합니다.

cd extensions/
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/SpamBlacklist

미디어위키 설정 LocalSettings.php에 다음을 추가하십시오:

wfLoadExtension( 'SpamBlacklist' );

원하는 대로 블랙리스트를 구성하십시오.

Special:Version에 접근해서 확장이 정상적으로 설치가 되었는지 확인하십시오.

Setting the blacklist

추가 소스가 나열되어 있더라도 다음 지역 페이지는 항상 사용됩니다:

금지된 URL의 목록에 대해 기본 추가 소스는 Meta-Wiki에서 위키미디어 스팸 블랙리스트 m:Spam blacklist입니다. 기본적으로, 확장은 이 목록을 사용하고, 10-15분마다 한 번씩 다시 로드합니다. 많은 위키에 대해, 이 목록을 사용하는 것은 대부분의 스팸 시도를 차단할 만큼 충분할 것입니다. 어쨌든, 다양한 대규모 위키 그룹이 수십만 개의 외부 링크를 갖는 위키미디어 블랙리스트를 사용하기 때문에, 그것이 차단하는 링크는 비교적 보수적입니다.

위키미디어 스팸 블랙리스트는 관리자에 의해서만 편집될 수 있지만, m:Talk:Spam blacklist에서 블랙리스트에 대한 수정을 제안할 수 있습니다.

자신의 위키에 다른 나쁜 URL을 추가할 수 있습니다. LocalSettings.php 에 전역 변수 $wgBlacklistSettings에 그것들을 나열하십시오. 아래 예제를 참조하십시오.

$wgBlacklistSettings는 두 개의 수준 배열입니다. 최상위 키는 spam 또는 email입니다. 그것들은 각 값에 URL, 파일이름 또는 데이터베이스 위치를 포함하는 배열을 취합니다.

만약 "LocalSettings.php"에서 $wgBlacklistSettings를 사용하면 "[[m:Spam blacklist]]"의 기본값은 더 이상 사용되지 않을 것입니니다 - 만약 해당 블랙리스트에 접근하기를 원하면, 수동으로 추가해야만 할 것이며, 아래의 예제를 참조하십시오.

데이터베이스 위치를 지정하는 것은 위키 페이지에서 블랙리스트로를 작성할 수 있습니다.

데이터베이스 위치 지정자의 형식은 ">DB: [db name] [title]"입니다. [db name]은 LocalSettings.php d에서 $wgDBname의 값과 정확하게 일치해야 합니다. 위키의 기본 이름공간에 필요한 페이지 이름 [title]을 만들어야 합니다. 이렇게 하면, 일반 편집에서 페이지를 보호하는 것을 강력히 추천합니다. 어떤 사람이 모든 것과 일치하는 정규식을 추가할 수 있다는 명백한 위험 외에도, 임의적인 정규 표현식을 입력할 수 있는 공격자가 PCRE 라이브러리에서 세그폴트를 생성할 수 있다는 사실에 주목하십시오.

Examples

If you want to, for instance, use the English-language Wikipedia's spam blacklist in addition to the standard Meta-Wiki one, you could call the following in LocalSettings.php , AFTER wfLoadExtension( 'SpamBlacklist' ); call:

$wgBlacklistSettings = [
	'spam' => [
		'files' => [
			"https://meta.wikimedia.org/w/index.php?title=Spam_blacklist&action=raw&sb_ver=1",
			"https://en.wikipedia.org/w/index.php?title=MediaWiki:Spam-blacklist&action=raw&sb_ver=1"
		],
	],
];

Here is an example of an entirely local set of blacklists: the administrator is using the update script to generate a local file called "wikimedia_blacklist" that holds a copy of the Meta-Wiki blacklist and has an additional blacklist on the wiki page "My spam blacklist":

$wgBlacklistSettings = [
	'spam' => [
		'files' => [
			"$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list
			// database, title
			'DB: wikidb My_spam_block_list',    
		],
	],
];

Logging

By default, the extension does not log hits to a spam blacklist log. To enable logging set $wgLogSpamBlacklistHits = true;. You can use the spamblacklist user right to control access to the logs. Every signed-in user can view the logs by default.

Issues

Backtrack Limit

If you encounter issues with the blacklist, you may want to increase the backtrack limit. However, on the other hand, this can reduce your security against DOS [1] attacks, as the backtrack limit is a performance limit:

// Bump the Perl Compatible Regular Expressions backtrack memory limit
// (PHP 5.3.x default, 1000K, is too low for SpamBlacklist)
ini_set( 'pcre.backtrack_limit', '8M' );

Hardened Wikis

The SpamBlacklist will not allow editing if the wiki is hardened. Hardening includes limiting open_basedir so that curl is not on-path, and setting allow_url_fopen=Off in php.ini.

In the hardened case, SpamBlacklist will cause an exception when Guzzle attempts to make a network request. The Guzzle exception message is, GuzzleHttp requires cURL, the allow_url_fopen ini setting, or a custom HTTP handler.

Safe list

A corresponding safe list can be maintained by editing the MediaWiki:Spam-whitelist page. This is useful if you would like to override certain entries from another wiki's blacklist that you are using. Wikimedia wikis, for instance, sometimes use the spam blacklist for purposes other than combating spam.

It is questionable how effective Wikimedia spam blacklists are at keeping spam off of third-party wikis. Some spam might be targeted only at Wikimedia wikis or only at third-party wikis, which would make Wikimedia's blacklist of little help to said third-party wikis in those cases. Some third-party wikis might prefer that users be allowed to cite sources that Wikipedia does not allow. Sometimes, what one wiki considers useless spam, another might consider useful.

Users may not always realize that when a link is rejected as spammy, it does not necessarily mean that the individual wiki they are editing has specifically chosen to ban that URL. Therefore, wiki system administrators may want to edit the Manual:System messages at MediaWiki:Spamprotectiontext and/or MediaWiki:Spamprotectionmatch on your wiki to invite users to make suggestions at MediaWiki talk:Spam-whitelist for pages that should be added by a Manual:Administrators to the safe list. For example, you could put, for MediaWiki:Spamprotectiontext:

The spam filter blocked the text you wanted to save. This is probably caused by a link to a blacklisted external site. {{SITENAME}} maintains [[MediaWiki:Spam-blacklist|its own blacklist]]; however, most blocking is done by means of [[metawikimedia:Spam-blacklist|Meta-Wiki's blacklist]], so this block should not necessarily be construed as an indication that {{SITENAME}} made a decision to block this particular text (or URL). If you would like this text (or URL) to be added to [[MediaWiki:Spam-whitelist|the local spam safe list]], so that {{SITENAME}} users will not be blocked from adding it to pages, please make a request at [[MediaWiki talk:Spam-whitelist]]. A [[Project:Sysops|sysop]] will then respond on that page with a decision as to whether it should be listed as safe.

Notes

  • This extension examines only new external links added by wiki editors. To check user agents, add Akismet As the various tools for combating spam on MediaWiki use different methods to spot abuse, the safeguards are best used in combination.
  • Users with the sboverride can override the blacklist and add blocked links to pages. By default, this right is only given to bots.

Usage

Syntax

If you would like to create a blacklist of your own or modify an existing one, here is the syntax:

Everything on a line after a '#' character is ignored (for comments). All other strings are regex fragments which will only match inside URLs.

Notes
  • Do not add "http://"; this would fail, since the regex will match after "http://" (or "https://") inside URLs.
  • Furthermore, "www" is unneeded since the regex will match any subdomains. By giving "www\." explicitly one can match specific subdomains.
  • The (?<=//|\.) and $ anchors match the beginning and end of the domain name, not the beginning and end of the URL. The regular anchor ^ won't be of any use.
  • Slashes don't need to be escaped by backslashes. The script will do this automatically.
  • The spam blacklist functions before abuse filters, so blacklisted domains will not show in the entries in the abuse filter log (special:abuselog), and will only show in (special:log/spamblacklist).
Example

The following line will block all URLs that contain the string "example.com", except where it is immediately preceded or followed by a letter or a number.

\bexample\.com\b

These are blocked:

  • http://www.example.com
  • http://www.this-example.com
  • http://www.google.de/search?q=example.com

These are not blocked:

  • http://www.goodexample.com
  • http://www.google.de/search?q=example.commodity

Performance

The extension creates a single regex statement which looks like /https?:\/\/[a-z0-9\-.]*(line 1|line 2|line 3|....)/Si (where all slashes within the lines are escaped automatically). It saves this in a small "loader" file to avoid loading all the code on every page view. Page view performance will not be affected even if you're not using a bytecode cache, although using a cache is strongly recommended for any MediaWiki installation.

The regex match itself generally adds an insignificant overhead to page saves (on the order of 100ms in our experience). However, loading the spam file from the disk or the database and constructing the regex may take a significant amount of time, depending on your hardware. If you find that enabling this extension slows down saves excessively, try installing a supported bytecode cache. This extension will cache the constructed regex if such a system exists.

If you're sharing a server and cache with several wikis, you may improve your cache performance by modifying getSharedBlacklists and clearCache in SpamBlacklist_body.php to use $wgSharedUploadDBname (or a specific DB if you do not have a shared upload DB) rather than $wgDBname. Be sure to get all references! The regexes from the separate MediaWiki:Spam-blacklist and MediaWiki:Spam-whitelist pages on each wiki will still be applied.

External blacklist servers (RBLs)

This extension requires that the blacklist be constructed manually in its standard form. While regular expression wildcards are permitted, and a blacklist originated on one wiki may be re-used by many others, some effort is still required to add new patterns in response to spam or remove patterns that generate false positives.

Much of this effort may be reduced by supplementing the spam regex with lists of known domains advertised in spam emails. The regex will catch common patterns (like "casino-" or "-viagra") while the external blacklist server will automatically update with names of specific sites being promoted through spam.

In the filter() function in includes/SpamBlacklist.php, approximately halfway between the file start and end, are the lines:

       # Do the match
       wfDebugLog( 'SpamBlacklist', "Checking text against " . count( $blacklists ) .
           " regexes: " . implode( ', ', $blacklists ) . "\n" );

Directly above this section (which does the actual regex test on the extracted links), one could add additional code to check the external RBL servers [2]:

        # Do RBL checks
        $retVal = false;
        $wgAreBelongToUs = ['l1.apews.org.', 'multi.surbl.org.', 'multi.uribl.com.'];
        foreach( $addedLinks as $link ) {
              $link_url=parse_url($link);
              $link_url=$link_url['host'];
              if ($link_url) {
                   foreach( $wgAreBelongToUs as $base ) {
                        $host = "$link_url.$base";
                        $ipList = gethostbynamel( $host );
                        if( $ipList ) {
                           wfDebug( "RBL match: Hostname $host is {$ipList[0]}, it's spam says $base!\n" );
                           $ip = wfGetIP();
                           wfDebugLog( 'SpamBlacklistHit', "$ip caught submitting spam: {$link_url} per RBL {$base}\n" );
                           $retVal = $link_url . ' (blacklisted by ' . $base .')';
                           wfProfileOut( $fname );
                           return $retVal;
                        }
                   }
              }
        }

        # If no match is found on the RBL server, continue normally with regex tests...

This ensures that if an edit contains URLs from already blocked spam domains, an error is returned to the user indicating which link cannot be saved due to its appearance on an external spam blacklist. If nothing is found, the remaining regex tests can run normally, so any manually specified 'suspicious pattern' in the URL may be identified and blocked.

Note that the RBL servers list just the base domain names - not the full URL path - so http://example.com/casino-viagra-lottery.html will trigger RBL only if "example.com" itself was blocked by name by the external server. The regex, however, would be able to block on any of the text in the URL and path, from "example" to "lottery" and everything in between. Both approaches carry some risk of false positives - the regex because of the use of wildcard expressions, and the external RBL as these servers are often created for other purposes - such as control of abusive spam email - and may include domains which are not engaged in forum, wiki, blog or guestbook comment spam per se.

Other spam-fighting tools

There are various helpful manuals on mediawiki.org on combating spam and other vandalism:

Other anti-spam, anti-vandalism extensions include:

References