Misplaced Pages

:Bots/Requests for approval/Cyberbot II 4: Difference between revisions - Misplaced Pages

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
< Misplaced Pages:Bots | Requests for approval Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 18:43, 28 August 2013 editBeetstra (talk | contribs)Edit filter managers, Administrators172,074 edits Trial: Stop the press ... Bug! ;-)← Previous edit Revision as of 18:51, 28 August 2013 edit undoBeetstra (talk | contribs)Edit filter managers, Administrators172,074 edits Discussion: rename template?Next edit →
Line 143: Line 143:
*:::This is merely a starter template. If people want it changed, then consensus will change.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 17:24, 28 August 2013 (UTC) *:::This is merely a starter template. If people want it changed, then consensus will change.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 17:24, 28 August 2013 (UTC)
* Cyberpower, you have a serious bug. None of the links on ] are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --] <sup>] ]</sup> 18:43, 28 August 2013 (UTC) * Cyberpower, you have a serious bug. None of the links on ] are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --] <sup>] ]</sup> 18:43, 28 August 2013 (UTC)
* Other point, I would really consider to move the template to a more neutral name, like 'blacklisted-links'. Not all links are spam that are on the blacklist, they are however all blacklisted .. --] <sup>] ]</sup> 18:51, 28 August 2013 (UTC)

Revision as of 18:51, 28 August 2013

Cyberbot II 4

Operator: Cyberpower678 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:04, Thursday June 27, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: No

Function overview: Tag all pages contatining blacklisted links in the MediaWiki:Spam-blacklist and the meta:Spam blacklist with {{Spam-links}}

Links to relevant discussions (where appropriate): Misplaced Pages:Bot_requests#Unreliable_source_bot

Edit period(s): Daily

Estimated number of pages affected: Unknown. Probably hundreds or thousands at first

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: This bot scans the above mentioned lists and tags any page with blacklisted link with {{Spam-links}} in the article namespace.

Discussion

Since the sites on those lists have been determined to be spam, would it be better to simply remove those links? Would your bot only consider external links, or also references? Thanks! GoingBatty (talk) 02:33, 27 June 2013 (UTC)

I believe it would be better to simply tag them instead of remove them. It is uncertain whether removing them may end up breaking something. I can have my bot remove them instead if that is what is preferred, or the MediaWiki software turns out to inhibit the bot. As for your questions, it would handle any link matched in article space.—cyberpower Online 02:40, 27 June 2013 (UTC)
External links are not {{unreliable source}}, they're external links. Also, you should probably skip links listed at MediaWiki:Spam-whitelist. And note there are also links on the blacklist that aren't there because anything using that link is unreliable, e.g. any url shortener is there because the target should be linked directly rather than via a shortener. Anomie 11:24, 27 June 2013 (UTC)
Thanks for the input. I could remove external links while tagging refs with {{unreliable source}}.—cyberpower Offline 12:32, 27 June 2013 (UTC)

Wait, you're tagging external links that are listed on the spam blacklist with {{unreliable source}}? Unless I'm missing something here this won't work. When the bot tries to save the page, it will hit the blacklist and won't save. --Chris 13:55, 27 June 2013 (UTC)

I have considered that possibility, which is why my alternative is to simply remove the link and refs altogether.—cyberpower Online 14:51, 27 June 2013 (UTC)
In that case, I think this is something better dealt with by a human. Simply removing external links will probably lead to a bit of "brokenness" in the article where the link was, and would need human intervention to clean up after the bot. Also, if the article does have blacklisted links it in, chances are it probably has other problems (e.g. the entire article could be spam), so it would be preferable to have a human view the article and take action. I think if you want to continue with this task, the best thing to do would be for the bot to create a list of pages that contain blacklisted links, and post that for users to manually review. --Chris 15:18, 27 June 2013 (UTC)
I'm not certain if the software will block the bots edits, if the spam link is already there. I was thinking more along the lines that the tags place it in a category, that humans can then review. If it can't tag it next to the link, maybe it can tag the page instead and place it in the same category. What do you think?—cyberpower Online 15:30, 27 June 2013 (UTC)
As I understand it, if the spam link is already on the page, the software will block the edit anyway. --Chris 16:00, 27 June 2013 (UTC)
hmmm. I'm looking at the extension that is responsible. If the software blocks any edit that has the link in there already, that would likely cause a lot of problems on wiki. But, I'll have more info later tonight.—cyberpower Offline 16:57, 27 June 2013 (UTC)
{{BAGAssistanceNeeded}} I have tested the spam filter extensively on the peachy wiki. Tagging blacklisted links will not trip the filter, nor will removing it or adding the link if it already exists on the page. Modifying the link, or adding it to a page where the link is not yet present will trip the filter.—cyberpower Offline
Ok, I stand corrected. I'd like to review the source code for this bot. --Chris 12:40, 3 July 2013 (UTC)
Also can you give a bit more detail on exactly how the bot will operate? Will it only be tagging references, or will it remove external links as mentioned above? How will the bot deal with any false positives? Will it skip links listed on MediaWiki:Spam-whitelist? Will it be possible to whitelist other links (e.g. url shorteners as mentioned by Anomie), that shouldn't be tagged as unreliable? --Chris 12:47, 3 July 2013 (UTC)
The bot code is not yet fully completed as of this writing. I seem to be hitting resource barriers. Because it process an enormous amount of external links, I am working on conserving memory usage. Also, the regex scan is quite a resource hog as well, which I am trying to improve efficiency on. Yes, it will obey the whitelist. Because there is a risk of breaking things when removing the link, and tagging references can lead to false positives, I thought about placing a tag on the top of the page, listing the links that it found. False positives can be reported to me, or an admin, who will modify a .js page in my userspace with an exception to be added or removed, that the bot will read before it edits the pages.—cyberpower Online 14:00, 3 July 2013 (UTC)
  • The script is now finished. Chris G. has the code and is reviewing it. The task will seek out blacklisted external links and tag the pages containing them. Exceptions can be added for specific cases and it reads the whitelist too.—cyberpower Online 15:33, 24 July 2013 (UTC)
Although this page states that blacklisted external links will be tagged with {{unreliable source}}, Misplaced Pages:Village_pump_(miscellaneous)#New_Bot states that they will be tagged with {{spam-links}}. Could you please clarify? Thanks! GoingBatty (talk) 23:38, 24 July 2013 (UTC)
Some changes were made since the filing of this BRFA. I have now amended the above.—cyberpower Offline 05:43, 25 July 2013 (UTC)

Review:

  • Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of:
    if( empty($blacklistregexarray) ) goto theeasystuff;
    else $blacklistregex = buildSafeRegexes($blacklistregexarray);

You could have written:

    if( !empty($blacklistregexarray) ) {
           $blacklistregex = buildSafeRegexes($blacklistregexarray);
           <LINES 89 - 112>
    }

 Done all labels removed.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156

 Already done The break command was a remnant from the debugging period. It's removed now.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.:
substr($exception,strlen("page="))

 Done Missed this one.—cyberpower Online 12:44, 25 July 2013 (UTC)

  • Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries)

 Already done You reminded me that I programed that safeguard into the Peachy framework already. :p—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links.
        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) return false; 
                else return true;
            }
        }

v.s.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) 
                     return false; 
            }
            return true;
        }

 Fixedcyberpower Online 12:24, 25 July 2013 (UTC)

  • General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request.

 Already done Framework has throttle.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • lines 145ish - is it possible to get the page id in the same API request as you get the transclusions? That way instead of making 165,000+ API calls (for each page), you only make about 33 calls.

 Donecyberpower Online 12:24, 25 July 2013 (UTC) --Chris 09:29, 25 July 2013 (UTC)

  • AHHH. How did I not see that regex scan bug? D: Thanks for the input. I'll make the appropriate modifications now. I completely forgot that the framework was already designed to handle errors. :D
    Modifications finished.—cyberpower Online 12:44, 25 July 2013 (UTC)

Trial

Ok, we'll start with a small trial to make sure everything runs smoothly, and then we can move onto a much wider trial. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --Chris 10:57, 2 August 2013 (UTC)

It started out ok, but then something went horribly wrong and it started tagging pages with empty tags. I have terminated the bot at the moment and will be looking into what caused the problems.—cyberpower Online 12:46, 10 August 2013 (UTC)
Bug found. Bot restarted.—cyberpower Online 19:58, 10 August 2013 (UTC)
Trial complete. I haven't looked at the edits yet as it's currently the middle of the night right now.—cyberpower Offline 00:32, 12 August 2013 (UTC)

Even after the restart, 2 pages had blank tags added (1, 2). Also, maybe non-article pages should be skipped (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17), unless there is some reason that these should have the links removed:Jay8g 01:18, 13 August 2013 (UTC)

Thank you. I am already looking into the bug. And am already working on excluding namespaces.
The bugs have been fixed. The exceptions list now supports entire namespaces.—cyberpower Online 12:14, 15 August 2013 (UTC)

Approved for extended trial (1000 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Although I would ask that you do them in batches (maybe 100 or 200 edits at a time) --Chris 10:16, 24 August 2013 (UTC)

  • The bugs seem to be fixed. However, upon reactivating the bot, it began tagging spam reports in the Misplaced Pages space. I have added the Misplaced Pages namespace into the exclusions list.—cyberpower Online 16:06, 24 August 2013 (UTC)
  • Comment - You may want to change the edit summary from "Tagging page with Spam-links" to "Tagging page with Template:Spam-links" to make it clear that the bot isn't adding links, but adding a template. Giving users a link to the template documentation may also help reduce the number of comments you get on your bot's talk page. GoingBatty (talk) 14:42, 26 August 2013 (UTC)
    Good idea.—cyberpower Online 15:01, 26 August 2013 (UTC)
  • Comment Frankly, these spamlink tags are massive, too invasive and even disturbing for the readers. They should have a more decent and reasonable size and position. Cavarrone 11:10, 28 August 2013 (UTC)
    The intent is to tackle spam links. They are supposed to be very noticeable.—cyberpower Offline 13:42, 28 August 2013 (UTC)
    I understand the intent very well, but the result is distractive and objectively disturbing for the readers and subsequently a damage for the encyclopedia, as the main intent of Misplaced Pages should be serving the readers, not punishing the spammers (at least not in such way, affecting the whole article). With that size/position the template becomes almost the main topic of the article, and this is IMHO unacceptable. I am not even sure it is a damage for them (if their intent is publicizing a website, I'm not sure they are disturbed from having their links in the head of an article). We have filters and whitelists to prevent further spamming of a blacklisted website, but marking/working on previous (possible) spam requires a finer and more accurate approach than this stuff here. I predict a lot of similar complaints when/if the bot will work on regular basis (eg, have you an idea of how many articles use the only blacklisted New York Times as a source?). Cavarrone 15:23, 28 August 2013 (UTC)
    This is merely a starter template. If people want it changed, then consensus will change.—cyberpower Limited Access 17:24, 28 August 2013 (UTC)
  • Cyberpower, you have a serious bug. None of the links on Access2Research are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --Dirk Beetstra 18:43, 28 August 2013 (UTC)
  • Other point, I would really consider to move the template to a more neutral name, like 'blacklisted-links'. Not all links are spam that are on the blacklist, they are however all blacklisted .. --Dirk Beetstra 18:51, 28 August 2013 (UTC)
Category:
Misplaced Pages:Bots/Requests for approval/Cyberbot II 4: Difference between revisions Add topic