Misplaced Pages

:Bots/Requests for approval/Cyberbot II 4: Difference between revisions - Misplaced Pages

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
< Misplaced Pages:Bots | Requests for approval Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 18:51, 28 August 2013 editBeetstra (talk | contribs)Edit filter managers, Administrators172,074 edits Discussion: rename template?← Previous edit Latest revision as of 19:12, 8 May 2022 edit undoQwerfjkl (bot) (talk | contribs)Bots, Mass message senders4,025,583 editsm Discussion: Replaced deprecated <source> tags with <syntaxhighlight>Tag: AWB 
(52 intermediate revisions by 11 users not shown)
Line 1: Line 1:
<noinclude>]</noinclude> <noinclude>]</noinclude><div class="boilerplate metadata" style="background-color:
#A0FFA0; margin:2em 0 0 0; padding:0 10px 0 10px; border:1px solid #AAAAAA;">
:''The following discussion is an archived debate. <span style="color:red">'''Please do not modify it.'''</span> To request review of this BRFA, please start a new section at ].'' The result of the discussion was ] '''Approved'''{{#ifeq:yes|yes|.}}<!-- from Template:Bot Top-->
==]== ==]==
{{Newbot|Cyberbot II|4}} {{Newbot|Cyberbot II|4}}
Line 22: Line 24:


<!-- A SHORT bot function overview (max 2–3 lines); place the in-depth explanation in the "function details" section below--> <!-- A SHORT bot function overview (max 2–3 lines); place the in-depth explanation in the "function details" section below-->
'''Function overview:''' Tag all pages contatining blacklisted links in the ] and the ] with {{tlx|Spam-links}} '''Function overview:''' Tag all pages containing blacklisted links in the ] and the ] with {{tlx|Spam-links}}


<!-- Bot tasks require consensus in order to be approved. Please list any relevant discussions here to indicate consensus for the task. If such input is not necessary (for instance, a task that is duplicating or closely matching an existing bot) leave this blank--> <!-- Bot tasks require consensus in order to be approved. Please list any relevant discussions here to indicate consensus for the task. If such input is not necessary (for instance, a task that is duplicating or closely matching an existing bot) leave this blank-->
'''Links to relevant discussions (where appropriate):''' ] '''Links to relevant discussions (where appropriate):''' ]


<!-- e.g. Continuous, daily, one time run, etc. --> <!-- e.g. Continuous, daily, one time run, etc. -->
Line 67: Line 69:
'''Review''': '''Review''':
* Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of: * Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of:
<source lang="php"> <syntaxhighlight lang="php">
if( empty($blacklistregexarray) ) goto theeasystuff; if( empty($blacklistregexarray) ) goto theeasystuff;
else $blacklistregex = buildSafeRegexes($blacklistregexarray); else $blacklistregex = buildSafeRegexes($blacklistregexarray);
</syntaxhighlight>
</source>
You could have written: You could have written:
<source lang="php"> <syntaxhighlight lang="php">
if( !empty($blacklistregexarray) ) { if( !empty($blacklistregexarray) ) {
$blacklistregex = buildSafeRegexes($blacklistregexarray); $blacklistregex = buildSafeRegexes($blacklistregexarray);
<LINES 89 - 112> <LINES 89 - 112>
} }
</syntaxhighlight>
</source>
{{done}} all labels removed.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC) {{done}} all labels removed.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC)
* Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156 * Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156
{{already done}} The break command was a remnant from the debugging period. It's removed now.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC) {{already done}} The break command was a remnant from the debugging period. It's removed now.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC)
* Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.: * Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.:
<source lang="php"> <syntaxhighlight lang="php">
substr($exception,strlen("page=")) substr($exception,strlen("page="))
</syntaxhighlight>
</source>
{{done}} Missed this one.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:44, 25 July 2013 (UTC) {{done}} Missed this one.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:44, 25 July 2013 (UTC)
* Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries) * Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries)
{{Already done}} You reminded me that I programed that safeguard into the Peachy framework already. :p—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC) {{Already done}} You reminded me that I programed that safeguard into the Peachy framework already. :p—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC)
* Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links. * Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links.
<source lang="php"> <syntaxhighlight lang="php">
if( preg_match($regex, $link) ) { if( preg_match($regex, $link) ) {
foreach( $whitelistregex as $wregex ) { foreach( $whitelistregex as $wregex ) {
Line 96: Line 98:
} }
} }
</syntaxhighlight>
</source>
v.s. v.s.
<source lang="php"> <syntaxhighlight lang="php">
if( preg_match($regex, $link) ) { if( preg_match($regex, $link) ) {
foreach( $whitelistregex as $wregex ) { foreach( $whitelistregex as $wregex ) {
Line 106: Line 108:
return true; return true;
} }
</syntaxhighlight>
</source>
{{fixed}}—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC) {{fixed}}—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 12:24, 25 July 2013 (UTC)
* General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request. * General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request.
Line 143: Line 145:
*:::This is merely a starter template. If people want it changed, then consensus will change.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 17:24, 28 August 2013 (UTC) *:::This is merely a starter template. If people want it changed, then consensus will change.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 17:24, 28 August 2013 (UTC)
* Cyberpower, you have a serious bug. None of the links on ] are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --] <sup>] ]</sup> 18:43, 28 August 2013 (UTC) * Cyberpower, you have a serious bug. None of the links on ] are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --] <sup>] ]</sup> 18:43, 28 August 2013 (UTC)
*:The petition rule doesn't seem specific to me. I hardly call <code>\bpetition(?:online|s)?\b</code> specific. It's not a bot bug.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 19:04, 28 August 2013 (UTC)
::those links are not blacklisted, I just managed to save a page with one of the links ... --] <sup>] ]</sup> 21:20, 28 August 2013 (UTC)
:::If they exist already, they won't be blocked. Also, I have found that the filter only partially enforces the regex list. Sometimes it blocks links with petition in it and other times it doesn't. The regex generator is the same as MediaWiki's extension. The validation process of these links is identical. If it's really a bug, then it's a bug with PHP.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 22:03, 28 August 2013 (UTC)
::See --] <sup>] ]</sup> 21:23, 28 August 2013 (UTC)
* Other point, I would really consider to move the template to a more neutral name, like 'blacklisted-links'. Not all links are spam that are on the blacklist, they are however all blacklisted .. --] <sup>] ]</sup> 18:51, 28 August 2013 (UTC) * Other point, I would really consider to move the template to a more neutral name, like 'blacklisted-links'. Not all links are spam that are on the blacklist, they are however all blacklisted .. --] <sup>] ]</sup> 18:51, 28 August 2013 (UTC)
*:That can be sorted out afterwards.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 19:04, 28 August 2013 (UTC)
:::It is also tagging links to googlebook with search strings with "forbidden words" like "petitions" in it. That's not a spam link and it doesn't not trigger the spam filter either, because it ain't forbidden in that context. I'll also second that the box is too large and overwhelming. ] (]) 01:53, 29 August 2013 (UTC)

Cyberpower678 can you stop the trial until we sort out the above. --] 02:00, 29 August 2013 (UTC)
:Per my talk page discussion, it has been brought to my attention that these issues are indeed a bot issue, not because it's bugged, but because it's running outdated code. I seem to have downloaded an outdated version of the extension. I'll make the modifications in the next few days. My bot has been shut down for quite some time now. I still recommend that petition regex be removed. There's no need for it. As for the spam links template, that can be fixed later as it's not crucial to the bot's operation.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 02:05, 29 August 2013 (UTC)
::I am glad that you have found the problem, but I can't say I agree that the template is not crucial to the bot's operation or that it can be fixed later. This is an encyclopedia and the template (rather than the code) is what our readers and editors see and use. The prior template was inappropriately large and overwhelming; it talks about "spam links" which is not the case and uses the term "external links" in a way that is not consistent with WP's definition at ]. At least on the ] page, the promised list of "problematic links" , which meant we had to dig to try and figure out what the (non)problem was. Given the fact that the bot is in a trial stage and is making mistakes, it seems to me that it would be better to provide information about where to report errors in tagging, rather than the current formulation which suggest that the bot can do no wrong, that the article and its editors (or the blacklist) are at fault, and that they need to figure out the problem and act on it or the bot will be back. That's not the case, and feedback needs to be given, received and then acted upon with a positive spirit. ] (]) 11:54, 29 August 2013 (UTC)
:::I agree with you completely. The problem was that I was convinced that there was no issue with the code, that the regex is generated by the blacklist extension itself, that the bot simply validates the regex against the blacklist. I was right in every aspect. Where I was wrong was that I was running an out of date version of the extension. The newer version has a more refined regex generator. The template layout should be decided by the community. I merely created something for the bot. I'm not going to force this template on the community if they don't want it.—] ]<sub style="margin-left:-4.4ex;color:red;font-family:arnprior">Offline</sub> 12:47, 29 August 2013 (UTC)
:::To expand on what I said, I meant not crucially in fixing bugs in the bot that may cause mistags. I do agree that the template will need to be fixed and adjusted to reflect consensus, but now I am only concerned in making sure the bot operates correctly.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 14:21, 29 August 2013 (UTC)
::::Yeah, but the bot operating correctly comprises not just the technical aspects of a bot etc but also the surface and interface aspects of the template, as well as the giving and receiving of feedback between the community and the bot operator. I understand that you did not think that there was a mistake with the bot, but the fact is that there was a problem. It isn't the end of the world and it isn't a question of fault, because everybody makes mistakes, and mistakes are a good opportunity for learning and growth. But what I am a little concerned with is that you might be a little bit too interested in getting the code right and not enough interested in the interactional component of bots with the community. A bunch of people told you it was making mistakes and you just said "no, it is doing what it is supposed to do", when it wasn't. You even said that to one person ''after'' you'd found there was a problem.. A number of articles are still tagged inappropriately. How about cleaning them up? And it doesn't seem like you are taking seriously the comments about the template because you want to do it "later" and are even apparently planning another run without working on this. What happens if there is another bug that you didn't realize was there? I realize you don't think there is, but that's what you thought the last time, wasn't it?. Personally I don't think you should add any more tags until the erroneous ones are removed and the concerns about the template addressed. A bot is a whole package, not just the coding ] (]) 17:43, 29 August 2013 (UTC)
:::::No offense, but I think you are twisting things out of proportion. There are people who will claim a link is not blacklisted when in fact, it is. The bot is supposed to keep tagging pages until it no longer sees them as blacklisted. No, I do not run the bot if I see a bug surfacing. That's why it's been off. The next run is supposed to test the removal of tags containing links not blacklisted. I have indeed fixed the regex scanner. I am aware that the tag is important, but it's not my say on how it's supposed to look, but the communities, so I won't take initiative. Feel free to modify the tag so it still works, but looks "better". If there is a visible bug, the bot gets shut down immediately. The petition problem was not because the code was buggy, but the regex generator running on the wrong version. I take comments very seriously, and I find that remark mildly offensive, to say I don't. You have essentially just called me incompetent.—] ]<sub style="margin-left:-4.4ex;color:red;font-family:arnprior">Offline</sub> 18:33, 29 August 2013 (UTC)
::::::Cyberpower, you say that you take comments very seriously. which was apparently specifically designed with this bot in mind. You took the initiative in creating it, decided how it was going to look then, and now it would be great if you would follow up by considering the community's good faith comments above (and on the template talk) and modifying it. That would show that you do indeed take comments seriously, as opposed to passing on the responsibility "to the community". Removing the incorrect tags from articles asap -manually if need be- would also show that you take your responsibilities as a bot operator seriously. I am sure you that you are not incompetent but it seems that you may not understand the degree of frustration and discouragement such bots can cause to good faith (and perhaps technically clueless) content editors when articles are incorrectly tagged (sometimes repeatedly). This is . Anyway, with this I leave you and the rest of the bot experts to your work. ] (]) 19:06, 29 August 2013 (UTC)
*I have updated the regex scanner. It should now mirror Misplaced Pages's blacklist filter when scanning regexes. The change will go into effect on the next run.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 14:23, 29 August 2013 (UTC)

I am going on Wikibreak for awhile. As I will not be around to monitor the trial, please stop the trial, and wait for another BAGer to approve and monitor it. --] 11:25, 2 September 2013 (UTC)

More blank tags were added (, ), both in namespaces where no tags should have been added''':Jay8g''' <small>]•]•]<nowiki />]</small> 03:04, 3 September 2013 (UTC)
:I saw it, and already fixed the issue. Thank you.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 03:17, 3 September 2013 (UTC)
::I will commence the last ~50 edits sometime tomorrow. This run will test to see if the good links are no longer tagged. This will verify that the exceptions list, whitelist, and the new regex scanner are working correctly.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 16:47, 3 September 2013 (UTC)
:::Cyberpower. Did you not see that Chris G has told you to stop the trial for now? See his post of 11:25, 2 September 2013 (UTC). ] (]) 17:27, 3 September 2013 (UTC)
::::Oops. I forgot to mention that addshore has volunteered to take over this trial and has allowed me to resume it.
:::::It is probably best to ask addshore to come here and do so officially, don't you think?] (]) 18:10, 3 September 2013 (UTC)
::::::I do. I'll ask him to post here before proceeding.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 18:17, 3 September 2013 (UTC)
:::::::I see no harm in completing the last 50 edits! :) ''']''' <sup>]</sup> 11:17, 10 September 2013 (UTC)
::::::::{{BotTrialComplete}} I accidentally went a bit over. I will post my results and fixes shortly.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 14:55, 11 September 2013 (UTC)
If possible, it might be a good idea to un-exclude "Misplaced Pages talk:Articles for creation/" prefix pages, as the intention is that they would become articles''':Jay8g''' <small>]•]•]<nowiki />]</small> 02:11, 12 September 2013 (UTC)
:puuuh. That's going to be tricky. I'd have to I exclude the Misplaced Pages talk namespace and add each Misplaced Pages talk page manually.—] ]<sub style="margin-left:-4.4ex;color:\#FF8C00;font-family:arnprior">Limited Access</sub> 11:17, 12 September 2013 (UTC)
:On second though, there shouldn't be any blacklisted links on those AfC pages, since the filter should filtering them out.—] ]<sub style="margin-left:-4.4ex;color:red;font-family:arnprior">Offline</sub> 13:52, 12 September 2013 (UTC)

=== Post-Trial ===
I am currently generating the post trial report for a BAGger and everyone else to review.—] ]<sub style="margin-left:-4.4ex;color:red;font-family:arnprior">Offline</sub> 13:53, 12 September 2013 (UTC)

The following below are bugs and issues brought up during the extended trial and that status of the issue:
#{{tlx|Spam-links}} should be renamed to a more neutral name: {{done}} Renamed to {{tlx|Blacklisted-links}}. Bot change will go into effect upon approval or new trial.
#Tag template used incorrect terminology: {{fixed}}
#{{tlx|Blacklisted-links}} is too large in size and disruptive to the readers: {{done|Resolved}} template is collapsed by default with a one line notice.
#{{tlx|Blacklisted-links}} is not showing the list of blacklisted URLs, in some cases, despite it being present in the Wikimarkup: {{fixed}}
#Bot has been tagging links that shouldn't be tagged. {{fixed}} Regex scanner was not up to date.
#Bot is tagging wrong namespaces. {{done|Resolved}} Any administrator can alter the settings on the exceptions page.
If I have missed any error reports that should have been addressed, please let me know.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 14:50, 12 September 2013 (UTC)

{{BotExtendedTrial|edits=100}} Quick trial to see any more issues. Article namespace. —&nbsp;<small>&nbsp;]&nbsp;&nbsp;▎]</small> 19:55, 17 September 2013 (UTC)
*{{BotTrialComplete}} No further issues found. Report will be posted shortly.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 00:58, 22 September 2013 (UTC)

=== Second Post-Trial ===
The contributions from the trial can be found .

Number of edits performed: 141

Bugs found: 0

Issues found: 1

#When the bot went through to de-tag invalid tags, because the tag was renamed, several pages did not get returned by the API. This left invalid tags in place.
#:'''Solution:''' I have fixed this issue by writing a quick script to remove all tags under the old name. This should not be an issue for future runs as they will all be transcluded under the new name.

Per ] and , I can confirm the cleanup script did it's job and that only valid tags are present, in the article namespace only. The bot successfully detected and removed several petition false-positives since the regex scanner has been updated. I also recommend that the {{tlx|Spam-links}} tag be deleted at this point.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 01:57, 22 September 2013 (UTC)

Thinking about this and looking at the edits. What happens when the offending link is removed from the page? Would the bot come back and remove the tag if it wasn't removed ? Or if the link is no longer blacklisted? Otherwise, edits look good. —&nbsp;<small>&nbsp;]&nbsp;&nbsp;▎]</small> 14:06, 22 September 2013 (UTC)
:The bot will remove the tag if the link in question is either, white listed, removed from the blacklist, added to the exceptions list, or removed from the page.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 17:33, 22 September 2013 (UTC)
::Right, I should've looked more careful. —&nbsp;<small>&nbsp;]&nbsp;&nbsp;▎]</small> 17:38, 22 September 2013 (UTC)

{{BotApproved}} Trials look good, no problems that I see, all issues seem to be reasonably resolved. BRFA open for quite a while, lots of edits, no more comment received. Would be good if you made a page explaining what to do after a page is tagged and link it from the edit summary. —&nbsp;<small>&nbsp;]&nbsp;&nbsp;▎]</small> 17:38, 22 September 2013 (UTC)

:''The above discussion is preserved as an archive of the debate. <span style="color:red">'''Please do not modify it.'''</span> To request review of this BRFA, please start a new section at ].''<!-- from Template:Bot Bottom --></div>

Latest revision as of 19:12, 8 May 2022

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.

Cyberbot II 4

Operator: Cyberpower678 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:04, Thursday June 27, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: No

Function overview: Tag all pages containing blacklisted links in the MediaWiki:Spam-blacklist and the meta:Spam blacklist with {{Spam-links}}

Links to relevant discussions (where appropriate): Misplaced Pages:Bot_requests#Unreliable_source_bot

Edit period(s): Daily

Estimated number of pages affected: Unknown. Probably hundreds or thousands at first

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: This bot scans the above mentioned lists and tags any page with blacklisted link with {{Spam-links}} in the article namespace.

Discussion

Since the sites on those lists have been determined to be spam, would it be better to simply remove those links? Would your bot only consider external links, or also references? Thanks! GoingBatty (talk) 02:33, 27 June 2013 (UTC)

I believe it would be better to simply tag them instead of remove them. It is uncertain whether removing them may end up breaking something. I can have my bot remove them instead if that is what is preferred, or the MediaWiki software turns out to inhibit the bot. As for your questions, it would handle any link matched in article space.—cyberpower Online 02:40, 27 June 2013 (UTC)
External links are not {{unreliable source}}, they're external links. Also, you should probably skip links listed at MediaWiki:Spam-whitelist. And note there are also links on the blacklist that aren't there because anything using that link is unreliable, e.g. any url shortener is there because the target should be linked directly rather than via a shortener. Anomie 11:24, 27 June 2013 (UTC)
Thanks for the input. I could remove external links while tagging refs with {{unreliable source}}.—cyberpower Offline 12:32, 27 June 2013 (UTC)

Wait, you're tagging external links that are listed on the spam blacklist with {{unreliable source}}? Unless I'm missing something here this won't work. When the bot tries to save the page, it will hit the blacklist and won't save. --Chris 13:55, 27 June 2013 (UTC)

I have considered that possibility, which is why my alternative is to simply remove the link and refs altogether.—cyberpower Online 14:51, 27 June 2013 (UTC)
In that case, I think this is something better dealt with by a human. Simply removing external links will probably lead to a bit of "brokenness" in the article where the link was, and would need human intervention to clean up after the bot. Also, if the article does have blacklisted links it in, chances are it probably has other problems (e.g. the entire article could be spam), so it would be preferable to have a human view the article and take action. I think if you want to continue with this task, the best thing to do would be for the bot to create a list of pages that contain blacklisted links, and post that for users to manually review. --Chris 15:18, 27 June 2013 (UTC)
I'm not certain if the software will block the bots edits, if the spam link is already there. I was thinking more along the lines that the tags place it in a category, that humans can then review. If it can't tag it next to the link, maybe it can tag the page instead and place it in the same category. What do you think?—cyberpower Online 15:30, 27 June 2013 (UTC)
As I understand it, if the spam link is already on the page, the software will block the edit anyway. --Chris 16:00, 27 June 2013 (UTC)
hmmm. I'm looking at the extension that is responsible. If the software blocks any edit that has the link in there already, that would likely cause a lot of problems on wiki. But, I'll have more info later tonight.—cyberpower Offline 16:57, 27 June 2013 (UTC)
{{BAGAssistanceNeeded}} I have tested the spam filter extensively on the peachy wiki. Tagging blacklisted links will not trip the filter, nor will removing it or adding the link if it already exists on the page. Modifying the link, or adding it to a page where the link is not yet present will trip the filter.—cyberpower Offline
Ok, I stand corrected. I'd like to review the source code for this bot. --Chris 12:40, 3 July 2013 (UTC)
Also can you give a bit more detail on exactly how the bot will operate? Will it only be tagging references, or will it remove external links as mentioned above? How will the bot deal with any false positives? Will it skip links listed on MediaWiki:Spam-whitelist? Will it be possible to whitelist other links (e.g. url shorteners as mentioned by Anomie), that shouldn't be tagged as unreliable? --Chris 12:47, 3 July 2013 (UTC)
The bot code is not yet fully completed as of this writing. I seem to be hitting resource barriers. Because it process an enormous amount of external links, I am working on conserving memory usage. Also, the regex scan is quite a resource hog as well, which I am trying to improve efficiency on. Yes, it will obey the whitelist. Because there is a risk of breaking things when removing the link, and tagging references can lead to false positives, I thought about placing a tag on the top of the page, listing the links that it found. False positives can be reported to me, or an admin, who will modify a .js page in my userspace with an exception to be added or removed, that the bot will read before it edits the pages.—cyberpower Online 14:00, 3 July 2013 (UTC)
  • The script is now finished. Chris G. has the code and is reviewing it. The task will seek out blacklisted external links and tag the pages containing them. Exceptions can be added for specific cases and it reads the whitelist too.—cyberpower Online 15:33, 24 July 2013 (UTC)
Although this page states that blacklisted external links will be tagged with {{unreliable source}}, Misplaced Pages:Village_pump_(miscellaneous)#New_Bot states that they will be tagged with {{spam-links}}. Could you please clarify? Thanks! GoingBatty (talk) 23:38, 24 July 2013 (UTC)
Some changes were made since the filing of this BRFA. I have now amended the above.—cyberpower Offline 05:43, 25 July 2013 (UTC)

Review:

  • Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of:
    if( empty($blacklistregexarray) ) goto theeasystuff;
    else $blacklistregex = buildSafeRegexes($blacklistregexarray);

You could have written:

    if( !empty($blacklistregexarray) ) {
           $blacklistregex = buildSafeRegexes($blacklistregexarray);
           <LINES 89 - 112>
    }

 Done all labels removed.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156

 Already done The break command was a remnant from the debugging period. It's removed now.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.:
substr($exception,strlen("page="))

 Done Missed this one.—cyberpower Online 12:44, 25 July 2013 (UTC)

  • Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries)

 Already done You reminded me that I programed that safeguard into the Peachy framework already. :p—cyberpower Online 12:24, 25 July 2013 (UTC)

  • Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links.
        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) return false; 
                else return true;
            }
        }

v.s.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) 
                     return false; 
            }
            return true;
        }

 Fixedcyberpower Online 12:24, 25 July 2013 (UTC)

  • General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request.

 Already done Framework has throttle.—cyberpower Online 12:24, 25 July 2013 (UTC)

  • lines 145ish - is it possible to get the page id in the same API request as you get the transclusions? That way instead of making 165,000+ API calls (for each page), you only make about 33 calls.

 Donecyberpower Online 12:24, 25 July 2013 (UTC) --Chris 09:29, 25 July 2013 (UTC)

  • AHHH. How did I not see that regex scan bug? D: Thanks for the input. I'll make the appropriate modifications now. I completely forgot that the framework was already designed to handle errors. :D
    Modifications finished.—cyberpower Online 12:44, 25 July 2013 (UTC)

Trial

Ok, we'll start with a small trial to make sure everything runs smoothly, and then we can move onto a much wider trial. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --Chris 10:57, 2 August 2013 (UTC)

It started out ok, but then something went horribly wrong and it started tagging pages with empty tags. I have terminated the bot at the moment and will be looking into what caused the problems.—cyberpower Online 12:46, 10 August 2013 (UTC)
Bug found. Bot restarted.—cyberpower Online 19:58, 10 August 2013 (UTC)
Trial complete. I haven't looked at the edits yet as it's currently the middle of the night right now.—cyberpower Offline 00:32, 12 August 2013 (UTC)

Even after the restart, 2 pages had blank tags added (1, 2). Also, maybe non-article pages should be skipped (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17), unless there is some reason that these should have the links removed:Jay8g 01:18, 13 August 2013 (UTC)

Thank you. I am already looking into the bug. And am already working on excluding namespaces.
The bugs have been fixed. The exceptions list now supports entire namespaces.—cyberpower Online 12:14, 15 August 2013 (UTC)

Approved for extended trial (1000 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Although I would ask that you do them in batches (maybe 100 or 200 edits at a time) --Chris 10:16, 24 August 2013 (UTC)

  • The bugs seem to be fixed. However, upon reactivating the bot, it began tagging spam reports in the Misplaced Pages space. I have added the Misplaced Pages namespace into the exclusions list.—cyberpower Online 16:06, 24 August 2013 (UTC)
  • Comment - You may want to change the edit summary from "Tagging page with Spam-links" to "Tagging page with Template:Spam-links" to make it clear that the bot isn't adding links, but adding a template. Giving users a link to the template documentation may also help reduce the number of comments you get on your bot's talk page. GoingBatty (talk) 14:42, 26 August 2013 (UTC)
    Good idea.—cyberpower Online 15:01, 26 August 2013 (UTC)
  • Comment Frankly, these spamlink tags are massive, too invasive and even disturbing for the readers. They should have a more decent and reasonable size and position. Cavarrone 11:10, 28 August 2013 (UTC)
    The intent is to tackle spam links. They are supposed to be very noticeable.—cyberpower Offline 13:42, 28 August 2013 (UTC)
    I understand the intent very well, but the result is distractive and objectively disturbing for the readers and subsequently a damage for the encyclopedia, as the main intent of Misplaced Pages should be serving the readers, not punishing the spammers (at least not in such way, affecting the whole article). With that size/position the template becomes almost the main topic of the article, and this is IMHO unacceptable. I am not even sure it is a damage for them (if their intent is publicizing a website, I'm not sure they are disturbed from having their links in the head of an article). We have filters and whitelists to prevent further spamming of a blacklisted website, but marking/working on previous (possible) spam requires a finer and more accurate approach than this stuff here. I predict a lot of similar complaints when/if the bot will work on regular basis (eg, have you an idea of how many articles use the only blacklisted New York Times as a source?). Cavarrone 15:23, 28 August 2013 (UTC)
    This is merely a starter template. If people want it changed, then consensus will change.—cyberpower Limited Access 17:24, 28 August 2013 (UTC)
  • Cyberpower, you have a serious bug. None of the links on Access2Research are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --Dirk Beetstra 18:43, 28 August 2013 (UTC)
    The petition rule doesn't seem specific to me. I hardly call \bpetition(?:online|s)?\b specific. It's not a bot bug.—cyberpower Online 19:04, 28 August 2013 (UTC)
those links are not blacklisted, I just managed to save a page with one of the links ... --Dirk Beetstra 21:20, 28 August 2013 (UTC)
If they exist already, they won't be blocked. Also, I have found that the filter only partially enforces the regex list. Sometimes it blocks links with petition in it and other times it doesn't. The regex generator is the same as MediaWiki's extension. The validation process of these links is identical. If it's really a bug, then it's a bug with PHP.—cyberpower Online 22:03, 28 August 2013 (UTC)
See diff --Dirk Beetstra 21:23, 28 August 2013 (UTC)
It is also tagging links to googlebook with search strings with "forbidden words" like "petitions" in it. That's not a spam link and it doesn't not trigger the spam filter either, because it ain't forbidden in that context. I'll also second that the box is too large and overwhelming. Slp1 (talk) 01:53, 29 August 2013 (UTC)

Cyberpower678 can you stop the trial until we sort out the above. --Chris 02:00, 29 August 2013 (UTC)

Per my talk page discussion, it has been brought to my attention that these issues are indeed a bot issue, not because it's bugged, but because it's running outdated code. I seem to have downloaded an outdated version of the extension. I'll make the modifications in the next few days. My bot has been shut down for quite some time now. I still recommend that petition regex be removed. There's no need for it. As for the spam links template, that can be fixed later as it's not crucial to the bot's operation.—cyberpower Online 02:05, 29 August 2013 (UTC)
I am glad that you have found the problem, but I can't say I agree that the template is not crucial to the bot's operation or that it can be fixed later. This is an encyclopedia and the template (rather than the code) is what our readers and editors see and use. The prior template was inappropriately large and overwhelming; it talks about "spam links" which is not the case and uses the term "external links" in a way that is not consistent with WP's definition at external links. At least on the William Wilberforce page, the promised list of "problematic links" was not shown, which meant we had to dig to try and figure out what the (non)problem was. Given the fact that the bot is in a trial stage and is making mistakes, it seems to me that it would be better to provide information about where to report errors in tagging, rather than the current formulation which suggest that the bot can do no wrong, that the article and its editors (or the blacklist) are at fault, and that they need to figure out the problem and act on it or the bot will be back. That's not the case, and feedback needs to be given, received and then acted upon with a positive spirit. Slp1 (talk) 11:54, 29 August 2013 (UTC)
I agree with you completely. The problem was that I was convinced that there was no issue with the code, that the regex is generated by the blacklist extension itself, that the bot simply validates the regex against the blacklist. I was right in every aspect. Where I was wrong was that I was running an out of date version of the extension. The newer version has a more refined regex generator. The template layout should be decided by the community. I merely created something for the bot. I'm not going to force this template on the community if they don't want it.—cyberpower Offline 12:47, 29 August 2013 (UTC)
To expand on what I said, I meant not crucially in fixing bugs in the bot that may cause mistags. I do agree that the template will need to be fixed and adjusted to reflect consensus, but now I am only concerned in making sure the bot operates correctly.—cyberpower Online 14:21, 29 August 2013 (UTC)
Yeah, but the bot operating correctly comprises not just the technical aspects of a bot etc but also the surface and interface aspects of the template, as well as the giving and receiving of feedback between the community and the bot operator. I understand that you did not think that there was a mistake with the bot, but the fact is that there was a problem. It isn't the end of the world and it isn't a question of fault, because everybody makes mistakes, and mistakes are a good opportunity for learning and growth. But what I am a little concerned with is that you might be a little bit too interested in getting the code right and not enough interested in the interactional component of bots with the community. A bunch of people told you it was making mistakes and you just said "no, it is doing what it is supposed to do", when it wasn't. You even said that to one person after you'd found there was a problem.. A number of articles are still tagged inappropriately. How about cleaning them up? And it doesn't seem like you are taking seriously the comments about the template because you want to do it "later" and are even apparently planning another run without working on this. What happens if there is another bug that you didn't realize was there? I realize you don't think there is, but that's what you thought the last time, wasn't it?. Personally I don't think you should add any more tags until the erroneous ones are removed and the concerns about the template addressed. A bot is a whole package, not just the coding Slp1 (talk) 17:43, 29 August 2013 (UTC)
No offense, but I think you are twisting things out of proportion. There are people who will claim a link is not blacklisted when in fact, it is. The bot is supposed to keep tagging pages until it no longer sees them as blacklisted. No, I do not run the bot if I see a bug surfacing. That's why it's been off. The next run is supposed to test the removal of tags containing links not blacklisted. I have indeed fixed the regex scanner. I am aware that the tag is important, but it's not my say on how it's supposed to look, but the communities, so I won't take initiative. Feel free to modify the tag so it still works, but looks "better". If there is a visible bug, the bot gets shut down immediately. The petition problem was not because the code was buggy, but the regex generator running on the wrong version. I take comments very seriously, and I find that remark mildly offensive, to say I don't. You have essentially just called me incompetent.—cyberpower Offline 18:33, 29 August 2013 (UTC)
Cyberpower, you say that you take comments very seriously. You created and are the major contributor for the template which was apparently specifically designed with this bot in mind. You took the initiative in creating it, decided how it was going to look then, and now it would be great if you would follow up by considering the community's good faith comments above (and on the template talk) and modifying it. That would show that you do indeed take comments seriously, as opposed to passing on the responsibility "to the community". Removing the incorrect tags from articles asap -manually if need be- would also show that you take your responsibilities as a bot operator seriously. I am sure you that you are not incompetent but it seems that you may not understand the degree of frustration and discouragement such bots can cause to good faith (and perhaps technically clueless) content editors when articles are incorrectly tagged (sometimes repeatedly). This is talkpage posting from another editor makes this point well. Anyway, with this I leave you and the rest of the bot experts to your work. Slp1 (talk) 19:06, 29 August 2013 (UTC)
  • I have updated the regex scanner. It should now mirror Misplaced Pages's blacklist filter when scanning regexes. The change will go into effect on the next run.—cyberpower Online 14:23, 29 August 2013 (UTC)

I am going on Wikibreak for awhile. As I will not be around to monitor the trial, please stop the trial, and wait for another BAGer to approve and monitor it. --Chris 11:25, 2 September 2013 (UTC)

More blank tags were added (1, 2), both in namespaces where no tags should have been added:Jay8g 03:04, 3 September 2013 (UTC)

I saw it, and already fixed the issue. Thank you.—cyberpower Online 03:17, 3 September 2013 (UTC)
I will commence the last ~50 edits sometime tomorrow. This run will test to see if the good links are no longer tagged. This will verify that the exceptions list, whitelist, and the new regex scanner are working correctly.—cyberpower Online 16:47, 3 September 2013 (UTC)
Cyberpower. Did you not see that Chris G has told you to stop the trial for now? See his post of 11:25, 2 September 2013 (UTC). Slp1 (talk) 17:27, 3 September 2013 (UTC)
Oops. I forgot to mention that addshore has volunteered to take over this trial and has allowed me to resume it.
It is probably best to ask addshore to come here and do so officially, don't you think?Slp1 (talk) 18:10, 3 September 2013 (UTC)
I do. I'll ask him to post here before proceeding.—cyberpower Limited Access 18:17, 3 September 2013 (UTC)
I see no harm in completing the last 50 edits! :) ·addshore· 11:17, 10 September 2013 (UTC)
Trial complete. I accidentally went a bit over. I will post my results and fixes shortly.—cyberpower Online 14:55, 11 September 2013 (UTC)

If possible, it might be a good idea to un-exclude "Misplaced Pages talk:Articles for creation/" prefix pages, as the intention is that they would become articles:Jay8g 02:11, 12 September 2013 (UTC)

puuuh. That's going to be tricky. I'd have to I exclude the Misplaced Pages talk namespace and add each Misplaced Pages talk page manually.—cyberpower Limited Access 11:17, 12 September 2013 (UTC)
On second though, there shouldn't be any blacklisted links on those AfC pages, since the filter should filtering them out.—cyberpower Offline 13:52, 12 September 2013 (UTC)

Post-Trial

I am currently generating the post trial report for a BAGger and everyone else to review.—cyberpower Offline 13:53, 12 September 2013 (UTC)

The following below are bugs and issues brought up during the extended trial and that status of the issue:

  1. {{Spam-links}} should be renamed to a more neutral name:  Done Renamed to {{Blacklisted-links}}. Bot change will go into effect upon approval or new trial.
  2. Tag template used incorrect terminology:  Fixed
  3. {{Blacklisted-links}} is too large in size and disruptive to the readers:  Resolved template is collapsed by default with a one line notice.
  4. {{Blacklisted-links}} is not showing the list of blacklisted URLs, in some cases, despite it being present in the Wikimarkup:  Fixed
  5. Bot has been tagging links that shouldn't be tagged.  Fixed Regex scanner was not up to date.
  6. Bot is tagging wrong namespaces.  Resolved Any administrator can alter the settings on the exceptions page.

If I have missed any error reports that should have been addressed, please let me know.—cyberpower Online 14:50, 12 September 2013 (UTC)

Approved for extended trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Quick trial to see any more issues. Article namespace. —  HELLKNOWZ  ▎TALK 19:55, 17 September 2013 (UTC)

Second Post-Trial

The contributions from the trial can be found here.

Number of edits performed: 141

Bugs found: 0

Issues found: 1

  1. When the bot went through to de-tag invalid tags, because the tag was renamed, several pages did not get returned by the API. This left invalid tags in place.
    Solution: I have fixed this issue by writing a quick script to remove all tags under the old name. This should not be an issue for future runs as they will all be transcluded under the new name.

Per this and this, I can confirm the cleanup script did it's job and that only valid tags are present, in the article namespace only. The bot successfully detected and removed several petition false-positives since the regex scanner has been updated. I also recommend that the {{Spam-links}} tag be deleted at this point.—cyberpower Online 01:57, 22 September 2013 (UTC)

Thinking about this and looking at the edits. What happens when the offending link is removed from the page? Would the bot come back and remove the tag if it wasn't removed like here? Or if the link is no longer blacklisted? Otherwise, edits look good. —  HELLKNOWZ  ▎TALK 14:06, 22 September 2013 (UTC)

The bot will remove the tag if the link in question is either, white listed, removed from the blacklist, added to the exceptions list, or removed from the page.—cyberpower Online 17:33, 22 September 2013 (UTC)
Right, I should've looked more careful. —  HELLKNOWZ  ▎TALK 17:38, 22 September 2013 (UTC)

 Approved. Trials look good, no problems that I see, all issues seem to be reasonably resolved. BRFA open for quite a while, lots of edits, no more comment received. Would be good if you made a page explaining what to do after a page is tagged and link it from the edit summary. —  HELLKNOWZ  ▎TALK 17:38, 22 September 2013 (UTC)

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.
Category:
Misplaced Pages:Bots/Requests for approval/Cyberbot II 4: Difference between revisions Add topic