Twitter’s latest robo-nag will flag “harmful” language before you post

Before you tweet, you might be asked if you meant to be so rude.

Want to know exactly what Twitter’s fleet of text-combing, dictionary-parsing bots defines as “mean”? Starting any day now, you’ll have instant access to that data—at least, whenever a stern auto-moderator says you’re not tweeting politely.

On Wednesday, members of Twitter’s product-design team confirmed that a new automatic prompt will begin rolling out for all Twitter users, regardless of platform and device, that activates when a post’s language crosses Twitter’s threshold of “potentially harmful or offensive language.” This follows a number of limited-user tests of the notices beginning in May of last year. Soon, any robo-moderated tweets will be interrupted with a notice asking, “Want to review this before tweeting?”

Earlier tests of this feature, unsurprisingly, had their share of issues. “The algorithms powering the [warning] prompts struggled to capture the nuance in many conversations and often didn’t differentiate between potentially offensive language, sarcasm, and friendly banter,” Twitter’s announcement states. The news post clarifies that Twitter’s systems now account for, among other things, how often two accounts interact with each other—meaning, I’ll likely get a flag for sending curse words and insults to a celebrity I never talk to on Twitter, but I would likely be in the clear sending those same sentences via Twitter to friends or Ars colleagues.

Read 8 remaining paragraphs | Comments