Usenet stats: User Agent and Sentiment

Take away

Jump to Stats

  • Most everyone posts with G2 or Mozilla (Thunderbird)
  • Old newsreaders (Xnews, newsSync, VSoup, MicroPlanet-Gravity, 40tude_Dialog) are still somewhat popular – despite development having stopped on many.
  • Russion langauge sentiment appears positive when stemmed and scored as English. German sentiment appears negative.
  • Selling things and some support groups (stop-smoking) are on the most postiive end
  • Obscentities make `testing` and `dev` groups score negative; as do some individual and prolific unhappy posters.

Motivation

In researching self-hosting Discourse, I saw there's an old plugin to sync to nntp. That was surprising enough to add usenet to lemmy, mastodon, nostr, and ssb – the list of communities/protocols I have no need for but am still interested in exploring.

After tinkering for a bit, I've two questions

  1. I setup pan, gnus, and slrn to experiment. But what user agent is most popular?
  2. Finding useful groups is challenging. Where are the people relevant to me?

Both of these look like they can be answered with data!

Stats already available online: http://top1000.anthologeek.net/#todo http://www.eternal-september.org/postingstats.php

Download

I used slrnpull, but leafnode might have also worked.

Config

Hierarchy list: slrn

The first run of slrn creates $HOME/.jnewsrc with a list of all hierarchies. This will be useful for slrnpull

For eternal september, authentication is necessary. .slrnrc looks like

set force_authentication 1
nnrpaccess "news.eternal-september.org" "xxxx" "yyyyy"

And first pull

slrn -h news.eternal-september.org  -f /home/foranw/.jnewsrc --create
articles: slrnpull

I initially put the full (27k+) group list in slrnpull.conf. but eternal-september disconnects after the first 300. 1 Conveniently, .jnewsrc also stores an articles count. And that's easy to sort on and probably a good way to limit to active groups.

Unfortunately, the decently active and interesting group I've pegged as a reference ( alt.fan.usenet) is ranked just after 4000 by post counts.

echo default 200     365 > /var/spool/news/slrnpull/slrnpull.conf
perl -lne 'print if s/^(.*)[!:].*-(\d+)$/$2\t$1/' ~/.jnewsrc|
  sort -nr|grep -v sex |head -n 4100 | cut -f 2 >> /var/spool/news/slrnpull/slrnpull.conf

eternal-september.org requires authentication. I copied slrn's auth to slrnpull's authinfo

sed 's/"//g' .slrnrc |
  awk '(/nnrpaccess/){print $3 "\n" $4}' \
  > /var/spool/news/slrnpull/authinfo

Data

Pulling

slrnpull -h news.eternal-september.org
# 300 groups
  Time: 01:04:23, BPS: 70284
08/19/2023 12:57:18 A total of 271509221 bytes received, 1035863 bytes sent in 3879 seconds.

# 4100 groups
Time: 09:21:24, BPS: 74419
08/20/2023 01:52:36 A total of 2506742895 bytes received, 5781052 bytes sent in 34098 seconds.

Dataset

The directory structure with .minmax files consumes ~100Mb!

time find news/ -type f -not -name '.*'  |wc -l
59618
real    6m55.229s
time du -hcs ./
3.2Gtotal

Parse

I've extracted headers with a perl script, used Lingua::Stem::Snowball to stem the article's text, and parallel 2 to execute on multiple cores. The resulting tab separated file has one line/row per message and compresses well.

time \
  find news/ -mindepth 2 -type f -not -name '.*'  |
  parallel -j 3 --xargs ./article_tsv.pl |
  gzip -c > news/all_articles.tsv.gz
real    122m32.779s
 du -h news/all_articles.tsv.gz
373M    news/all_articles.tsv.gz

Sharing

ln -s all_articles.tsv.gz usenet_group-41000_messages-200_date-20230819.tsv.gz 
rhash --magnet --btih  usenet_group-41000_messages-200_date-20230819.tsv.gz

Stats

library(dplyr)
d <- data.table::fread('/mnt/ttb/news/all_articles.tsv.gz', quote="",
                       col.names=c("folder","date","org","from",
                                   "agent_full","path","id","body")) |>
     mutate(date=lubridate::ymd_hm(date),
            # remove version number
            agent=gsub('[ /(:].*','',agent_full),
            # extract from within <>: Name <email@host.com>
            #email=stringr::str_extract(from,'(?<=<)[^>]+'),
            # or ( blah@x.com )
            email=stringr::str_extract(from,'[a-zA-Z0-9.!#$%&*+-/=?^_`{|}~]+@[^ )>]+'),
            top=stringr::str_extract(folder,'(?<=news/)[^/]*'))
# only look at 2023 (ending in 20230820)
d2023 <- d |> filter(date >= "2023-01-01")

n_messages_2023 <- nrow(d2023)

We have +src_R[:session *R:WillForan.github.io*]ults(166614)}}} articles from 2023

User Agent

G2 (google groups) and Mozilla (Thunderbird) are an order of magnitude above other clients.

Mozilla users post more often than google users (though a better stat might be median instead of mean).

agents_allposts <- d2023 |> count(agent, name='n_posts') |> arrange(-n_posts)
agents_from     <- d2023 |> filter(date >= "2023-01-01") |> count(from, agent) |>
  count(agent, name='n_from') |>
  arrange(-n_from)

agents <- inner_join(agents_allposts,agents_from) |>
     mutate(user_posts=round(n_posts/n_from,1)) |> arrange(-n_from)
agents  |> head(n=20)
agentn_postsn_fromuser_posts
G25873964069.2
4169446309
Mozilla31724234813.5
Xnews27548643.2
newsSync5104261.2
ForteAgent360623715.2
slrn20832199.5
Pan177816111
Gnus15151589.6
NewsTap175112613.9
tin153710714.4
Mime142891.6
VSoup379874.4
Evolution644788.3
Dolbo121592.1
Mutt357507.1
MicroPlanet-Gravity12024924.5
40tude_Dialog11194624.3
Usenapp4494510
Unison232415.7

The second most popular user-agent is none – missing in the header. These look like they come from lists and scripts.

d2023 |>filter(agent=="") |> count(email,path,org) |> arrange(-n) |> head()
emailpathorgn
doctor@doctor.nl2k.ab.ca.POSTED.doctor.nl2k.ab.caNetKnow News2086
remailer@domain.invalidmail2news1586
racist_queer_democrat_paedophiles@now.orgmail2news836
bugzilla-noreply@freebsd.org.POSTED.news.muc.deNewsgate at muc.de e.V.790
disciple@T3WiJ.comnews.eternal-september.orgA noiseless patient Spider702
ftpmaster@ftp-master.debian.org<envelope@ftp-master.debian.org>linux.* mail to news gateway666
d2023 |>filter(agent=="",is.na(email)) |> count(folder,name="n_noagent_noemail") |> arrange(-n_noagent_noemail) |> head()
foldern_noagent_noemail
news/soc/culture/korean188
news/junk129
news/alt/bbs/synchronet86
news/alt/online-service/webtv46
news/comp/mail/sendmail35
news/alt/politics/uk27

By top level group

Do different audiences have specific client preferences?

Yes. Or maybe user agents are just a proxy for spam.

Here we're looking at the top 4 user agents across each top level. slrn and Gnus make the top 4 cut in comp.* and news.*, and slrn also sneaks in for sci.* while gnus writes nearly 1/10 of sfnet messages.

library(tidyr)
agents_top <- d2023 |> filter(agent!="") |>
   count(email, agent, top) |>
   group_by(top, agent) |> summarise(n_user=length(unique(email))) |>
   group_by(top) |> arrange(-n_user) |>
   mutate(rank=1:n(), percent=sprintf("%.0f%%",n_user/sum(n_user)*100))

a_order <- agents_top %>% group_by(agent) %>%
           summarise(srank=sum(n_user)) %>% arrange(-srank) %>%`[[`('agent')
big8 <-  c("comp","alt","sfnet","misc","sci", "news", "misc", "soc", "talk")
N_top <- d2023 |> filter(top %in% big8, agent!="") |> count(top, name="TOTAL")

agent_wide <- agents_top %>%
   filter(rank<=4, top %in% big8) %>%
   mutate(agent=factor(agent,levels=a_order)) %>%
   select(-rank, -n_user) %>%
   spread(agent, percent, fill="0")

merge(N_top,agent_wide) %>% arrange(-TOTAL)
topTOTALG2MozillaXnewsForteAgentslrnGnusVSoup
alt3843252%16%10%4%000
comp1128562%18%003%3%0
soc767977%9%5%2%000
sci461962%17%04%2%00
misc267354%19%8%4%000
talk196447%21%8%7%000
news66535%17%009%7%0
sfnet42241%34%0009%6%
VSoup

What is Vsoup?! Google isn't any help. It has an OS/2 version!?

d2023 %>% filter(agent=='VSoup') %>% count(top,agent_full) %>% spread(top,n)
agent_fullaltcompmiscnzrecsfnet
VSoup v1.2.9.47Beta [95/NT]2965211110
VSoup v1.2.9.48Beta [OS/2]72
d2023 %>% filter(agent=='VSoup') %>% count(email) %>% summarise(max_vsoup_posts=max(n),med_vsoup_posts=median(n), n_emails=n())
max_vsoup_postsmed_vsoup_postsn_emails
44287

sentiment

scoring sentiment using stemmed words individual words, valence from Finn Årup Nielsen. AFINN ranks a subset of English words -5 (negative) to +5 (positive). I average all the scored words within the subject + body of a message for a single value per article.

library(tidytext)
#nnc <- get_sentiments("nrc") # has dimensions, eg. "joy"
afn <- get_sentiments("afinn") # -5 neg to +5 positive

# match stemming from perl
afn_stem <- afn |> mutate(word=SnowballC::wordStem(word,language="en")) |> group_by(word) |> summarise(value=mean(value))



# could do a giant merge afn to body split
# but takes too much RAM
# hash lookup with list name should be fast enough alt to merge
afn_lookup <- as.list(afn_stem$value) |> `names<-`(afn_stem$word)
body_stats <- function(body){
   body <- stringr::str_split(body,' ', simplify=T)
   vals <- unlist(afn_lookup[body])
   adj <- 0
   if(length(vals)==0L) {
       vals <- c("NA"=0)
       adj <-1 
   }
   data.frame(n_words=length(body),
              afn=mean(vals),
              afn_sd=sd(vals),
              words_scored=length(vals)-adj,
              body=paste(names(vals),vals,sep=":",collapse=" "))
}
# replaces body with scored words only
afn_score <- d2023 |> 
  select(top, folder, agent, email, body) |>
   mutate(folder=gsub('^news/','',folder) |>
   rowwise() |>
   mutate(body_stats(body))

write.csv(afn_score,file="afn_score.csv.gz",row.names=F,quote=F)

per group

  • The most positive place on usenet in 2023 looks like fido7/ru/home. Russian lang take the top 3 (afn=2.5-2.3). Since removed (english valance for russian stemmed words are likely not meaningful)
  • nice to see a supportive place looking positive: alt.support.stop-smoking
  • post selling things are positive. What a nice financial incentive for being optomistic (tor/forsale, phl/forsal,chi/forsal,van/forsal,alt/forsale)
  • windows makes two appearances in the top 20 (edit it/comp/os/win/windows10 fell out after update, afn=1.5). I guess being held hostage by your OS endears some fraternal empathy.

    • similar thing for alt.alien.visitors?
  • Groups with non-English articles shouldn't be included.

NB. I capped my pull to 200 articles per group.


n_articles <- d2023 |> count(folder,name="n_articles") 

afn_folder_smry <- 
  afn_score |> group_by(folder) |>
  summarize(
    afn_wt=mean(words_scored/n_words*afn),
    across(c(n_words,words_scored), sum),
    afn=round(mean(afn),2),
    wrd_article=round(n_words/n(),1),
    mean_sd=round(mean(afn_sd,na.rm=T),2),
    n_email=length(unique(email))) |>
  inner_join(n_articles)

afn_folder_smry |>
  filter(n_email>=8, n_articles>10, !grepl('/ru/',folder)) |> arrange(-afn) |>
  select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
  head(n=20)
folderafnn_emailn_articleswrd_articlemean_sd
alt/alien/visitors1.9982002055.40.68
alt/books/larry-niven1.8636104125.61.23
tor/forsale1.71283835.10.85
rec/games/go1.67598379.61.83
houston/forsale1.661416491.4
phl/forsale1.65272836.61.13
chi/forsale1.64678050.40.98
it/sport/calcio/fiorentina1.621720067.51.18
alt/flashback1.64881550.88
alt/support/stop-smoking1.6134338.71.48
alt/fan/fratellibros1.5982855.21.64
it/discussioni/commercialisti1.564420064.71.22
soc/culture/occitan1.54588467.31.32
van/forsale1.54303340.10.96
atl/forsale1.53445561.51.31
it/sport/formula11.521720066.41.39
alt/locksmithing1.5181240.90.75
it/comp/os/win/windows101.54620065.41.9
nyc/forsale1.489110448.21.14
aioe/news/assistenza1.462310655.91.46

negative

A kill file would probably change this a lot. soc.culture.scottish and *.webtv have a few spammy/tortured individuals in groups without many other posters to suppress the noise.

  • I removed "test" groups. those came out as most negative. I'd hoped 'test' had negative valence, but it's not even in afinn. But obscenities/racial epitaphs are and have the most negative values.
  • huuhaa is a finish group
  • Äffle und Pferdle (monkey and horse) is a german cartoon played between commercials? hopefully a language scoring issue and not an especially negative place.
  • In the opposite of the smoking support above, fat-acceptance is scored negatively.
  • I guess buffalo bills fans (all 9 of them) are not a happy bunch
  • alt.crime's no surprise, but not b/c of racist obscenities! The most popular negative words are evil(-3), torture(-4), charge(-3), and crime(-3)
  • scottish culture? includes a lot of torture, kill, death
  • webtv in 2023?

    • euthanasia drugs!? lots of other very upsets (re: child trafficking?) posts
library(stringr)
afn_score$body[grepl("soc/culture/scottish$",afn_score$folder)] %>%
  str_split(" ",simplify=T) %>%
  str_split(":") %>%
  Filter(f=\(x) length(x)==2L)  %>%
  lapply(\(x) data.frame(w=x[1],v=as.numeric(x[2]))) %>%
  bind_rows() %>% count(w,v) %>% mutate(score=v*n) %>%
  arrange(score) %>%
  head()
wvnscore
tortur-4574-2296
kill-3244-732
death-2276-552
victim-3165-495
abus-3137-411
useless-2202-404
afn_folder_smry |>
  filter(n_email>=8, n_articles>10,
        !grepl('/pl/|/fr/|geschn|tratsch|/de/|/pa/|/dk/|/in/|spanish|ttiili|german|/nl/|/be/|test$|dev$',folder)) |>
  arrange(afn) |>
  select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
  head(n=20)
folderafnn_emailn_articleswrd_articlemean_sd
alt/aeffle/und/pferdle-2.391274104.10.86
sfnet/huuhaa-1.391019983.51.17
alt/games/microsoft/flight-sim-1.238200212.22.18
it/news/net-abuse-1.162199307.41.79
alt/lawyers-1.052441465.31.97
ca/driving-1.051923219.51.66
alt/tv-1.049200891.67
alt/business/accountability-1.0110200781.61
alt/online-service/webtv-0.951473186.41.38
alt/sports/football/pro/buffalo-bills-0.95919171.91.86
soc/culture/scottish-0.948139993.11.64
alt/crime-0.945189271.11.65
control/cancel-0.892717911.30.58
soc/culture/african/american-0.78562002891.91
alt/conspiracy/jfk-0.771421262.71.87
alt/disney-0.7729125969.91.86
news/answers-0.7719903376.21.44
alt/sports/football/pro/phila-eagles-0.761019957.72.12
aus/politics-0.752118783.61.85
alt/0a/fred-hall/nancy-boy-0.7411200212.4

By user-agent, newsgroup reader client

Sentiment by reader is probably a silly stat.

  • 40tude_Dialog is a windows gui client last updated in 2008.
  • K-9 users number less than 20 and are all in linux.debian.*

n_articles_agent <- d2023 |> count(agent,name="n_articles") 

afn_agent_smry <- 
  afn_score |> group_by(agent) |>
  summarize(
    afn_wt=mean(words_scored/n_words*afn),
    across(c(n_words,words_scored), sum),
    afn=round(mean(afn),2),
    wrd_article=round(n_words/n(),1),
    mean_sd=round(mean(afn_sd,na.rm=T),2),
    n_groups=length(unique(folder)),
    n_email=length(unique(email))) |>
  inner_join(n_articles_agent)


afn_agent_smry |>
  select(agent, afn,n_groups, n_email,n_articles, wrd_article, mean_sd) |>
  filter(n_email>10, n_articles>10) |>
  arrange(-afn)
agentafnn_groupsn_emailn_articleswrd_articlemean_sd
Rocksolid1.61101210831.41
newsSync1.541942651048.31.15
NewsHound1.0942084178.51.28
Thunderbird1.0516145876.41.78
Cyrus-JMAP0.9110113165.21.39
VSoup0.89237637980.21.57
K-90.82019561971.72
NeoMutt0.762317196315.61.46
Evolution0.636571644185.11.48
Mutt0.6355433572471.56
Turnpike0.61302036978.71.68
Messenger-Pro0.5981611457.31.62
XanaNews0.57452331054.51.55
G20.47178451865873915731.7
Gnus0.472101371515162.11.55
Roundcube0.43141643246.21.73
Usenapp0.41753944966.91.7
Pluto0.36111914154.31.41
MacSOUP0.28652749852.91.75
Unison0.28563323267.11.61
Mozilla0.271132191731724229.11.73
tin0.2623382153797.41.84
Hogwasher0.2429724188291.41.85
Thoth0.21481321855.81.65
40tude_Dialog0.29137111982.91.38
Alpine0.221167280.51.56
slrn0.17288166208381.91.71
0.131381343441694275.91.67
NewsTap0.11213841751116.11.79
ForteAgent0.08359216360682.71.72
MT-NewsWatcher0.0836113259891.91.84
Pan0.0529313817781686.21.86
MicroPlanet-Gravity-0.03112411202131.81.98
Mime-0.328888142218.21.85
Opera-0.474612285441.61
Xnews-0.53694322754233.11.89
Nemo-0.662296551041.64
PhoNews-0.7914118869.31.22
MacCafe-0.8242161389115.61.7

pseudo stats

The average G2 written article is significantly more positive than that from Mozilla! Both means are slightly above to neutral.

t.test(afn ~ agent, afn_score %>% filter(agent %in% c("G2","Mozilla")))

And Gnus more positive than slrn

t.test(afn ~ agent, afn_score %>% filter(agent %in% c("Gnus","slrn")))

Despite how the plot may looking

library(ggplot2)
agent_subset <- c("G2","Mozilla","Gnus","40tude_Dialog","slrn","ForteAgent", "Xnews")
popular_agents <- afn_score |>
  filter(agent %in% agent_subset) |>
  mutate(interface=ifelse(agent %in% c("Gnus","slrn"), "CLI","GUI")) |>
  ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) + 
  see::theme_modern() + facet_grid(interface~.) +
  labs(x="article afinn score", title="Sentiment by user-agent")

positives <- afn_score |>
  filter(agent %in% agent_subset, afn>0) |>
  ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) + 
  see::theme_modern() +
  labs(x="article afinn score", title="Sentiment by user-agent: positive")

cowplot::plot_grid(popular_agents,positives,nrow=2)
#ggsave('agent_sentiment.png', height=7,widht=7)

../images/usenet/agent_sentiment.png


1

or actually maybe the problem is alt.autos.toyota.camry: 3 articles available.\n***Connection to news.eternal-september.org lost. Performing shutdown.

2

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.