Usenet stats: User Agent and Sentiment

Take away

Most everyone posts with G2 or Mozilla (Thunderbird)
Old newsreaders (Xnews, newsSync, VSoup, MicroPlanet-Gravity, 40tude_Dialog) are still somewhat popular – despite development having stopped on many.
Russion langauge sentiment appears positive when stemmed and scored as English. German sentiment appears negative.
Selling things and some support groups (stop-smoking) are on the most postiive end
Obscentities make `testing` and `dev` groups score negative; as do some individual and prolific unhappy posters.

Motivation

In researching self-hosting Discourse, I saw there's an old plugin to sync to nntp. That was surprising enough to add usenet to lemmy, mastodon, nostr, and ssb – the list of communities/protocols I have no need for but am still interested in exploring.

After tinkering for a bit, I've two questions

I setup pan, gnus, and slrn to experiment. But what user agent is most popular?
Finding useful groups is challenging. Where are the people relevant to me?

Both of these look like they can be answered with data!

Stats already available online: http://top1000.anthologeek.net/#todo http://www.eternal-september.org/postingstats.php

Download

I used slrnpull, but leafnode might have also worked.

Config

Hierarchy list: slrn

The first run of slrn creates $HOME/.jnewsrc with a list of all hierarchies. This will be useful for slrnpull

For eternal september, authentication is necessary. .slrnrc looks like

set force_authentication 1
nnrpaccess "news.eternal-september.org" "xxxx" "yyyyy"

And first pull

slrn -h news.eternal-september.org  -f /home/foranw/.jnewsrc --create

articles: slrnpull

I initially put the full (27k+) group list in slrnpull.conf. but eternal-september disconnects after the first 300. ¹ Conveniently, .jnewsrc also stores an articles count. And that's easy to sort on and probably a good way to limit to active groups.

Unfortunately, the decently active and interesting group I've pegged as a reference ( alt.fan.usenet) is ranked just after 4000 by post counts.

echo default 200     365 > /var/spool/news/slrnpull/slrnpull.conf
perl -lne 'print if s/^(.*)[!:].*-(\d+)$/$2\t$1/' ~/.jnewsrc|
  sort -nr|grep -v sex |head -n 4100 | cut -f 2 >> /var/spool/news/slrnpull/slrnpull.conf

eternal-september.org requires authentication. I copied slrn's auth to slrnpull's authinfo

sed 's/"//g' .slrnrc |
  awk '(/nnrpaccess/){print $3 "\n" $4}' \
  > /var/spool/news/slrnpull/authinfo

Data

Pulling

slrnpull -h news.eternal-september.org

# 300 groups
  Time: 01:04:23, BPS: 70284
08/19/2023 12:57:18 A total of 271509221 bytes received, 1035863 bytes sent in 3879 seconds.

# 4100 groups
Time: 09:21:24, BPS: 74419
08/20/2023 01:52:36 A total of 2506742895 bytes received, 5781052 bytes sent in 34098 seconds.

Dataset

The directory structure with .minmax files consumes ~100Mb!

time find news/ -type f -not -name '.*'  |wc -l

59618
real    6m55.229s

time du -hcs ./

3.2G

total

Parse

I've extracted headers with a perl script, used Lingua::Stem::Snowball to stem the article's text, and parallel ² to execute on multiple cores. The resulting tab separated file has one line/row per message and compresses well.

time \
  find news/ -mindepth 2 -type f -not -name '.*'  |
  parallel -j 3 --xargs ./article_tsv.pl |
  gzip -c > news/all_articles.tsv.gz

real    122m32.779s

 du -h news/all_articles.tsv.gz

373M    news/all_articles.tsv.gz

Sharing

ln -s all_articles.tsv.gz usenet_group-41000_messages-200_date-20230819.tsv.gz 
rhash --magnet --btih  usenet_group-41000_messages-200_date-20230819.tsv.gz

Stats

library(dplyr)
d <- data.table::fread('/mnt/ttb/news/all_articles.tsv.gz', quote="",
                       col.names=c("folder","date","org","from",
                                   "agent_full","path","id","body")) |>
     mutate(date=lubridate::ymd_hm(date),
            # remove version number
            agent=gsub('[ /(:].*','',agent_full),
            # extract from within <>: Name <email@host.com>
            #email=stringr::str_extract(from,'(?<=<)[^>]+'),
            # or ( blah@x.com )
            email=stringr::str_extract(from,'[a-zA-Z0-9.!#$%&*+-/=?^_`{|}~]+@[^ )>]+'),
            top=stringr::str_extract(folder,'(?<=news/)[^/]*'))
# only look at 2023 (ending in 20230820)
d2023 <- d |> filter(date >= "2023-01-01")

n_messages_2023 <- nrow(d2023)

We have +src_R[:session *R:WillForan.github.io*]ults(166614)}}} articles from 2023

User Agent

G2 (google groups) and Mozilla (Thunderbird) are an order of magnitude above other clients.

Mozilla users post more often than google users (though a better stat might be median instead of mean).

agents_allposts <- d2023 |> count(agent, name='n_posts') |> arrange(-n_posts)
agents_from     <- d2023 |> filter(date >= "2023-01-01") |> count(from, agent) |>
  count(agent, name='n_from') |>
  arrange(-n_from)

agents <- inner_join(agents_allposts,agents_from) |>
     mutate(user_posts=round(n_posts/n_from,1)) |> arrange(-n_from)
agents  |> head(n=20)

agent	n_posts	n_from	user_posts
G2	58739	6406	9.2
	41694	4630	9
Mozilla	31724	2348	13.5
Xnews	2754	864	3.2
newsSync	510	426	1.2
ForteAgent	3606	237	15.2
slrn	2083	219	9.5
Pan	1778	161	11
Gnus	1515	158	9.6
NewsTap	1751	126	13.9
tin	1537	107	14.4
Mime	142	89	1.6
VSoup	379	87	4.4
Evolution	644	78	8.3
Dolbo	121	59	2.1
Mutt	357	50	7.1
MicroPlanet-Gravity	1202	49	24.5
40tude_Dialog	1119	46	24.3
Usenapp	449	45	10
Unison	232	41	5.7

The second most popular user-agent is none – missing in the header. These look like they come from lists and scripts.

d2023 |>filter(agent=="") |> count(email,path,org) |> arrange(-n) |> head()

email	path	org	n
doctor@doctor.nl2k.ab.ca	.POSTED.doctor.nl2k.ab.ca	NetKnow News	2086
remailer@domain.invalid	mail2news		1586
racist_queer_democrat_paedophiles@now.org	mail2news		836
bugzilla-noreply@freebsd.org	.POSTED.news.muc.de	Newsgate at muc.de e.V.	790
disciple@T3WiJ.com	news.eternal-september.org	A noiseless patient Spider	702
ftpmaster@ftp-master.debian.org	<envelope@ftp-master.debian.org>	linux.* mail to news gateway	666

d2023 |>filter(agent=="",is.na(email)) |> count(folder,name="n_noagent_noemail") |> arrange(-n_noagent_noemail) |> head()

folder	n_noagent_noemail
news/soc/culture/korean	188
news/junk	129
news/alt/bbs/synchronet	86
news/alt/online-service/webtv	46
news/comp/mail/sendmail	35
news/alt/politics/uk	27

By top level group

Do different audiences have specific client preferences?

Yes. Or maybe user agents are just a proxy for spam.

Here we're looking at the top 4 user agents across each top level. slrn and Gnus make the top 4 cut in comp.* and news.*, and slrn also sneaks in for sci.* while gnus writes nearly 1/10 of sfnet messages.

library(tidyr)
agents_top <- d2023 |> filter(agent!="") |>
   count(email, agent, top) |>
   group_by(top, agent) |> summarise(n_user=length(unique(email))) |>
   group_by(top) |> arrange(-n_user) |>
   mutate(rank=1:n(), percent=sprintf("%.0f%%",n_user/sum(n_user)*100))

a_order <- agents_top %>% group_by(agent) %>%
           summarise(srank=sum(n_user)) %>% arrange(-srank) %>%`[[`('agent')
big8 <-  c("comp","alt","sfnet","misc","sci", "news", "misc", "soc", "talk")
N_top <- d2023 |> filter(top %in% big8, agent!="") |> count(top, name="TOTAL")

agent_wide <- agents_top %>%
   filter(rank<=4, top %in% big8) %>%
   mutate(agent=factor(agent,levels=a_order)) %>%
   select(-rank, -n_user) %>%
   spread(agent, percent, fill="0")

merge(N_top,agent_wide) %>% arrange(-TOTAL)

top	TOTAL	G2	Mozilla	Xnews	ForteAgent	slrn	Gnus	VSoup
alt	38432	52%	16%	10%	4%	0	0	0
comp	11285	62%	18%	0	0	3%	3%	0
soc	7679	77%	9%	5%	2%	0	0	0
sci	4619	62%	17%	0	4%	2%	0	0
misc	2673	54%	19%	8%	4%	0	0	0
talk	1964	47%	21%	8%	7%	0	0	0
news	665	35%	17%	0	0	9%	7%	0
sfnet	422	41%	34%	0	0	0	9%	6%

VSoup

What is Vsoup?! Google isn't any help. It has an OS/2 version!?

d2023 %>% filter(agent=='VSoup') %>% count(top,agent_full) %>% spread(top,n)

agent_full	alt	comp	misc	nz	rec	sfnet
VSoup v1.2.9.47Beta [95/NT]	296	52	1	11	10
VSoup v1.2.9.48Beta [OS/2]	7					2

d2023 %>% filter(agent=='VSoup') %>% count(email) %>% summarise(max_vsoup_posts=max(n),med_vsoup_posts=median(n), n_emails=n())

max_vsoup_posts	med_vsoup_posts	n_emails
44	2	87

sentiment

scoring sentiment using stemmed words individual words, valence from Finn Årup Nielsen. AFINN ranks a subset of English words -5 (negative) to +5 (positive). I average all the scored words within the subject + body of a message for a single value per article.

library(tidytext)
#nnc <- get_sentiments("nrc") # has dimensions, eg. "joy"
afn <- get_sentiments("afinn") # -5 neg to +5 positive

# match stemming from perl
afn_stem <- afn |> mutate(word=SnowballC::wordStem(word,language="en")) |> group_by(word) |> summarise(value=mean(value))



# could do a giant merge afn to body split
# but takes too much RAM
# hash lookup with list name should be fast enough alt to merge
afn_lookup <- as.list(afn_stem$value) |> `names<-`(afn_stem$word)
body_stats <- function(body){
   body <- stringr::str_split(body,' ', simplify=T)
   vals <- unlist(afn_lookup[body])
   adj <- 0
   if(length(vals)==0L) {
       vals <- c("NA"=0)
       adj <-1 
   }
   data.frame(n_words=length(body),
              afn=mean(vals),
              afn_sd=sd(vals),
              words_scored=length(vals)-adj,
              body=paste(names(vals),vals,sep=":",collapse=" "))
}
# replaces body with scored words only
afn_score <- d2023 |> 
  select(top, folder, agent, email, body) |>
   mutate(folder=gsub('^news/','',folder) |>
   rowwise() |>
   mutate(body_stats(body))

write.csv(afn_score,file="afn_score.csv.gz",row.names=F,quote=F)

per group

The most positive place on usenet in 2023 looks like fido7/ru/home. Russian lang take the top 3 (afn=2.5-2.3). Since removed (english valance for russian stemmed words are likely not meaningful)
nice to see a supportive place looking positive: alt.support.stop-smoking
post selling things are positive. What a nice financial incentive for being optomistic (tor/forsale, phl/forsal,chi/forsal,van/forsal,alt/forsale)
windows makes two appearances in the top 20 (edit it/comp/os/win/windows10 fell out after update, afn=1.5). I guess being held hostage by your OS endears some fraternal empathy.
- similar thing for alt.alien.visitors?
Groups with non-English articles shouldn't be included.

NB. I capped my pull to 200 articles per group.


n_articles <- d2023 |> count(folder,name="n_articles") 

afn_folder_smry <- 
  afn_score |> group_by(folder) |>
  summarize(
    afn_wt=mean(words_scored/n_words*afn),
    across(c(n_words,words_scored), sum),
    afn=round(mean(afn),2),
    wrd_article=round(n_words/n(),1),
    mean_sd=round(mean(afn_sd,na.rm=T),2),
    n_email=length(unique(email))) |>
  inner_join(n_articles)

afn_folder_smry |>
  filter(n_email>=8, n_articles>10, !grepl('/ru/',folder)) |> arrange(-afn) |>
  select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
  head(n=20)

folder	afn	n_email	n_articles	wrd_article	mean_sd
alt/alien/visitors	1.99	8	200	2055.4	0.68
alt/books/larry-niven	1.86	36	104	125.6	1.23
tor/forsale	1.71	28	38	35.1	0.85
rec/games/go	1.67	59	83	79.6	1.83
houston/forsale	1.66	14	16	49	1.4
phl/forsale	1.65	27	28	36.6	1.13
chi/forsale	1.64	67	80	50.4	0.98
it/sport/calcio/fiorentina	1.62	17	200	67.5	1.18
alt/flashback	1.6	48	81	55	0.88
alt/support/stop-smoking	1.6	13	43	38.7	1.48
alt/fan/fratellibros	1.59	8	28	55.2	1.64
it/discussioni/commercialisti	1.56	44	200	64.7	1.22
soc/culture/occitan	1.54	58	84	67.3	1.32
van/forsale	1.54	30	33	40.1	0.96
atl/forsale	1.53	44	55	61.5	1.31
it/sport/formula1	1.52	17	200	66.4	1.39
alt/locksmithing	1.51	8	12	40.9	0.75
it/comp/os/win/windows10	1.5	46	200	65.4	1.9
nyc/forsale	1.48	91	104	48.2	1.14
aioe/news/assistenza	1.46	23	106	55.9	1.46

negative

A kill file would probably change this a lot. soc.culture.scottish and *.webtv have a few spammy/tortured individuals in groups without many other posters to suppress the noise.

I removed "test" groups. those came out as most negative. I'd hoped 'test' had negative valence, but it's not even in afinn. But obscenities/racial epitaphs are and have the most negative values.
huuhaa is a finish group
Äffle und Pferdle (monkey and horse) is a german cartoon played between commercials? hopefully a language scoring issue and not an especially negative place.
In the opposite of the smoking support above, fat-acceptance is scored negatively.
I guess buffalo bills fans (all 9 of them) are not a happy bunch
alt.crime's no surprise, but not b/c of racist obscenities! The most popular negative words are evil(-3), torture(-4), charge(-3), and crime(-3)
scottish culture? includes a lot of torture, kill, death
webtv in 2023?
- euthanasia drugs!? lots of other very upsets (re: child trafficking?) posts

library(stringr)
afn_score$body[grepl("soc/culture/scottish$",afn_score$folder)] %>%
  str_split(" ",simplify=T) %>%
  str_split(":") %>%
  Filter(f=\(x) length(x)==2L)  %>%
  lapply(\(x) data.frame(w=x[1],v=as.numeric(x[2]))) %>%
  bind_rows() %>% count(w,v) %>% mutate(score=v*n) %>%
  arrange(score) %>%
  head()

w	v	n	score
tortur	-4	574	-2296
kill	-3	244	-732
death	-2	276	-552
victim	-3	165	-495
abus	-3	137	-411
useless	-2	202	-404

afn_folder_smry |>
  filter(n_email>=8, n_articles>10,
        !grepl('/pl/|/fr/|geschn|tratsch|/de/|/pa/|/dk/|/in/|spanish|ttiili|german|/nl/|/be/|test$|dev$',folder)) |>
  arrange(afn) |>
  select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
  head(n=20)

folder	afn	n_email	n_articles	wrd_article	mean_sd
alt/aeffle/und/pferdle	-2.39	12	74	104.1	0.86
sfnet/huuhaa	-1.39	10	199	83.5	1.17
alt/games/microsoft/flight-sim	-1.23	8	200	212.2	2.18
it/news/net-abuse	-1.16	21	99	307.4	1.79
alt/lawyers	-1.05	24	41	465.3	1.97
ca/driving	-1.05	19	23	219.5	1.66
alt/tv	-1.04	9	200	89	1.67
alt/business/accountability	-1.01	10	200	78	1.61
alt/online-service/webtv	-0.95	14	73	186.4	1.38
alt/sports/football/pro/buffalo-bills	-0.95	9	19	171.9	1.86
soc/culture/scottish	-0.94	8	139	993.1	1.64
alt/crime	-0.9	45	189	271.1	1.65
control/cancel	-0.89	27	179	11.3	0.58
soc/culture/african/american	-0.78	56	200	289	1.91
alt/conspiracy/jfk	-0.77	14	212	62.7	1.87
alt/disney	-0.77	29	125	969.9	1.86
news/answers	-0.77	19	90	3376.2	1.44
alt/sports/football/pro/phila-eagles	-0.76	10	199	57.7	2.12
aus/politics	-0.75	21	187	83.6	1.85
alt/0a/fred-hall/nancy-boy	-0.74	11	200	21	2.4

By user-agent, newsgroup reader client

Sentiment by reader is probably a silly stat.

40tude_Dialog is a windows gui client last updated in 2008.
K-9 users number less than 20 and are all in linux.debian.*


n_articles_agent <- d2023 |> count(agent,name="n_articles") 

afn_agent_smry <- 
  afn_score |> group_by(agent) |>
  summarize(
    afn_wt=mean(words_scored/n_words*afn),
    across(c(n_words,words_scored), sum),
    afn=round(mean(afn),2),
    wrd_article=round(n_words/n(),1),
    mean_sd=round(mean(afn_sd,na.rm=T),2),
    n_groups=length(unique(folder)),
    n_email=length(unique(email))) |>
  inner_join(n_articles_agent)


afn_agent_smry |>
  select(agent, afn,n_groups, n_email,n_articles, wrd_article, mean_sd) |>
  filter(n_email>10, n_articles>10) |>
  arrange(-afn)

agent	afn	n_groups	n_email	n_articles	wrd_article	mean_sd
Rocksolid	1.61	10	12	108	31.4	1
newsSync	1.54	19	426	510	48.3	1.15
NewsHound	1.09	4	20	84	178.5	1.28
Thunderbird	1.05	16	14	58	76.4	1.78
Cyrus-JMAP	0.91	10	11	31	65.2	1.39
VSoup	0.89	23	76	379	80.2	1.57
K-9	0.8	20	19	56	197	1.72
NeoMutt	0.76	23	17	196	315.6	1.46
Evolution	0.63	65	71	644	185.1	1.48
Mutt	0.63	55	43	357	247	1.56
Turnpike	0.61	30	20	369	78.7	1.68
Messenger-Pro	0.59	8	16	114	57.3	1.62
XanaNews	0.57	45	23	310	54.5	1.55
G2	0.47	1784	5186	58739	1573	1.7
Gnus	0.47	210	137	1515	162.1	1.55
Roundcube	0.43	14	16	43	246.2	1.73
Usenapp	0.41	75	39	449	66.9	1.7
Pluto	0.36	11	19	141	54.3	1.41
MacSOUP	0.28	65	27	498	52.9	1.75
Unison	0.28	56	33	232	67.1	1.61
Mozilla	0.27	1132	1917	31724	229.1	1.73
tin	0.26	233	82	1537	97.4	1.84
Hogwasher	0.24	297	24	1882	91.4	1.85
Thoth	0.21	48	13	218	55.8	1.65
40tude_Dialog	0.2	91	37	1119	82.9	1.38
Alpine	0.2	21	16	72	80.5	1.56
slrn	0.17	288	166	2083	81.9	1.71
	0.13	1381	3434	41694	275.9	1.67
NewsTap	0.11	213	84	1751	116.1	1.79
ForteAgent	0.08	359	216	3606	82.7	1.72
MT-NewsWatcher	0.08	361	13	2598	91.9	1.84
Pan	0.05	293	138	1778	1686.2	1.86
MicroPlanet-Gravity	-0.03	112	41	1202	131.8	1.98
Mime	-0.32	88	88	142	218.2	1.85
Opera	-0.47	46	12	285	44	1.61
Xnews	-0.5	369	432	2754	233.1	1.89
Nemo	-0.6	62	29	655	104	1.64
PhoNews	-0.79	14	11	88	69.3	1.22
MacCafe	-0.82	42	16	1389	115.6	1.7

pseudo stats

The average G2 written article is significantly more positive than that from Mozilla! Both means are slightly above to neutral.

t.test(afn ~ agent, afn_score %>% filter(agent %in% c("G2","Mozilla")))

And Gnus more positive than slrn

t.test(afn ~ agent, afn_score %>% filter(agent %in% c("Gnus","slrn")))

Despite how the plot may looking

library(ggplot2)
agent_subset <- c("G2","Mozilla","Gnus","40tude_Dialog","slrn","ForteAgent", "Xnews")
popular_agents <- afn_score |>
  filter(agent %in% agent_subset) |>
  mutate(interface=ifelse(agent %in% c("Gnus","slrn"), "CLI","GUI")) |>
  ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) + 
  see::theme_modern() + facet_grid(interface~.) +
  labs(x="article afinn score", title="Sentiment by user-agent")

positives <- afn_score |>
  filter(agent %in% agent_subset, afn>0) |>
  ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) + 
  see::theme_modern() +
  labs(x="article afinn score", title="Sentiment by user-agent: positive")

cowplot::plot_grid(popular_agents,positives,nrow=2)
#ggsave('agent_sentiment.png', height=7,widht=7)

../images/usenet/agent_sentiment.png

or actually maybe the problem is alt.autos.toyota.camry: 3 articles available.\n***Connection to news.eternal-september.org lost. Performing shutdown.

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.