Take away
- Most everyone posts with G2 or Mozilla (Thunderbird)
- Old newsreaders (Xnews, newsSync, VSoup, MicroPlanet-Gravity, 40tude_Dialog) are still somewhat popular – despite development having stopped on many.
- Russion langauge sentiment appears positive when stemmed and scored as English. German sentiment appears negative.
- Selling things and some support groups (stop-smoking) are on the most postiive end
- Obscentities make `testing` and `dev` groups score negative; as do some individual and prolific unhappy posters.
Motivation
In researching self-hosting Discourse
, I saw there's an old plugin to sync to nntp.
That was surprising enough to add usenet to lemmy, mastodon, nostr, and ssb
– the list of communities/protocols I have no need for but am still interested in exploring.
After tinkering for a bit, I've two questions
- I setup pan, gnus, and slrn to experiment. But what user agent is most popular?
- Finding useful groups is challenging. Where are the people relevant to me?
Both of these look like they can be answered with data!
Stats already available online: http://top1000.anthologeek.net/#todo http://www.eternal-september.org/postingstats.php
Download
I used slrnpull
, but leafnode
might have also worked.
Config
Hierarchy list: slrn
The first run of slrn
creates $HOME/.jnewsrc
with a list of all hierarchies.
This will be useful for slrnpull
For eternal september, authentication is necessary. .slrnrc
looks like
set force_authentication 1
nnrpaccess "news.eternal-september.org" "xxxx" "yyyyy"
And first pull
slrn -h news.eternal-september.org -f /home/foranw/.jnewsrc --create
articles: slrnpull
I initially put the full (27k+) group list in slrnpull.conf
.
but eternal-september disconnects after the first 300. 1
Conveniently, .jnewsrc also stores an articles count.
And that's easy to sort on and probably a good way to limit to active groups.
Unfortunately, the decently active and interesting group I've pegged as a reference ( alt.fan.usenet
) is ranked just after 4000 by post counts.
echo default 200 365 > /var/spool/news/slrnpull/slrnpull.conf
perl -lne 'print if s/^(.*)[!:].*-(\d+)$/$2\t$1/' ~/.jnewsrc|
sort -nr|grep -v sex |head -n 4100 | cut -f 2 >> /var/spool/news/slrnpull/slrnpull.conf
eternal-september.org requires authentication. I copied slrn's auth to slrnpull's authinfo
sed 's/"//g' .slrnrc |
awk '(/nnrpaccess/){print $3 "\n" $4}' \
> /var/spool/news/slrnpull/authinfo
Data
Pulling
slrnpull -h news.eternal-september.org
# 300 groups Time: 01:04:23, BPS: 70284 08/19/2023 12:57:18 A total of 271509221 bytes received, 1035863 bytes sent in 3879 seconds. # 4100 groups Time: 09:21:24, BPS: 74419 08/20/2023 01:52:36 A total of 2506742895 bytes received, 5781052 bytes sent in 34098 seconds.
Dataset
The directory structure with .minmax files consumes ~100Mb!
time find news/ -type f -not -name '.*' |wc -l
59618 real 6m55.229s
time du -hcs ./
3.2G | total |
Parse
I've extracted headers with a perl script,
used Lingua::Stem::Snowball
to stem the article's text, and parallel
2
to execute on multiple cores. The resulting tab separated file has one line/row per message and compresses well.
time \
find news/ -mindepth 2 -type f -not -name '.*' |
parallel -j 3 --xargs ./article_tsv.pl |
gzip -c > news/all_articles.tsv.gz
real 122m32.779s
du -h news/all_articles.tsv.gz
373M news/all_articles.tsv.gz
Sharing
ln -s all_articles.tsv.gz usenet_group-41000_messages-200_date-20230819.tsv.gz
rhash --magnet --btih usenet_group-41000_messages-200_date-20230819.tsv.gz
Stats
library(dplyr)
d <- data.table::fread('/mnt/ttb/news/all_articles.tsv.gz', quote="",
col.names=c("folder","date","org","from",
"agent_full","path","id","body")) |>
mutate(date=lubridate::ymd_hm(date),
# remove version number
agent=gsub('[ /(:].*','',agent_full),
# extract from within <>: Name <email@host.com>
#email=stringr::str_extract(from,'(?<=<)[^>]+'),
# or ( blah@x.com )
email=stringr::str_extract(from,'[a-zA-Z0-9.!#$%&*+-/=?^_`{|}~]+@[^ )>]+'),
top=stringr::str_extract(folder,'(?<=news/)[^/]*'))
# only look at 2023 (ending in 20230820)
d2023 <- d |> filter(date >= "2023-01-01")
n_messages_2023 <- nrow(d2023)
We have +src_R[:session *R:WillForan.github.io*]ults(166614
)}}} articles from 2023
User Agent
G2 (google groups) and Mozilla (Thunderbird) are an order of magnitude above other clients.
Mozilla users post more often than google users (though a better stat might be median instead of mean).
agents_allposts <- d2023 |> count(agent, name='n_posts') |> arrange(-n_posts)
agents_from <- d2023 |> filter(date >= "2023-01-01") |> count(from, agent) |>
count(agent, name='n_from') |>
arrange(-n_from)
agents <- inner_join(agents_allposts,agents_from) |>
mutate(user_posts=round(n_posts/n_from,1)) |> arrange(-n_from)
agents |> head(n=20)
agent | n_posts | n_from | user_posts |
---|---|---|---|
G2 | 58739 | 6406 | 9.2 |
41694 | 4630 | 9 | |
Mozilla | 31724 | 2348 | 13.5 |
Xnews | 2754 | 864 | 3.2 |
newsSync | 510 | 426 | 1.2 |
ForteAgent | 3606 | 237 | 15.2 |
slrn | 2083 | 219 | 9.5 |
Pan | 1778 | 161 | 11 |
Gnus | 1515 | 158 | 9.6 |
NewsTap | 1751 | 126 | 13.9 |
tin | 1537 | 107 | 14.4 |
Mime | 142 | 89 | 1.6 |
VSoup | 379 | 87 | 4.4 |
Evolution | 644 | 78 | 8.3 |
Dolbo | 121 | 59 | 2.1 |
Mutt | 357 | 50 | 7.1 |
MicroPlanet-Gravity | 1202 | 49 | 24.5 |
40tude_Dialog | 1119 | 46 | 24.3 |
Usenapp | 449 | 45 | 10 |
Unison | 232 | 41 | 5.7 |
The second most popular user-agent is none – missing in the header. These look like they come from lists and scripts.
d2023 |>filter(agent=="") |> count(email,path,org) |> arrange(-n) |> head()
path | org | n | |
---|---|---|---|
doctor@doctor.nl2k.ab.ca | .POSTED.doctor.nl2k.ab.ca | NetKnow News | 2086 |
remailer@domain.invalid | mail2news | 1586 | |
racist_queer_democrat_paedophiles@now.org | mail2news | 836 | |
bugzilla-noreply@freebsd.org | .POSTED.news.muc.de | Newsgate at muc.de e.V. | 790 |
disciple@T3WiJ.com | news.eternal-september.org | A noiseless patient Spider | 702 |
ftpmaster@ftp-master.debian.org | <envelope@ftp-master.debian.org> | linux.* mail to news gateway | 666 |
d2023 |>filter(agent=="",is.na(email)) |> count(folder,name="n_noagent_noemail") |> arrange(-n_noagent_noemail) |> head()
folder | n_noagent_noemail |
---|---|
news/soc/culture/korean | 188 |
news/junk | 129 |
news/alt/bbs/synchronet | 86 |
news/alt/online-service/webtv | 46 |
news/comp/mail/sendmail | 35 |
news/alt/politics/uk | 27 |
By top level group
Do different audiences have specific client preferences?
Yes. Or maybe user agents are just a proxy for spam.
Here we're looking at the top 4 user agents across each top level.
slrn
and Gnus
make the top 4 cut in comp.*
and news.*
, and slrn
also sneaks in for sci.*
while gnus
writes nearly 1/10 of sfnet
messages.
library(tidyr)
agents_top <- d2023 |> filter(agent!="") |>
count(email, agent, top) |>
group_by(top, agent) |> summarise(n_user=length(unique(email))) |>
group_by(top) |> arrange(-n_user) |>
mutate(rank=1:n(), percent=sprintf("%.0f%%",n_user/sum(n_user)*100))
a_order <- agents_top %>% group_by(agent) %>%
summarise(srank=sum(n_user)) %>% arrange(-srank) %>%`[[`('agent')
big8 <- c("comp","alt","sfnet","misc","sci", "news", "misc", "soc", "talk")
N_top <- d2023 |> filter(top %in% big8, agent!="") |> count(top, name="TOTAL")
agent_wide <- agents_top %>%
filter(rank<=4, top %in% big8) %>%
mutate(agent=factor(agent,levels=a_order)) %>%
select(-rank, -n_user) %>%
spread(agent, percent, fill="0")
merge(N_top,agent_wide) %>% arrange(-TOTAL)
top | TOTAL | G2 | Mozilla | Xnews | ForteAgent | slrn | Gnus | VSoup |
---|---|---|---|---|---|---|---|---|
alt | 38432 | 52% | 16% | 10% | 4% | 0 | 0 | 0 |
comp | 11285 | 62% | 18% | 0 | 0 | 3% | 3% | 0 |
soc | 7679 | 77% | 9% | 5% | 2% | 0 | 0 | 0 |
sci | 4619 | 62% | 17% | 0 | 4% | 2% | 0 | 0 |
misc | 2673 | 54% | 19% | 8% | 4% | 0 | 0 | 0 |
talk | 1964 | 47% | 21% | 8% | 7% | 0 | 0 | 0 |
news | 665 | 35% | 17% | 0 | 0 | 9% | 7% | 0 |
sfnet | 422 | 41% | 34% | 0 | 0 | 0 | 9% | 6% |
VSoup
What is Vsoup?! Google isn't any help. It has an OS/2 version!?
d2023 %>% filter(agent=='VSoup') %>% count(top,agent_full) %>% spread(top,n)
agent_full | alt | comp | misc | nz | rec | sfnet |
---|---|---|---|---|---|---|
VSoup v1.2.9.47Beta [95/NT] | 296 | 52 | 1 | 11 | 10 | |
VSoup v1.2.9.48Beta [OS/2] | 7 | 2 |
d2023 %>% filter(agent=='VSoup') %>% count(email) %>% summarise(max_vsoup_posts=max(n),med_vsoup_posts=median(n), n_emails=n())
max_vsoup_posts | med_vsoup_posts | n_emails |
---|---|---|
44 | 2 | 87 |
sentiment
scoring sentiment using stemmed words individual words, valence from Finn Årup Nielsen. AFINN ranks a subset of English words -5 (negative) to +5 (positive). I average all the scored words within the subject + body of a message for a single value per article.
library(tidytext)
#nnc <- get_sentiments("nrc") # has dimensions, eg. "joy"
afn <- get_sentiments("afinn") # -5 neg to +5 positive
# match stemming from perl
afn_stem <- afn |> mutate(word=SnowballC::wordStem(word,language="en")) |> group_by(word) |> summarise(value=mean(value))
# could do a giant merge afn to body split
# but takes too much RAM
# hash lookup with list name should be fast enough alt to merge
afn_lookup <- as.list(afn_stem$value) |> `names<-`(afn_stem$word)
body_stats <- function(body){
body <- stringr::str_split(body,' ', simplify=T)
vals <- unlist(afn_lookup[body])
adj <- 0
if(length(vals)==0L) {
vals <- c("NA"=0)
adj <-1
}
data.frame(n_words=length(body),
afn=mean(vals),
afn_sd=sd(vals),
words_scored=length(vals)-adj,
body=paste(names(vals),vals,sep=":",collapse=" "))
}
# replaces body with scored words only
afn_score <- d2023 |>
select(top, folder, agent, email, body) |>
mutate(folder=gsub('^news/','',folder) |>
rowwise() |>
mutate(body_stats(body))
write.csv(afn_score,file="afn_score.csv.gz",row.names=F,quote=F)
per group
- The most positive place on usenet in 2023 looks like
fido7/ru/home
. Russian lang take the top 3 (afn=2.5-2.3). Since removed (english valance for russian stemmed words are likely not meaningful) - nice to see a supportive place looking positive: alt.support.stop-smoking
- post selling things are positive. What a nice financial incentive for being optomistic (tor/forsale, phl/forsal,chi/forsal,van/forsal,alt/forsale)
windows makes two appearances in the top 20 (edit it/comp/os/win/windows10 fell out after update, afn=1.5). I guess being held hostage by your OS endears some fraternal empathy.
- similar thing for alt.alien.visitors?
- Groups with non-English articles shouldn't be included.
NB. I capped my pull to 200 articles per group.
n_articles <- d2023 |> count(folder,name="n_articles")
afn_folder_smry <-
afn_score |> group_by(folder) |>
summarize(
afn_wt=mean(words_scored/n_words*afn),
across(c(n_words,words_scored), sum),
afn=round(mean(afn),2),
wrd_article=round(n_words/n(),1),
mean_sd=round(mean(afn_sd,na.rm=T),2),
n_email=length(unique(email))) |>
inner_join(n_articles)
afn_folder_smry |>
filter(n_email>=8, n_articles>10, !grepl('/ru/',folder)) |> arrange(-afn) |>
select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
head(n=20)
folder | afn | n_email | n_articles | wrd_article | mean_sd |
---|---|---|---|---|---|
alt/alien/visitors | 1.99 | 8 | 200 | 2055.4 | 0.68 |
alt/books/larry-niven | 1.86 | 36 | 104 | 125.6 | 1.23 |
tor/forsale | 1.71 | 28 | 38 | 35.1 | 0.85 |
rec/games/go | 1.67 | 59 | 83 | 79.6 | 1.83 |
houston/forsale | 1.66 | 14 | 16 | 49 | 1.4 |
phl/forsale | 1.65 | 27 | 28 | 36.6 | 1.13 |
chi/forsale | 1.64 | 67 | 80 | 50.4 | 0.98 |
it/sport/calcio/fiorentina | 1.62 | 17 | 200 | 67.5 | 1.18 |
alt/flashback | 1.6 | 48 | 81 | 55 | 0.88 |
alt/support/stop-smoking | 1.6 | 13 | 43 | 38.7 | 1.48 |
alt/fan/fratellibros | 1.59 | 8 | 28 | 55.2 | 1.64 |
it/discussioni/commercialisti | 1.56 | 44 | 200 | 64.7 | 1.22 |
soc/culture/occitan | 1.54 | 58 | 84 | 67.3 | 1.32 |
van/forsale | 1.54 | 30 | 33 | 40.1 | 0.96 |
atl/forsale | 1.53 | 44 | 55 | 61.5 | 1.31 |
it/sport/formula1 | 1.52 | 17 | 200 | 66.4 | 1.39 |
alt/locksmithing | 1.51 | 8 | 12 | 40.9 | 0.75 |
it/comp/os/win/windows10 | 1.5 | 46 | 200 | 65.4 | 1.9 |
nyc/forsale | 1.48 | 91 | 104 | 48.2 | 1.14 |
aioe/news/assistenza | 1.46 | 23 | 106 | 55.9 | 1.46 |
negative
A kill file would probably change this a lot. soc.culture.scottish and *.webtv have a few spammy/tortured individuals in groups without many other posters to suppress the noise.
- I removed "test" groups. those came out as most negative. I'd hoped 'test' had negative valence, but it's not even in afinn. But obscenities/racial epitaphs are and have the most negative values.
- huuhaa is a finish group
- Äffle und Pferdle (monkey and horse) is a german cartoon played between commercials? hopefully a language scoring issue and not an especially negative place.
- In the opposite of the smoking support above,
fat-acceptance
is scored negatively. - I guess buffalo bills fans (all 9 of them) are not a happy bunch
- alt.crime's no surprise, but not b/c of racist obscenities! The most popular negative words are evil(-3), torture(-4), charge(-3), and crime(-3)
- scottish culture? includes a lot of torture, kill, death
webtv in 2023?
- euthanasia drugs!? lots of other very upsets (re: child trafficking?) posts
library(stringr)
afn_score$body[grepl("soc/culture/scottish$",afn_score$folder)] %>%
str_split(" ",simplify=T) %>%
str_split(":") %>%
Filter(f=\(x) length(x)==2L) %>%
lapply(\(x) data.frame(w=x[1],v=as.numeric(x[2]))) %>%
bind_rows() %>% count(w,v) %>% mutate(score=v*n) %>%
arrange(score) %>%
head()
w | v | n | score |
---|---|---|---|
tortur | -4 | 574 | -2296 |
kill | -3 | 244 | -732 |
death | -2 | 276 | -552 |
victim | -3 | 165 | -495 |
abus | -3 | 137 | -411 |
useless | -2 | 202 | -404 |
afn_folder_smry |>
filter(n_email>=8, n_articles>10,
!grepl('/pl/|/fr/|geschn|tratsch|/de/|/pa/|/dk/|/in/|spanish|ttiili|german|/nl/|/be/|test$|dev$',folder)) |>
arrange(afn) |>
select(folder,afn,n_email,n_articles,wrd_article,mean_sd) |>
head(n=20)
folder | afn | n_email | n_articles | wrd_article | mean_sd |
---|---|---|---|---|---|
alt/aeffle/und/pferdle | -2.39 | 12 | 74 | 104.1 | 0.86 |
sfnet/huuhaa | -1.39 | 10 | 199 | 83.5 | 1.17 |
alt/games/microsoft/flight-sim | -1.23 | 8 | 200 | 212.2 | 2.18 |
it/news/net-abuse | -1.16 | 21 | 99 | 307.4 | 1.79 |
alt/lawyers | -1.05 | 24 | 41 | 465.3 | 1.97 |
ca/driving | -1.05 | 19 | 23 | 219.5 | 1.66 |
alt/tv | -1.04 | 9 | 200 | 89 | 1.67 |
alt/business/accountability | -1.01 | 10 | 200 | 78 | 1.61 |
alt/online-service/webtv | -0.95 | 14 | 73 | 186.4 | 1.38 |
alt/sports/football/pro/buffalo-bills | -0.95 | 9 | 19 | 171.9 | 1.86 |
soc/culture/scottish | -0.94 | 8 | 139 | 993.1 | 1.64 |
alt/crime | -0.9 | 45 | 189 | 271.1 | 1.65 |
control/cancel | -0.89 | 27 | 179 | 11.3 | 0.58 |
soc/culture/african/american | -0.78 | 56 | 200 | 289 | 1.91 |
alt/conspiracy/jfk | -0.77 | 14 | 212 | 62.7 | 1.87 |
alt/disney | -0.77 | 29 | 125 | 969.9 | 1.86 |
news/answers | -0.77 | 19 | 90 | 3376.2 | 1.44 |
alt/sports/football/pro/phila-eagles | -0.76 | 10 | 199 | 57.7 | 2.12 |
aus/politics | -0.75 | 21 | 187 | 83.6 | 1.85 |
alt/0a/fred-hall/nancy-boy | -0.74 | 11 | 200 | 21 | 2.4 |
By user-agent, newsgroup reader client
Sentiment by reader is probably a silly stat.
40tude_Dialog
is a windows gui client last updated in 2008.K-9
users number less than 20 and are all in linux.debian.*
n_articles_agent <- d2023 |> count(agent,name="n_articles")
afn_agent_smry <-
afn_score |> group_by(agent) |>
summarize(
afn_wt=mean(words_scored/n_words*afn),
across(c(n_words,words_scored), sum),
afn=round(mean(afn),2),
wrd_article=round(n_words/n(),1),
mean_sd=round(mean(afn_sd,na.rm=T),2),
n_groups=length(unique(folder)),
n_email=length(unique(email))) |>
inner_join(n_articles_agent)
afn_agent_smry |>
select(agent, afn,n_groups, n_email,n_articles, wrd_article, mean_sd) |>
filter(n_email>10, n_articles>10) |>
arrange(-afn)
agent | afn | n_groups | n_email | n_articles | wrd_article | mean_sd |
---|---|---|---|---|---|---|
Rocksolid | 1.61 | 10 | 12 | 108 | 31.4 | 1 |
newsSync | 1.54 | 19 | 426 | 510 | 48.3 | 1.15 |
NewsHound | 1.09 | 4 | 20 | 84 | 178.5 | 1.28 |
Thunderbird | 1.05 | 16 | 14 | 58 | 76.4 | 1.78 |
Cyrus-JMAP | 0.91 | 10 | 11 | 31 | 65.2 | 1.39 |
VSoup | 0.89 | 23 | 76 | 379 | 80.2 | 1.57 |
K-9 | 0.8 | 20 | 19 | 56 | 197 | 1.72 |
NeoMutt | 0.76 | 23 | 17 | 196 | 315.6 | 1.46 |
Evolution | 0.63 | 65 | 71 | 644 | 185.1 | 1.48 |
Mutt | 0.63 | 55 | 43 | 357 | 247 | 1.56 |
Turnpike | 0.61 | 30 | 20 | 369 | 78.7 | 1.68 |
Messenger-Pro | 0.59 | 8 | 16 | 114 | 57.3 | 1.62 |
XanaNews | 0.57 | 45 | 23 | 310 | 54.5 | 1.55 |
G2 | 0.47 | 1784 | 5186 | 58739 | 1573 | 1.7 |
Gnus | 0.47 | 210 | 137 | 1515 | 162.1 | 1.55 |
Roundcube | 0.43 | 14 | 16 | 43 | 246.2 | 1.73 |
Usenapp | 0.41 | 75 | 39 | 449 | 66.9 | 1.7 |
Pluto | 0.36 | 11 | 19 | 141 | 54.3 | 1.41 |
MacSOUP | 0.28 | 65 | 27 | 498 | 52.9 | 1.75 |
Unison | 0.28 | 56 | 33 | 232 | 67.1 | 1.61 |
Mozilla | 0.27 | 1132 | 1917 | 31724 | 229.1 | 1.73 |
tin | 0.26 | 233 | 82 | 1537 | 97.4 | 1.84 |
Hogwasher | 0.24 | 297 | 24 | 1882 | 91.4 | 1.85 |
Thoth | 0.21 | 48 | 13 | 218 | 55.8 | 1.65 |
40tude_Dialog | 0.2 | 91 | 37 | 1119 | 82.9 | 1.38 |
Alpine | 0.2 | 21 | 16 | 72 | 80.5 | 1.56 |
slrn | 0.17 | 288 | 166 | 2083 | 81.9 | 1.71 |
0.13 | 1381 | 3434 | 41694 | 275.9 | 1.67 | |
NewsTap | 0.11 | 213 | 84 | 1751 | 116.1 | 1.79 |
ForteAgent | 0.08 | 359 | 216 | 3606 | 82.7 | 1.72 |
MT-NewsWatcher | 0.08 | 361 | 13 | 2598 | 91.9 | 1.84 |
Pan | 0.05 | 293 | 138 | 1778 | 1686.2 | 1.86 |
MicroPlanet-Gravity | -0.03 | 112 | 41 | 1202 | 131.8 | 1.98 |
Mime | -0.32 | 88 | 88 | 142 | 218.2 | 1.85 |
Opera | -0.47 | 46 | 12 | 285 | 44 | 1.61 |
Xnews | -0.5 | 369 | 432 | 2754 | 233.1 | 1.89 |
Nemo | -0.6 | 62 | 29 | 655 | 104 | 1.64 |
PhoNews | -0.79 | 14 | 11 | 88 | 69.3 | 1.22 |
MacCafe | -0.82 | 42 | 16 | 1389 | 115.6 | 1.7 |
pseudo stats
The average G2 written article is significantly more positive than that from Mozilla! Both means are slightly above to neutral.
t.test(afn ~ agent, afn_score %>% filter(agent %in% c("G2","Mozilla")))
And Gnus more positive than slrn
t.test(afn ~ agent, afn_score %>% filter(agent %in% c("Gnus","slrn")))
Despite how the plot may looking
library(ggplot2)
agent_subset <- c("G2","Mozilla","Gnus","40tude_Dialog","slrn","ForteAgent", "Xnews")
popular_agents <- afn_score |>
filter(agent %in% agent_subset) |>
mutate(interface=ifelse(agent %in% c("Gnus","slrn"), "CLI","GUI")) |>
ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) +
see::theme_modern() + facet_grid(interface~.) +
labs(x="article afinn score", title="Sentiment by user-agent")
positives <- afn_score |>
filter(agent %in% agent_subset, afn>0) |>
ggplot() + aes(x=afn, fill=agent) + geom_density(alpha=.5) +
see::theme_modern() +
labs(x="article afinn score", title="Sentiment by user-agent: positive")
cowplot::plot_grid(popular_agents,positives,nrow=2)
#ggsave('agent_sentiment.png', height=7,widht=7)