/navi/ - Science & Technology

File 16054637432.jpg - (438.18KB , 850x616 , punch.jpg )

Stylometry Anonymous 11/15/20(Sun)10:09 No. 2100 [Edit]

I found a scary, but interesting toy.
Java Graphical Authorship Attribution Program
https://github.com/evllabs/JGAAP

Anonymous 11/15/20(Sun)12:15 No. 2101 [Edit]
File 160547130643.jpg - (60.16KB , 600x480 , 23685a619f8e58b436442d3a3bf545a7.jpg )

I fed it a sample of my, tohno's, Lewis Carroll's and Oscar Wilde's writing, and gave it four unknown samples by each author. "Markov Chain Analysis" attributed none of them correctly, "LDA" attributed tohno's and mine correctly, and "Absolute Centroid Driver with metric Alt Intersection Distance" attributed mine, Carroll's and Wilde's correctly. Why? I have no clue.

Canonicizers:
Normalize Whitespace
Strip AlphaNumeric
Strip Null Characters
Unify Case
EventDrivers:
Word stems w/ Irregular
Words
EventCullers:
Most Common Events numevents : 75
Lexical Frequencies
EventCullers:
Most Common Events numevents : 75
Binned Frequencies
EventCullers:
Most Common Events numevents : 75
Stanford Part of Speech taggingmodel : english-left3words-distsim
EventCullers:
Most Common Events numevents : 75
Word Lengths
EventCullers:
Most Common Events numevents : 75
Sentence Length
EventCullers:
Most Common Events numevents : 75
Word stems
First Word In Sentence

my sample

Analysis:
Markov Chain Analysis
1. oscar 1069.331221322057
2. lewis 1064.0639706955424
3. me 1003.2718351175538
4. lewis 953.4430985453695
5. tohno 874.210494638616
6. me 833.7232563444958
7. me 787.8843863772557
8. me 764.7286762552402

Absolute Centroid Driver with metric Alt Intersection Distance
1. me 0.02
2. lewis 0.023809523809523808
3. tohno 0.02564102564102564
4. oscar 0.03125

LDA
1. me 2.0198861851999578E14
2. tohno 2.0198861851999328E14
3. lewis 2.019886185199932E14
4. oscar 2.0198861851999075E14

Lewis Sample

Analysis:
Markov Chain Analysis
1. me 344.64592772385595
2. oscar 326.70974309813477
3. lewis 283.59897650666517
4. lewis 261.6183018407893
5. tohno 202.38131877604445
6. me 174.79604427564163
7. me 164.37704164369205
8. me 143.74712586614893

Absolute Centroid Driver with metric Alt Intersection Distance
1. lewis 0.015873015873015872
2. me 0.018867924528301886
3. oscar 0.0196078431372549
3. tohno 0.0196078431372549

LDA
1. me 1.6400230733690834E14
2. oscar 1.64002307336908E14
3. lewis 1.6400230733690712E14
4. tohno 1.6400230733690378E14

Oscar Sample

Analysis:
Markov Chain Analysis
1. lewis 200.38893005230196
2. oscar 170.0575418388182
3. tohno 146.13035171912483
4. me 145.04149066336706
5. me 138.31719973630894
6. me 132.81698296944657
7. lewis 125.63518710585872
8. me 111.41391522693222

Absolute Centroid Driver with metric Alt Intersection Distance
1. oscar 0.05555555555555555
1. me 0.05555555555555555
1. tohno 0.05555555555555555
1. lewis 0.05555555555555555

LDA
1. me 2.2603990291968966E14
2. tohno 2.260399029196884E14
3. lewis 2.260399029196875E14
4. oscar 2.260399029196863E14

Tohno Sample

Analysis:
Markov Chain Analysis
1. oscar 2116.6700988662474
2. lewis 2113.2694453947047
3. lewis 2038.2435926063758
4. me 1916.607207652884
5. me 1707.4784559694858
6. me 1685.238701009766
7. tohno 1642.094220578221
8. me 1553.8726019149778

Absolute Centroid Driver with metric Alt Intersection Distance
1. me 0.015384615384615385
2. tohno 0.017543859649122806
3. lewis 0.02
4. oscar 0.02631578947368421

LDA
1. tohno 1.9740782667309578E14
2. me 1.9740782667309403E14
3. lewis 1.9740782667309112E14
4. oscar 1.9740782667308762E14

Anonymous 11/15/20(Sun)13:52 No. 2102 [Edit]

>>2100
Yeah it's been somewhat well-known that stylometric analysis is sufficient to deanonymize users. I think even without known ground-truths/labelled samples you could probably do some sort of clustering. The techniques employed by JGAAP also appear to be relatively primitive actually: they're mostly statistical methods that seem to operate mostly at the lexeme level, and don't use any of the newer breakthroughs like transformers that could potentially allow for better matching based on content-similarity as well.

Post edited on 15th Nov 2020, 1:54pm

>>	Anonymous 11/15/20(Sun)14:50 No. 2103 [Edit] >>2102 >deanonymize users Stay safe. Keep posts short.

Anonymous 11/15/20(Sun)15:40 No. 2104 [Edit]

>>2103
Ironically, unless everyone does it you'll stick out the most. What would be neat is a technique that makes use of a GAN-sort of thing that will "normalize" any input you give it, making it impossible to distinguish different users. Maybe this already exists in the literature somewhere, but given that things like word-vectors and GPT already perform some sort of dimensionality reduction, there's probably a way to go from "input sentence" -> "sparse representation" -> "normalized input."

Unfortunately there's an inherent limitation/asymmetry in that no amount of normalization can remove distinguishing features based on content. For instance, if poster A talks about a specific technology a lot, and you come across someone mentioning that same technology on another board, there's a very high chance that you've run into poster A. And there's no way to normalize that since the content being talked about is inherently identifying.

>>	Anonymous 11/15/20(Sun)15:54 No. 2105 [Edit] >>2104 Some Drexel students tried making something that could "anonymize" text input, but it doesn't use machine learning. https://github.com/psal/anonymouth

>>	Anonymous 11/15/20(Sun)15:59 No. 2106 [Edit] >>2105 Hm it doesn't seem like it automatically does the anonymization yet. As of now it seems to just highlight the distinguishing features which the user can manually edit.

>>	Anonymous 07/20/21(Tue)12:40 No. 2357 [Edit] >>2104 Maybe something like this: https://steganography.live/ Take your input sentence, "modulate" it with a random sentence to get a semantically equivalent but "encoded" version of the input.

>>	Anonymous 07/20/21(Tue)16:10 No. 2358 [Edit] File 162682265275.png - (93.33KB , 2590x982 , trial.png ) >>2357 neat

[Return] [Entire Thread] [Last 50 posts]

Name
Email
Subject	(reply to 2100)
Message
BB Code
File
File URL
Embed	Help
Password	(for post and file deletion)

Supported file types are: BMP, C, CPP, CSS, EPUB, FLAC, FLV, GIF, JPG, OGG, PDF, PNG, PSD, RAR, TORRENT, TXT, WEBM, ZIP Maximum file size allowed is 10000 KB. Images greater than 260x260 pixels will be thumbnailed. Currently 1101 unique user posts. board catalog