Beep Boop Bip
[Return]
Posting mode: Reply
Name
Email
Subject   (reply to 2100)
Message
BB Code
File
File URL
Embed   Help
Password  (for post and file deletion)
  • Supported file types are: BMP, CSS, EPUB, FLAC, FLV, GIF, JPG, OGG, PDF, PNG, PSD, RAR, TORRENT, TXT, WEBM, ZIP
  • Maximum file size allowed is 10000 KB.
  • Images greater than 260x260 pixels will be thumbnailed.
  • Currently 797 unique user posts.
  • board catalog

File 16054637432.jpg - (438.18KB , 850x616 , punch.jpg )
2100 No. 2100 [Edit]
I found a scary, but interesting toy.
Java Graphical Authorship Attribution Program
https://github.com/evllabs/JGAAP
Expand all images
>> No. 2101 [Edit]
File 160547130643.jpg - (60.16KB , 600x480 , 23685a619f8e58b436442d3a3bf545a7.jpg )
2101
I fed it a sample of my, tohno's, Lewis Carroll's and Oscar Wilde's writing, and gave it four unknown samples by each author. "Markov Chain Analysis" attributed none of them correctly, "LDA" attributed tohno's and mine correctly, and "Absolute Centroid Driver with metric Alt Intersection Distance" attributed mine, Carroll's and Wilde's correctly. Why? I have no clue.

Canonicizers:
Normalize Whitespace
Strip AlphaNumeric
Strip Null Characters
Unify Case
EventDrivers:
Word stems w/ Irregular
Words
EventCullers:
Most Common Events numevents : 75
Lexical Frequencies
EventCullers:
Most Common Events numevents : 75
Binned Frequencies
EventCullers:
Most Common Events numevents : 75
Stanford Part of Speech taggingmodel : english-left3words-distsim
EventCullers:
Most Common Events numevents : 75
Word Lengths
EventCullers:
Most Common Events numevents : 75
Sentence Length
EventCullers:
Most Common Events numevents : 75
Word stems
First Word In Sentence


my sample

Analysis:
Markov Chain Analysis
1. oscar 1069.331221322057
2. lewis 1064.0639706955424
3. me 1003.2718351175538
4. lewis 953.4430985453695
5. tohno 874.210494638616
6. me 833.7232563444958
7. me 787.8843863772557
8. me 764.7286762552402

Absolute Centroid Driver with metric Alt Intersection Distance
1. me 0.02
2. lewis 0.023809523809523808
3. tohno 0.02564102564102564
4. oscar 0.03125

LDA
1. me 2.0198861851999578E14
2. tohno 2.0198861851999328E14
3. lewis 2.019886185199932E14
4. oscar 2.0198861851999075E14


Lewis Sample

Analysis:
Markov Chain Analysis
1. me 344.64592772385595
2. oscar 326.70974309813477
3. lewis 283.59897650666517
4. lewis 261.6183018407893
5. tohno 202.38131877604445
6. me 174.79604427564163
7. me 164.37704164369205
8. me 143.74712586614893

Absolute Centroid Driver with metric Alt Intersection Distance
1. lewis 0.015873015873015872
2. me 0.018867924528301886
3. oscar 0.0196078431372549
3. tohno 0.0196078431372549


LDA
1. me 1.6400230733690834E14
2. oscar 1.64002307336908E14
3. lewis 1.6400230733690712E14
4. tohno 1.6400230733690378E14


Oscar Sample

Analysis:
Markov Chain Analysis
1. lewis 200.38893005230196
2. oscar 170.0575418388182
3. tohno 146.13035171912483
4. me 145.04149066336706
5. me 138.31719973630894
6. me 132.81698296944657
7. lewis 125.63518710585872
8. me 111.41391522693222

Absolute Centroid Driver with metric Alt Intersection Distance
1. oscar 0.05555555555555555
1. me 0.05555555555555555
1. tohno 0.05555555555555555
1. lewis 0.05555555555555555

LDA
1. me 2.2603990291968966E14
2. tohno 2.260399029196884E14
3. lewis 2.260399029196875E14
4. oscar 2.260399029196863E14


Tohno Sample

Analysis:
Markov Chain Analysis
1. oscar 2116.6700988662474
2. lewis 2113.2694453947047
3. lewis 2038.2435926063758
4. me 1916.607207652884
5. me 1707.4784559694858
6. me 1685.238701009766
7. tohno 1642.094220578221
8. me 1553.8726019149778

Absolute Centroid Driver with metric Alt Intersection Distance
1. me 0.015384615384615385
2. tohno 0.017543859649122806
3. lewis 0.02
4. oscar 0.02631578947368421

LDA
1. tohno 1.9740782667309578E14
2. me 1.9740782667309403E14
3. lewis 1.9740782667309112E14
4. oscar 1.9740782667308762E14
>> No. 2102 [Edit]
>>2100
Yeah it's been somewhat well-known that stylometric analysis is sufficient to deanonymize users. I think even without known ground-truths/labelled samples you could probably do some sort of clustering. The techniques employed by JGAAP also appear to be relatively primitive actually: they're mostly statistical methods that seem to operate mostly at the lexeme level, and don't use any of the newer breakthroughs like transformers that could potentially allow for better matching based on content-similarity as well.

Post edited on 15th Nov 2020, 1:54pm
>> No. 2103 [Edit]
>>2102
>deanonymize users
Stay safe.
Keep posts short.
>> No. 2104 [Edit]
>>2103
Ironically, unless everyone does it you'll stick out the most. What would be neat is a technique that makes use of a GAN-sort of thing that will "normalize" any input you give it, making it impossible to distinguish different users. Maybe this already exists in the literature somewhere, but given that things like word-vectors and GPT already perform some sort of dimensionality reduction, there's probably a way to go from "input sentence" -> "sparse representation" -> "normalized input."

Unfortunately there's an inherent limitation/asymmetry in that no amount of normalization can remove distinguishing features based on content. For instance, if poster A talks about a specific technology a lot, and you come across someone mentioning that same technology on another board, there's a very high chance that you've run into poster A. And there's no way to normalize that since the content being talked about is inherently identifying.
>> No. 2105 [Edit]
>>2104
Some Drexel students tried making something that could "anonymize" text input, but it doesn't use machine learning.
https://github.com/psal/anonymouth
>> No. 2106 [Edit]
>>2105
Hm it doesn't seem like it automatically does the anonymization yet. As of now it seems to just highlight the distinguishing features which the user can manually edit.

View catalog

Delete post []
Password  
Report post
Reason  


[Home] [Manage]



[ Rules ] [ an / foe / ma / mp3 / vg / vn ] [ cr / fig / navi ] [ mai / ot / so / tat ] [ arc / ddl / irc / lol / ns / pic ] [ home ]