Fast.AI Drivetrain ML approach: Find Secret Pseudonym of "Jane Doe"
Alas! I'm working on my first Machine Learning(AI) project with a friend. We're trying to find a small work of fiction that a particular author posted on a fiction site under a pseudonym. There are hints and clues on Reddit about this work so my friend is using his web scraping experience and I'm trying to use what I'm learning right now in my Fast.AI course.
The course discusses the Drivetrain method of planning your AI setup and goals. From the Fast.AI course: "In 2012 Jeremy [Howard], along with Margit Zwemer and Mike Loukides, introduced a method called the Drivetrain Approach for thinking about this issue(See Paper Here).
As I read this I got very excited, maybe too excited, because I can use this to plan how we'll try to use AI to find this hidden work of fiction.
(image from Fast.AI)
So here's what I've put together so far. NOTE: Jane Doe is used in place of the author's known name so we will be the first to find this work of ficiton :
Our Drivetrain Approach To Find Jane Does Secret Work:
Google Example Objective: Show the most relevant search result
Of all the pieces that are listed on Reddit as "All-Time Greats" which of these did Jane Doe write under a pseudonym.
Google Example: Ranking search results
1) rankings of how well AI identifies which posts contain "All-Time Greats" and maybe how well it has filtered out the Titles/Authors in said posts:
a) this is a list of "All-Time Greats"(% likely)
b) this is a list of NONSENSE! (%likely)
2)rankings of how well AI is categorizing different written works:
a) this is written by Jane Doe(% likely)
b) this is not written by Eliezer(%likely)
1)We will scrape all the data from the relevant Reddit posts and comments of such posts using python.
2) We will also scrape all available written works by Jane Doe and several similar writers.
1) perhaps we'll fine-tune pre-trained models that already categorize text. Perhaps we can fine-tune one of these to categorize which posts/comments contain "All-Time Greats", then spit out the posts list.
2) I believe a university already has a model that can tell the difference between female and male writers. Perhaps we can train a model to categorize different written works by author particularly focusing on : IS Jane Doe(% likely) or IS NOT Jane Doe(% likely)