I would recommend using ClouderaVM to do these. Alternatively, you can use the Elastic Map Reduce service in AWS.

Data Mining (Text Mining)

General Strategy

The basic data (text) mining project outline is as follows:

  1. Stackexchange posts are available as a massive XML dump
  2. Parse XML dump and store into HDFS
  3. Analysis on the parsed posts – such as the samples below:

Posts Mining – Sample labs/exercises on the parsed stackexchange ‘Posts’

  1. — Find all Questions that do not have any answers.
  2. –Find the most viewed questions in each category.

Sample  exercises  on stackexchange ‘Users’

  1. The User profile with maximum views.
  2. The top users with maximum reputation points.

Sample mining of Comments

  1. The Question Post that have highest number of comments etc.

Predictive Analysis (on StackExchange Posts)

Sample Exercise – Find average time between a question appearing and an answer being posted

General Strategy

  • For each posted question the fastest reply was taken into consideration and the time difference between posting a question and getting the first reply was calculated.
  • This difference was averaged for all the posts belonging to a category, thereby predicting the activity on a post.

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.