I would recommend using ClouderaVM to do these. Alternatively, you can use the Elastic Map Reduce service in AWS.
Data Mining (Text Mining)
The basic data (text) mining project outline is as follows:
- Stackexchange posts are available as a massive XML dump
- Parse XML dump and store into HDFS
- Analysis on the parsed posts – such as the samples below:
Posts Mining – Sample labs/exercises on the parsed stackexchange ‘Posts’
- — Find all Questions that do not have any answers.
- –Find the most viewed questions in each category.
Sample exercises on stackexchange ‘Users’
- The User profile with maximum views.
- The top users with maximum reputation points.
Sample mining of Comments
- The Question Post that have highest number of comments etc.
Predictive Analysis (on StackExchange Posts)
Sample Exercise – Find average time between a question appearing and an answer being posted
- For each posted question the fastest reply was taken into consideration and the time difference between posting a question and getting the first reply was calculated.
- This difference was averaged for all the posts belonging to a category, thereby predicting the activity on a post.