WEB MINING LAB
In this exercises you will create a program that infers user interests from a set of visited web pages derived from a “simulated user”. (This could also be just a link or a list of links that you provide to your program similar to the last lab where we provide it a starting point). While there are a large number of algorithms in use to help infer interests and preferences from text and web based interactions, our approach will be limited to only the simulated pages and information you provide. You may need to research on the basics of common informational retrieval methods and/or web and text mining in order to develop the best approach. Think about what information may be needed to mine from a page in order to accomplish this task!
For this assignment you may work with a partner if you wish, but you do not have to. Develop you own algorithm or set of algorithms (or set of ideas/techniques of how it should work if it’s not running properly) that determines a user interests (in terms of derived key words) based on a set of “visited” websites from the simulated user. (Hint: It may be helpful to use part of your web crawler from an earlier lab to help you! In some cases you may want to strip out HTML tags from the page source or you may choose to keep them). You may use any language of your choice for this assignment. You may want to run your original crawler, and then replace all of the links in the “urls.txt” with several links to simulate the links the user has visited.
Turn in:
-All source code
-A file contacting URL’s for websites used
-A brief write-up on how your program works and why you decided on your particular approach? How well do you think your method worked?
-A text file /output file showing the final results of what your program determined to be the user’s interests.
-If you worked with a partner, be sure to include both names on the assignment.
Also answer the following questions and submit with your write-up:
1. What problems do you see related to search and web mining?
2. Why is it so important to figure out what users are interested in?
3. Why is it so hard to figure out what users are interned in?
4. Do you find it difficult to represent users? How should they be represented in a system?
5. Do you view search engines or web mining differently after the last two labs?