Goal: Implement a complete search engine. Milestones Overview
Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component
#2 Complete Search System
PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 – INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index:
Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2
Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. – words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query.
COMPONENT 3 – RANKING:
At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes parameters to improve ranking other than tf-idf from the techniques discussed in class.
Milestone #1 (15 points) Goal: Build an index and a basic retrieval component By basic retrieval component; we mean that at this point you just need to be able to query your index for links (The query can be as simple as single word at this point). These links do not need to be accurate/ranked. We will cover ranking in the next milestone. At least the following queries should be used to test your retrieval: 1 – Informatics 2 – Mondego 3 – Irvine 4 – artificial intelligence 5 – computer science Note: query 4 and 5 are for milestone #2 Deliverables: Submit a report (pdf) in Canvas with the following content:
1. A table with assorted numbers pertaining to your index. It should have, at least the number of documents, the number of [unique] words, and the total size (in KB) of your index on disk.
2. Number of URLs retrieved for each of the queries above and listing the first 20 URLS for each query
● Was the report submitted on time?
● Are the reported numbers plausible?
● Are the reported URLs plausible?
Milestone #2 (45 points and 6 pts Goal: complete search engine Deliverables:
● Submit a zip file containing all the artifacts/programs you wrote for your search ● A live demonstration of your search engine
– Does your program work as expected of search engines? – How general are the heuristics that you employed to improve the retrieval? – How complete is the UI? (e.g. links to the actual pages, snippets, etc.)
– Do you demonstrate in-depth knowledge of how your search engine works? Are you able to answer detailed questions pertaining to any aspect of its implementation?
Additional Information: Understanding the data dump: In Assignment-2, crawlers of all the groups collectively crawled 37,497 URLs. We collected these URLS and are providing them to you as ‘webpages_clean.zip’ file. This zip file contains the following:
1. bookkeeping.json 2. bookkeeping.tsv 3. Folders 0 to 74
Folders: The 37,497 URLs are organized into 75 folders, each folder having 500 files. Every file has the extracted HTML source code of a particular URL. Bookeeping files: bookkeeping.json and bookkeeping.tsv are two different formats of the same file. These files maintain a list of all the URLs that have been crawled. Every URL has an identifier associated with it. This identifier helps locate the HTML code of the URL. The identifier is of the format: “folder_number/file_number” For example, consider the entry on line 13 of bookkeeping.json:
This means that the HTML code extracted for the link “vision.ics.uci.edu/papers/Ramanan BK_ICCV_2007” is located at folder 0, file number 108. Broken HTML: The HTML source code of the URLs may not be well formed. This means that the code may not necessarily have a pair of opening and closing tags. For example, there might be an open <strong> tag but the associated closing tag </strong> might be missing. The HTML parsers that you will use to parse the documents should be able to handle broken HTML. Hence, as mentioned above, while selecting the parser for your project, please ensure that it can handle broken HTML. Use of libraries: It is strictly not allowed to use libraries that perform the entire task of index creation or ranking for you. Hence, libraries such as Lucene or Elastic Search are not allowed. You may use libraries that help you achieve specific tasks. For example, you can use a tokenizer such as NLTK to tokenize your content. The HTML files
• A Zip file that contains crawl-able HTML files which you may parse/process for extracting tokens. • The HTML files have been organized and stored in numbered directories. The file names are
numbers as well. • The bookkeeping.json and bookkeeping.tsv files represent the index of all the HTML files.
• The key value of the json file is essentially the relative file path of the HTML content. The value is the web URL of the HTML content.
• Do not confuse bookkeeping with the inverted index. It simply provides you a means to access the crawl-able HTMLs programmatically. The key values in
• bookkeeping can also be used to uniquely identify the files. This will be useful when you need to retrieve the web page and the content while displaying your search engine results.
Building the inverted index • Now that you have been provided the HTML files to index. You may build your inverted
index off of them. • As most of you may already know, the inverted index is simply a map with the token as a key
and a list of its corresponding postings. • A posting is nothing but the representation of the token’s occurrence in a document. • The posting would typically (not limited to) contain the following info (you are encouraged
to think of other attributes that you could add to the index) : • The document name/id the token was found in. • The word frequency. • Indices of occurrence within the document • Tf-idf score etc
• When designing your inverted index, you will think about the structure of your posting first. • You would normally begin by implementing the code to calculate/fetch the elements which
will constitute your posting. • Modularize. For eg:- If you’re using python, use scripts that will perform a function or a set
of closely related functions. This helps in keeping track of your progress, debugging, and also dividing work amongst teammates if you’re in a group.
• You are free to choose any database system to store your inverted index. • Some possible options – Redis, MongoDB, memcached, MySQL etc. • Pro-tip : If you have a hard time choosing between the database systems. Read about their
performance and learning curves of the libraries available with the language of your choice.
Search and Retrieve • Once you have built the inverted index, you are ready to test document retrieval with
queries. • At the very least, the documents retrieved should be returned based on tf-idf scoring. This
can be done using • the cosine similarity method. Feel free to use a library to compute cosine similarity once you
have the term frequencies and inverse document frequencies. • You may add other weighting/scoring mechanisms to help refine the search results.
Why Work with Us
Top Quality and Well-Researched Papers
We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.
Professional and Experienced Academic Writers
We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.
Free Unlimited Revisions
If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.
Prompt Delivery and 100% Money-Back-Guarantee
All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.
Original & Confidential
We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.
24/7 Customer Support
Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.
Try it now!
How it works?
Follow these simple steps to get your paper done
Place your order
Fill in the order form and provide all details of your assignment.
Proceed with the payment
Choose the payment system that suits you most.
Receive the final file
Once your paper is ready, we will email it to you.
No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.
No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.
Admission Essays & Business Writing Help
An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.
Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.
If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.