How is internet search engine made of?
How I can search a lot of information in many documents or web pages?
I could search these information using a information retrieval system.
Information retrieval
The information retrival system executes queries to text unstructured document but this file is not ready to search information inside it because the user searches only few keyword that are written in sparse positions in the document and also the document can have a synonym of this keyword etc for this reason it is necessary to do a preprocessing operation.
The most important preprocessing operations are the following:
- Stemming: Replacing words with their stems. For instance the English stem “bikes” is replaced with “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
- Stop Words Filtering: Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
- Text Normalization: Stripping accents and other character markings can make for better searching.
- Synonym Expansion: Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
- Get most frequency terms
The information retrieval system executes the precedent operation and saves the resulted document and the pointer of the original document in a index file.
When the user searches the keyword, this keyword is find into the index file and the result is shown to the users grouped by ranking (level of relevance) when the user clicks on the link, he can open the document.
Apache Lucene
Lucene is the framework composed by many libraries to execute queries on text documents like an information retrieval system.
The principal libs are the following:
- org.apache.lucene.document.Document: allows to store into the index principal information (for example title or contain or creation date) called field of the document. When executing the query it’s necessary to specify what property of the document is searched
- org.apache.lucene.analysis.standard.StandardAnalyzer: the analyser that processes the documents to extract the information to save into the index, for example removes stopword, executes stemming and all the four preprocessing operation written above
- org.apache.lucene.search.Query: executes the query
- org.apache.lucene.search.ScoreDoc: executes the ranking of the results
- org.apache.poi.extractor.ExtractorFactory: is very useful for indexing office document like word or excel and to execute queries inside these documents
My Project
I tested all the precedent libraries of Lucene for searching information into word and excel documents and they run well: Lucene creates index files, executes the query and shows the results ordered by ranking.
There are a lot of analysers:
- org.apache.lucene.analysis.standard.StandardAnalyzer
- org.apache.lucene.analysis.en.EnglishAnalyzer
- org.apache.lucene.analysis.it.ItalianAnalyzer
each analyser makes different preprocessing because English stopwords are different from the Italian stopwords, the same for synonyms and stemming.
The pre processing step is not executed automatically but it is necessary to use the method analyzer.tokenStream
when I execute the following code
for (File f : queue) { FileReader fr = new FileReader(f); TokenStream ts=analyzer.tokenStream("contents", ExtractorFactory.createExtractor(f).getText()); doc.add(new TextField("contents",ts)); ts.reset(); ts.close(); writer.addDocument(doc); }
the console shows the following error:
java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow. at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:111)
This is not the correct way to add the token that contains the text preprocessed at the indexed document.
when I add the field of the document to the indexed document there are no exceptions and the query is executed correctly but when I use a token the previous exception occurs
In my opinion the another way to make the fourth precedents preprocessing is manually, by making many simple dictionaries:
- Stop word dictionary a simple xml file with all words that it will be removed while are processed documents for making the index and it will be removed from the user queries
- stem synonym dictionary: simple database table with two column: key:base term, another column:term to replace.
- For example BASE TERM:eat; TERM TO REPLACE:ate,eating,eats, have a meal
- I use a table that loads in a class file when the aplication starts by executing the following query “select * from stem_synonym_dictionary”
- when I process a document or a query and I found eating or have a meal, I replace it with eat (simple search into string)