Nowadays everyone has internet, there are a lot of web pages and it is quite difficult to perform an accurate research among these pages to get the correct information that you need.
The problem is that the information online is not structured like a database:
Web page is complex document with pictures, videos, images and meta information of mark up to specify the interaction with human and graphic effects.
A way to index a text is to use a system of information retrieval like lucene but it is not enough because Lucene works only parsing text on your web page.
Lucene makes a structure called index composed by a pointer of the document and many keywords that this document contains. When you are looking for some information Lucene searches them in these keyword and assigns a special outcome called ranking.
The disadvantage of Lucene is that it performs the simple research in string that is composed by keywords of the document. Lucene doesn’t know the semantic of the text but knows only the keyword contained in this document.
The best way to perform a research on written text document is to add them a specific meta-information by introducing another tags to the html document.
These tags are not visible to the human but are necessary to improve the index of this page for a external information retrieval system.
This is called Semantic WEB
Semantic web works using a special XML file with all rules of the domain, all the constraints of it and the objects written here must respect these rules.
This XML file is called ontology and it is the knowledge base (KB) of the system.
In this way it’s possible to convert the static not structured text into a xml structured document with a lot of constraints and rules (like database) and it’s possible to make queries to this system. The language that executes these queries is OWL (Ontology Web Language)
There are a lot of XML schemas and a lot of rules to make these semantic web a lot of ontology web languages but the most widespread used is RDF
RDF
RDF is the acronym of Resource Description framework and it is the standard model for data interchange on the Web
This is an example of RDF written by me using Apache Jena:
<xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#" > <rdf:Description rdf:about="http://www.rizzimichele.it"> <vcard:N rdf:nodeID="A0"/> <vcard:FN>Michele Rizzi</vcard:FN> </rdf:Description> <rdf:Description rdf:nodeID="A0"> <vcard:Family>Rizzi</vcard:Family> <vcard:Given>Michele</vcard:Given> </rdf:Description> </rdf:RDF>
In the precedent example I written that the address www.rizzimichele.it is related at vcard descripted below that has Family Rizzi and Name: Michele.
This is a simple RDF example but I can make a complex XML like a WSDL and XSD that contains the identity card of web service.
Here is linked a complex RDF example
For example, the following are tags to define the domain.
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#InverseFunctionalProperty"/> <rdfs:comment xml:lang="en"> Note that hasTopping is inverse functional because isToppingOf is functional </rdfs:comment> <rdfs:domain rdf:resource="#Pizza"/> <rdfs:subPropertyOf rdf:resource="#hasIngredient"/> <rdfs:range rdf:resource="#PizzaTopping"/> <owl:inverseOf rdf:resource="#isToppingOf"/> </owl:ObjectProperty>
The property of #hasTopping is subproperty of #hasIngredient: each ingredient is Topping
hasTopping has the inverse of isTopping for passive voice.
hasTopping has a specified domains: pizza names.
This is an object that respects all rules and constraints on the system
<owl:Class rdf:about="#FourCheesesTopping"> <rdfs:label xml:lang="pt">CoberturaQuatroQueijos</rdfs:label> <rdfs:label xml:lang="it">Quattro Formaggi</rdfs:label> <rdfs:subClassOf> <owl:Restriction><owl:onProperty rdf:resource="#hasSpiciness"/> <owl:someValuesFrom rdf:resource="#Mild"/> </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf rdf:resource="#CheeseTopping"/> </owl:Class>
This pizza is the subclass of cheeseTopping with mid spiciness
Apache JENA
Jena is a java library to write or read RDF language and other owl languages such as TTL, it can also to perform a query to do a reasoning about the xml rdf tree to search the information that you need
this is an example of jena query
select ?pizza where {?pizza a owl:Class ; rdfs:subClassOf ?restriction. ?restriction owl:onProperty pizza:hasSpiciness ; owl:someValuesFrom pizza:Mild }
In the previous query I am looking for a pizza with Spiciness Mild
My project
I tested four examples of applications with Apache Jena and I made one my application to write a simple RDF file.
When I tried to read a customized RDF file to execute a customized queries I get the following error:
Exception in thread "main" com.hp.hpl.jena.shared.PropertyNotFoundException: http://www.w3.org/2001/vcard-rdf/3.0#N at com.hp.hpl.jena.rdf.model.impl.ModelCom.getRequiredProperty(ModelCom.java:1243) at com.hp.hpl.jena.rdf.model.impl.ResourceImpl.getRequiredProperty(ResourceImpl.java:170) at org.apache.jena.example.helloworld.Tutorial6.main(Tutorial6.java:52)
I am not an expert of Lucene or Apache jena these are not widespread technologies and there are few forums and manuals to learn about them, when it will be necessary to work with this framework I will spend much time to try and I will perform a lot of tests to fix these problems and to study this framework but in my opinion it is enough for now