Index and Search structured XML documents using Apache Solr

Apache Solr is highly scalable search engine with lots of goodies in-built. In this guide we will learn how to get our structured data in XML can be indexed and searched effectively.

We will learn following concepts:

  1. Starting up Apache Solr
  2. Importing structured XML document for indexing in Apache Solr

Tools & Library used in this project:

  1. Apache Solr 5.3.0
  2. Java 8
  3. Mac OSX

Downloading & Starting Apache Solr

Download Apache Solr Binary Distribution

We can download Apache Solr latest version from their official website. When we click on major or mirror download distribution link, we got a page like it:

apache solr download page package to choose

Tip: Apache Solr downloadable package size is around 130 MB. Make sure you have this much bandwidth left on your internet connection.

Unpack Apache Solr Binary Download Zip

When we unpack Apache Solr Binary Download Zip we see following files and folder inside main folder:

apache solr 5.3 binary folder structure

Apache Solr 5.3 installation folder structure

Starting, Stopping and Restarting Apache Solr

Starting Apache Solr Server

$ cd /Volumes/Drive2/App/solr-5.3.0/

# Start Solr Server

$ bin/solr start

Apache Solr has been started at http://localhost:8983/solr.

Stopping Apache Solr

$ cd /Volumes/Drive2/App/solr-5.3.0/

# Stop Solr

$ bin/solr stop -p 8983

Restarting Apache Solr

$ cd /Volumes/Drive2/App/solr-5.3.0/

# Stop Solr

$ bin/solr restart -p 

Note: Replace Solr folder path with your installation path

Let’s create a core (or Collection) “xmlhub”

$ bin/solr create -c xmlhub

Setup new core instance directory:
/Volumes/Drive2/App/solr-5.3.0/server/solr/xmlhub

Creating new core 'xmlhub' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=xmlhub&instanceDir=xmlhub

{
 "responseHeader":{
 "status":0,
 "QTime":874},
 "core":"xmlhub"}

Indexing XML files

Sample XML File

We will be indexing xml files kept in a folder (In our application its at <solr_installtion_root_dir>/example-data). An example of a XML file content:

File: example1.xml

<?xml version="1.0" encoding="UTF-8"?>
<ele xmlns:dc="http://purl.org/dc/elements/1.1/">
<attr1>
  Atrr1 Value 1
</attr1>
<attr2>
  Attr2 Value 1
</attr2>
<meta property="meta1">
  Meta 1 Val 1
</meta>
<meta property="meta2">
  Meta 2 Val 1
</meta>
<meta name="name1">
  Name 1 value 1
</meta>
<meta name="name2">
  Name 2 value 1
</meta>
</ele>

Uploading XML structured data for Indexing using Data Import Handler

Step 1 : Configure solrconfig.xml

We will find solrconfig.xml file in location <solr_installtion_root_dir>/solr/<collection/node_name>/conf. 

Location of solrconfig.xml in solr 5.3 installation

Location of solrconfig.xml in solr 5.3 installation

File: solrconfig.xml

....
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*.jar" />
....
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">xmlhubconfig.xml</str>
    </lst>
</requestHandler>
...

We can place this code in solrconfig.xml

Step 2: Create Data Import configuration

We may can provide data import configuration in solrconfig.xml file, but we choose to do that in external file xmlhubconfig.xml.

File: xmlhubconfig.xml

<dataConfig>
  <dataSource type="FileDataSource"/>
  <document>
    <!-- this outer processor generates a list of files satisfying the conditions specified in the attributes -->
    <entity name="f" processor="FileListEntityProcessor" fileName=".*.xml$" recursive="true" rootEntity="false" dataSource="null" baseDir="/Volumes/Drive2/App/solr-5.3.0/example-data">

      <!-- this processor extracts content using Xpath from each file found -->

      <entity name="nested" processor="XPathEntityProcessor" forEach="/ele | /metadata" url="${f.fileAbsolutePath}" >
              <field column="attr1_s" xpath="/ele/attr1"/>
              <field column="attr2_s" xpath="/ele/attr2"/>
              <field column="meta1_s" xpath="/ele/meta[@property='meta1']"/>
              <field column="meta2_s" xpath="/ele/meta[@property='meta2']"/>
              <field column="name1_s" xpath="/ele/meta[@name='name1']"/>
              <field column="name2_s" xpath="/ele/meta[@name='name2']"/>
      </entity>
    </entity>
  </document>
</dataConfig>

This configuration is specific to XML file structure. Pay attention on how we had used XPATH. You should also replace baseDir with your path.

Step 3: Configure to generate unique id automatically

In solrconfig.xml we will be using updateRequestProcessorChain to setup UUIDUpdateProcessorFactory to generate unique UUID for id column.

File: solrconfig.xml

...
<updateRequestProcessorChain>
      <processor class="solr.UUIDUpdateProcessorFactory">
        <str name="fieldName">id</str>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...

Index File

We should restart Apache Solr.

Go to http://localhost:8983/solr/#/xmlhub/dataimport//dataimport:

Execute dataimport handler to index xml

Execute dataimport handler to index xml

It will index the xml files and create documents. You can browse the document at http://localhost:8983/solr/xmlhub/browse.

Browse indexed documents in built-in solr collection browser

Browse indexed documents in built-in solr collection browser

Using the built-in collection browser we can search indexed documents. Learn more about Solr Query Syntax at official documentation. Apache Solr also provides API to access search interfaces with all the available features.

References

  1. Learn about the Apache Solr Query Syntax
  2. Apache Solr Data Import Helper Documentation
  3. Apache Solr Site

5 Replies to “Index and Search structured XML documents using Apache Solr”

  1. Hi ,
    Thank for a nice artical .
    I am having lucene index data which i already created so where i can specify that path of dataDir.

  2. Thanks for this nice article
    but can i ask you about the file “schema.xml”
    can i choose not to use it .
    if not , how to configure the fields of my files in this schema
    Thank for your help .
    PS : Sorry. English is not my home language

    1. Thanks for reading Aicha. Your English is pretty well.

      If you can observe in this article, we have not modified schema.xml at all. It was untouched. For most of the need, you don’t need any modification in this file.

      Please read this article to know, how to define custom fields in schema.xml
      https://wiki.apache.org/solr/SchemaXml

  3. Thanks for a nice article. Do i need to mention all the column names in field tag? Is there a way to get all the data of a xml file without mentioning all the column names. Just write a single line(ex. ) and fetch all the data.

Leave a Reply

Your email address will not be published. Required fields are marked *