Blogs

One Media Indexer to Rule Them All

Ismail Mayat

03 Jul 2017 • 3 min read

A reboot of the CogUmbracoExamineMediaIndexer package.

Hello, youve stumbled into the old Cogblog archives

We've switched our blogging focus to our new Innerworks content, where tech community members can share inspiring stories, content, and top tips. Because of this, old Cogworks blogs will soon be deleted from the site, so enjoy it while you can. A few of these might refer to images in the text, but those have been deleted already sorry. Some of these subjects will return, some aren't relevant anymore, and some just don't fit the unbiased community initiative of Innerworks.

If you'd like to take on this subject yourself please submit a new blog!

Farewell Cogworks Blog 💚

When working with Examine, you will eventually have a customer requirement for indexing non-HTML content, i.e. files in the media section. The default Examine offering is a PDF indexer, which, as the name suggests, can be used to index media files provided they are in PDF format.

However, you may also want to index other file types like Word, PowerPoint, or Excel, so what should you do in this case? Around 2012, I authored the CogUmbracoExamineMediaIndexer, which has had 1600 downloads to date. The package targeted Umbraco v6.

I thought the package could use a refresh, and so Cogworks is proud to present the reboot: Cogworks.ExamineFileIndexer.

This custom Examine indexer utilises Apache Tika. Apache Tika is a library used for document type detection and content extraction from various file formats. Internally, Tika uses existing document parsers and document-type detection techniques to detect and extract data.

Using Tika, one can develop a universal type detector and content extractor to extract structured text and metadata from different types of documents, such as spreadsheets, text documents, images, PDFs, and even multimedia input formats to a certain extent.

Tika provides a single generic API for parsing different file formats. It uses existing specialised parser libraries for each document type.

As you can see from the Luke screenshot below, Tika will extract more than just the file content; it will also extract quite a lot of useful metadata:

Tika also internally has a language detector class, which can determine the language the document is written in. You can even do crazy things like detect phone numbers in the text.

More examples can be found on the Tika samples page (we may incorporate some of this other cool stuff in a later version), but if you want to get into Tika, there is a great book called Tika in Action.

The package can be installed via NuGet Install-Package Cogworks.ExamineFileIndexer

ExamineIndex.config and ExamineSettings.config files will be updated upon installation. The following entries will be added:

ExamineIndex.config

<IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/MediaIndexSet">
   <IndexAttributeFields>
     <add Name="id" />
     <add Name="nodeName" />
     <add Name="updateDate" />
     <add Name="writerName" />
     <add Name="path" />
     <add Name="nodeTypeAlias" />
     <add Name="parentID" />
   </IndexAttributeFields>
   <IncludeNodeTypes>
     <add Name="File" />
   </IncludeNodeTypes>
</IndexSet>

And

ExamineSettings.config

Under ExamineIndexProviders/providers:

<add name="MediaIndexer" type="Cogworks.ExamineFileIndexer.UmbracoMediaFileIndexer, Cogworks.ExamineFileIndexer"
extensions=".pdf,.docx."
umbracoFileProperty="umbracoFile" />

Under ExamineSearchProviders/providers:

<add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

The following file types will be indexed by default: pdf and docx. To add other file types to the index, you need to update ExamineSettings.config and add the extension to the extensions attribute:

<add name="MediaIndexer" type="Cogworks.ExamineFileIndexer.UmbracoMediaFileIndexer, Cogworks.ExamineFileIndexer"
extensions=".pdf,.docx"
umbracoFileProperty="umbracoFile" />

The package also supports virtual path providers. So, if you are storing media files in the cloud, e.g., Azure, use the package UmbracoFileSystemProviders.Azur, they will still be indexed.

You can install it via Nuget, which is preferred, or install the Umbraco package on our.umbraco.org. Source code is available on GitHub, so feel free to log any issues or fork and improve via pull request (contributions are welcome; the VPP functionality was a contribution from Crumpled Dog made by Hendy Racher).

So go forth and index those files!

Did you know that media indexing is covered in the Umbraco Searching and Indexing course? To find an event near you, visit the Umbraco course schedule page.

Media Indexer Package
Umbraco Package
Cogworks Package

Innerworks and Cogworks are proud to partner with Community TechAid who aim to enable sustainable access to technology and skills needed to ensure digital inclusion for all. Any support you can give is hugely appreciated.

Donate tech Donate funds About community tech aid