The following sessions have been confirmed—specific times will be provided once all sessions have been confirmed:
Enhancing Relevancy through Personalization and Semantic Search
Trey Grainger, Search Technology Development Manager for CareerBuilder
Turning Search Upside Down: Using Lucene for Very Fast Stored Queries
Charlie Hull, Managing Director, Flax
Alan Woodward, Principle Developer, Flax
High Performance JSON Search and Relational Faceted Browsing with Lucene
Renaud Delbru, Co-Founder, SindiceTech
What is in a Lucene index?
Adrien Grand, Software Engineer, Elasticsearch
Querying rich text with XQuery
Michael Sokolov, Senior Architect, Safari Books Online
A Novel methodology for handling Document Level Security in Search Based Applications
Rajini Maski, Senior Software Engineer, Happiest Minds Technologies
Syed Abdul Kather, Senior Software Engineer
Solr Indexing and Analysis Tricks
Erik Hatcher, Senior Solutions Architect, LucidWorks
System Teardown: Solr as a Practical Recommendation Engine
Michael Hausenblas, Chief Data Engineer, MapR Technologies
Test Driven Relevancy -- How to Work with Content Experts to Optimize and Maintain Search Relevancy
Doug Turnbull, Search and Big Data Architect, OpenSource Connections
Rena Morse, Director of Semantic Technology, Silverchair Information Systems
Scaling Solr with SolrCloud
Rafal Kuć, Consultant Software Engineer, Sematext Group, Inc.
Shrinking the Haystack" using Solr and OpenNLP
Wes Caldwell, Chief Architect, ISS, Inc.
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Hien Luu, Technical Lead, LinkedIn
Rajasekaran Rangaswamy, LinkedIn
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Majirus Fansi, SOA and Search Engine Developer, Valtech SA
The First Class Integration of Solr with Hadoop
Mark Miller, Software Developer, Cloudera
Integrate Solr with real-time stream processing applications
Timothy Potter, Founder, Text Centrix
OpenStreetMap Geocoder Based on Solr
Ishan Chattopadhyaya, LucidWorks
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads,
Xavier Sanchez Loro, Ph.D., Trovit Search SL
State of the Art Logging. Kibana4Solr is Here!
Markus Klose, Search + Big Data Consultant, SHI Elektronische Medien GmbH
Query Latency Optimization with Lucene
Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
The Typed Index,
Christoph Goller, Chief Scientist, IntraFind Software AG
Hacking Lucene and Solr for Fun and Profit
Grant Ingersoll, CTO, LucidWorks
Faceted Search with Lucene
Shai Erera, Researcher, IBM
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
John Berryman, Data Architect, Bloom
Real-time Inverted Search in the Cloud Using Lucene and Storm
Joshua Conlin, Associate, Booz Allen Hamilton
Sidecar Index - Solr Components for Parallel Index Management
Andrzej Bialecki, LucidWorks
Solr Relevance Tuning Simplified
Daniel Ling, Senior Architect, Findwise
Large Scale Crawling with Apache Nutch and Friends
Julien Nioche, Director, DigitalPebble
Building Client-side Search Applications with Solr
Daniel Beach, Search Application Developer, OpenSource Connections
Lucene Search Essentials: Scorers, Collectors and Custom Queries
Mikhail Khludnev, Principal Engineer, Grid Dynamics
Relevancy Hacks for eCommerce
Varun Thacker, Search Engineer, Unbxd Inc
Text Classification Powered by Apache Mahout and Lucene
Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH
Schemaless Solr and the Solr Schema REST API
Steve Rowe, Senior Software Engineer, LucidWorks
Solr for Analytics: Metrics Aggregations at Sematext
Otis Gospodnetić, President, Sematext Group, Inc.
Using Solr to Search and Analyze Logs
Radu Gheorghe, Software Engineer, Sematext Group, Inc.
Administering and Monitoring SolrCloud Clusters
Rafal Kuć, Consultant and Software engineer, Sematext Group, Inc.
Use Case Diagnosis: When is Solr Really the Best Tool?
Michael Hausenblas, Chief Data Engineer, MapR Technologies
Recent Additions to Lucene Arsenal
Shai Erera, Researcher, IBM
Solr's Admin UI - Where does the data come from?
Stefan Matheis, Lucene/Solr PMC
Juris portal - Moving a Complex Application from Using Verity and Oracle to Using Solr
Christine Rüb, Senior Project Manager, Juris GmbH
Ramp Up Your Web Experiences Using Drupal and Apache Solr
Peter Wolanin, Momentum Specialist, Acquia, Inc.
Presented by Trey Grainger, Search Technology Development ManagerCareerBuilder
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Presented by Renaud Delbru, Co-Founder, SindiceTech
In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Presented by Michael Sokolov, Senior Architect, Safari Books Online
Solr and Lucene provide a powerful, scalable search server. XQuery provides a rich querying and programming model for use with marked-up text. This session will present Lux, a system that combines these into a powerful XML search engine, which is freely available under an open-source license. Query optimizers often mystify database users: sometimes queries run quickly and sometimes they don’t. An intuitive grasp of what will work well in an optimizer is often gained only after trial, error, inductive logic (i.e. educated guessing), and sometimes propitiatory sacrifice. This session will explain some of the mystery by describing work on Lux's optimizer. Lux optimizes queries by rewriting them as equivalent (but usually faster) indexed queries, so its results are easier for a user to understand than the abstract query plans produced by some optimizers. Lucene-based QName and path indexes prove useful in speeding up XQuery execution by Saxon. Finally, this session will describe the mechanisms Lux uses for extending Solr and Lucene, which include Solr UpdateProcessor, ResponseWriter, and QueryComponent plugins, dynamic Solr schema enhancement, custom XML-aware analyzers and tokenizers.
Presented by Rajini Maski, Senior Software Engineer, Happiest Minds Technologies
An important problem with document-search in any content management system (CMS) is the handling of permission-based search requests for each user. In this session, we present an algorithm and framework that allows the Search Engine to plainly index both public and privileged documents without any early binding overhead—thus enforcing document-level security policies only at the time of search. With our late-binding approach for ACL (access control lists) and some custom components, we have achieved reduction in search-time overhead. We will also discuss the order of complexity and execution time for the search overhead.
Presented by Erik_Hatcher, Senior Solutions Architect, LucidWorks
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Presented by Michael Hausenblas, Chief Data Engineer, MapRTechnologies
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like
Getting good search results is hard; maintaining good relevancy is even harder. Fixing one problem can easily create many others. Without good tools to measure the impact of relevancy changes, there's no way to know if the "fix" that you've developed will cause relevancy problems with other queries. Ideally, much like we have unit tests for code to detect when bugs are introduced, we would like to create ways to measure changes in relevancy. This is exactly what we've done at OpenSource Connections. We've developed a series of tools and practices that allow us to work with content experts to define metrics for search quality. Once defined, we can instantly measure the impact of modifying our relevancy strategy, allowing us to iterate quickly on very difficult relevancy problems. Get an in depth look at the tools we utilize when we not only need to solve a relevancy problem, we need to make sure it stays solved over the product's life.
Presented by Rafal Kuć, Consultant Software Engineer, Sematext Group, Inc
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Presented by Wes Caldwell, Chief Architect, ISS, Inc.
The customers in the Intelligence Community and Department of Defense that ISS services have a big data challenge. The sheer volume of data being produced and ultimately consumed by large enterprise systems has grown exponentially in a short amount of time. Providing analysts the ability to interpret meaning, and act on time-critical information is a top priority for ISS. In this session, we will explore our journey into building a search and discovery system for our customers that combines Solr, OpenNLP, and other open source technologies to enable analysts to "Shrink the Haystack" into actionable information.
For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns.
Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.
Presented by Majirus Fansi, SOA and Search Engine Developer Valtech SA
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Presented by Ishan Chattopadhyaya, LucidWorks
This talk is on the technical aspects of a new OpenStreetMap geocoder based on Apache Solr & Lucene. Recent changes to Apache Lucene and Apache Solr (4.0 and onwards) have seen a marked improvement in the spatial search capabilities. Also, its improved support for distributed storage and search, via the SolrCloud mode, makes applications using Solr scale easily. OpenStreetMap's current geocoder, Nomainatim, is based on Postgresql/PostGis. Some benefits of using Solr (as compared to a database system like Postgres) for building a geocoder, is robust partial text search, analysis in various languages (stemming, tokenization, stop words etc.), spell check, faceting, highlighting etc. Through this presentation, the author intends to bring out an appreciation for a Solr based geocoder.
Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL
This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.
Presented by Christoph Goller, Chief Scientist, IntraFind Software AG
If you want to search in a multilingual environment with high-quality language-specific word-normalization, if you want to handle mixed-language documents, if you want to add phonetic search for names if you need a semantic search which distinguishes between a search for the color "brown" and a person with the second name "brown", in all these cases you have to deal with different types of terms. I will show why it makes much more sense to attach types (prefixes) to Lucene terms instead of relying on different fields or even different indexes for different kinds of terms. Furthermore I will show how queries to such a typed index look and why e.g. SpanQueries are needed to correctly treat compound words and phrases or realize a reasonable phonetic search. The Analyzers and the QueryParser described are available as plugins for Lucene, Solr, and elasticsearch.
Presented by Grant Ingersoll, CTO, LucidWorks
Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this session, we'll explore the features and capabilities of the latest Lucene and Solr, as well as look at how to (ab)use your search engine technology for fun and profit.
Presented by Shai Erera, Researcher, IBM
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Presented by John Berryman, Data Architect, Bloom
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Presented by Steve Rowe, Senior Software Engineer, , LucidWorks
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
Presented by Otis Gospodnetić, President, , Sematext Group, Inc
While Solr and Lucene were originally written for full-text search, they are capable and increasingly used for Analytics, as Key Value Stores, NoSQL databases, and more. In this session we'll describe our experience with Solr for Analytics. More specifically, we will describe a couple of different approaches we have taken with SolrCloud for aggregation of massive amounts of performance metrics, we'll share our findings, and compare SolrCloud with HBase for large-scale, write-intensive aggregations. We'll also visit several Solr new features that are in the works that will make Solr even more suitable for Analytics workloads.
Presented by Radu Gheorghe, Software Engineer, , Sematext Group, Inc
Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.
Presented by Michael Hausenblas, Chief Data Engineer, , MapR Technologies
This session will present an overview of common big data use cases in the form of a set of questions that can be used to determine what kind of problem you really have. From the answers to these questions, you can quickly find out about what technologies are likely to be most productive, useful and easy to apply.This analysis will also allow you to discern cases where Solr is not a good fit, but where augmentation with other big data systems like HBase leads to feasible architectures. Conversely, you will see cases where Solr can be the hero by filling the gaps that big data systems alone are destined to fail.
Presented by Joshua Conlin, Associate, Booz Allen Hamilton
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Presented by Andrzej Bialecki, LucidWorks
This session presents a set of Solr components for easy management of "sidecar indexes" - indexes that extend the main index with additional stored and / or indexed fields. Conceptually this can be viewed as an extension of the ExternalFileField or as a static join between documents from two collections. This functionality is useful in applications that require very different update regimes for the two parts of the index (e.g. main catalogue items combined with clickthroughs).
Presented by Daniel Ling, Senior Architect, Findwise
Relevance is a critical aspect of a successful implementation of a search solution, and tuning it to get it right is essential to the success of search projects. I'll share some best-practices for working with relevance, including three uses case examples of the impact of customized relevance based on scenarios from the media industry, an intelligence agency, as well as a case of personalized relevance based on role. In addition, during the session a Solr relevance tuning module will be open sourced (on Github) and demonstrated. The module contains functionality to adjust the relevance parameters in an appealing UI with sliders and parameters, all providing immediate feedback and impact on search results displayed and re-ranked within the UI.
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Presented by Stefan Matheis, Freelance Software Engineer, Lucene/Solr PMC
Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.
Presented by Shai Erera, Researcher, IBM
Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
Presented by Mikhail Khludnev, Principal Engineer, Grid Dynamics
My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.
Presented by Varun Thacker, Search Engineer, Unbxd Inc
This session is aimed at understanding how the ranking of documents works with Solr and ways to improve relevancy your search application. We will cover how a user gets query parsed in Solr and the default scoring which comes with it. I will show examples of how to customize scoring to work better with your dataset, how to add different relevancy signals into your ranking algorithm, and how to customize results for your top N queries.
Presented by Christine Rüb, Christine Rüb
The juris portal provides access to legal information (about 6.5 mio documents) and information about German companies (about 23 mio documents). Access is highly personalized: search, links and search suggestions are customized according to the documents contained in a user's product collection. There are many search options, the system stability and reliability have to be high and there are DVD versions of subsets of the complete collection. This session describes how the juris portal moved from using Verity and Oracle to using Solr, which decisions we made, which architecture we choose, how we handle indexing and what results we achieved in areas like performance, hardware ressources and stability.
Presented by Peter Wolanin, Momentum Specialist, Acquia, Inc.
Drupal and Apache Solr search are a potent combination in the move towards "digital experiences" online. It is behind a growing number of customized, personalized enterprise platforms for eCommerce, healthcare, physical retail and more. Drupal powers a growing portion of the web, and has been adopted especially by governments around the world, the music industry, media organizations, and retailers. If you have a new web project or and existing Drupal site, the combination of Drupal and Apache Solr is both powerful and easy to set up. The indexing workflow built into the Drupal integration module provides a broad range of automatic facets based on data fields on the Drupal content defined by site administrators. Drupal facilitates further customizations of the UI, indexing, custom facets, and boosting because of an open architecture that provides multiple opportunities for a minimal amount of custom code to alter the behavior. This session will provide a high-level overview of how the Drupal integration works, a tour of the UI configuration options, a few code snippets, and examples of successful Drupal sites using Apache Solr for search.