
gsingers at apache
Jul 12, 2009, 4:46 AM
Post #6 of 7
(1161 views)
Permalink
|
On Jul 7, 2009, at 9:55 AM, Jukka Zitting wrote: > Hi, > > On Tue, Jul 7, 2009 at 3:09 PM, Uwe Schindler <uwe [at] thetaphi> wrote: >> I am a little bit confused: I thought the call for papers/talks is >> over? > > Yes, the CFP has ended and we (currently just me and Grant) have all > the submissions. FWIW, All of them are already listed on Lucene Meetup page (http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009 ), so I think we can just state them. I will send a follow up with some scheduling ideas. We need to get this done. CFPs: =================================== TRACK: Lucene (two days) =================================== 347: Building Intelligent Search Applications with the Lucene Stack Presentation, 60 minutes by Grant Ingersoll Apache Lucene has evolved in recent years beyond a core search library into a top level project containing a whole suite of tools for working with content. Starting with Solr, which builds on the core Lucene search library, we can add in tools like Tika, Mahout, Droids and other open source libraries to build intelligent search applications. This talk will focus on how to leverage the various components of the Lucene Stack to build out intelligent search applications that better enable users to find what they are looking for in today's sea of content. 366: Apache Solr: Out of the Box Presentation, 60 minutes by Chris Hostetter Apache Solr is an HTTP based enterprise search server built on top of the Lucene Java search library. In this session we will see how quick and easy it can be to install and configure Solr to provide full-text searching of structured data without needing to write any custom code. We will demonstrate various built-in features such as: loading data from CSV files, tolerant parsing of user input, faceted searching, highlighting matched text in results, and retrieving search results in a variety of formats (XML, JSON, etc....) We will also look at using Solr's Administrative interface to understand how different text analysis configuration options affect our results, and why various results score the way they do against different searches. No previous Solr experience is expected. 367: Apache Solr: Beyond the Box Presentation, 60 minutes by Chris Hostetter Apache Solr is an HTTP based enterprise search server built on top of the Lucene Java search library. In this session we will look at Solr's internal Java APIs and discuss how to write various types of plugins for customizing it's behavior-- as well as some real world examples of "When" and "Why" it makes sense to do so. 415: Implementing an Information Retrieval Framework for an Organizational Repository Presentation, 60 minutes by Sithu D Sudarsan Successful Information Retrieval (IR) frameworks for large repositories have been reported in recent times. Invariably, all of them have used machine readable repositories, where plain text availability is the norm. However, organizations with legacy archives need to develop a framework which first converts the non-electronic archive to electronic archive and then extract machine readable text with an acceptable error rate. The Food and Drug Administration (FDA) has electronic images of the documents collected as part of their charter to approve and monitor products related to health care. These documents date back multiple decades and have formats which range from microfiche through early optical character recognition to recent electronic formats. We believe that a large knowledge base hidden in them could be mined. To mine this knowledge base, we are developing a semantic mining framework using open source tools such as lucene, pdfbox, solr, poi, and Java. Challenges include determining the quality of text being extracted and the ability to handle documents containing formatted text in part. The text itself may contain specific vocabularies from medical, legal, engineering and scientific domains and terminology that evolves over time. Careful thought needs to be given to selecting analyzers for indexing and retrieval and implementing a framework for heuristics useful to domain experts as well as novices. An initial prototype is currently being evaluated with a sample size of over 100,000 documents and 70GB of data for different extractors, analyzers and search heuristics, with multiple indices for each document stored in a distributed fashion. 424: Apache Mahout - Going from raw data to information Presentation, 60 minutes by Isabel Drost It has become very easy to create, publish, and collect data in digital form. The volume of structured and unstructured data is increasing at tremendous pace. This has led to a whole new set of applications that can be build if one solves the problem of turning raw data into valuable information. Possible applications include but are not limited to: Discovering new trends from a stream of weblog entries. Automatic learning approaches for supplementing market research processes for new products. Machine learning provides tools for building such applications. A large community of researchers has been working on the topic of learning from data. Although a lot of information on algorithms and solutions to common problems are publicly available, scaling these solutions into the range of terabytes and petabytes is an open issue. To scale algorithms to such dimensions it is indispensable to distribute data as well as computation. The mission of the Mahout project is to build a suite of scalable machine learning algorithms that can cope with todays amount of data. The project is built on top of Hadoop. This talk provides a beginner-friendly introduction to the topic of machine learning. It presents a broad set of applications that benefit machine learning. The presentation gives a highlevel overview of the project itself: The types of tasks that can be solved with each algorithm and the pitfalls one needs to look out for when using it. 426: MIME Magic with Apache Tika Presentation, 60 minutes by Jukka Zitting Apache Tika is a Lucene subproject whose purpose is to make it easier to extract metadata and structured text content from all kinds of files. Tika leverages libraries like Apache POI and PDFBox to provide a powerful yet simple interface for parsing dozens of document formats. This makes Tika an ideal companion for Apache Lucene or any other search engine that needs to be able to index metadata and content from many different types of files. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity. The audience is expected to have basic understanding of Java programming and MIME media types. 493: Solr Flair: User Interfaces, powered by Apache Solr Presentation, 60 minutes by Erik Hatcher Come see Solr in a new light, with snazzy innovative user interfaces. We'll talk about Solr's flexible capabilities for driving custom user interfaces and how projects like SolrJS and "Solritas" bring Solr to the front-end. We'll experience user interfaces in a variety of front-end technologies, including PHP, Ruby on Rails, Java, Velocity, JQuery, and SIMILE Timeline. We'll have Ajax, clouds, maps, timelines, and set visualizations, oh my! 512: Advanced Indexing Techniques with Apache Lucene Presentation, 60 minutes by Michael Busch Just as in 2007 and 2008 will we talk in this presentation about the latest indexing and search innovations in Lucene and how to use them. The payloads feature that was added in 2007 enabled many new interesting use cases. The Lucene developers continued working on Flexible Indexing, and so far a new flexible TokenStream API, a configurable indexing chain and pluggable indexing consumers have been developed. We are also working on column-stride fields, a feature which will perform better than payloads for many use cases. This talk will give an overview of the latest progress and demonstrate the new features with interesting use cases.
|