Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Apr 16, 2012, 8:38 AM

Post #1 of 19 (281 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254765#comment-13254765 ]

James Dyer commented on SOLR-3360:
----------------------------------

Claudio,

Thanks for reporting this. Was this working prior with 3.5? (We did some work with the "threads" feature in 3.6, so it'd be helpful to know if this is a new bug).

Also, can you try it (1) without any transformers and (2) with just the parent entity (take out the sub-entities). Do you get 10,000 or 1,000 ? This might help in diagnosing any maybe solving this problem.

Finally, you may want to be aware that 3.6 is the last release that will support the DIH "threads" feature. It simply had too many bugs and was too difficult to maintain to keep it in. But we did try and fix as many bugs for 3.6 as we could. Possibly in "fixing" what we could, we introduced this as a new problem?

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 16, 2012, 11:06 AM

Post #2 of 19 (356 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254855#comment-13254855 ]

Claudio R commented on SOLR-3360:
---------------------------------

Hi James,

About the version 3.5.0, I got unstable behavior with 10 threads. In first full-import, I got successful import:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1000</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-16 14:12:08</str>
<str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
<str name="Committed">2012-04-16 14:13:21</str>
<str name="Optimized">2012-04-16 14:13:21</str>
<str name="Total Documents Processed">1000</str>
<str name="Time taken ">0:1:12.875</str>
</lst>

But, in second, third full-import I got Indexing failed. Rolled back all changes.

<lst name="statusMessages">
<str name="Time Elapsed">0:0:6.906</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">12</str>
<str name="Total Documents Processed">11</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-16 14:15:38</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2012-04-16 14:15:43</str>
</lst>

At catalina.out, I got:

SEVERE: Full Import failed:java.lang.RuntimeException: Error in multi-threaded import
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:265)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select categoria, sub_categoria from filtro where indice_id_indice = '257346'
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:253)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84)
at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:446)
at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:399)
at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:466)
at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:353)
at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:406)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: socket closed
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1368)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1355)
at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1532)
at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:3274)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:4433)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:784)
at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:685)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4026)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1416)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:185)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:160)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.execute(SQLServerStatement.java:658)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:246)
... 13 more

In version 3.6.0 I did not get unstable behavior as obtained in version 3.5.0 with 10 threads.
In version 3.6.0 I tried without transformers:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">10000</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-16 14:30:39</str>
<str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
<str name="Committed">2012-04-16 14:39:45</str>
<str name="Total Documents Processed">10000</str>
<str name="Time taken">0:9:5.719</str>
</lst>

In version 3.6.0, with just parent entity:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">10000</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-16 14:42:49</str>
<str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
<str name="Committed">2012-04-16 14:49:05</str>
<str name="Total Documents Processed">10000</str>
<str name="Time taken">0:6:16.0</str>
</lst>

It's weird obtain 10000 documents processed. I only have 1000 records in the database.
Thanks.

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 16, 2012, 11:26 AM

Post #3 of 19 (273 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254875#comment-13254875 ]

Claudio R commented on SOLR-3360:
---------------------------------

I ran in version 3.6.0 with 20 threads and got 20000 documents processed:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">20000</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-16 15:10:22</str>
<str name="">Indexing completed. Added/Updated: 20000 documents. Deleted 0 documents.</str>
<str name="Committed">2012-04-16 15:24:04</str>
<str name="Total Documents Processed">20000</str>
<str name="Time taken">0:13:42.110</str>
</lst>

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 16, 2012, 11:58 AM

Post #4 of 19 (277 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254919#comment-13254919 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

Claudio,

Thank you for providing feedback. I have several considerations for your issue:

# to be honest I didn't pay much attention to these counter when fixing threads, I didn't assert it. So, it might be a bug with counters. But the main subject is your index is it correct? Does it has expected number of docs? Are all master entities were properly connected to the details ones? Pls let us know your observations.

# even DIH code would be correct, you add too many threads. The reason of adding threads is get high CPU utilization, if you exceeds your IO limits you waste CPU time for contentions. Could you start from 2?

# I suppose significant time were spend for obtaining JDBC connections, btw how many of them are avalable in parallel? If you are not happy how DIH scales you can check what does it spent time for. Logs with debug level for DIH enabled are appreciated. You also can take sampling by jconsole, or even manually run jstack <JVMPID>

Thanks

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 16, 2012, 12:18 PM

Post #5 of 19 (273 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254940#comment-13254940 ]

James Dyer commented on SOLR-3360:
----------------------------------

I think we need to verify whether or not it is adding the same 1000 documents 10x, or if its just counting each document 10x. The fact that the successful 10-thread 3.5 run took 1:12 but that same 10-thread run on 3.6 took 14:15 makes me wonder if each thread is actually duplicating the work and not just doing extra counting?

But then again the successful ONE-thread 3.6 run took 1:12 also... hmm...

Probably we need a unit test that does a simple SQL import with 2 threads and counts how many times SolrWriter#upload got called, then compares it both with the # of docs sent and the # docs reported to the user. Then we'd know what is actually broken. It'd be interesting to see what that same test against 3.5 does (if it can be made to run to completion). Possibly this is broken in 3.5 too (except the counters) but nobody noticed because they always got synchronization problems and gave up??

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 16, 2012, 1:58 PM

Post #6 of 19 (259 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255044#comment-13255044 ]

Claudio R commented on SOLR-3360:
---------------------------------

Hi James Dyer and Mikhail Khludnev,

I added in logging.properties of tomcat the line below:

org.apache.solr.handler.dataimport.JdbcDataSource.level=FINE

And ran again in 3.6.0 with 10 threads.
The select below was performed 10 times

select url from video where indice_id_indice = '257933'

This select of sub-entity should have been executed only one time.



> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 21, 2012, 6:19 AM

Post #7 of 19 (269 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258851#comment-13258851 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

James,
I checked what was committed in SOLR-3011. The problem is that all N+1 cases ( <entity name=\"y\" query=\"select * from y where y.A=${x.id}\">\n") were dropped off from my patch for TestThreaded.java from 16th Mar. After they were not covered anymore, I suppose that "halting" problem (child entities selected again and again) were introduced by fix SOLR-3307 (shame on me I was out of the loop). My plan is bring N+1 cases back in TestThreaded.java, and provide correct fix for SOLR-3307. It's just first feeling. The worst case is SOLR-3307 can conflict with halting problem.

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 21, 2012, 1:22 PM

Post #8 of 19 (262 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258940#comment-13258940 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

Ok.

I picked up old test https://raw.github.com/m-khl/solr-patches/solr3011/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestThreaded.java

for testNPlusOneTenThreads_FullImport I have
21-04-2012 14:55:39 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*,add=[.3, 2, 4, 2, 1, 1, 2, 2, ... (40 adds)],commit=} 0 2086
which is what issue is about.

single thread is fine
testNPlusOneSingleThread_FullImport()
21.04.2012 13:08:21 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*,add=[2, 3, 4, 1],commit=} 0 1289

so, this test can reproduce the problem but not actually test it, I need to make it more restrictive.





> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 22, 2012, 6:29 PM

Post #9 of 19 (259 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259333#comment-13259333 ]

Claudio R commented on SOLR-3360:
---------------------------------

Hi Mikhail and James,

When I ran the test with only the root entity (without sub-entities) the problem also occurs. This problem does not appear to be related only to the sub-entities (N+1 case).

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 22, 2012, 10:27 PM

Post #10 of 19 (249 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259388#comment-13259388 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

Claudio,

It's not clear what you did. Have you applied SOLR-3360.patch attached?

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 23, 2012, 7:13 AM

Post #11 of 19 (251 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259618#comment-13259618 ]

Claudio R commented on SOLR-3360:
---------------------------------

Hi Mikhail,

I didnĀ“t apply the SOLR-3360.path. The my test was over version 3.6.0 final
Which svn revision should I apply the patch?

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 23, 2012, 12:08 PM

Post #12 of 19 (246 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259843#comment-13259843 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

Claudio,
patched sources are https://github.com/m-khl/solr-patches/tree/solr3360
patched jar is https://github.com/downloads/m-khl/solr-patches/apache-solr-dataimporthandler-3.6.1-SNAPSHOT.jar
I work with http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_6/

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 23, 2012, 12:36 PM

Post #13 of 19 (255 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259868#comment-13259868 ]

Claudio R commented on SOLR-3360:
---------------------------------

Hi Mikhail,

I did checkout from: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_6_0, applied the SOLR-3360.patch and compiled the dataimporthandler. The code worked correctly. In the log there is no repetition of queries as before.

I got:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1000</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-23 15:56:48</str>
<str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
<str name="Committed">2012-04-23 15:57:29</str>
<str name="Total Documents Processed">1000</str>
<str name="Time taken">0:0:41.390</str>
</lst>

The time spent decreased.
Before, with 1 thread I had obtained 0:1:22.390
Now, with 10 threads I obtain 0:0:41.390

Great job Mikhail.
Thank you very much.
Will this fix be present in version 3.6.1?



> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Apr 24, 2012, 1:27 AM

Post #14 of 19 (255 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260351#comment-13260351 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

James,

I checked the commit http://svn.apache.org/viewvc?view=revision&revision=1329444 code changes looks ok. But I insist on comitting TestThreaded too. It asserts quite important N+1 cases and Once-Datasource semantic, pls have a look essential but not last test coverage improvements https://github.com/m-khl/solr-patches/commit/0a98a827a2df6373ed7a227a240c822e2c150486#solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestThreaded.java . Patch already has these changes for TestThreaded.java. Do you like me to raise separate issue for improving test coverage or you want me to polish them somehow?


> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 4, 2012, 7:32 AM

Post #15 of 19 (224 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268400#comment-13268400 ]

James Dyer commented on SOLR-3360:
----------------------------------

Mikhail,

I'm looking over your "n+1" test and I don't see how it tests anything that isn't being covered in what I committed. These three tests...

testCachedThreadless_FullImport()
testCachedSingleThread_FullImport()
testCachedThread_FullImport()

...they all randomly select (1) is the parent entity cached?, (2) is the child entity cached (or is it going to do an n+1 select)?, and (3) how many threads (2-10)? (for the third test).

What is your version testing that this doesn't cover?

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 4, 2012, 10:34 AM

Post #16 of 19 (221 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268557#comment-13268557 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

TestThreaded in a patch has the following significant additions:

testNPlusOneThreadless_FullImport()
testNPlusOneSingleThread_FullImport()
testNPlusOneTenThreads_FullImport()

these guys use separate dataconfig called *dataConfigNPulsOne* which has " where y.A=${x.id}" as wells as the topic starter's config. Current tests covers only where=\"xid=x.id\" in-mem join scenario.

Also all data provided for MockDataSource are wrapped by new Once(parentRow) that enforces the verification of the subject issue - halting or query repeating problem.

Regards

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 4, 2012, 2:02 PM

Post #17 of 19 (227 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268704#comment-13268704 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

OK. I check it and like your laconic approach.
* I can suggest to rename Once class to something more easily gettable (I'm sorry about that name).
* And also I've found that for N+1 case (where y.A=${x.id}\">\n") CachedSqlEntityProcessor makes no sense i.e. it should be
processor=\""+((childCached && !useWhereParam) ? "Cached":"")+"SqlEntityProcessor\">\n". It didn't work for me at least.
And it can be worth to warn user somehow if he configured CachedSqlEntityProcessor for N+1. Separate jira or YAGNI?

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 9:42 AM

Post #18 of 19 (223 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269758#comment-13269758 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

James,

I'm looking into but how /where="xid=x.id"/ is possible with processor="SqlEntityProcessor" ? It seems to me that your version of ternary operator is not so restrictive as mine.

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 12:28 PM

Post #19 of 19 (241 views)
Permalink
[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269925#comment-13269925 ]

Mikhail Khludnev commented on SOLR-3360:
----------------------------------------

James,

I can't reproduce the failure.
mkhl$ ant test-contrib -Dtests.seed=-55eeb72d0a16dfec:4e1a59f5738a6b25:4bf3cbf2bd3b659a -Dtestcase=TestThreaded

junit report

{code}
<property name="tests.seed" value="-55eeb72d0a16dfec:4e1a59f5738a6b25:4bf3cbf2bd3b659a" />

<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedThread_FullImport" time="0.965" />
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedThreadless_FullImport" time="0.055" />
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedSingleThread_FullImport" time="0.053" />

{code}


Finally I added _Total test which enumerates all test params
https://github.com/m-khl/solr-patches/commit/0532e653a3319247519f90bd8987c84171ac6a56.diff

at core i5 MacBook Pro

:solr mkhl$ ant test-contrib -Dtestcase=TestThreaded
junit-sequential:
[junit] Testsuite: org.apache.solr.handler.dataimport.TestThreaded
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 3.899 sec
[junit]

junit-parallel:

{code}
</properties>
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedThread_FullImport" time="1.042" />
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedThreadless_FullImport" time="0.047" />
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedSingleThread_FullImport" time="0.052" />
<testcase classname="org.apache.solr.handler.dataimport.TestThreaded" name="testCachedThread_Total" time="0.898" />
<system-out><![CDATA[]]></system-out>
{code}


Please give me a clue how to reproduce the failure. What do you use IDE or script? Have you clean before test? Pls show me exact command, junit report, log/output, etc

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
> Reporter: Claudio R
> Assignee: James Dyer
> Fix For: 3.6.1
>
> Attachments: SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360-test.patch, SOLR-3360.patch
>
>
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">5001</str>
> <str name="Total Rows Fetched">1000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:21:57</str>
> <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:23:19</str>
> <str name="Total Documents Processed">1000</str>
> <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">0</str>
> <str name="Total Rows Fetched">10000</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2012-04-16 11:31:43</str>
> <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
> <str name="Committed">2012-04-16 11:41:50</str>
> <str name="Total Documents Processed">10000</str>
> <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" password="pass"/>
> <document>
> <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer" query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}', 120)">
> <field column="id_indice" name="id_indice" />
> <field column="a" name="a" />
> <field column="b" name="b" />
> <entity name="filtro" transformer="RegexTransformer,TemplateTransformer" query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
> <field name="filtro_categoria" column="categoria" />
> <field name="filtro_sub_categoria" column="sub_categoria" />
> <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}" />
> </entity>
> <entity name="pagina_relacionada" query="select url from pagina_relacionada where indice_id_indice = '${indice.id_indice}'">
> <field name="pagina_relacionada_url" column="url" />
> </entity>
> <entity name="veja_mais" query="select chamada, url from veja_mais where indice_id_indice = '${indice.id_indice}'">
> <field name="veja_mais_chamada" column="chamada" />
> <field name="veja_mais_url" column="url" />
> </entity>
> <entity name="video" query="select url from video where indice_id_indice = '${indice.id_indice}'">
> <field name="video_url" column="url" />
> </entity>
> <entity name="galeria" query="select url from galeria where indice_id_indice = '${indice.id_indice}'">
> <field name="galeria_url" column="url" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.