Best method for sequentially searching a large number of records

Started by scifidude, May 04, 2023, 07:45:10 PM

Previous topic - Next topic

scifidude

Hello all.

I've been writing Winbatch programs for 20+ years and love it as a way to get things done on Windows systems quickly and also it's ability to write some very sophisticated applications.

I'm currently working on a Winbatch application that needs to sequentially search through about 100k to a million records looking for fields in each record specified by the client using the app.  As I don't know the what the client is going to ask for I need to look at every record.  The records are variable length from 50 to 200 bytes and are ansi characters.

Does anyone have any recommendations on how to efficiently search large record sets?   Currently, I'm developing the code using a Winbatch "list" read from a flat file when the program starts.  I also was experimenting with having the data in an array but ran into the "out of string space" error.

My two concerns are the speed at which the client's query is solved and also to be confident I won't run into memory issues with nearly a million records.

Thanks for any input you care to share

Scifidude

JTaylor

Is this some type of delimited data or is the entire line the field and you are searching for a substring of the line?

Might look at the Binary functions.  Not sure if they would help or not though.

Jim

td

Quote from: scifidude on May 04, 2023, 07:45:10 PM
Hello all.

I've been writing Winbatch programs for 20+ years and love it as a way to get things done on Windows systems quickly and also it's ability to write some very sophisticated applications.

I'm currently working on a Winbatch application that needs to sequentially search through about 100k to a million records looking for fields in each record specified by the client using the app.  As I don't know the what the client is going to ask for I need to look at every record.  The records are variable length from 50 to 200 bytes and are ansi characters.

Does anyone have any recommendations on how to efficiently search large record sets?   Currently, I'm developing the code using a Winbatch "list" read from a flat file when the program starts.  I also was experimenting with having the data in an array but ran into the "out of string space" error.

My two concerns are the speed at which the client's query is solved and also to be confident I won't run into memory issues with nearly a million records.


A few random maybe or maybe not helpful suggestions.

Since speed is important you want to avoid too much file IO in a tight loop. To avoid that you need to start with a recent version of 64-bit WinBatch because it supports a larger string table than the 32-bit version. The current limit for 64-bit WinBatch is about 350 MB. Assuming you do not consume too much string memory processing the file, you might be able to handle a data set up to about 150MB. That number is just a rough guess because of the number of factors involved. If you load the whole file, you could just treat the contents as a string and use one of the StrIndex* functions to perform your search.

64-bit WinBatch can allocate binary buffers up to the size of available virtual memory and your data sets are much less than that on most modern systems. One of the  BinaryIndex* functions could then be used to search the loaded buffer for your target string.

Whether or not these approaches are worth considering depends a lot on what happens after the search target is found.

Good luck.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Tried this on an over .5 GB file with success.
Code (winbatch) Select
nStart = GetTickCount64()
strBigFile = "f:\temp\combined.csv"
strItem = "bob"
hBig = BinaryAlloc(FileSize(strBigFile))
BigBytes = BinaryRead(hBig, strBigFile)
Offset = BinaryIndexEx(hBig, 0, strItem, @Fwdscan, 0)
nElapse = (GetTickCount64()-nStart)/1000.0
if Offset >= 0 then Message(strItem, "Found at offset: ":Offset:" in ":nElaps:" sec.")
else  Message(strItem, "Not found":" in ":nElapse:" sec.")
exit


But the script took about 35 seconds when searching for a nonexistent item.  BinaryIndexEx performs a linear search so that is to be expected.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: JTaylor on May 04, 2023, 08:47:40 PM
Is this some type of delimited data or is the entire line the field and you are searching for a substring of the line?


Sort of echo of Jim's question. Are you looking for a count of the string in the entire file, or output of the line/line number in the file where the string occurs [assuming there are lines as original post refefenced 'fields']

td

That is what I meant previously when I mentioned that whether or not these approaches are worth considering depends greatly on what happens after the search target is found. BinaryIndexEx can be used to make repeated searches starting just after the previously found item. Accounting for record numbers would require a multiple-level search that would likely be no more efficient than using FileOpen/FileRead in a loop. Perhaps a Regex solution would work in the latter case but that would require some testing to determine performance.

Or maybe Excel COM would be appropriate. Don't really know what size limits Excel has though.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: td on May 05, 2023, 11:07:27 AM
That is what I meant previously when I mentioned that whether or not these approaches are worth considering depends greatly on what happens after the search target is found. BinaryIndexEx can be used to make repeated searches starting just after the previously found item. Account for record numbers would require a multiple-level search that would likely be no more efficient than using FileOpen/FileRead in a loop. Perhaps a Regex solution would work in the latter case but that would require some testing to determine performance.

Or maybe Excel COM would be appropriate. Don't really know what size limits Excel has though.


Is an Array out of the question?

JTaylor

OP said he ran out of memory going that route.   Not sure if he was using 64-bit version or not.

Quote from: stanl on May 05, 2023, 11:54:51 AM
Is an Array out of the question?

stanl

Quote from: JTaylor on May 05, 2023, 12:26:02 PM
OP said he ran out of memory going that route.   Not sure if he was using 64-bit version or not.

Quote from: stanl on May 05, 2023, 11:54:51 AM
Is an Array out of the question?


Maybe a .Net Streamreader breaking the data into chunks of arrays

scifidude

First, thanks to all of you for your input.

Think of this app as a way to keep track of items in your collections.   Using baseball cards as an example the client would first establish what information or attributes they would like to track (and search for) about each card in the collection.  Examples would be the condition of the card from "mint" to "poor", the date the card was printed (if that is able to be determined), what you paid for for it, the latest appraised value, team name, location of the card in your storage bin, etc.

As each card is acquired the app is used to create a record which contains the various values of the attributes.  So, a sample record in the file could look like this...

C5D19871206P8A16T123L012R235

with C5 being its "Condition" 5, D1987... being Dec 6, 1987, P8 ... paid $8 for it, A16 ... last appraisal was $16, T123... team 123, L012 ... location 12, and R235 ... record number 235.   The record number could be used to provide further detailed information that is not searchable but is accessable via the app.

When this file is even partially complete you now have a way to locate one or more cards in your collection and pull them out to maybe sell, display, or trade.

The search program in the app allows the client to make multiple selections with and/or logic for each attribute - for instance if you wanted to locate all Jackie Robinson cards AND condition 4 AND appraised at $10 or more.

So, with each initiated search I need to go through all the records as I don't know how many Jackie Robinson cards I have, where they are in the file, what the condition of the card is, etc.  Besides that I might need to search through each record multiple times depending on the number of attributes defined and the complexity of the search. That's why I'm concerned about the response time on the search when you have a 100K card collection.

By spending the time to pre-load the data file into memory before allowing the search to occur, I don't have to read the records from disk more than one time, so I'm really looking for the fastest way to sequentially go through the data in memory using either a Winbatch list, an array, or the Binary functions.

I'm trying to make this as simple as possible and as fast as possible which is why I haven't considered using a database.

Based on what I've read in your posts I will likely upgrade to the latest WB+Compiler and compile this as a 64bit executable then test the performance of each memory store and search method.   I am working on creating a set of test versions of the record set with 100K, 500K, and 1MB records and I'll post the resulting performance benchmarks.

Thanks again.
Scifidude

stanl

Quote from: scifidude on May 05, 2023, 07:45:18 PM

As each card is acquired the app is used to create a record which contains the various values of the attributes.  So, a sample record in the file could look like this...

C5D19871206P8A16T123L012R235



Based on what I am interpreting from the above portion of your reply, seems like it would require a search of 1 or more substrings and if it were me I would prefer a database, especially if the 'attributes' are fixed 'fields'. But, good luck.

JTaylor

I would highly recommend using SQLite.   I can't imagine you will even begin to match the searching performance and flexibility, overall, going a line by line text search route.

Jim

stanl

Quote from: JTaylor on May 06, 2023, 07:31:31 AM
I would highly recommend using SQLite.   I can't imagine you will even begin to match the searching performance and flexibility, overall, going a line by line text search route.

Jim


I was going to mention SQLite and your Extender, had the op replied. Especially if it plays with 64-bit.

JTaylor

Doesn't work with 64-bit but not sure that would matter much as the main benefit in this case was the extra memory and speed in certain functions which would not be an issue with using a DB.   

There is so much wrapped up in the Extender that it would be extremely difficult to make a 64-bit version, at least for someone as ignorant as me.  I can't even upgrade the compiler version in VS as it generates so many errors throughout the various libraries.   The most extraordinary thing about my Extender is that it works at all :)   I occasionally think about splitting it back up so I can do such things for some of them but the hassle of dealing with 8-10 different Extenders doesn't seem worth it. 

Guess this isn't a very good sale pitch  ;)

Jim

stanl

WB can still use SQLite, even w/out the Extender. Then there open source alternatives https://www.litedb.org

td

SQLite seems like a good choice. One option is a dotNet Framework assembly package that provides WIL scripts access to SQLite functionality via WinBatch CLR hosting. The dotNet overhead should be minimal as all the heavy lifting is still done in native code.  But given the data format, a straight linear file content search with a bit of post-hit processing has merit too.

Binary buffers have some ability to create hash tables using the BinaryHashRec function but that would require that the file creation and appending be under the OP's control. It also requires fixed link records and requires the hashed part of a record to be specified for each search.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: td on May 06, 2023, 03:31:47 PM
SQLite seems like a good choice. One option is a dotNet Framework assembly package that provides WIL scripts access to SQLite functionality via WinBatch CLR hosting.


There are also several articles in the Tech DB, several dealing with the SQLite ODBC driver. Here is an oldie


https://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/getarticle.web+WinBatch/OLE~COM~ADO~CDO~ADSI~LDAP/SQLite+Create~SQLite~DB~and~Table.txt';

td

Not recommending nor disparaging this approach for the OP. Just passing a very simple example along for the curious.

Code (winbatch) Select
;; Assembly downloaded from here:
;;  https://system.data.sqlite.org/index.html/doc/17fd4c1265/www/downloads.wiki

; Delete previous test DB.
strFile = "f:\Reinstalls\SQLite\sqlite-netFx45-static-binary-bundle-Win32-2012-1.0.88.0\test.db"
if FileExist(strFile) then FileDelete(strFile)

; Set appbase to extracted files location.
ObjectClrOption("appbase", "f:\Reinstalls\SQLite\sqlite-netFx45-static-binary-bundle-Win32-2012-1.0.88.0")
ObjectClrOption("use", "System.Data.SQLite")

objSqlLiteCon = objectClrNew("System.Data.SQLite.SQLiteConnection", "Data Source=%strFile%;Pooling=true")
ObjectClrOption("useany", "System.Data")
objDataState = ObjectClrNew( "System.Data.ConnectionState")
objSqlLiteCon.Open()
if objSqlLiteCon.State == objDataState.Open 

   ; Create a table.
   objSqlLiteCmd = objectClrNew("System.Data.SQLite.SQLiteCommand")
   objSqlLiteCmd.Connection  = objSqlLiteCon
   objSqlLiteCmd.CommandText = "CREATE TABLE employee_data (emp_id INTEGER, emp_name TEXT);"
   objSqlLiteCmd.ExecuteNonQuery()
   objSqlLiteCmd.CommandText = "INSERT into employee_data (emp_id, emp_name) VALUES (001,""Bob Abergast"");"
   objSqlLiteCmd.ExecuteNonQuery()
   objSqlLiteCmd.CommandText = "INSERT into employee_data (emp_id, emp_name) VALUES (002,""Marquess Doe"");"
   objSqlLiteCmd.ExecuteNonQuery()
   objSqlLiteCmd.CommandText = "Select * from employee_data;"
   objReader = objSqlLiteCmd.ExecuteReader()
   
   ; Check the newly created table.
   while objReader.Read
      Message(objReader.GetValue(0), objReader.GetValue(1))
   endwhile
   
   ; Cleanup
   objReader.Close()
   objSqlLiteCon.Close()
   objReader     = 0
   objSqlLiteCmd = 0
endif
objDataState  = 0
objSqlLiteCon = 0


<edit> There is a Nuget package for SQLite assemblies mentioned here:
https://system.data.sqlite.org/index.html/doc/trunk/www/faq.wiki#q5
Nuget package files can be treated as nothing more the glorified zip files. 7-Zip is a good tool to use to extract package content.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

cannot find 1.0.88.0 on that URL, see 1.0.117... OOPs... it is there


EDIT:


Regardless of using Jim's extender, ODBC or .NET.  SQLite is adequate for the job and data from query results can be sent to a WB Reportview.

td

Performed a quick test on a 2,000,000-record database with 10 columns using the Nuget dotNet SQLite assembly. When using a SELECT command with a LIKE operator for each column and a value that did not exist in any of the columns, it took .766  seconds.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Created a quick and dirty test script using the Nuget package assembly.  The 32-bit version of SQLite can't handle a .5 GB CSV file but can convert a .25 GB CSV file in short order.

Code (winbatch) Select
; The 2,000,000 row DB
strDb = "F:\temp\org2000000.db"

; Initialize a couple of variables.
nStart = GetTickCount64()
nCnt = 0

; Set appbase to extracted Nuget assembly location.
ObjectClrOption("appbase", "F:\Reinstalls\SQLite\Nuget\stub.system.data.sqlite.core.netframework.1.0.117\lib\net46")
ObjectClrOption("use", "System.Data.SQLite")

objSqlLiteCon = objectClrNew("System.Data.SQLite.SQLiteConnection", "Data Source=%strDb%;Pooling=true")
objDataState = ObjectClrNew( "System.Data.ConnectionState")
objSqlLiteCon.Open()
if objSqlLiteCon.State == objDataState.Open 

   objSqlLiteCmd = objectClrNew("System.Data.SQLite.SQLiteCommand")
   objSqlLiteCmd.Connection  = objSqlLiteCon
   
   ; Likely a better way to write this query.
   objSqlLiteCmd.CommandText = `SELECT * FROM org_data WHERE "Index" LIKE "bob" OR "Organization Id" LIKE "bob" OR "Name" LIKE "bob" OR "Website" LIKE "bob" OR "Website" LIKE "bob" OR "Country" LIKE "bob" OR "Description" LIKE "bob" OR "Founded" LIKE "bob" OR "Industry" LIKE "bob" OR "Number of employees" LIKE "bob";`   
   objReader = objSqlLiteCmd.ExecuteReader()
   
   ; Count the found rows.
   while objReader.Read
      nCnt += 1
   endwhile

   ; Cleanup
   objReader.Close()
   objSqlLiteCon.Close()
   objReader     = 0
   objSqlLiteCmd = 0
endif
objDataState  = 0
objSqlLiteCon = 0

nElapse = (GetTickCount64()-nStart)/1000.0
if nCnt > 0 then Message("SQLite Search", "Found: ":nCnt:" rows in ":nElapse:" sec.")
else  Message("SQLite Search", "No rows found in ":nElapse:" sec.")
exit



"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

kdmoyers

Quote from: td on May 09, 2023, 09:51:12 AM
Created a quick and dirty test script using the Nuget package assembly.... 

Wow, I got this to work!  Thanks Tony!!
The mind is everything; What you think, you become.

td

FWIW, 32-bit SQLite can handle .5 GB CSV files once the operator error is removed from the process...
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Before the OP feels his thread has been hi-jacked, if some mocked-up text can be uploaded, similar to the example


C5D19871206P8A16T123L012R235


with parsing instructions, then it would be easy to post code to move the text data into a SQLite db  8)

td

I suspect the OP can sort it out for himself, given his previous posts. If not, he can always ask here.

A direct link to the Nuget package:

https://system.data.sqlite.org/downloads/1.0.117.0/System.Data.SQLite.1.0.117.0.nupkg
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade