Normally we can use tools like Jet's Text IISAM to import delimited text. But sometimes our delimited text might not be in a file. Perhaps we received it from a web service or a TCP connection or something, and we don't want to take the step of writing the data to disk just to turn around and import it into our database.
The ADO Recordset has a GetString method that can be used to convert its contents to a delimited text String value fairly easily. If only we had an inverse function, a sort of PutString we could used?
PutString
Here is a function that does just that. It takes care of parsing the delimited columns and rows and posts these to a database table using an append-only cursor Recordset.
All of this seems pretty well optimized, though with effort you might squeeze out another millisecond or two. The commonly advocated "split the splits" approach is far slower than this logic:
Demo
PutString is contained in the attached demo within Module1.bas.
This demo creates a new empty database with a single table SOMETABLE on its first run. Once it has an open database connection it first deletes all rows (if any) from SOMETABLE.
Then it creates a big String containing 5000 rows with 8 random data fields (of several types). This String has TAB column delimiters and CR/LF row delimiters.
Then it calls PutString to append the data to SOMETABLE, displays a MsgBox with the elapsed time for the PutString, and ends.
The compiled program takes from 0.12 to 0.16 seconds to do the PutString call here, but the Timer() function isn't very accurate for small intervals.
Issues
I think I have the bugs out of the parsing logic.
This has only been tested with the Jet 4.0 provider, and I'm not sure how well it will do with other DBMSs. With Jet I found no advantage at all to wrapping the appends in a transaction or using batch updating, both whizzy performance gaining techniques according to common wisdom (which often isn't wise at all). Using any form of client Recordset only hurt performance, pretty much as expected.
Of course many variables have been left out, for example other connections could be updating, holding locks, etc. and that could make a huge difference.
Opening the database with exclusive access gains you a little more performance too. When you aren't sharing a database this is always a good bet, since eliminating locking naturally improves performance. The demo just lets this default to shared access.
Nasty Issues
The ADO Recordset's GetString method has a nasty secret. Not quite that big of a secret to classic ASP scripters since it was tripped over quite early. That secret is:
How does this matter?
What about Boolean values? What about fractional numeric values?
It turns out the PutString has the very same limitation (or is that a feature??).
As far as I can determine through testing, the demo should work just fine even in one of the Central European locales (e.g. Germany) with funky number punctuation it different wors for "true" and for "false." That's because it is using the locale-aware CStr() function when building the big test String value.
However the main reasons to work with a delimited text tabular data format are (a.) persisting, and (b.) interchanging data.
So a program running on a German language machine can't use this for talking to a French language machine. A French machine can't talk to an English machine because the number formats may match but the Booleans are goofy.
SetThreadLocale
The clever may think they know the answer, but calling SetThreadLocale passing LOCALE_INVARIANT won't cut it. For that matter the story is more complicated for supported versions of Windows anyway, involving SetThreadUILanguage calls.
But as far as I can tell the Variant parsing/formatting routines within OLE Automation that ADO makes use of lock in the locale pretty early and are not swayed by flippity-flopping locale settings around GetString or (my) PutString calls.
The ADO Recordset has a GetString method that can be used to convert its contents to a delimited text String value fairly easily. If only we had an inverse function, a sort of PutString we could used?
PutString
Here is a function that does just that. It takes care of parsing the delimited columns and rows and posts these to a database table using an append-only cursor Recordset.
All of this seems pretty well optimized, though with effort you might squeeze out another millisecond or two. The commonly advocated "split the splits" approach is far slower than this logic:
Code:
Private Function PutString( _
ByRef StringData As String, _
ByVal Connection As ADODB.Connection, _
ByVal TableName As String, _
ByVal ColumnIds As Variant, _
Optional ByVal ColumnDelimiter As String = vbTab, _
Optional ByVal RowDelimiter As String = vbCr, _
Optional ByVal NullExpr As Variant = vbNullString) As Long
'A sort of "inverse analog" of the ADO Recordset's GetString() method.
'
'Returns count of rows added.
Dim SaveCursorLocation As CursorLocationEnum
Dim RS As ADODB.Recordset
Dim ColumnStart As Long
Dim ColumnLength As Long
Dim ColumnValues() As Variant
Dim Pos As Long
Dim NewPos As Long
Dim RowLimit As Long
Dim I As Long
Dim AtRowEnd As Boolean
If (VarType(ColumnIds) And vbArray) = 0 Then Err.Raise 5 'Invalid procedure call or argument.
SaveCursorLocation = Connection.CursorLocation
Connection.CursorLocation = adUseServer 'Required to create this fast-append Recordset:
With New ADODB.Command
Set .ActiveConnection = Connection
.CommandType = adCmdTable
.CommandText = TableName
.Properties![Append-Only Rowset] = True
.Properties![Own Changes Visible] = False 'Doesn't matter when using exclusive access.
.Properties![Others' Changes Visible] = False 'Doesn't matter when using exclusive access.
Set RS = .Execute()
End With
Connection.CursorLocation = SaveCursorLocation
ReDim ColumnValues(UBound(ColumnIds))
Pos = 1
Do
RowLimit = InStr(Pos, StringData, RowDelimiter)
If RowLimit = 0 Then RowLimit = Len(StringData) + 1
I = 0
AtRowEnd = False
Do
ColumnStart = Pos
NewPos = InStr(Pos, StringData, ColumnDelimiter)
If NewPos = 0 Or NewPos > RowLimit Then
Pos = InStr(Pos, StringData, RowDelimiter)
ColumnLength = RowLimit - ColumnStart
If Pos <> 0 Then
Pos = Pos + Len(RowDelimiter)
'Auto-handle CrLf when RowDelimiter is vbCr. GetString()
'itself defaults to vbCr as the RowDelimiter. Some software
'strangely enough will use a mix of vbCr and vbCrLf:
If RowDelimiter = vbCr Then
If Mid$(StringData, Pos, 1) = vbLf Then Pos = Pos + 1
End If
End If
AtRowEnd = True
Else
Pos = NewPos
ColumnLength = Pos - ColumnStart
Pos = Pos + Len(ColumnDelimiter)
End If
ColumnValues(I) = Trim$(Mid$(StringData, ColumnStart, ColumnLength))
If Not IsMissing(NullExpr) Then
If ColumnValues(I) = NullExpr Then ColumnValues(I) = Null
End If
I = I + 1
Loop Until AtRowEnd
RS.AddNew ColumnIds, ColumnValues
PutString = PutString + 1
Loop Until Pos = 0 Or Pos > Len(StringData)
End Function
Demo
PutString is contained in the attached demo within Module1.bas.
This demo creates a new empty database with a single table SOMETABLE on its first run. Once it has an open database connection it first deletes all rows (if any) from SOMETABLE.
Then it creates a big String containing 5000 rows with 8 random data fields (of several types). This String has TAB column delimiters and CR/LF row delimiters.
Then it calls PutString to append the data to SOMETABLE, displays a MsgBox with the elapsed time for the PutString, and ends.
The compiled program takes from 0.12 to 0.16 seconds to do the PutString call here, but the Timer() function isn't very accurate for small intervals.
Issues
I think I have the bugs out of the parsing logic.
This has only been tested with the Jet 4.0 provider, and I'm not sure how well it will do with other DBMSs. With Jet I found no advantage at all to wrapping the appends in a transaction or using batch updating, both whizzy performance gaining techniques according to common wisdom (which often isn't wise at all). Using any form of client Recordset only hurt performance, pretty much as expected.
Of course many variables have been left out, for example other connections could be updating, holding locks, etc. and that could make a huge difference.
Opening the database with exclusive access gains you a little more performance too. When you aren't sharing a database this is always a good bet, since eliminating locking naturally improves performance. The demo just lets this default to shared access.
Nasty Issues
The ADO Recordset's GetString method has a nasty secret. Not quite that big of a secret to classic ASP scripters since it was tripped over quite early. That secret is:
GetString does not use the invariant locale and you cannot set its locale
How does this matter?
What about Boolean values? What about fractional numeric values?
It turns out the PutString has the very same limitation (or is that a feature??).
As far as I can determine through testing, the demo should work just fine even in one of the Central European locales (e.g. Germany) with funky number punctuation it different wors for "true" and for "false." That's because it is using the locale-aware CStr() function when building the big test String value.
However the main reasons to work with a delimited text tabular data format are (a.) persisting, and (b.) interchanging data.
So a program running on a German language machine can't use this for talking to a French language machine. A French machine can't talk to an English machine because the number formats may match but the Booleans are goofy.
SetThreadLocale
The clever may think they know the answer, but calling SetThreadLocale passing LOCALE_INVARIANT won't cut it. For that matter the story is more complicated for supported versions of Windows anyway, involving SetThreadUILanguage calls.
But as far as I can tell the Variant parsing/formatting routines within OLE Automation that ADO makes use of lock in the locale pretty early and are not swayed by flippity-flopping locale settings around GetString or (my) PutString calls.