Aside from the carriage return, linefeed, tab, and null, the "non-printable" control characters only mean something to ancient terminals and transmission protocols. Nowadays they are avoided in text documents, and therefore should be avoided in XML too. But if for some reason you need to represent them in XML, you might like to use XML's natural numeric reference escaping mechanism. Unfortunately, it is not that simple.
In its infinite wisdom the XML 1.0 standard excluded the control characters in the range 0x01 to 0x1f except whitespace 0x09, 0x0a, 0x0d, even in escaped form. This was reversed in XML 1.1 but it was too late. This mistake is perfectly understandable if you agree that the purpose of the XML standard was to over-engineer the concept of a simple markup format. ;)
CMarkup with non-printable characters
Matt 09-Feb-2012
Basically, CMarkup::EscapeText(...)
allows non-printable characters like 0xE to be transmitted as-is. But some XML parsers seem to be picky about this, and cite the spec as the reason: http://www.w3.org/TR/xml/#charsets. This is leading to this real-world problem [where we pass text from a third party source application to a third party receiving application via XML. When the text in the XML contains a 0x0e (which only has meaning in the source application) a failure is triggered in the receiving application].
After testing, it looks like a lot of parsers still reject the XML if it has an escaped non-print character like . Our plan is to convert non-prints to question marks.
I think in your case you should convert the control characters to question marks (or remove them) as part of scrubbing the source data because the receiving application would never want them anyway.
But if someone needed to make CMarkup automatically escape control characters, it would require a small modification. If you have 11.5, it is Markup.cpp:2967, but in any recent release in CMarkup::EscapeText before the else { nCharLen = MCD_CLEN( pSource );... add the following else if clause:
else if ( cSource<0x20 && cSource>0 && cSource!=0x0a && ccSource!=0x0d && cSource!=0x09 ) { // 0x0e becomes MCD_CHAR szEscaped[10]; MCD_SPRINTF( MCD_SSZ(szEscaped), MCD_T("&#x%x;"), (int)cSource ); MCD_BLDAPPEND(strText,szEscaped); ++pSource; }
before this:
else { nCharLen = MCD_CLEN( pSource );
Re: ASCII control characters in XML Yes, the XML spec clearly rules these characters out. We didn't discuss it that much during the process - it seemed like a good idea, and nobody on any of the committees seemed troubled at the prospect of losing them; so I'm afraid this is a hardwired characteristic of XML 1.0, and you're stuck with it. -Tim Bray Tue, 28 Apr 1998
Re: control characters I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. -Tim Bray Sat, 17 Jun 2000
XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection [and prove the committee consists of incorrigible meddlers], the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) -Extensible Markup Language (XML) 1.1 W3C Candidate Recommendation 15 October 2002
See also:
separate, translate, and import back in
Pablo 08-Dec-2011
I work for the translation department at my company. We normally receive monolingual XML files for translation. In such cases, we just make copies of these files and translate each set to the corresponding target languages. In this file, the text in each node <INFO>
under <LANGUAGE>EN</LANGUAGE>
should be translated into the rest of the languages (DA, DE, ES, FI, NL, NO, SV).
What I need to do is the following:
<?xml version="1.0" encoding="UTF-8" ?>
<INFORMATION>
<FUNDOBJECTIVE>
<FUNDOBJECTIVEDATA ID="1">
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>EN</LANGUAGE>
<INFO>Text to be translated.</INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>DA</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>DE</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>ES</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>FI</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>NL</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>NO</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>SV</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
</FUNDOBJECTIVEDATA>
<FUNDOBJECTIVEDATA ID="2">
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>EN</LANGUAGE>
<INFO>More text to be translated.</INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>DA</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>DE</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>ES</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>FI</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>NL</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>NO</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
<FUNDOBJECTIVEDATAITEM>
<LANGUAGE>SV</LANGUAGE>
<INFO></INFO>
</FUNDOBJECTIVEDATAITEM>
</FUNDOBJECTIVEDATA>
</FUNDOBJECTIVE>
</INFORMATION>
Could you please let me know if any of your products would allow me to do the steps 1 and 3 mentioned above? I know step 1 can be done, as I've seen similar examples. What about step 3? How complex would it be to import the translations back? I'm not a programmer -- I have a basic programming knowledge. Does your product require an advanced level of programming?
I wrote two scripts for Pablo (he reported back that they worked perfectly), one for split and one for merge afterwards. To try them out:
split() { str sFolder = [["C:\Temp\"]]; CMarkup mInput, mOutput; mInput.Load(sFolder+"translate_input.xml"); int nIDCount = 0; while (mInput.FindElem("//FUNDOBJECTIVEDATA")) { // Extract the ID for this data str sID = mInput.GetAttrib("ID"); // The first item must be EN mInput.IntoElem(); mInput.FindElem("FUNDOBJECTIVEDATAITEM"); mInput.FindChildElem("LANGUAGE"); if (mInput.GetChildData() != "EN") return "unexpected: first data item under ID " + sID + " is not EN"; mInput.FindChildElem("INFO"); str sInfo = mInput.GetChildData(); // Generate data elements for subsequent languages ++nIDCount; while (mInput.FindElem("FUNDOBJECTIVEDATAITEM")) { mInput.FindChildElem("LANGUAGE"); str sLang = mInput.GetChildData(); if (! mOutput.RestorePos(sLang)) { mOutput.ResetPos(); mOutput.AddElem("INFORMATION"); mOutput.IntoElem(); mOutput.AddElem("FUNDOBJECTIVE"); mOutput.SavePos(sLang); } mOutput.AddChildElem("FUNDOBJECTIVEDATA"); mOutput.IntoElem(); mOutput.SetAttrib("ID",sID); mOutput.AddChildElem("LANGUAGE", sLang); mOutput.AddChildElem("INFO", sInfo); } } // Output files int nFileCount = 0; mOutput.ResetPos(); while (mOutput.FindElem("INFORMATION")) { CMarkup mOutputLang = mOutput.GetSubDoc(); str sLang = mOutputLang.FindGetData("//LANGUAGE"); mOutputLang.Save(sFolder+"translate_to_" + sLang + ".xml"); ++nFileCount; } return "Generated " + nFileCount + " files, " + nIDCount + " data elements per file"; }
merge() { str sFolder = [["C:\Temp\"]]; CMarkup mInput, mOutput; mInput.Load(sFolder + "translate_input.xml"); // Loop through all translated files CMarkup mInputFiles = EnvFindFiles(sFolder+"translate_to_*.xml"); mInputFiles.ResetPos(); while (mInputFiles.FindElem()) { CMarkup mTrans; mTrans.Load(sFolder + mInputFiles.GetData()); str sLang = mTrans.FindGetData("//LANGUAGE"); if (sLang == "") { return "language not found in " + mInputFiles.GetData(); } // Loop through data of input and bring in items from this language mInput.ResetPos(); while (mInput.FindElem("//FUNDOBJECTIVEDATA")) { str sID = mInput.GetAttrib("ID"); mTrans.ResetPos(); if (mTrans.FindElem("//FUNDOBJECTIVEDATA[@ID='" + sID + "']")) { str sInfo = mTrans.FindGetData("//INFO"); // Locate corresponding language item to place translation into mInput.IntoElem(); while (mInput.FindElem()) { mInput.FindChildElem("LANGUAGE"); if (mInput.GetChildData() == sLang) { mInput.FindChildElem("INFO"); mInput.SetChildData(sInfo); break; } } } } } mInput.Save(sFolder+"translate_merged.xml"); }
The easiest way to adjust and customize the scripts is to press F10 and run them line by line and see how they do what they do. Then look up any additional functions you need either with F1 or searching firstobject.com. There are also ways to set these scripts up as command line calls, see using the firstobject XML editor from the command line.
See also:
Release 11.5 Date: April 23, 2011, download
Fixes for whitespace trimming/collapsing, and file read mode, as well as some changes in compiler #ifdef
handling for WIN32
.
MCD_BLDLEN
Using CMarkup with QT on Windows is easier now; you shouldn't have to do any tweaking of CMarkup to add it to your QT project. The changes involved special cases for GNUC when it is used on Windows. Either the WIN32
(Windows.h) or _WIN32
(Visual Studio) precompiler defines will let CMarkup know it is compiling for Windows.
In 11.3 and 11.4, MDF_TRIMWHITESPACE
and MDF_COLLAPSEWHITESPACE
would remove an escaped char at the end of the trimmed data (see 11.3 Bug: trim whitespace removes escaped value).
On Linux and OS X, lines generated by CMarkup will now end with a newline by default instead of a Windows style CRLF (carriage return line feed), and the end-of-line setting can now be directed with preprocessor definitions. If you are on a non-Windows platform and want your CRLFs back, now you must add MARKUP_EOL_CRLF
to your preprocessor definitions.
End Of Line Defines | ||
---|---|---|
Name | Value | Description |
MARKUP_EOL_CRLF |
MCD_T("\r\n") |
Aka 0d 0a , this is the default for Windows builds |
MARKUP_EOL_NEWLINE |
MCD_T("\n") |
Aka 0a , this is now the default for non-Windows builds |
MARKUP_EOL_RETURN |
MCD_T("\r") |
Aka 0d , this is rarely used |
MARKUP_EOL_NONE |
MCD_T("") |
For minimal size, documents will be on one line, but GetDocFormatted will not produce desired results |
The file read mode bug fix only affects CMarkup Developer and the free XML editor FOAL C++ scripting |
The file read mode bug occurred with elements over 32k long that did not have child elements. This problem was only in file read mode where the Open method is used with MDF_READFILE
. See File read GetSubDoc incomplete.
See also previous CMarkup release notes: 11.4, 11.3, 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes
You might work only with text in the ASCII range (below 128) or have some non-ASCII text like the Euro character or Western European characters with accents and umlautes. Here are some examples of how to handle encoding issues as you move beyond ASCII.
euro is unreadable in XML
Davide 02-Mar-2011
I need to insert an amount in a UTF-8 xml file, something like:
CString sFmt; sFmt.Format( _T("%d €"), nPrice ); xml.AddElem( _T("Price"), sFmt );
But the resulting xml is unreadable. I've found a workaround using:
CString sFmt; sFmt.Format( _T("%d \xE2\x82\xAC"), nPrice ); xml.AddElem( _T ("Price"), sFmt );
Using \xE2\x82\xAC for the euro is correct in your case because your string encoding is UTF-8.
When you specify a non-ASCII character in a source file on Windows it is compiled into your program in the locale charset. So the problem with the euro symbol in sFmt
is that it is in your locale's MBCS (in which the euro is represented by one byte) and CMarkup is expecting UTF-8 (which is the case when your project is set to use neither MBCS
nor UNICODE
). You were able to work around it by putting the UTF-8 encoding directly in the string.
If compiling for MBCS you could have used the euro character directly in your source string, but the result would only be satisfactory as long as the program is running on a machine with your same locale "Language for non-Unicode programs."
This is another opportunity to discuss the internal memory string encoding choices in C++ (also described in ANSI and Unicode files and C++ strings).
The CMarkup class has a string member m_strDoc
that holds the XML document (or part of it in the case of file mode). Also, the CMarkup methods accept and return strings. The encoding of these strings depends on platform and compiler options.
You select the UTF-8 option in Windows by turning off the UNICODE (wide char) and MBCS project defines. In Visual Studio 2005+ Properties General Character Set choose the "Not Set" option. In this case, all the strings going into and out of the CMarkup methods are expected in UTF-8.
Note: the terminology is confusing, but in this Windows context UTF-8 is neither MBCS nor UNICODE. UTF-8 is "multibyte," however Windows uses MBCS to refer only to those character sets that can be selected for the machine locale and used in "A" APIs and Windows messages. And although UTF-8 is Unicode, Windows uses UNICODE to refer only to UTF-16 used in "W" APIs and Windows messages. In Windows you must convert UTF-8 strings to MBCS for "A" APIs like SetWindowTextA
or better yet to UTF-16 for "W" APIs like SetWindowTextW
(that's what CMarkup's UTF8To16 and UTF16To8 are for).
If you have a UTF-8 file, using UTF-8 in memory eliminates the need to convert the text encoding between file and memory.
If you have a UTF-8 file and compile for MBCS in memory, CMarkup converts the XML to the locale code page when it is loaded into memory. This has performance and multi-language disadvantages. It must do the conversion as mentioned when going between memory and file, which adds time (though less than the time to read from disk) and might be a performance consideration depending on your requirements. But also, you will lose any Unicode characters not supported in the locale code page where the program is running. For example, the ö with the umlaute is 246 in the standard Windows U.S. code page Windows-1252 but it is not supported in Greek Windows-1253 (but the Euro is).
GetAttrib result is a question mark
Chen 22-Aug-2011
When the xml data has a node like the following:
<block solution ="~" />
CMarkup's function GetAttrib("solution")
has the result "?".
The character in your attribute is U+FF5E (65374, Halfwidth and Fullwidth Forms FF00 - FFEF) UTF-8 EF BD 9E). When this character is not supported in the character set in memory, it is replaced by a question mark. You likely have an MBCS build which expects strings to be in the system locale Windows code page (not Unicode). It is best if you can use a Unicode charset in memory -- either UTF-8 or wide string, as explained above.
Trouble working with Arabic XML
Greg 16-Jun-2011
I am reading in a bunch of small Arabic XML files and combining them into a larger one. My problem is that I must not be handling something correctly when I read the data into memory because the output is gibberish (examples follow). Here are the relevant details of my environment: CMarkup version 11, Microsoft Visual Studio 2010 Ultimate, Unmanaged C++, MBCS, XML is UTF-8 with no BOM. Here is an example of what I am reading in:
<?xml version="1.0" encoding="UTF-8"?>
<Symbol xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="..\schemas\symbolcanonical.xsd">
<Filename>11.esds</Filename>
<Description>
<LocText>ضع هذاالرمز على رسم المخطط .ثم ابداء الطباعة .</LocText>
</Description>
</Symbol>
And here is an example of what I am writing out:
<?xml version="1.0" encoding="UTF-8"?>
<SymbolCollection>
<SymbolList>
<Symbol>
<Filename>11.esds</Filename>
<Description>?? ???????? ??? ??? ?????? .?? ????? ??????? .</Description>
</Symbol>
</SymbolList>
</SymbolCollection>
I’ve played around with various SetDocFlags and setlocale
options and whatnot...
[Selecting "Not Set" for the project character set solved the problem.] I didn't realize I had a viable third choice on that build setting.
When your project is set to use MBCS, CMarkup converts the file to your system locale charset in memory. If you can set your Character Set to the "Not Set" option in your Project Properties it will keep your XML in UTF-8.
encoding of XML with german umlaute
Ahyan 19-Jul-2011
It is possible for developers to save their XML files in an unfortunately inconsistent way where the XML header encoding information does not fit to the file content. That happens when the XML files are modified in a text editor that does not care if it is saving an XML file and if the encoding header matches the content. Within this text editor one has to explicitly specify the encoding of the text file (which is actually XML in this case) with the "Save File As" options. So we end up having XML files with incorrect headers and a given XML file that can contain "german umlaute" (special german characters like ä,ö etc) will be invalid because the XML header states e.g. an encoding ("UTF-8" or "8859-1") that doesn't fit the actual content. This can be improved by training the developers...
Yes, in the real world, you get situations where you need to salvage improperly declared XML documents. Say you have a header (an "XML declaration") at the top of your XML file:
<?xml version="1.0" encoding="UTF-8"?>
But the encoding of the non-ASCII characters is actually Windows-1252. You can get CMarkup to ignore the "UTF-8" specified there by using the ReadTextFile and WriteTextFile functions directly and specifying the desired encoding:
string strDoc, strEncoding="Windows-1252"; CMarkup::ReadTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding); CMarkup xml; xml.SetDoc(strDoc); ... strDoc = xml.GetDoc(); CMarkup::WriteTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);
This should allow you to leave (and ignore) the incorrect encoding in the XML declaration.
See also:
Release 11.4 Date: February 5, 2011, download
An important fix to the 11.3 whitespace features, improvements in the GetDocFormatted
method, and an enhancement to HasAttrib
.
Here's the list of 11.4 enhancements:
MDF_TRIMWHITESPACE
and MDF_COLLAPSEWHITESPACE
were crashing on values that were only whitespace... sorry, a glaring hole in the test cases<A a/>
See also previous CMarkup release notes: 11.3, 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes
Release 11.3 Date: November 20, 2010, download
A performance improvement makes CMarkup significantly faster! Overall parsing speed is up 35%, and attribute methods are twice as fast as 11.2. This release also includes document flags to trim whitespace and collapse whitespace.
Here's the list of 11.3 enhancements:
MDF_TRIMWHITESPACE
and MDF_COLLAPSEWHITESPACE
to affect retrieved values (see Whitespace and CMarkup)TextEncoding::IConv
*thanks Frank DeringSee also previous CMarkup release notes: 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes
The free firstobject XML editor has some command line switches. Here is the summary; details are below.
Switch | Purpose |
---|---|
-new | open in a new instance of the editor (useful when single instance preference is selected) |
-same | open in the existing instance of the editor |
-watch "C:\event.log" | open file in read-only auto-reloading mode to view the tail of log files |
-line 23 | open file at a line |
-offset 451 | UTF-8 offset from beginning of document or from beginning of line if line is specified |
-fromoffset 5 | pre-select text from this offset to specified offset |
-run script.foal:f arg1 arg2 | execute the script without showing the editor |
Examples:
foxe.exe "C:\XML examples\file.xml" -line 6
foxe.exe file.xml -line 6 -offset 5
foxe.exe -offset 240 -fromoffset 235 file.xml
foxe.exe -watch C:\event.log
foxe.exe -run C:\script.foal
foxe.exe -same file.xml
"C:\Program Files\firstobject\foxe.exe" -new file.xml
How to run script automatically
Angela Baines 18-Jan-2010
The foal script works a treat now. When I move this to production it has to run as part of an automated project that will be scheduled to run overnight
As of release 2.4.1 the free firstobject XML editor has a command line switch to run a script:
foxe -run "C:\foal scripts\script.foal"
It generates 2 files, foxe_err.txt and foxe_out.txt. The err file contains marked up information about the run and can help diagnose issues with running. The out file contains the output returned from the script in the return statement.
And with release 2.4.2 you can specify function and arguments, see below.
Run from command line problem
Garth Lancaster 19-Jul-2010
In foxe I can do this [with a foal script]
str NavigateIterativelyXYZ_Generated( CMarkup mDocToNavigate ) { mDocToNavigate.ResetPos(); str sXML = mDocToNavigate.GetDocFormatted(0); return sXML; }
so when I run it, a window pops up and asks which document I wish to convert, shows me the output, all is good... except now, I wish to do this from a command line.
foxe –run align0.foal input.xml output.xml
Where align0.foal contains the foal program, input and output names will likely come from a batch script % parameter or such, and input.xml will contain the streamed xml and output.xml will contain the results of the GetDocFormatted(0)
. I'm thinking foal can be very powerful, am I missing a point in its implementation?
Update April 23, 2011: With release 2.4.2, it works the way it should (the way Garth described). The -run command line option passes any number of command line arguments into the parameters of the function in the foal script. For example (remember to use quotes if a path or argument contains spaces):
foxe -run "C:\foal scripts\script.foal" C:\in.xml C:\out.xml
Here is a script to format XML that works with the previous command line:
formatxml(str sInPath, str sOutPath) { CMarkup m, r; r.AddElem( "load", sInPath ); m.Load( sInPath ); r.AddSubDoc( m.GetResult() ); if ( ! m.SetDoc(m.GetDocFormatted()) ) r.AddSubDoc( m.GetResult() ); r.AddElem( "save", sOutPath ); m.Save(sOutPath); r.AddSubDoc( m.GetResult() ); return r; }
Bascially this script just loads the input file, calls GetDocFormatted and saves to the output file (you could also add an argument to pass format flags). In addition it returns the results in case something goes awry you can see what happened in foxe_out.txt. If there is no problem, it might look like this:
<load>C:\in.xml</load>
<read encoding="UTF-8" length="31579"/>
<save>C:\out.xml</save>
<write encoding="UTF-8" length="29958"/>
If there are multiple functions in a script, you can name your entry function main
to avoid confusion about which function is being called. You can also specify the entry function explicitly on the command line with :function
at the end of the script filename. Specifying the function allows you to invoke multiple functions in one script from the command line. Here is an example of using the same script to perform different operations:
foxe -run C:\script.foal:extract London
foxe -run C:\script.foal:merge London "New York"
If you specify a function name that is not found in the script, there will be an indication at the bottom of foxe_err.txt that also indicates the entry point it would have used if you had not specified one:
<entry_point arg_count="2" not_found="gormatxml">formatxml</entry_point>
If you do not specify the function and there is no main function in the script, the last function with a matching number of arguments is called. It chooses the last matching function because functions earlier in the script tend to be subroutines since in FOAL you can only call functions below where they are defined.
Quoted Command Line Param Bug?
Tim Johnston 20-Jun-2011
It would seem there is a problem when running a FOXE script from the command line where the quoted parameter (such as a folder path) has a trailing slash, then the script fails... [this is a command console issue rather than a foxe issue, and he answered his own question in a subsequent email] It seems if you leave off the trailing slash and check for it/add it in the script it works fine. It also seems you can end it with a double back slash and it correctly escapes it - go figure :)
Incidentally, this failure doesn’t affect the return code of the process – I assume it would/should, but doing an echo %errorlevel% returns 0 after the failure... Also, would it be possible to make the FOXE command line not return back to the prompt before it is finished? i.e. make it "modal"? Basically when you run a script, control immediately returns back to the calling process and if the remaining batch file etc. is relying on the output of the FOXE script, then it is not there (as its still processing) – or at least do you think you could add a command line option to not return until processing is complete?
Thanks for the help on escaping the trailing backslash in the command line. I think you should be able to make the foxe.exe call wait using the START /WAIT command to launch it. By default batch files wait, but windowing programs do not so you have to use the START command in that case.
trimming white space
Marc Dyksterhouse 17-Mar-2010
Is there a way to have GetData or some other call return just the text of an element and not the whitespace around it? For example, can GetData
return "text"
in the following XML instead of " text\n"
?
<item>
text
</item>
I know I can just trim the returned string, but since whitespace isn't supposed to be pertinent in XML, I just thought the library should work this way. In the few cases where I need to preserve whitespace, I can use a CDATA encoding.
With release 11.3 you can set flags to trim whitespace or collapse whitespace when reading values from the document. CMarkup is unusual among XML tools because it simply preserves all whitespace, but now it can also support standard ways that XML and HTML processors alter whitespace.
Whitespace includes spaces, tabs, returns and newline characters. CMarkup has always preserved the whitespace as it appears in the document, and it still will. These new flags give you the option of reading the trimmed or collapsed text values, but the document is not altered, so you can turn off the flags and go back to reading the preserved whitespace.
Document Flag | Purpose |
---|---|
MDF_TRIMWHITESPACE |
removes leading and trailing whitespace |
MDF_COLLAPSEWHITESPACE |
removes leading and trailing whitespace, but also replaces all segments of whitespace inside the text with a single space; so for example a newline and tab within the text will become a single space |
These flags affect CMarkup methods like GetData and GetAttrib that retrieve element data, text nodes, and attributes (but not methods like GetSubDoc and GetElemContent that return XML i.e. markup text).
These flags have no effect on text retrieved from CDATA Sections. With CMarkup you can create elements to contain CData Section text to protect the whitespace from ever being altered by CMarkup or any other XML tool:
xml.AddElem( "Prose", strProseText, CMarkup::MNF_WITHCDATA );
Turn the whitespace flags on and off anytime without performance penalty if for example you want to trim some values and not others. Use SetDocFlags to set these flags.
CMarkup m; m.SetDocFlags( CMarkup::MDF_TRIMWHITESPACE );
You can OR a flag with GetDocFlags if you don't want to affect other flags:
m.SetDocFlags( m.GetDocFlags() | CMarkup::MDF_COLLAPSEWHITESPACE );
Turn off a flag without affecting others as follows:
m.SetDocFlags( m.GetDocFlags() & ~CMarkup::MDF_COLLAPSEWHITESPACE );
These whitespace flags can affect values returned by GetData, GetAttrib and related methods. They also affect methods like FindElem that search for a path specifying a value in a path attribute predicate (see Paths In CMarkup) because values from the document will be trimmed or collapsed before being compared to the specified value.
See also:
Node Methods in CMarkup
An example of how to use the free firstobject XML editor to split XML and then name the output files based on information in the pieces separated by the XML splitter script. Maybe this will be useful to other NGO's who need to split their XML.
XML Splitter
Dita Ciulacu 01-Jul-2009
I am desperately searching for a xml splitter to generate the file name using values from a child field. Is there any way to have the file named this way:
xmlOutput.Open( "test" + "_" + [Child value from REFERRAL_ID] + "_" + nFileCount + ".xml", MDF_WRITEFILE );
My xml [not real data] is:
<REFERRAL_DISCHARGE>
<FILE_VERSION>1.0</FILE_VERSION>
<REFERRAL_ID>1234</REFERRAL_ID>
<ORGANISATION_ID>ORG-5678</ORGANISATION_ID>
<ORGANISATION_TYPE>005</ORGANISATION_TYPE>
<EXTRACT_FROM_DATE_TIME>2009-06-01T00:00:00</EXTRACT_FROM_DATE_TIME>
<EXTRACTED_DATE_TIME>2009-06-30T15:40:15</EXTRACTED_DATE_TIME>
<TEAM_CODE>5555</TEAM_CODE>
<EVENT_HCU_ID>XXX1234</EVENT_HCU_ID>
<SEX>M</SEX>
<DATE_OF_BIRTH>1900-05-05</DATE_OF_BIRTH>
<REFERRAL_FROM>UN</REFERRAL_FROM>
<START_DATE_TIME>2008-12-24T00:00:00</START_DATE_TIME>
</REFERRAL_DISCHARGE>
The parent is REFERRAL_DISCHARGE, I need the file name exactly how you have it plus the individual value from REFERRAL_ID to make it easy to link to the data included.
We are a not-for-profit organization and we have to report to the [New Zealand] Ministry of Health and our data is to be packed as individual xml files. We are not dealing with huge files (this one was only 316kb) and also they are relatively simple extracts, but I don’t know in the future... it may get more complicated.
For splitting an XML file less than 10MB into a lot of referral discharge files, this is the easiest way to do it:
split() { CMarkup xmlInput, xmlSubDoc; xmlInput.Load( "input.xml" ); int nFileCount = 0; while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") ) { ++nFileCount; xmlSubDoc.SetDoc( xmlInput.GetSubDoc() ); str sID = xmlSubDoc.FindGetData( "//REFERRAL_ID" ); str sFilename = "test_" + sID + "_"+ nFileCount + ".xml"; WriteTextFile( sFilename, xmlSubDoc.GetDoc() ); } return nFileCount; }
For others who have really large files (especially over 100MB up to any number of gigabytes) use the XML reader mode which processes the source file on disk very efficiently. The only difference from the above script is opening the input file in read mode rather than loading it all into memory.
split() { CMarkup xmlInput, xmlSubDoc; xmlInput.Open( "input.xml", MDF_READFILE ); int nFileCount = 0; while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") ) { ++nFileCount; xmlSubDoc.SetDoc( xmlInput.GetSubDoc() ); str sID = xmlSubDoc.FindGetData( "//REFERRAL_ID" ); str sFilename = "test_" + sID + "_"+ nFileCount + ".xml"; WriteTextFile( sFilename, xmlSubDoc.GetDoc() ); } xmlInput.Close(); return nFileCount; }
A note about usage of the anywhere path. If you want to grab multiple pieces of data like xmlSubDoc.FindGetData("//REFERRAL_ID")
remember that the //
anywhere path starts from the current position. So if you're not sure about the order of the data you are grabbing, call xmlSubDoc.ResetPos()
in between calls to FindGetData
.
See also:
Split XML with XML editor script
Split XML file into smaller pieces
Video of XML splitter script for splitting XML files
C++ XML reader parses a very large XML file
Conventional wisdom has you importing and exporting XML to and from a database in order to run queries and utilize data that is in XML. But with firstobject's free XML editor you can perform all sorts of operations rapidly and efficiently directly on the XML document. This example shows how to export subsets of records, query, tally and modify XML records in a real estate database XML file.
export records with matching childset
Eddie Wrenn 25-Jan-2010
What I have is a list of properties for sale nationwide, contained in a 1.5gb XML file (your program is the only one which seems to handle this with ease!) I'm looking for a way to make the editor export all the records which have a matching childset, in this case 'locality' (in this example, London). There's 100,000 listings so not a manual job!
I've been successful splitting the file into 100,000 seperate files, named by the locality (using your tutorials). But patching them all together takes a long time, even if I automate it. A sample record below:
<listing key="1234567" status="active" updated="20090101T010101" type="residence">
<title><![CDATA[Xyz Street, London]]></title>
<supplementary-url><![CDATA[1234567.htm]]></supplementary-url>
<description><![CDATA[AVAILABLE 01/01/2010. This
beautifully decorated place is situated on a quiet back
street of Xyz Garden in the heart of Xyz London.
The owners have refurbished to a particluarly high
standard paying exceptional attention to detail to the
overall finish and decoration. As the apartment is
situated on the Nth floor there are great views of
London giving the apartment excellent natural light.
Features available. We highly recommend a viewing.]]></description>
<residence type="flat">
<bedrooms><![CDATA[1]]></bedrooms>
<bathrooms><![CDATA[1]]></bathrooms>
<reception><![CDATA[yes]]></reception>
</residence>
<authority>
<lease currency="GBP" term="private" visible="yes">
<price term="weekly"><![CDATA[450]]></price>
</lease>
</authority>
<address visible="yes">
<country><![CDATA[GB]]></country>
<subdivision><![CDATA[London]]></subdivision>
<locality><![CDATA[London]]></locality>
<postcode><![CDATA[AA1A 1AA]]></postcode>
<road><![CDATA[Xyz Street]]></road>
</address>
<attachments>
<photo title="" updated="20090101T010101" type="image/jpeg">
<uri><![CDATA[1234567_354_255.jpg]]></uri>
</photo>
<photo title="" updated="20091118T201049" type="image/jpeg">
<uri><![CDATA[23456789_354_255.jpg]]></uri>
</photo>
<photo title="" updated="20091118T201049" type="image/jpeg">
<uri><![CDATA[34567890_354_255.jpg]]></uri>
</photo>
</attachments>
<vendor>
<name><![CDATA[Xyz Property Services]]></name>
<phone><![CDATA[020 1234 5678]]></phone>
<email><![CDATA[enquiries@xyz.example]]></email>
</vendor>
</listing>
To find all the matching records on a huge file you do something like this: from the File menu select New Program, paste in the following script, and modify the input file pathname (note that for C++ syntax, use a double backslash for backslashes in the pathname).
pull_by_locality() { str strSearch = "London"; CMarkup xmlInput, xmlListing, xmlOutput; xmlInput.Open( "C:\\huge.xml", MDF_READFILE ); while ( xmlInput.FindElem("//listing") ) { xmlListing.SetDoc( xmlInput.GetSubDoc() ); if ( xmlListing.FindGetData("//locality") == strSearch ) xmlOutput.AddSubDoc( xmlListing.GetDoc() ); } return xmlOutput.GetDoc(); }
To export the result document as London.xml:
xmlOutput.Save( strSearch + ".xml" );
To delete (or actually skip) records which are no longer required e.g. we want them if status is "active" but not if it is "inactive" or "sold":
xmlListing.ResetPos(); if ( xmlListing.FindGetData("//status") != "active" ) ...
To change an element tag name from title to topicname in the output, first add the new element with the same content, then remove the old one (this is the easiest way to make sure the new element goes into the same position as the removed one).
xmlListing.ResetPos(); if (xmlListing.FindElem("//title")) { xmlListing.AddElem("topicname", xmlListing.GetData()); xmlListing.FindPrevElem(); // title xmlListing.RemoveElem(); }
As far as inputing the search string, FOAL scripts don't support dialogs yet. However, you can automate the process if you can put the search string in a file such as search.txt which could be retrieved in the FOAL script with:
str s; if ( ReadTextFile("C:\\search.txt", s) && StrLength(s) > 2 ) s = StrMid( s, 0, StrLength(s)-2 ); // remove CRLF str strSearch = s;
In DOS, if you had a script named search.foal, then you could create a search.bat file as follows to let you type search London
on the command line.
echo %1 > C:\search.txt "C:\Program Files\firstobject\foxe.exe" -run C:\search.foal
Here's an interesting diagnostic to count instances of each locality:
locality_tally() { CMarkup xmlLocalities, xmlInput; xmlInput.Open( "huge.xml", MDF_READFILE ); while ( xmlInput.FindElem("//locality") ) { str sLoc = xmlInput.GetData(); int n = 1; if ( xmlLocalities.RestorePos(sLoc) ) n = StrToInt(xmlLocalities.GetAttrib("n")) + 1; else { xmlLocalities.ResetPos(); xmlLocalities.AddElem("locality",sLoc); xmlLocalities.SavePos(sLoc); } xmlLocalities.SetAttrib("n", n); } return xmlLocalities; }
Would yield a result like this:
<locality n="890">London</locality>
<locality n="431">Yorkshire</locality>
how to clear the XML result
Eddie Wrenn 27-Jan-2010
Now I'm piggybacking "searches" on top of each other, so it will search for London, output them into a London file, then search for Yorkshire, and output that into a Yorkshire file. My problem is that the editor [script] will retain the results for London, and add them to the top of my Yorkshire file - is there a little code that will clear the internal memory before starting the next process?
xmlOutput.SetDoc("");
records that contain the State of Michigan
Grace 03-Feb-2014
I am new to XML and need some instructions. I have the same problem that one of your previous customers inquired about on your website to "Export XML records with matching childset." I am trying to get all the records that contain the State of Michigan on a separate file but it does not appear to be working for me. Below is a example of a record:
<Property>
<Description><![CDATA[Great Location at corner of Xyz.
Large older home, very charming! Note: Tenants pay 1/nth
of gas and electric. Water included.]]></Description>
<MinRent>1300</MinRent>
<MaxRent>1300</MaxRent>
<MarketingName/>
<Address>301 N test St</Address>
<City>Little Town</City>
<State>MI</State>
<Zip>49876</Zip>
<YearBuilt>0</YearBuilt>
<NumberUnits>7</NumberUnits>
<Latitude>12.3456789</Latitude>
<Longitude>-12.3456789</Longitude>
<AcceptsHcv>False</AcceptsHcv>
<PhoneNumber>(123) 456-7890</PhoneNumber>
<LastUpdated>1/30/2014 8:00:00 AM</LastUpdated>
<Amenity AmenityID="101" AmenityName="Parking"/>
<Amenity AmenityID="102" AmenityName="Unfurnished"/>
<Amenity AmenityID="103" AmenityName="Dishwasher"/>
<Amenity AmenityID="109" AmenityName="Garbage Disposal"/>
</Property>
The file is pretty big (225 MB).
Although for 225MB it is probably not necessary, the following script is written to handle extremely large input and output files by opening them in "file mode" (using MDF_READFILE
for the input and MDF_WRITEFILE
for the output). The "anywhere path" //Property is used to search the input document for Property records, and //State searches anywhere in the xmlRecord subdocument for the State. In the output window it shows you the count of records it searched and the number matched. If it shows 0 searched it is because your input does not contain any Property elements.
pull_by_State() { str strSearch = "MI"; int s = 0; int m = 0; CMarkup xmlInput, xmlRecord, xmlOutput; if (!xmlOutput.Open("C:\\test_" + strSearch + ".xml", MDF_WRITEFILE)) return xmlOutput.GetResult(); xmlOutput.AddElem("Search"); xmlOutput.SetAttrib("criteria", strSearch); xmlOutput.IntoElem(); if (!xmlInput.Open("C:\\test.xml", MDF_READFILE)) return xmlInput.GetResult(); while ( xmlInput.FindElem("//Property") ) { ++s; xmlRecord.SetDoc( xmlInput.GetSubDoc() ); if ( xmlRecord.FindGetData("//State") == strSearch ) { xmlOutput.AddSubDoc(xmlRecord.GetDoc()); ++m; } } xmlInput.Close(); xmlOutput.Close(); return "Searched " + s + " records, matched " + m; }
See also:
Using the firstobject XML editor from the command line
Counting XML tag names and values with foal
Format XML, indent align beautify clean up XML
Simple XML editor meets memory stick
Split XML with XML editor script
Tree customization in the firstobject XML editor
Video demo of editing RSS XML in the tree view of the free firstobject XML editor
Video of XML Editor format XML, customize treeview, and program
Split XML file into smaller pieces
Video of XML splitter script for splitting XML files
bool CMarkup::GetNthAttrib( int n, MCD_STR& strName, MCD_STR& strValue ) const;
Call GetNthAttrib
to get the string name and value of the Nth attribute in the main position element. The first attribute is 0, the second is 1, etc. If there is no current position or the current position node does not have the specified attribute, it returns false
.
Similar to GetAttribName, this method lets you iterate through the attributes of an element or processing instruction. However, this is usually better because it provides the attribute value without an additional call to GetAttrib, and it returns a bool
which is convenient for looping. For example:
MCD_STR strName, strAttrib; int n = 0; while ( xml.GetNthAttrib(n++, strName, strAttrib) ) { // do something with strName, strAttrib }
GetNthAttrib
also works when the main position is a processing instruction node with attributes. See Node Methods in CMarkup.
Release 11.3 has made a leap in performance (e.g. from 39mb/s to 53mb/s* excluding file I/O), so its a good time to post some data on the speed of CMarkup, and to discuss XML parser performance issues. Here is a comparison of 11.3 with the previous release 11.2; raw parsing goes from 40000 to 54000 bytes per millisecond and attribute parsing (the basis for attribute methods) goes from 5000 to 9000 b/ms (see also Attribute Method Performance).
Release | Chart | parse doc/attrib | create doc/attrib | Units | ||
---|---|---|---|---|---|---|
CMarkup 11.2 |
![]() |
40002 | 5175 | 12331 | 4754 | b/ms |
CMarkup 11.3 |
![]() |
54042 | 9195 | 14394 | 6820 | b/ms |
Since these measurements do not involve disk I/O, the speeds are measured in character units per millisecond where the character unit is b for byte, w for word (2 bytes), and dw for double word (4 bytes), depending on the build and platform. In the first chart I include 2 parse tests and then 2 corresponding create tests.
![]() |
parse document | this is the core indicator of parsing speed; the document string is passed to SetDoc in memory and parsed, it is not loading the document from disk |
![]() |
parse attributes | loops through the document reading all attributes with GetAttribName and GetAttrib (the new GetNthAttrib method is more efficient way to do this) |
![]() |
create document | builds a document using an AddElem and SetAttrib for each element, the document is not saved to disk, there is no disk I/O in this measurement |
![]() |
create attributes | creates a document with up to 4 randomly selected attributes and values per element, the SetAttrib call occassionally overwrites an attribute |
One of the most intensively used operations in the parser is determining whether a character is one of a set of characters. In 11.3 I replaced MCD_PSZCHR
(strchr
) with a lookup define which is an order of magnitude faster and yields a roughly 30% speed improvement in overall raw parser speed. The new lookup define only checks the bounds and then returns the offset in the array, where c
is the character, f
and l
are the bounds (first and last) and s
is the lookup array (a string):
#define x_ISONEOF(c,f,l,s) ((c>=f&&c<=l)?(int)(s[c-f]):0)
So, for example, a whitespace check uses x_ISONEOF
and passes the bounds 9 and 32, and a lookup string array for the range between those bounds:
// classic whitespace " \t\n\r" #define x_ISWHITESPACE(c) x_ISONEOF(c,9,32, "\2\3\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1")
Another roughly 5% overall improvement was gained by replacing MCD_PSZNCMP
(strncmp
) with a simple speedy implementation of string compare.
Build configuration makes a big difference in performance. See ANSI and Unicode files and C++ strings and non-Unicode text handling in CMarkup for discussions of string character set options.
Build | Chart | parse doc/attrib | create doc/attrib | Units | ||
---|---|---|---|---|---|---|
MFC (UTF-8) |
![]() |
54042 | 9195 | 14394 | 6820 | b/ms |
STL (UTF-8) |
![]() |
55923 | 9193 | 11583 | 6061 | b/ms |
MFC MBCS |
![]() |
14424 | 3269 | 11084 | 3492 | b/ms |
STL MBCS |
![]() |
14783 | 3137 | 8636 | 3223 | b/ms |
MSXML6 MFC MBCS |
![]() |
3832 | 1762 | 1849 | 1347 | b/ms |
MFC WCHAR |
![]() |
57405 | 8607 | 14530 | 6594 | w/ms |
STL WCHAR |
![]() |
57780 | 8607 | 10744 | 5639 | w/ms |
MSXML6 MFC WCHAR |
![]() |
3950 | 1939 | 1963 | 1428 | w/ms |
Using Unicode (either UTF-8 or WCHAR) strings in memory is much more efficient than MBCS which utilizes Windows APIs to determine character boundaries according to the locale character set. MSXML is very slow due to the overhead of COM and is slightly faster in a WCHAR build which avoids conversion to and from COM's WCHAR-based strings.
Unlike the measurements above, the XML reader and XML writer measurements are all in bytes per millisecond regardless of build because they are based on the file I/O rather than the in-memory character unit size. The file is UTF-8, which means the MBCS and wide character builds have the extra penalty of character set conversion. The MBCS conversion can be done using the libc (stdlib.h) function wctomb
(not using the Windows API).
Build | Chart | XML reader | XML writer | Units |
---|---|---|---|---|
MFC |
![]() |
15086 | 11528 | b/ms |
STL |
![]() |
13858 | 9540 | b/ms |
MFC WCHAR |
![]() |
10854 | 8757 | b/ms |
STL WCHAR |
![]() |
10717 | 7509 | b/ms |
MFC MBCS |
![]() |
11673 | 9846 | b/ms |
STL MBCS |
![]() |
10444 | 8137 | b/ms |
MFC MBCS libc |
![]() |
2231 | 2844 | b/ms |
STL MBCS libc |
![]() |
2155 | 2677 | b/ms |
See also:
Archived CMarkup Performance Tests
Attribute Method Performance
* Measurements here are representative of the speed with my own sample data on a 1.7GHz 1GB Vista netbook. Running these tests twice in a row often gets slighly different results because they are affected by variations in CPU.
Attribute parsing performance came up several times this year, and some significant improvements were made in CMarkup release 11.3.
In its attribute methods, in every call CMarkup reparses attributes up to the one that is accessed. This can lead to poorer than expected performance when you have attribute intensive code, i.e. code that repeatedly accesses or checks for many attributes. This is due to an original design trade off: CMarkup does not store attribute indexes.
CMarkup - Attribute Query Speed
Cameron Dunn 23-Jun-2010
I've been very impressed with the speed of loading and parsing. However, I've hit one area which is surprisingly slow which I wanted to ask you about - XML attributes.
I'm loading about 3000 XML files, for a total of 99468934 bytes. I'm loading in the files myself and then passing the string to CMarkup. If I do that, and then loop down into every element in every file, it takes about 2 seconds (specifically, 1985ms), which I thought was pretty impressive.
However, if I do the same thing but also loop over every attribute on every element, it takes 8 seconds. I found this a bit surprising - obviously there's no additional file IO time or anything like that, it's all in string processing. The interface which CMarkup provides to access attributes is very string heavy - you need to get the attribute by name and then query the value using this name.
Is there a quicker way to loop over the XML attributes? I need the name and value for each attribute, but they can be in the order in which they occur in the file.
...I iterate the attributes for a single element with GetAttribName()
and then call GetAttrib()
to get their values.
CMarkup release 11.3 introduces a new method GetNthAttrib which is twice as efficient as GetAttribName combined with GetAttrib, and in addition attribute parsing is about twice as fast (see CMarkup XML Parser Performance). So, iterating the attributes in your case might be reduced from 6 seconds to 1.5 seconds.
I did design a solution to manage and reuse attribute indexes for the current element, but it was actually slower for a single attribute access and wasn't really fast enough to justify the added complexity. Another option would be to include attributes much like elements in CMarkup indexing, but I think that's too fundamental at this point. So I've chosen to remain with the original reparse design for the time being, and hopefully the 11.3 performance boost and new method will help out enough.
If you have intensive use of attributes, in some cases you might want to extract them with GetNthAttrib to an external map as a more efficient machanism to access them repeatedly. You can even map them in a separate CMarkup object as elements using SavePos and then RestorePos to do the lookup.
See how to use the tree view to edit RSS (and any XML or HTML) in this screencast video demonstrating this new feature in the free firstobject XML editor release 2.4.1.
See also:
video of XML splitter script
Format XML, indent align beautify clean up XML
Tree customization in the firstobject XML editor
This screencast video demonstrates the free firstobject XML editor, and how to format XML, customize the treeview, generate and step through a C++ style program.
See also:
video of XML splitter script
Format XML, indent align beautify clean up XML
Tree customization in the firstobject XML editor
firstobject Access Language
Counting XML tag names and values with foal
Convert ANSI file to Unicode