Control Characters in XML 9 Feb 2012 8:12 AM (13 years ago)

Aside from the carriage return, linefeed, tab, and null, the "non-printable" control characters only mean something to ancient terminals and transmission protocols. Nowadays they are avoided in text documents, and therefore should be avoided in XML too. But if for some reason you need to represent them in XML, you might like to use XML's natural numeric reference escaping mechanism. Unfortunately, it is not that simple.

In its infinite wisdom the XML 1.0 standard excluded the control characters in the range 0x01 to 0x1f except whitespace 0x09, 0x0a, 0x0d, even in escaped form. This was reversed in XML 1.1 but it was too late. This mistake is perfectly understandable if you agree that the purpose of the XML standard was to over-engineer the concept of a simple markup format. ;)

comment posted CMarkup with non-printable characters

Matt 09-Feb-2012

Basically, CMarkup::EscapeText(...) allows non-printable characters like 0xE to be transmitted as-is. But some XML parsers seem to be picky about this, and cite the spec as the reason: http://www.w3.org/TR/xml/#charsets. This is leading to this real-world problem [where we pass text from a third party source application to a third party receiving application via XML. When the text in the XML contains a 0x0e (which only has meaning in the source application) a failure is triggered in the receiving application].

After testing, it looks like a lot of parsers still reject the XML if it has an escaped non-print character like . Our plan is to convert non-prints to question marks.

I think in your case you should convert the control characters to question marks (or remove them) as part of scrubbing the source data because the receiving application would never want them anyway.

But if someone needed to make CMarkup automatically escape control characters, it would require a small modification. If you have 11.5, it is Markup.cpp:2967, but in any recent release in CMarkup::EscapeText before the else { nCharLen = MCD_CLEN( pSource );... add the following else if clause:

else if ( cSource<0x20 && cSource>0 && cSource!=0x0a && ccSource!=0x0d && cSource!=0x09 )
{
  // 0x0e becomes 
  MCD_CHAR szEscaped[10];
  MCD_SPRINTF( MCD_SSZ(szEscaped), MCD_T("&#x%x;"), (int)cSource );
  MCD_BLDAPPEND(strText,szEscaped);
  ++pSource;
}

before this:

else
{
  nCharLen = MCD_CLEN( pSource );

Re: ASCII control characters in XML Yes, the XML spec clearly rules these characters out. We didn't discuss it that much during the process - it seemed like a good idea, and nobody on any of the committees seemed troubled at the prospect of losing them; so I'm afraid this is a hardwired characteristic of XML 1.0, and you're stuck with it. -Tim Bray Tue, 28 Apr 1998

Re: control characters I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. -Tim Bray Sat, 17 Jun 2000

XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection [and prove the committee consists of incorrigible meddlers], the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) -Extensible Markup Language (XML) 1.1 W3C Candidate Recommendation 15 October 2002

See also:

EscapeText

UnescapeText

Split and Merge Translation XML 8 Dec 2011 5:39 AM (13 years ago)

comment posted separate, translate, and import back in

Pablo 08-Dec-2011

I work for the translation department at my company. We normally receive monolingual XML files for translation. In such cases, we just make copies of these files and translate each set to the corresponding target languages. In this file, the text in each node <INFO> under <LANGUAGE>EN</LANGUAGE> should be translated into the rest of the languages (DA, DE, ES, FI, NL, NO, SV). What I need to do is the following:

For each target language, extract all the strings in the source language into a separate XML. I would have one XML file per target language.
The XML files are translated into the corresponding languages.
The translated XML files are imported back into the original XML. In this way, I would obtain a multilingual XML, with the source strings and the corresponding translations for the different languages.

<?xml version="1.0" encoding="UTF-8" ?>
<INFORMATION>
  <FUNDOBJECTIVE>
    <FUNDOBJECTIVEDATA ID="1">
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>EN</LANGUAGE>
        <INFO>Text to be translated.</INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>DA</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>DE</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>ES</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>FI</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>NL</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>NO</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>SV</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
    </FUNDOBJECTIVEDATA>
    <FUNDOBJECTIVEDATA ID="2">
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>EN</LANGUAGE>
        <INFO>More text to be translated.</INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>DA</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>DE</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>ES</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>FI</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>NL</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>NO</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
      <FUNDOBJECTIVEDATAITEM>
        <LANGUAGE>SV</LANGUAGE>
        <INFO></INFO>
      </FUNDOBJECTIVEDATAITEM>
    </FUNDOBJECTIVEDATA>
  </FUNDOBJECTIVE>
</INFORMATION>

Could you please let me know if any of your products would allow me to do the steps 1 and 3 mentioned above? I know step 1 can be done, as I've seen similar examples. What about step 3? How complex would it be to import the translations back? I'm not a programmer -- I have a basic programming knowledge. Does your product require an advanced level of programming?

I wrote two scripts for Pablo (he reported back that they worked perfectly), one for split and one for merge afterwards. To try them out:

install the free firstobject XML editor
go to File New Program and paste the split function in and save the script as say "translate_split.foal"
go to File New Program and paste the merge function in and save the script as say "translate_merge.foal"
Adjust the folder value in both scripts to the folder where you wish to do your processing
put your sample.xml in that folder and name it translate_input.xml
Open the split script and press F9, you will see new files named translate_to_AA.xml
Perform translations and gather files back to this folder
Open the merge script and press F9, you will see the file named translate_merged.xml

split()
{
  str sFolder = [["C:\Temp\"]];
  CMarkup mInput, mOutput;
  mInput.Load(sFolder+"translate_input.xml");
  int nIDCount = 0;
  while (mInput.FindElem("//FUNDOBJECTIVEDATA"))
  {
    // Extract the ID for this data
    str sID = mInput.GetAttrib("ID");
    
    // The first item must be EN
    mInput.IntoElem();
    mInput.FindElem("FUNDOBJECTIVEDATAITEM");
    mInput.FindChildElem("LANGUAGE");
    if (mInput.GetChildData() != "EN")
      return "unexpected: first data item under ID " + sID + " is not EN";
    mInput.FindChildElem("INFO");
    str sInfo = mInput.GetChildData();

    // Generate data elements for subsequent languages
    ++nIDCount;
    while (mInput.FindElem("FUNDOBJECTIVEDATAITEM"))
    {
      mInput.FindChildElem("LANGUAGE");
      str sLang = mInput.GetChildData();
      if (! mOutput.RestorePos(sLang))
      {
        mOutput.ResetPos();
        mOutput.AddElem("INFORMATION");
        mOutput.IntoElem();
        mOutput.AddElem("FUNDOBJECTIVE");
        mOutput.SavePos(sLang);
      }
      mOutput.AddChildElem("FUNDOBJECTIVEDATA");
      mOutput.IntoElem();
      mOutput.SetAttrib("ID",sID);
      mOutput.AddChildElem("LANGUAGE", sLang);
      mOutput.AddChildElem("INFO", sInfo);
    }
  }

  // Output files
  int nFileCount = 0;      
  mOutput.ResetPos();
  while (mOutput.FindElem("INFORMATION"))
  {
    CMarkup mOutputLang = mOutput.GetSubDoc();
    str sLang = mOutputLang.FindGetData("//LANGUAGE");        
    mOutputLang.Save(sFolder+"translate_to_" + sLang + ".xml");
    ++nFileCount;
  }
  
  return "Generated " + nFileCount + " files, " + nIDCount + " data elements per file";
}

merge()
{
  str sFolder = [["C:\Temp\"]];
  CMarkup mInput, mOutput;
  mInput.Load(sFolder + "translate_input.xml");
  
  // Loop through all translated files
  CMarkup mInputFiles = EnvFindFiles(sFolder+"translate_to_*.xml");
  mInputFiles.ResetPos();
  while (mInputFiles.FindElem())
  {
    CMarkup mTrans;
    mTrans.Load(sFolder + mInputFiles.GetData());
    str sLang = mTrans.FindGetData("//LANGUAGE");
    if (sLang == "")
    {
      return "language not found in " + mInputFiles.GetData();
    }
    
    // Loop through data of input and bring in items from this language
    mInput.ResetPos();
    while (mInput.FindElem("//FUNDOBJECTIVEDATA"))
    {
      str sID = mInput.GetAttrib("ID");
      mTrans.ResetPos();
      if (mTrans.FindElem("//FUNDOBJECTIVEDATA[@ID='" + sID + "']"))
      {
        str sInfo = mTrans.FindGetData("//INFO");

        // Locate corresponding language item to place translation into        
        mInput.IntoElem();
        while (mInput.FindElem())
        {
          mInput.FindChildElem("LANGUAGE");
          if (mInput.GetChildData() == sLang)
          {
            mInput.FindChildElem("INFO");
            mInput.SetChildData(sInfo);
            break;
          }
        }
      }
    }
  }
  mInput.Save(sFolder+"translate_merged.xml");
}

The easiest way to adjust and customize the scripts is to press F10 and run them line by line and see how they do what they do. Then look up any additional functions you need either with F1 or searching firstobject.com. There are also ways to set these scripts up as command line calls, see using the firstobject XML editor from the command line.

See also:

Split XML file into smaller pieces

Video of XML splitter script for splitting XML files

CMarkup 11.5 Release Notes 23 Apr 2011 9:05 AM (14 years ago)

Release 11.5 Date: April 23, 2011, download

Fixes for whitespace trimming/collapsing, and file read mode, as well as some changes in compiler #ifdef handling for WIN32.

Summary:

QT Windows compiling
end-of-line options
fix: trim whitespace escaped chars
fix: 64-bit compiler warnings for MCD_BLDLEN
fix: (Dev only) file read mode bug

Details:

Using CMarkup with QT on Windows is easier now; you shouldn't have to do any tweaking of CMarkup to add it to your QT project. The changes involved special cases for GNUC when it is used on Windows. Either the WIN32 (Windows.h) or _WIN32 (Visual Studio) precompiler defines will let CMarkup know it is compiling for Windows.

In 11.3 and 11.4, MDF_TRIMWHITESPACE and MDF_COLLAPSEWHITESPACE would remove an escaped char at the end of the trimmed data (see 11.3 Bug: trim whitespace removes escaped value).

On Linux and OS X, lines generated by CMarkup will now end with a newline by default instead of a Windows style CRLF (carriage return line feed), and the end-of-line setting can now be directed with preprocessor definitions. If you are on a non-Windows platform and want your CRLFs back, now you must add MARKUP_EOL_CRLF to your preprocessor definitions.

End Of Line Defines

Name Value Description

MARKUP_EOL_CRLF MCD_T("\r\n") Aka 0d 0a, this is the default for Windows builds

MARKUP_EOL_NEWLINE MCD_T("\n") Aka 0a, this is now the default for non-Windows builds

MARKUP_EOL_RETURN MCD_T("\r") Aka 0d, this is rarely used

MARKUP_EOL_NONE MCD_T("") For minimal size, documents will be on one line, but GetDocFormatted will not produce desired results

End Of Line Defines
Name	Value	Description
`MARKUP_EOL_CRLF`	`MCD_T("\r\n")`	Aka `0d 0a`, this is the default for Windows builds
`MARKUP_EOL_NEWLINE`	`MCD_T("\n")`	Aka `0a`, this is now the default for non-Windows builds
`MARKUP_EOL_RETURN`	`MCD_T("\r")`	Aka `0d`, this is rarely used
`MARKUP_EOL_NONE`	`MCD_T("")`	For minimal size, documents will be on one line, but GetDocFormatted will not produce desired results

The file read mode bug fix only affects CMarkup Developer and the free XML editor FOAL C++ scripting

The file read mode bug occurred with elements over 32k long that did not have child elements. This problem was only in file read mode where the Open method is used with MDF_READFILE. See File read GetSubDoc incomplete.

See also previous CMarkup release notes: 11.4, 11.3, 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes

Euro and other non-ASCII chars in XML with CMarkup 2 Mar 2011 3:08 AM (14 years ago)

You might work only with text in the ASCII range (below 128) or have some non-ASCII text like the Euro character or Western European characters with accents and umlautes. Here are some examples of how to handle encoding issues as you move beyond ASCII.

comment posted euro is unreadable in XML

Davide 02-Mar-2011

I need to insert an amount in a UTF-8 xml file, something like:

CString sFmt;
sFmt.Format( _T("%d €"), nPrice );
xml.AddElem( _T("Price"), sFmt );

But the resulting xml is unreadable. I've found a workaround using:

CString sFmt;
sFmt.Format( _T("%d \xE2\x82\xAC"), nPrice );
xml.AddElem( _T ("Price"), sFmt );

Using \xE2\x82\xAC for the euro is correct in your case because your string encoding is UTF-8.

When you specify a non-ASCII character in a source file on Windows it is compiled into your program in the locale charset. So the problem with the euro symbol in sFmt is that it is in your locale's MBCS (in which the euro is represented by one byte) and CMarkup is expecting UTF-8 (which is the case when your project is set to use neither MBCS nor UNICODE). You were able to work around it by putting the UTF-8 encoding directly in the string.

If compiling for MBCS you could have used the euro character directly in your source string, but the result would only be satisfactory as long as the program is running on a machine with your same locale "Language for non-Unicode programs."

C++ string charset build options

This is another opportunity to discuss the internal memory string encoding choices in C++ (also described in ANSI and Unicode files and C++ strings).

The CMarkup class has a string member m_strDoc that holds the XML document (or part of it in the case of file mode). Also, the CMarkup methods accept and return strings. The encoding of these strings depends on platform and compiler options.

Wide char Unicode: UTF-16 on Windows, generally UTF-32 on OS X and Linux

MBCS: Windows only, depends on machine's locale setting e.g. Western European, Korean

UTF-8 Unicode: plain byte-based text, UTF-8 on OS X and Linux, can be UTF-8 on Windows

You select the UTF-8 option in Windows by turning off the UNICODE (wide char) and MBCS project defines. In Visual Studio 2005+ Properties General Character Set choose the "Not Set" option. In this case, all the strings going into and out of the CMarkup methods are expected in UTF-8.

Note: the terminology is confusing, but in this Windows context UTF-8 is neither MBCS nor UNICODE. UTF-8 is "multibyte," however Windows uses MBCS to refer only to those character sets that can be selected for the machine locale and used in "A" APIs and Windows messages. And although UTF-8 is Unicode, Windows uses UNICODE to refer only to UTF-16 used in "W" APIs and Windows messages. In Windows you must convert UTF-8 strings to MBCS for "A" APIs like SetWindowTextA or better yet to UTF-16 for "W" APIs like SetWindowTextW (that's what CMarkup's UTF8To16 and UTF16To8 are for).

If you have a UTF-8 file, using UTF-8 in memory eliminates the need to convert the text encoding between file and memory.

If you have a UTF-8 file and compile for MBCS in memory, CMarkup converts the XML to the locale code page when it is loaded into memory. This has performance and multi-language disadvantages. It must do the conversion as mentioned when going between memory and file, which adds time (though less than the time to read from disk) and might be a performance consideration depending on your requirements. But also, you will lose any Unicode characters not supported in the locale code page where the program is running. For example, the ö with the umlaute is 246 in the standard Windows U.S. code page Windows-1252 but it is not supported in Greek Windows-1253 (but the Euro is).

comment posted GetAttrib result is a question mark

Chen 22-Aug-2011

When the xml data has a node like the following:

<block solution ="～" />

CMarkup's function GetAttrib("solution") has the result "?".

The character in your attribute is U+FF5E (65374, Halfwidth and Fullwidth Forms FF00 - FFEF) UTF-8 EF BD 9E). When this character is not supported in the character set in memory, it is replaced by a question mark. You likely have an MBCS build which expects strings to be in the system locale Windows code page (not Unicode). It is best if you can use a Unicode charset in memory -- either UTF-8 or wide string, as explained above.

comment posted Trouble working with Arabic XML

Greg 16-Jun-2011

I am reading in a bunch of small Arabic XML files and combining them into a larger one. My problem is that I must not be handling something correctly when I read the data into memory because the output is gibberish (examples follow). Here are the relevant details of my environment: CMarkup version 11, Microsoft Visual Studio 2010 Ultimate, Unmanaged C++, MBCS, XML is UTF-8 with no BOM. Here is an example of what I am reading in:

<?xml version="1.0" encoding="UTF-8"?>
<Symbol xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="..\schemas\symbolcanonical.xsd">
<Filename>11.esds</Filename>
<Description>
<LocText>ضع هذاالرمز  على رسم المخطط  .ثم ابداء الطباعة  .</LocText>
</Description>
</Symbol>

And here is an example of what I am writing out:

<?xml version="1.0" encoding="UTF-8"?>
<SymbolCollection>
<SymbolList>
<Symbol>
<Filename>11.esds</Filename>
<Description>?? ????????  ??? ??? ??????  .?? ????? ???????  .</Description>
</Symbol>
</SymbolList>
</SymbolCollection>

I’ve played around with various SetDocFlags and setlocale options and whatnot...

[Selecting "Not Set" for the project character set solved the problem.] I didn't realize I had a viable third choice on that build setting.

When your project is set to use MBCS, CMarkup converts the file to your system locale charset in memory. If you can set your Character Set to the "Not Set" option in your Project Properties it will keep your XML in UTF-8.

comment posted encoding of XML with german umlaute

Ahyan 19-Jul-2011

It is possible for developers to save their XML files in an unfortunately inconsistent way where the XML header encoding information does not fit to the file content. That happens when the XML files are modified in a text editor that does not care if it is saving an XML file and if the encoding header matches the content. Within this text editor one has to explicitly specify the encoding of the text file (which is actually XML in this case) with the "Save File As" options. So we end up having XML files with incorrect headers and a given XML file that can contain "german umlaute" (special german characters like ä,ö etc) will be invalid because the XML header states e.g. an encoding ("UTF-8" or "8859-1") that doesn't fit the actual content. This can be improved by training the developers...

Yes, in the real world, you get situations where you need to salvage improperly declared XML documents. Say you have a header (an "XML declaration") at the top of your XML file:

<?xml version="1.0" encoding="UTF-8"?>

But the encoding of the non-ASCII characters is actually Windows-1252. You can get CMarkup to ignore the "UTF-8" specified there by using the ReadTextFile and WriteTextFile functions directly and specifying the desired encoding:

string strDoc, strEncoding="Windows-1252";
CMarkup::ReadTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);
CMarkup xml;
xml.SetDoc(strDoc);
...
strDoc = xml.GetDoc();
CMarkup::WriteTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);

This should allow you to leave (and ignore) the incorrect encoding in the XML declaration.

See also:

ANSI and Unicode files and C++ strings

GetDeclaredEncoding

Archived CMarkup 11.4 Release Notes 5 Feb 2011 2:31 PM (14 years ago)

Release 11.4 Date: February 5, 2011, download

An important fix to the 11.3 whitespace features, improvements in the GetDocFormatted method, and an enhancement to HasAttrib.

Here's the list of 11.4 enhancements:

fix: MDF_TRIMWHITESPACE and MDF_COLLAPSEWHITESPACE were crashing on values that were only whitespace... sorry, a glaring hole in the test cases
GetDocFormatted now removes any space between the attribute name and value (and has improvements in speed and memory efficiency)
HasAttrib can now return the attribute value as well, making it an alternative to GetAttrib for convenience and performance
fix: rare HTML parser case for attributes without values in empty start tags, e.g. <A a/>

See also previous CMarkup release notes: 11.3, 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes

Archived CMarkup 11.3 Release Notes 20 Nov 2010 2:15 PM (14 years ago)

Release 11.3 Date: November 20, 2010, download

A performance improvement makes CMarkup significantly faster! Overall parsing speed is up 35%, and attribute methods are twice as fast as 11.2. This release also includes document flags to trim whitespace and collapse whitespace.

Here's the list of 11.3 enhancements:

overall parser performance increased about 35%, see CMarkup XML Parser Performance
attribute method performance improvements
New method GetNthAttrib retrieves name and value of attribute 0, 1, 2...
use MDF_TRIMWHITESPACE and MDF_COLLAPSEWHITESPACE to affect retrieved values (see Whitespace and CMarkup)
fix: bug in Linux/OS X TextEncoding::IConv *thanks Frank Dering
performance measures added to the tests and CMarkupTesting.xml output (see CMarkup test dialog)

See also previous CMarkup release notes: 11.2, 11.1, 11.0, 10.1, 10.0, Archived CMarkup Release Notes

Using the firstobject XML editor from the command line 20 Nov 2010 2:12 PM (14 years ago)

The free firstobject XML editor has some command line switches. Here is the summary; details are below.

Switch Purpose

-new open in a new instance of the editor (useful when single instance preference is selected)

-same open in the existing instance of the editor

-watch "C:\event.log" open file in read-only auto-reloading mode to view the tail of log files

-line 23 open file at a line

-offset 451 UTF-8 offset from beginning of document or from beginning of line if line is specified

-fromoffset 5 pre-select text from this offset to specified offset

-run script.foal:f arg1 arg2 execute the script without showing the editor

Switch	Purpose
`-new`	open in a new instance of the editor (useful when single instance preference is selected)
`-same`	open in the existing instance of the editor
`-watch "C:\event.log"`	open file in read-only auto-reloading mode to view the tail of log files
`-line 23`	open file at a line
`-offset 451`	UTF-8 offset from beginning of document or from beginning of line if line is specified
`-fromoffset 5`	pre-select text from this offset to specified offset
`-run script.foal:f arg1 arg2`	execute the script without showing the editor

Examples:

foxe.exe "C:\XML examples\file.xml" -line 6

foxe.exe file.xml -line 6 -offset 5

foxe.exe -offset 240 -fromoffset 235 file.xml

foxe.exe -watch C:\event.log

foxe.exe -run C:\script.foal

foxe.exe -same file.xml

"C:\Program Files\firstobject\foxe.exe" -new file.xml

comment posted How to run script automatically

Angela Baines 18-Jan-2010

The foal script works a treat now. When I move this to production it has to run as part of an automated project that will be scheduled to run overnight

As of release 2.4.1 the free firstobject XML editor has a command line switch to run a script:

foxe -run "C:\foal scripts\script.foal"

It generates 2 files, foxe_err.txt and foxe_out.txt. The err file contains marked up information about the run and can help diagnose issues with running. The out file contains the output returned from the script in the return statement.

And with release 2.4.2 you can specify function and arguments, see below.

comment posted Run from command line problem

Garth Lancaster 19-Jul-2010

In foxe I can do this [with a foal script]

str NavigateIterativelyXYZ_Generated( CMarkup mDocToNavigate )
{
  mDocToNavigate.ResetPos();
  str sXML = mDocToNavigate.GetDocFormatted(0);
  return sXML;
}

so when I run it, a window pops up and asks which document I wish to convert, shows me the output, all is good... except now, I wish to do this from a command line.

foxe –run align0.foal input.xml output.xml

Where align0.foal contains the foal program, input and output names will likely come from a batch script % parameter or such, and input.xml will contain the streamed xml and output.xml will contain the results of the GetDocFormatted(0). I'm thinking foal can be very powerful, am I missing a point in its implementation?

Update April 23, 2011: With release 2.4.2, it works the way it should (the way Garth described). The -run command line option passes any number of command line arguments into the parameters of the function in the foal script. For example (remember to use quotes if a path or argument contains spaces):

foxe -run "C:\foal scripts\script.foal" C:\in.xml C:\out.xml

Here is a script to format XML that works with the previous command line:

formatxml(str sInPath, str sOutPath)
{
  CMarkup m, r;
  r.AddElem( "load", sInPath );
  m.Load( sInPath );
  r.AddSubDoc( m.GetResult() );
  if ( ! m.SetDoc(m.GetDocFormatted()) )
    r.AddSubDoc( m.GetResult() );
  r.AddElem( "save", sOutPath );
  m.Save(sOutPath);
  r.AddSubDoc( m.GetResult() );
  return r;
}

Bascially this script just loads the input file, calls GetDocFormatted and saves to the output file (you could also add an argument to pass format flags). In addition it returns the results in case something goes awry you can see what happened in foxe_out.txt. If there is no problem, it might look like this:

<load>C:\in.xml</load>
<read encoding="UTF-8" length="31579"/>
<save>C:\out.xml</save>
<write encoding="UTF-8" length="29958"/>

If there are multiple functions in a script, you can name your entry function main to avoid confusion about which function is being called. You can also specify the entry function explicitly on the command line with :function at the end of the script filename. Specifying the function allows you to invoke multiple functions in one script from the command line. Here is an example of using the same script to perform different operations:

foxe -run C:\script.foal:extract London

foxe -run C:\script.foal:merge London "New York"

If you specify a function name that is not found in the script, there will be an indication at the bottom of foxe_err.txt that also indicates the entry point it would have used if you had not specified one:

<entry_point arg_count="2" not_found="gormatxml">formatxml</entry_point>

If you do not specify the function and there is no main function in the script, the last function with a matching number of arguments is called. It chooses the last matching function because functions earlier in the script tend to be subroutines since in FOAL you can only call functions below where they are defined.

comment posted Quoted Command Line Param Bug?

Tim Johnston 20-Jun-2011

It would seem there is a problem when running a FOXE script from the command line where the quoted parameter (such as a folder path) has a trailing slash, then the script fails... [this is a command console issue rather than a foxe issue, and he answered his own question in a subsequent email] It seems if you leave off the trailing slash and check for it/add it in the script it works fine. It also seems you can end it with a double back slash and it correctly escapes it - go figure :)

Incidentally, this failure doesn’t affect the return code of the process – I assume it would/should, but doing an echo %errorlevel% returns 0 after the failure... Also, would it be possible to make the FOXE command line not return back to the prompt before it is finished? i.e. make it "modal"? Basically when you run a script, control immediately returns back to the calling process and if the remaining batch file etc. is relying on the output of the FOXE script, then it is not there (as its still processing) – or at least do you think you could add a command line option to not return until processing is complete?

Thanks for the help on escaping the trailing backslash in the command line. I think you should be able to make the foxe.exe call wait using the START /WAIT command to launch it. By default batch files wait, but windowing programs do not so you have to use the START command in that case.

Whitespace and CMarkup 20 Nov 2010 2:11 PM (14 years ago)

comment posted trimming white space

Marc Dyksterhouse 17-Mar-2010

Is there a way to have GetData or some other call return just the text of an element and not the whitespace around it? For example, can GetData return "text" in the following XML instead of " text\n"?

<item>
 text
</item>

I know I can just trim the returned string, but since whitespace isn't supposed to be pertinent in XML, I just thought the library should work this way. In the few cases where I need to preserve whitespace, I can use a CDATA encoding.

With release 11.3 you can set flags to trim whitespace or collapse whitespace when reading values from the document. CMarkup is unusual among XML tools because it simply preserves all whitespace, but now it can also support standard ways that XML and HTML processors alter whitespace.

Whitespace includes spaces, tabs, returns and newline characters. CMarkup has always preserved the whitespace as it appears in the document, and it still will. These new flags give you the option of reading the trimmed or collapsed text values, but the document is not altered, so you can turn off the flags and go back to reading the preserved whitespace.

Document Flag Purpose
MDF_TRIMWHITESPACE removes leading and trailing whitespace
MDF_COLLAPSEWHITESPACE removes leading and trailing whitespace, but also replaces all segments of whitespace inside the text with a single space; so for example a newline and tab within the text will become a single space

Document Flag	Purpose
`MDF_TRIMWHITESPACE`	removes leading and trailing whitespace
`MDF_COLLAPSEWHITESPACE`	removes leading and trailing whitespace, but also replaces all segments of whitespace inside the text with a single space; so for example a newline and tab within the text will become a single space

These flags affect CMarkup methods like GetData and GetAttrib that retrieve element data, text nodes, and attributes (but not methods like GetSubDoc and GetElemContent that return XML i.e. markup text).

These flags have no effect on text retrieved from CDATA Sections. With CMarkup you can create elements to contain CData Section text to protect the whitespace from ever being altered by CMarkup or any other XML tool:

xml.AddElem( "Prose", strProseText, CMarkup::MNF_WITHCDATA );

Turn the whitespace flags on and off anytime without performance penalty if for example you want to trim some values and not others. Use SetDocFlags to set these flags.

CMarkup m;
m.SetDocFlags( CMarkup::MDF_TRIMWHITESPACE );

You can OR a flag with GetDocFlags if you don't want to affect other flags:

m.SetDocFlags( m.GetDocFlags() | CMarkup::MDF_COLLAPSEWHITESPACE );

Turn off a flag without affecting others as follows:

m.SetDocFlags( m.GetDocFlags() & ~CMarkup::MDF_COLLAPSEWHITESPACE );

These whitespace flags can affect values returned by GetData, GetAttrib and related methods. They also affect methods like FindElem that search for a path specifying a value in a path attribute predicate (see Paths In CMarkup) because values from the document will be trimmed or collapsed before being compared to the specified value.

How to generate file names with XML splitter script 20 Nov 2010 2:10 PM (14 years ago)

An example of how to use the free firstobject XML editor to split XML and then name the output files based on information in the pieces separated by the XML splitter script. Maybe this will be useful to other NGO's who need to split their XML.

comment posted XML Splitter

Dita Ciulacu 01-Jul-2009

I am desperately searching for a xml splitter to generate the file name using values from a child field. Is there any way to have the file named this way:

xmlOutput.Open( "test" + "_" + [Child value from REFERRAL_ID] + "_" + nFileCount + ".xml", MDF_WRITEFILE );

My xml [not real data] is:

<REFERRAL_DISCHARGE>
  <FILE_VERSION>1.0</FILE_VERSION>
  <REFERRAL_ID>1234</REFERRAL_ID>
  <ORGANISATION_ID>ORG-5678</ORGANISATION_ID>
  <ORGANISATION_TYPE>005</ORGANISATION_TYPE>
  <EXTRACT_FROM_DATE_TIME>2009-06-01T00:00:00</EXTRACT_FROM_DATE_TIME>
  <EXTRACTED_DATE_TIME>2009-06-30T15:40:15</EXTRACTED_DATE_TIME>
  <TEAM_CODE>5555</TEAM_CODE>
  <EVENT_HCU_ID>XXX1234</EVENT_HCU_ID>
  <SEX>M</SEX>
  <DATE_OF_BIRTH>1900-05-05</DATE_OF_BIRTH>
  <REFERRAL_FROM>UN</REFERRAL_FROM>
  <START_DATE_TIME>2008-12-24T00:00:00</START_DATE_TIME>
</REFERRAL_DISCHARGE>

The parent is REFERRAL_DISCHARGE, I need the file name exactly how you have it plus the individual value from REFERRAL_ID to make it easy to link to the data included.

We are a not-for-profit organization and we have to report to the [New Zealand] Ministry of Health and our data is to be packed as individual xml files. We are not dealing with huge files (this one was only 316kb) and also they are relatively simple extracts, but I don’t know in the future... it may get more complicated.

For splitting an XML file less than 10MB into a lot of referral discharge files, this is the easiest way to do it:

split()
{
  CMarkup xmlInput, xmlSubDoc;
  xmlInput.Load( "input.xml" );
  int nFileCount = 0;
  while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") )
  {
    ++nFileCount;
    xmlSubDoc.SetDoc( xmlInput.GetSubDoc() );
    str sID = xmlSubDoc.FindGetData( "//REFERRAL_ID" );
    str sFilename = "test_" + sID + "_"+ nFileCount + ".xml";
    WriteTextFile( sFilename, xmlSubDoc.GetDoc() );
  }
  return nFileCount;
}

Splitting a huge file

For others who have really large files (especially over 100MB up to any number of gigabytes) use the XML reader mode which processes the source file on disk very efficiently. The only difference from the above script is opening the input file in read mode rather than loading it all into memory.

split()
{
  CMarkup xmlInput, xmlSubDoc;
  xmlInput.Open( "input.xml", MDF_READFILE );
  int nFileCount = 0;
  while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") )
  {
    ++nFileCount;
    xmlSubDoc.SetDoc( xmlInput.GetSubDoc() );
    str sID = xmlSubDoc.FindGetData( "//REFERRAL_ID" );
    str sFilename = "test_" + sID + "_"+ nFileCount + ".xml";
    WriteTextFile( sFilename, xmlSubDoc.GetDoc() );
  }
  xmlInput.Close();
  return nFileCount;
}

A note about usage of the anywhere path. If you want to grab multiple pieces of data like xmlSubDoc.FindGetData("//REFERRAL_ID") remember that the // anywhere path starts from the current position. So if you're not sure about the order of the data you are grabbing, call xmlSubDoc.ResetPos() in between calls to FindGetData.

See also:

Split XML with XML editor script

Split XML file into smaller pieces

Video of XML splitter script for splitting XML files

C++ XML reader parses a very large XML file

CMarkup Open Method - file read mode

Parse huge XML file in C++

Export XML records with matching childset 20 Nov 2010 2:09 PM (14 years ago)

Conventional wisdom has you importing and exporting XML to and from a database in order to run queries and utilize data that is in XML. But with firstobject's free XML editor you can perform all sorts of operations rapidly and efficiently directly on the XML document. This example shows how to export subsets of records, query, tally and modify XML records in a real estate database XML file.

comment posted export records with matching childset

Eddie Wrenn 25-Jan-2010

What I have is a list of properties for sale nationwide, contained in a 1.5gb XML file (your program is the only one which seems to handle this with ease!) I'm looking for a way to make the editor export all the records which have a matching childset, in this case 'locality' (in this example, London). There's 100,000 listings so not a manual job!

I've been successful splitting the file into 100,000 seperate files, named by the locality (using your tutorials). But patching them all together takes a long time, even if I automate it. A sample record below:

<listing key="1234567" status="active" updated="20090101T010101" type="residence">
  <title><![CDATA[Xyz Street, London]]></title>
  <supplementary-url><![CDATA[1234567.htm]]></supplementary-url>
  <description><![CDATA[AVAILABLE 01/01/2010. This
beautifully decorated place is situated on a quiet back
street of Xyz Garden in the heart of Xyz London.
The owners have refurbished to a particluarly high
standard paying exceptional attention to detail to the
overall finish and decoration. As the apartment is
situated on the Nth floor there are great views of
London giving the apartment excellent natural light.
Features available. We highly recommend a viewing.]]></description>
  <residence type="flat">
    <bedrooms><![CDATA[1]]></bedrooms>
    <bathrooms><![CDATA[1]]></bathrooms>
    <reception><![CDATA[yes]]></reception>
  </residence>
  <authority>
    <lease currency="GBP" term="private" visible="yes">
      <price term="weekly"><![CDATA[450]]></price>
    </lease>
  </authority>
  <address visible="yes">
    <country><![CDATA[GB]]></country>
    <subdivision><![CDATA[London]]></subdivision>
    <locality><![CDATA[London]]></locality>
    <postcode><![CDATA[AA1A 1AA]]></postcode>
    <road><![CDATA[Xyz Street]]></road>
  </address>
  <attachments>
    <photo title="" updated="20090101T010101" type="image/jpeg">
      <uri><![CDATA[1234567_354_255.jpg]]></uri>
    </photo>
    <photo title="" updated="20091118T201049" type="image/jpeg">
      <uri><![CDATA[23456789_354_255.jpg]]></uri>
    </photo>
    <photo title="" updated="20091118T201049" type="image/jpeg">
      <uri><![CDATA[34567890_354_255.jpg]]></uri>
    </photo>
  </attachments>
  <vendor>
    <name><![CDATA[Xyz Property Services]]></name>
    <phone><![CDATA[020 1234 5678]]></phone>
    <email><![CDATA[enquiries@xyz.example]]></email>
  </vendor>
</listing>

To find all the matching records on a huge file you do something like this: from the File menu select New Program, paste in the following script, and modify the input file pathname (note that for C++ syntax, use a double backslash for backslashes in the pathname).

pull_by_locality()
{
  str strSearch = "London";
  CMarkup xmlInput, xmlListing, xmlOutput;
  xmlInput.Open( "C:\\huge.xml", MDF_READFILE );
  while ( xmlInput.FindElem("//listing") )
  {
    xmlListing.SetDoc( xmlInput.GetSubDoc() );
    if ( xmlListing.FindGetData("//locality") == strSearch )
      xmlOutput.AddSubDoc( xmlListing.GetDoc() );
  }
  return xmlOutput.GetDoc();
}

To export the result document as London.xml:

xmlOutput.Save( strSearch + ".xml" );

To delete (or actually skip) records which are no longer required e.g. we want them if status is "active" but not if it is "inactive" or "sold":

xmlListing.ResetPos();
if ( xmlListing.FindGetData("//status") != "active" )
  ...

To change an element tag name from title to topicname in the output, first add the new element with the same content, then remove the old one (this is the easiest way to make sure the new element goes into the same position as the removed one).

xmlListing.ResetPos();
if (xmlListing.FindElem("//title"))
{
  xmlListing.AddElem("topicname", xmlListing.GetData());
  xmlListing.FindPrevElem(); // title
  xmlListing.RemoveElem();
}

As far as inputing the search string, FOAL scripts don't support dialogs yet. However, you can automate the process if you can put the search string in a file such as search.txt which could be retrieved in the FOAL script with:

str s;
if ( ReadTextFile("C:\\search.txt", s) && StrLength(s) > 2 )
  s = StrMid( s, 0, StrLength(s)-2 ); // remove CRLF
str strSearch = s;

In DOS, if you had a script named search.foal, then you could create a search.bat file as follows to let you type search London on the command line.

echo %1 > C:\search.txt
"C:\Program Files\firstobject\foxe.exe" -run C:\search.foal

Here's an interesting diagnostic to count instances of each locality:

locality_tally()
{
  CMarkup xmlLocalities, xmlInput;
  xmlInput.Open( "huge.xml", MDF_READFILE );
  while ( xmlInput.FindElem("//locality") )
  {
    str sLoc = xmlInput.GetData();
    int n = 1;
    if ( xmlLocalities.RestorePos(sLoc) )
      n = StrToInt(xmlLocalities.GetAttrib("n")) + 1;
    else
    {
      xmlLocalities.ResetPos();
      xmlLocalities.AddElem("locality",sLoc);
      xmlLocalities.SavePos(sLoc);
    }
    xmlLocalities.SetAttrib("n", n);
  }
  return xmlLocalities;
}

Would yield a result like this:

<locality n="890">London</locality>
<locality n="431">Yorkshire</locality>

comment posted how to clear the XML result

Eddie Wrenn 27-Jan-2010

Now I'm piggybacking "searches" on top of each other, so it will search for London, output them into a London file, then search for Yorkshire, and output that into a Yorkshire file. My problem is that the editor [script] will retain the results for London, and add them to the top of my Yorkshire file - is there a little code that will clear the internal memory before starting the next process?

xmlOutput.SetDoc("");

comment posted records that contain the State of Michigan

Grace 03-Feb-2014

I am new to XML and need some instructions. I have the same problem that one of your previous customers inquired about on your website to "Export XML records with matching childset." I am trying to get all the records that contain the State of Michigan on a separate file but it does not appear to be working for me. Below is a example of a record:

<Property>
<Description><![CDATA[Great Location at corner of Xyz.
Large older home, very charming! Note: Tenants pay 1/nth
of gas and electric. Water included.]]></Description>
<MinRent>1300</MinRent>
<MaxRent>1300</MaxRent>
<MarketingName/>
<Address>301 N test St</Address>
<City>Little Town</City>
<State>MI</State>
<Zip>49876</Zip>
<YearBuilt>0</YearBuilt>
<NumberUnits>7</NumberUnits>
<Latitude>12.3456789</Latitude>
<Longitude>-12.3456789</Longitude>
<AcceptsHcv>False</AcceptsHcv>
<PhoneNumber>(123) 456-7890</PhoneNumber>
<LastUpdated>1/30/2014 8:00:00 AM</LastUpdated>
<Amenity AmenityID="101" AmenityName="Parking"/>
<Amenity AmenityID="102" AmenityName="Unfurnished"/>
<Amenity AmenityID="103" AmenityName="Dishwasher"/>
<Amenity AmenityID="109" AmenityName="Garbage Disposal"/>
</Property>

The file is pretty big (225 MB).

Although for 225MB it is probably not necessary, the following script is written to handle extremely large input and output files by opening them in "file mode" (using MDF_READFILE for the input and MDF_WRITEFILE for the output). The "anywhere path" //Property is used to search the input document for Property records, and //State searches anywhere in the xmlRecord subdocument for the State. In the output window it shows you the count of records it searched and the number matched. If it shows 0 searched it is because your input does not contain any Property elements.

pull_by_State()
{
  str strSearch = "MI";
  int s = 0;
  int m = 0;
  CMarkup xmlInput, xmlRecord, xmlOutput;
  if (!xmlOutput.Open("C:\\test_" + strSearch + ".xml", MDF_WRITEFILE))
    return xmlOutput.GetResult();
  xmlOutput.AddElem("Search");
  xmlOutput.SetAttrib("criteria", strSearch);
  xmlOutput.IntoElem();
  if (!xmlInput.Open("C:\\test.xml", MDF_READFILE))
    return xmlInput.GetResult();
  while ( xmlInput.FindElem("//Property") )
  {
    ++s;
    xmlRecord.SetDoc( xmlInput.GetSubDoc() );
    if ( xmlRecord.FindGetData("//State") == strSearch )
    {
      xmlOutput.AddSubDoc(xmlRecord.GetDoc());
      ++m;
    }
  }
  xmlInput.Close();
  xmlOutput.Close();
  return "Searched " + s + " records, matched " + m;
}

Counting XML tag names and values with foal

firstobject Access Language

Format XML, indent align beautify clean up XML

Simple XML editor meets memory stick

Split XML with XML editor script

Tree customization in the firstobject XML editor

Video demo of editing RSS XML in the tree view of the free firstobject XML editor

Video of XML Editor format XML, customize treeview, and program

Split XML file into smaller pieces

Video of XML splitter script for splitting XML files

C++ XML reader parses a very large XML file

Parse huge XML file in C++

CMarkup GetNthAttrib Method 20 Nov 2010 2:07 PM (14 years ago)

bool CMarkup::GetNthAttrib( int n, MCD_STR& strName, MCD_STR& strValue ) const;

Call GetNthAttrib to get the string name and value of the Nth attribute in the main position element. The first attribute is 0, the second is 1, etc. If there is no current position or the current position node does not have the specified attribute, it returns false.

Similar to GetAttribName, this method lets you iterate through the attributes of an element or processing instruction. However, this is usually better because it provides the attribute value without an additional call to GetAttrib, and it returns a bool which is convenient for looping. For example:

MCD_STR strName, strAttrib;
int n = 0;
while ( xml.GetNthAttrib(n++, strName, strAttrib) )
{
  // do something with strName, strAttrib
}

GetNthAttrib also works when the main position is a processing instruction node with attributes. See Node Methods in CMarkup.

CMarkup XML Parser Performance 20 Nov 2010 2:04 PM (14 years ago)

Release 11.3 has made a leap in performance (e.g. from 39mb/s to 53mb/s* excluding file I/O), so its a good time to post some data on the speed of CMarkup, and to discuss XML parser performance issues. Here is a comparison of 11.3 with the previous release 11.2; raw parsing goes from 40000 to 54000 bytes per millisecond and attribute parsing (the basis for attribute methods) goes from 5000 to 9000 b/ms (see also Attribute Method Performance).

Release Chart parse doc/attrib create doc/attrib Units

CMarkup 11.2 40002 5175 12331 4754 b/ms

CMarkup 11.3 54042 9195 14394 6820 b/ms

Release	Chart	parse doc/attrib	create doc/attrib	Units
CMarkup 11.2		40002	5175	12331	4754	b/ms
CMarkup 11.3		54042	9195	14394	6820	b/ms

Since these measurements do not involve disk I/O, the speeds are measured in character units per millisecond where the character unit is b for byte, w for word (2 bytes), and dw for double word (4 bytes), depending on the build and platform. In the first chart I include 2 parse tests and then 2 corresponding create tests.

parse document this is the core indicator of parsing speed; the document string is passed to SetDoc in memory and parsed, it is not loading the document from disk

parse attributes loops through the document reading all attributes with GetAttribName and GetAttrib (the new GetNthAttrib method is more efficient way to do this)

create document builds a document using an AddElem and SetAttrib for each element, the document is not saved to disk, there is no disk I/O in this measurement

create attributes creates a document with up to 4 randomly selected attributes and values per element, the SetAttrib call occassionally overwrites an attribute

The reason for release 11.3 performance improvement

One of the most intensively used operations in the parser is determining whether a character is one of a set of characters. In 11.3 I replaced MCD_PSZCHR (strchr) with a lookup define which is an order of magnitude faster and yields a roughly 30% speed improvement in overall raw parser speed. The new lookup define only checks the bounds and then returns the offset in the array, where c is the character, f and l are the bounds (first and last) and s is the lookup array (a string):

#define x_ISONEOF(c,f,l,s) ((c>=f&&c<=l)?(int)(s[c-f]):0)

So, for example, a whitespace check uses x_ISONEOF and passes the bounds 9 and 32, and a lookup string array for the range between those bounds:

// classic whitespace " \t\n\r"
#define x_ISWHITESPACE(c) x_ISONEOF(c,9,32,
  "\2\3\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1")

Another roughly 5% overall improvement was gained by replacing MCD_PSZNCMP (strncmp) with a simple speedy implementation of string compare.

Comparing different builds of release 11.3

Build configuration makes a big difference in performance. See ANSI and Unicode files and C++ strings and non-Unicode text handling in CMarkup for discussions of string character set options.

Build Chart parse doc/attrib create doc/attrib Units

MFC (UTF-8) 54042 9195 14394 6820 b/ms

STL (UTF-8) 55923 9193 11583 6061 b/ms

MFC MBCS 14424 3269 11084 3492 b/ms

STL MBCS 14783 3137 8636 3223 b/ms

MSXML6 MFC MBCS 3832 1762 1849 1347 b/ms

MFC WCHAR 57405 8607 14530 6594 w/ms

STL WCHAR 57780 8607 10744 5639 w/ms

MSXML6 MFC WCHAR 3950 1939 1963 1428 w/ms

Build	parse doc/attrib	create doc/attrib	Units
MFC (UTF-8)	54042	9195	14394	6820	b/ms
STL (UTF-8)	55923	9193	11583	6061	b/ms
MFC MBCS	14424	3269	11084	3492	b/ms
STL MBCS	14783	3137	8636	3223	b/ms
MSXML6 MFC MBCS	3832	1762	1849	1347	b/ms
MFC WCHAR	57405	8607	14530	6594	w/ms
STL WCHAR	57780	8607	10744	5639	w/ms
MSXML6 MFC WCHAR	3950	1939	1963	1428	w/ms

Using Unicode (either UTF-8 or WCHAR) strings in memory is much more efficient than MBCS which utilizes Windows APIs to determine character boundaries according to the locale character set. MSXML is very slow due to the overhead of COM and is slightly faster in a WCHAR build which avoids conversion to and from COM's WCHAR-based strings.

File mode performance

Unlike the measurements above, the XML reader and XML writer measurements are all in bytes per millisecond regardless of build because they are based on the file I/O rather than the in-memory character unit size. The file is UTF-8, which means the MBCS and wide character builds have the extra penalty of character set conversion. The MBCS conversion can be done using the libc (stdlib.h) function wctomb (not using the Windows API).

Build Chart XML reader XML writer Units

MFC 15086 11528 b/ms

STL 13858 9540 b/ms

MFC WCHAR 10854 8757 b/ms

STL WCHAR 10717 7509 b/ms

MFC MBCS 11673 9846 b/ms

STL MBCS 10444 8137 b/ms

MFC MBCS libc 2231 2844 b/ms

STL MBCS libc 2155 2677 b/ms

Build	XML reader	XML writer	Units
MFC	15086	11528	b/ms
STL	13858	9540	b/ms
MFC WCHAR	10854	8757	b/ms
STL WCHAR	10717	7509	b/ms
MFC MBCS	11673	9846	b/ms
STL MBCS	10444	8137	b/ms
MFC MBCS libc	2231	2844	b/ms
STL MBCS libc	2155	2677	b/ms

* Measurements here are representative of the speed with my own sample data on a 1.7GHz 1GB Vista netbook. Running these tests twice in a row often gets slighly different results because they are affected by variations in CPU.

Attribute Method Performance 20 Nov 2010 2:01 PM (14 years ago)

Attribute parsing performance came up several times this year, and some significant improvements were made in CMarkup release 11.3.

In its attribute methods, in every call CMarkup reparses attributes up to the one that is accessed. This can lead to poorer than expected performance when you have attribute intensive code, i.e. code that repeatedly accesses or checks for many attributes. This is due to an original design trade off: CMarkup does not store attribute indexes.

comment posted CMarkup - Attribute Query Speed

Cameron Dunn 23-Jun-2010

I've been very impressed with the speed of loading and parsing. However, I've hit one area which is surprisingly slow which I wanted to ask you about - XML attributes.

I'm loading about 3000 XML files, for a total of 99468934 bytes. I'm loading in the files myself and then passing the string to CMarkup. If I do that, and then loop down into every element in every file, it takes about 2 seconds (specifically, 1985ms), which I thought was pretty impressive.

However, if I do the same thing but also loop over every attribute on every element, it takes 8 seconds. I found this a bit surprising - obviously there's no additional file IO time or anything like that, it's all in string processing. The interface which CMarkup provides to access attributes is very string heavy - you need to get the attribute by name and then query the value using this name.

Is there a quicker way to loop over the XML attributes? I need the name and value for each attribute, but they can be in the order in which they occur in the file.

...I iterate the attributes for a single element with GetAttribName() and then call GetAttrib() to get their values.

CMarkup release 11.3 introduces a new method GetNthAttrib which is twice as efficient as GetAttribName combined with GetAttrib, and in addition attribute parsing is about twice as fast (see CMarkup XML Parser Performance). So, iterating the attributes in your case might be reduced from 6 seconds to 1.5 seconds.

I did design a solution to manage and reuse attribute indexes for the current element, but it was actually slower for a single attribute access and wasn't really fast enough to justify the added complexity. Another option would be to include attributes much like elements in CMarkup indexing, but I think that's too fundamental at this point. So I've chosen to remain with the original reparse design for the time being, and hopefully the 11.3 performance boost and new method will help out enough.

If you have intensive use of attributes, in some cases you might want to extract them with GetNthAttrib to an external map as a more efficient machanism to access them repeatedly. You can even map them in a separate CMarkup object as elements using SavePos and then RestorePos to do the lookup.

Video demo of editing RSS XML in the tree view of the free firstobject XML editor 12 Jun 2010 2:02 PM (15 years ago)

See how to use the tree view to edit RSS (and any XML or HTML) in this screencast video demonstrating this new feature in the free firstobject XML editor release 2.4.1.

XML Editor format XML, customize treeview, and program 10 Oct 2009 7:00 PM (15 years ago)

This screencast video demonstrates the free firstobject XML editor, and how to format XML, customize the treeview, generate and step through a C++ style program.

	parse document	this is the core indicator of parsing speed; the document string is passed to SetDoc in memory and parsed, it is not loading the document from disk
	parse attributes	loops through the document reading all attributes with GetAttribName and GetAttrib (the new GetNthAttrib method is more efficient way to do this)
	create document	builds a document using an AddElem and SetAttrib for each element, the document is not saved to disk, there is no disk I/O in this measurement
	create attributes	creates a document with up to 4 randomly selected attributes and values per element, the SetAttrib call occassionally overwrites an attribute

News from firstobject.com View RSS

Control Characters in XML 9 Feb 2012 8:12 AM (13 years ago)

Split and Merge Translation XML 8 Dec 2011 5:39 AM (13 years ago)

CMarkup 11.5 Release Notes 23 Apr 2011 9:05 AM (14 years ago)

Summary:

Details:

Euro and other non-ASCII chars in XML with CMarkup 2 Mar 2011 3:08 AM (14 years ago)

C++ string charset build options

Archived CMarkup 11.4 Release Notes 5 Feb 2011 2:31 PM (14 years ago)

Archived CMarkup 11.3 Release Notes 20 Nov 2010 2:15 PM (14 years ago)

Using the firstobject XML editor from the command line 20 Nov 2010 2:12 PM (14 years ago)

Whitespace and CMarkup 20 Nov 2010 2:11 PM (14 years ago)

How to generate file names with XML splitter script 20 Nov 2010 2:10 PM (14 years ago)

Splitting a huge file

Export XML records with matching childset 20 Nov 2010 2:09 PM (14 years ago)

CMarkup GetNthAttrib Method 20 Nov 2010 2:07 PM (14 years ago)

CMarkup XML Parser Performance 20 Nov 2010 2:04 PM (14 years ago)

The reason for release 11.3 performance improvement

Comparing different builds of release 11.3

File mode performance

Attribute Method Performance 20 Nov 2010 2:01 PM (14 years ago)

Video demo of editing RSS XML in the tree view of the free firstobject XML editor 12 Jun 2010 2:02 PM (15 years ago)

XML Editor format XML, customize treeview, and program 10 Oct 2009 7:00 PM (15 years ago)