Say you are like me, you have a strong background as a high level language programmer. You enjoy the amenities provided by the strong and mature framework libraries that backup this programming languages. And youre so used to use the string structure in such a natural way that for you its just another primitive data type.
High level languages like C# or Java provide a strong set of methods that manipulate strings in a large variety of ways.But then, all of the sudden you find yourself programming in low/mid level programming languages like C/C++ where strings are normally managed by dynamically allocated arrays (or pointers) or just plain static arrays of char elements.
If you are like me, when programming in C/C++ youre going to be trying to find a counterpart for a function or method call among the language provided libraries, just like you used to do on a high level language. So when I tried to split a string into different sub-strings using a delimiter on C++, I did the obvious and checked the standard string documentation looking for a method to achieve this without any success. I do know I could have imported the string.h library from C and work around this issue using the strtok() function but since theres a lot of people discouraging its use. I also know that third party libraries like Boost provide this functionality but this would create a program dependency of a complete library just for a simple task achievable with tools provided by the core features of the language. Not to mention the mastering of other idioms and structures involved in the use of the function.
You may ask yourself: is this really an issue? why on earth would you want to split a string? Well its a common practice to concatenate different data tokens using a special character as a delimiter when sending data over the network. This way you can summarize all the data that comprises a network package into just one string instead of being forced to send the different data subsets in different packages. Remember resources are not infinite and this way we are doing more with less operations.
A common example of this is CSV (Comma separated values) which is a common data formatting standard for interchangeability purposes; since matrix elements are commonly separated by commas when represented as data structures in programming languages, its a natural way of comprising data packages and it makes very easy to implicitly understand the package content depending the context. Usually comma separated values look like this:
- Raul,27,Santo Domingo
- 56000,256,This is a text message
Note that theres no specific order or rule for delimiting what goes first or last, this is delegated to the programmer criteria and may variate for every program that generates this kind of data. There are also other common practices for data representation using other delimiters like tab for example.
So the need of manipulating each data subset individually materializes on a split function. The basic signature of a split function is: string array split(string of characters, char delimiter). We could imply that we are invoking a function that will return a data structure comprised by data tokens which were originally separated by a delimiter. So if the original string was "This is my, original string" then the resulting array would be {{"This is my"},{"original string"}}.
The Code
In order to achieve this in C++ I created a simple function that returns a string vector (which is an array like data structure). The source code for this function definition goes like this:
vector<string> split(string str, string delim)
{
unsigned start = 0;
unsigned end;
vector<string> v;
while( (end = str.find(delim, start)) != string::npos )
{
v.push_back(str.substr(start, end-start));
start = end + delim.length();
}
v.push_back(str.substr(start));
return v;
}
The implementation for the client code is as pretty simple as vector<string> v = split("string,separated,by,commas", ",");
So thats it theres a simple way of doing strings split based on a delimiter. Enjoy!