String Splitting in C

For a project I’ve been working on I needed to split a string into it’s component parts. There is `strtok` which I find useless for pretty much any task. It is not thread-safe, nor is it re-entrant, which makes it impossible to parse two strings (in a loop) at once. Yet another issue with `strtok` is that after splitting, parts are returned by multiple calls to the function. The only way to know the number of parts is to loop until you’ve gone through all of them. Also, you can’t specify a maximum number of parts you want the string split into. Finally, `strtok` modifies the input string, which might not be desirable.

To deal with the shortcomings of `strtok` I wrote a simple string splitting function. It duplicates the input string and also takes a length so it can split a sub string. Not to mention, it supports a maximum number of splits in case I want to use it to partition a string.

char **str_split(const char *in, size_t in_len, char delm, size_t *num_elm, size_t max)
{
    char   *parsestr;
    char   **out;
    size_t  cnt = 1;
    size_t  i;

    if (in == NULL || in_len == 0 || num_elm == NULL)
        return NULL;

    parsestr = malloc(in_len+1);
    memcpy(parsestr, in, in_len+1);
    parsestr[in_len] = '\0';

    *num_elm = 1;
    for (i=0; i<in_len; i++) {
        if (parsestr[i] == delm)
            (*num_elm)++;
        if (max > 0 && *num_elm == max)
            break;
    }

    out    = malloc(*num_elm * sizeof(*out));
    out[0] = parsestr;
    for (i=0; i<in_len && cnt<*num_elm; i++) {
        if (parsestr[i] != delm)
            continue;

        /* Add the pointer to the array of elements */
        parsestr[i] = '\0';
        out[cnt] = parsestr+i+1;
        cnt++;
    }

    return out;
}

Before we start the actual splitting, we need to determine the number of elements. There will always be at least one element because if there is no delimiter within the data, then the entire input will be returned as the only element. The number of elements will stop once `max` is reached unless `max` was set to 0. If max was set to 0, we’ll find the real total. This has to happen before the actual splitting takes place because we need to know the number of elements to allocate in the output array.

Next copy the data into a new string that we’ll chop up. We’ll ensure a NULL terminator in this string so we don’t have to worry about the last element in a split. Since we could be splitting a sub string it’s possible the data we’re copying isn’t already NULL terminated. Then we’ll, loop through the string again so we can start pulling out elements. If there was only one element (no delimiter in the string), then this loop will not run. Since the duplicated string was already set to the first element, it doesn’t matter if this doesn’t run because this loop only deals with the remaining elements.

As we go though the string, any delimiters are changed into NULL terminators. The pointer after the (delimiter changed into a NULL) is stored as the start of the next split. If the last data character in the string is the delimiter, then the next character will be the NULL terminator. In this situation the last element in the array will point to the NULL terminator so we end up with an empty element.

What we’ve done is take an string and put NULL terminators throughout it. We’ve also create an array of pointers into locations within the string after each terminator. The string itself is the first element in the array. This way we only have one `malloc` for all the string data instead of needing one for each substring. Due to this, we can’t have the caller free each part individually. Instead we need a separate function to handle freeing memory allocated by the split function.

void str_split_free(char **in, size_t num_elm)
{
    if (in == NULL)
        return;
    if (num_elm != 0)
        free(in[0]);
    free(in);
}

There are two allocations in the split array so naturally there will only be two deallocations (the array itself and the fist element in the array). Don’t forget the first element is the full string with the rest of the array containing pointers to specific locations within the string.

The number of elements isn’t really needed but it’s an additional safety check to prevent mistakes.