Ruby Hacking Guide

Translated by Vincent ISAMBART

Chapter 2: Objects

Structure of Ruby objects

Guideline

From this chapter, we will begin actually exploring the ruby source code. First, as declared at the beginning of this book, we’ll start with the object structure.

What are the necessary conditions for objects to be objects? There could be many ways to explain about object itself, but there are only three conditions that are truly indispensable.

The ability to differentiate itself from other objects (an identity)
The ability to respond to messages (methods)
The ability to store internal state (instance variables)

In this chapter, we are going to confirm these three features one by one.

The target file is mainly ruby.h, but we will also briefly look at other files such as object.c, class.c or variable.c.

`VALUE` and object struct

In ruby, the body of an object is expressed by a struct and always handled via a pointer. A different struct type is used for each class, but the pointer type will always be VALUE (figure 1).

Here is the definition of VALUE:

▼ VALUE

  71  typedef unsigned long VALUE;

(ruby.h)

In practice, when using a VALUE, we cast it to the pointer to each object struct. Therefore if an unsigned long and a pointer have a different size, ruby will not work well. Strictly speaking, it will not work if there’s a pointer type that is bigger than sizeof(unsigned long). Fortunately, systems which could not meet this requirement is unlikely recently, but some time ago it seems there were quite a few of them.

The structs, on the other hand, have several variations, a different struct is used based on the class of the object.

`struct`	variation
`struct RObject`	all things for which none of the following applies
`struct RClass`	class object
`struct RFloat`	small numbers
`struct RString`	string
`struct RArray`	array
`struct RRegexp`	regular expression
`struct RHash`	hash table
`struct RFile`	`IO`, `File`, `Socket`, etc…
`struct RData`	all the classes defined at C level, except the ones mentioned above
`struct RStruct`	Ruby’s `Struct` class
`struct RBignum`	big integers

For example, for an string object, struct RString is used, so we will have something like the following.

Let’s look at the definition of a few object structs.

▼ Examples of object struct

      /* struct for ordinary objects */
struct RObject {
    struct RBasic basic;
    struct st_table *iv_tbl;
};

      /* struct for strings (instance of String) */
struct RString {
    struct RBasic basic;
    long len;
    char *ptr;
    union {
        long capa;
        VALUE shared;
    } aux;
};

      /* struct for arrays (instance of Array) */
struct RArray {
    struct RBasic basic;
    long len;
    union {
        long capa;
        VALUE shared;
    } aux;
    VALUE *ptr;
};

(ruby.h)

Before looking at every one of them in detail, let’s begin with something more general.

First, as VALUE is defined as unsigned long, it must be cast before being used when it is used as a pointer. That’s why Rxxxx() macros have been made for each object struct. For example, for struct RString there is RSTRING(), for struct RArray there is RARRAY(), etc… These macros are used like this:

VALUE str = ....;
VALUE arr = ....;
RSTRING(str)->len;   /* ((struct RString*)str)->len */
RARRAY(arr)->len;    /* ((struct RArray*)arr)->len */

Another important point to mention is that all object structs start with a member basic of type struct RBasic. As a result, if you cast this VALUE to struct RBasic*, you will be able to access the content of basic, regardless of the type of struct pointed to by VALUE.

Because it is purposefully designed this way, struct RBasic must contain very important information for Ruby objects. Here is the definition for struct RBasic:

▼ struct RBasic

struct RBasic {
    unsigned long flags;
    VALUE klass;
};

(ruby.h)

flags are multipurpose flags, mostly used to register the struct type (for instance struct RObject). The type flags are named T_xxxx, and can be obtained from a VALUE using the macro TYPE(). Here is an example:

VALUE str;
str = rb_str_new();    /* creates a Ruby string (its struct is RString) */
TYPE(str);             /* the return value is T_STRING */

The all flags are named as T_xxxx, like T_STRING for struct RString and T_ARRAY for struct RArray. They are very straightforwardly corresponded to the type names.

The other member of struct RBasic, klass, contains the class this object belongs to. As the klass member is of type VALUE, what is stored is (a pointer to) a Ruby object. In short, it is a class object.

The relation between an object and its class will be detailed in the “Methods” section of this chapter.

By the way, this member is named klass so as not to conflict with the reserved word class when the file is processed by a C++ compiler.

About struct types

I said that the type of struct is stored in the flags member of struct Basic. But why do we have to store the type of struct? It’s to be able to handle all different types of struct via VALUE. If you cast a pointer to a struct to VALUE, as the type information does not remain, the compiler won’t be able to help. Therefore we have to manage the type ourselves. That’s the consequence of being able to handle all the struct types in a unified way.

OK, but the used struct is defined by the class so why are the struct type and class are stored separately? Being able to find the struct type from the class should be enough. There are two reasons for not doing this.

The first one is (I’m sorry for contradicting what I said before), in fact there are structs that do not have a struct RBasic (i.e. they have no klass member). For example struct RNode that will appear in the second part of the book. However, flags is guaranteed to be in the beginning members even in special structs like this. So if you put the type of struct in flags, all the object structs can be differentiated in one unified way.

The second reason is that there is no one-to-one correspondence between class and struct. For example, all the instances of classes defined at the Ruby level use struct RObject, so finding a struct from a class would require to keep the correspondence between each class and struct. That’s why it’s easier and faster to put the information about the type in the struct.

The use of `basic.flags`

Regarding the use of basic.flags, because I feel bad to say it is the struct type “and such”, I’ll illustrate it entirely here. (Figure 5) There is no need to understand everything right away, because this is prepared for the time when you will be wondering about it later.

When looking at the diagram, it looks like that 21 bits are not used on 32 bit machines. On these additional bits, the flags FL_USER0 to FL_USER8 are defined, and are used for a different purpose for each struct. In the diagram I also put FL_USER0 (FL_SINGLETON) as an example.

Objects embedded in `VALUE`

As I said, VALUE is an unsigned long. As VALUE is a pointer, it may look like void* would also be all right, but there is a reason for not doing this. In fact, VALUE can also not be a pointer. The 6 cases for which VALUE is not a pointer are the following:

small integers
symbols
true
false
nil
Qundef

I’ll explain them one by one.

Small integers

All data are objects in Ruby, thus integers are also objects. But since there are so many kind of integer objects, if each of them is expressed as a struct, it would risk slowing down execution significantly. For example, when incrementing from 0 to 50000, we would hesitate to create 50000 objects for only that purpose.

That’s why in ruby, integers that are small to some extent are treated specially and embedded directly into VALUE. “Small” means signed integers that can be held in sizeof(VALUE)*8-1 bits. In other words, on 32 bits machines, the integers have 1 bit for the sign, and 30 bits for the integer part. Integers in this range will belong to the Fixnum class and the other integers will belong to the Bignum class.

Let’s see in practice the INT2FIX() macro that converts from a C int to a Fixnum, and confirm that Fixnum are directly embedded in VALUE.

▼ INT2FIX

 123  #define INT2FIX(i) ((VALUE)(((long)(i))<<1 | FIXNUM_FLAG))
 122  #define FIXNUM_FLAG 0x01

(ruby.h)

In brief, shift 1 bit to the left, and bitwise or it with 1.

` 110100001000`	before conversion
`1101000010001`	after conversion

That means that Fixnum as VALUE will always be an odd number. On the other hand, as Ruby object structs are allocated with malloc(), they are generally arranged on addresses multiple of 4. So they do not overlap with the values of Fixnum as VALUE.

Also, to convert int or long to VALUE, we can use macros like INT2NUM() or LONG2NUM(). Any conversion macro XXXX2XXXX with a name containing NUM can manage both Fixnum and Bignum. For example if INT2NUM() can’t convert an integer into a Fixnum, it will automatically convert it to Bignum. NUM2INT() will convert both Fixnum and Bignum to int. If the number can’t fit in an int, an exception will be raised, so there is no need to check the value range.

Symbols

What are symbols?

As this question is quite troublesome to answer, let’s start with the reasons why symbols were necessary. In the first place, there’s a type named ID used inside ruby. Here it is.

▼ ID

  72  typedef unsigned long ID;

(ruby.h)

This ID is a number having a one-to-one association with a string. However, it’s not possible to have an association between all strings in this world and numerical values. It is limited to the one to one relationships inside one ruby process. I’ll speak of the method to find an ID in the next chapter “Names and name tables.”

In language processor, there are a lot of names to handle. Method names or variable names, constant names, file names, class names… It’s troublesome to handle all of them as strings (char*), because of memory management and memory management and memory management… Also, lots of comparisons would certainly be necessary, but comparing strings character by character will slow down the execution. That’s why strings are not handled directly, something will be associated and used instead. And generally that “something” will be integers, as they are the simplest to handle.

These ID are found as symbols in the Ruby world. Up to ruby 1.4, the values of ID converted to Fixnum were used as symbols. Even today these values can be obtained using Symbol#to_i. However, as real use results came piling up, it was understood that making Fixnum and Symbol the same was not a good idea, so since 1.6 an independent class Symbol has been created.

Symbol objects are used a lot, especially as keys for hash tables. That’s why Symbol, like Fixnum, was made embedded in VALUE. Let’s look at the ID2SYM() macro converting ID to Symbol object.

▼ ID2SYM

 158  #define SYMBOL_FLAG 0x0e
 160  #define ID2SYM(x) ((VALUE)(((long)(x))<<8|SYMBOL_FLAG))

(ruby.h)

When shifting 8 bits left, x becomes a multiple of 256, that means a multiple of 4. Then after with a bitwise or (in this case it’s the same as adding) with 0x0e (14 in decimal), the VALUE expressing the symbol is not a multiple of 4. Or even an odd number. So it does not overlap the range of any other VALUE. Quite a clever trick.

Finally, let’s see the reverse conversion of ID2SYM(), SYM2ID().

▼ SYM2ID()

 161  #define SYM2ID(x) RSHIFT((long)x,8)

(ruby.h)

RSHIFT is a bit shift to the right. As right shift may keep or not the sign depending of the platform, it became a macro.

`true false nil`

These three are Ruby special objects. true and false represent the boolean values. nil is an object used to denote that there is no object. Their values at the C level are defined like this:

▼ true false nil

#define Qfalse 0        /* Ruby's false */
#define Qtrue  2        /* Ruby's true */
#define Qnil   4        /* Ruby's nil */

(ruby.h)

This time it’s even numbers, but as 0 or 2 can’t be used by pointers, they can’t overlap with other VALUE. It’s because usually the first block of virtual memory is not allocated, to make the programs dereferencing a NULL pointer crash.

And as Qfalse is 0, it can also be used as false at C level. In practice, in ruby, when a function returns a boolean value, it’s often made to return an int or VALUE, and returns Qtrue/Qfalse.

For Qnil, there is a macro dedicated to check if a VALUE is Qnil or not, NIL_P().

▼ NIL_P()

 170  #define NIL_P(v) ((VALUE)(v) == Qnil)

(ruby.h)

The name ending with p is a notation coming from Lisp denoting that it is a function returning a boolean value. In other words, NIL_P means “is the argument nil?” It seems the “p” character comes from “predicate.” This naming rule is used at many different places in ruby.

Also, in Ruby, false and nil are false (in conditional statements) and all the other objects are true. However, in C, nil (Qnil) is true. That’s why there’s the RTEST() macro to do Ruby-style test in C.

▼ RTEST()

 169  #define RTEST(v) (((VALUE)(v) & ~Qnil) != 0)

(ruby.h)

As in Qnil only the third lower bit is 1, in ~Qnil only the third lower bit is 0. Then only Qfalse and Qnil become 0 with a bitwise and.

!=0 has been added to be certain to only have 0 or 1, to satisfy the requirements of the glib library that only wants 0 or 1 ([ruby-dev:11049].)

By the way, what is the ‘Q’ of Qnil? ‘R’ I would have understood but why ‘Q’? When I asked, the answer was “Because it’s like that in Emacs.” I did not have the fun answer I was expecting…

`Qundef`

▼ Qundef

 167  #define Qundef 6                /* undefined value for placeholder */

(ruby.h)

This value is used to express an undefined value in the interpreter. It can’t (must not) be found at all at the Ruby level.

Methods

I already brought up the three important points of a Ruby object: having an identity, being able to call a method, and keeping data for each instance. In this section, I’ll explain in a simple way the structure linking objects and methods.

`struct RClass`

In Ruby, classes exist as objects during the execution. Of course. So there must be a struct for class objects. That struct is struct RClass. Its struct type flag is T_CLASS.

As classes and modules are very similar, there is no need to differentiate their content. That’s why modules also use the struct RClass struct, and are differentiated by the T_MODULE struct flag.

▼ struct RClass

struct RClass {
    struct RBasic basic;
    struct st_table *iv_tbl;
    struct st_table *m_tbl;
    VALUE super;
};

(ruby.h)

First, let’s focus on the m_tbl (Method TaBLe) member. struct st_table is an hashtable used everywhere in ruby. Its details will be explained in the next chapter “Names and name tables”, but basically, it is a table mapping names to objects. In the case of m_tbl, it keeps the correspondence between the name (ID) of the methods possessed by this class and the methods entity itself. As for the structure of the method entity, it will be explained in Part 2 and Part 3.

The fourth member super keeps, like its name suggests, the superclass. As it’s a VALUE, it’s (a pointer to) the class object of the superclass. In Ruby there is only one class that has no superclass (the root class): Object.

However I already said that all Object methods are defined in the Kernel module, Object just includes it. As modules are functionally similar to multiple inheritance, it may seem having just super is problematic, but in ruby some clever conversions are made to make it look like single inheritance. The details of this process will be explained in the fourth chapter “Classes and modules.”

Because of this conversion, super of the struct of Object points to struct RClass which is the entity of Kernel object and the super of Kernel is NULL. So to put it conversely, if super is NULL, its RClass is the entity of Kernel (figure 6).

Methods search

With classes structured like this, you can easily imagine the method call process. The m_tbl of the object’s class is searched, and if the method was not found, the m_tbl of super is searched, and so on. If there is no more super, that is to say the method was not found even in Object, then it must not be defined.

The sequential search process in m_tbl is done by search_method().

▼ search_method()

static NODE*
search_method(klass, id, origin)
    VALUE klass, *origin;
    ID id;
{
    NODE *body;
 262
    if (!klass) return 0;
    while (!st_lookup(RCLASS(klass)->m_tbl, id, &body)) {
        klass = RCLASS(klass)->super;
        if (!klass) return 0;
    }
 268
    if (origin) *origin = klass;
    return body;
}

(eval.c)

This function searches the method named id in the class object klass.

RCLASS(value) is the macro doing:

((struct RClass*)(value))

st_lookup() is a function that searches in st_table the value corresponding to a key. If the value is found, the function returns true and puts the found value at the address given in third parameter (&body).

Nevertheless, doing this search each time whatever the circumstances would be too slow. That’s why in reality, once called, a method is cached. So starting from the second time it will be found without following super one by one. This cache and its search will be seen in the 15th chapter “Methods.”

Instance variables

In this section, I will explain the implementation of the third essential condition, instance variables.

`rb_ivar_set()`

Instance variable is the mechanism that allows each object to hold its specific data. Since it is specific to each object, it seems good to store it in each object itself (i.e. in its object struct), but is it really so? Let’s look at the function rb_ivar_set(), which assigns an object to an instance variable.

▼ rb_ivar_set()

      /* assign val to the id instance variable of obj */
VALUE
rb_ivar_set(obj, id, val)
    VALUE obj;
    ID id;
    VALUE val;
{
    if (!OBJ_TAINTED(obj) && rb_safe_level() >= 4)
        rb_raise(rb_eSecurityError,
                       "Insecure: can't modify instance variable");
    if (OBJ_FROZEN(obj)) rb_error_frozen("object");
    switch (TYPE(obj)) {
      case T_OBJECT:
      case T_CLASS:
      case T_MODULE:
        if (!ROBJECT(obj)->iv_tbl)
                  ROBJECT(obj)->iv_tbl = st_init_numtable();
        st_insert(ROBJECT(obj)->iv_tbl, id, val);
        break;
      default:
        generic_ivar_set(obj, id, val);
        break;
    }
    return val;
}

(variable.c)

rb_raise() and rb_error_frozen() are both error checks. This can always be said hereafter: Error checks are necessary in reality, but it’s not the main part of the process. Therefore, we should wholly ignore them at first read.

After removing the error handling, only the switch remains, but

switch (TYPE(obj)) {
  case T_aaaa:
  case T_bbbb:
     ...
}

this form is an idiom of ruby. TYPE() is the macro returning the type flag of the object struct (T_OBJECT, T_STRING, etc.). In other words as the type flag is an integer constant, we can branch depending on it with a switch. Fixnum or Symbol do not have structs, but inside TYPE() a special treatment is done to properly return T_FIXNUM and T_SYMBOL, so there’s no need to worry.

Well, let’s go back to rb_ivar_set(). It seems only the treatments of T_OBJECT, T_CLASS and T_MODULE are different. These 3 have been chosen on the basis that their second member is iv_tbl. Let’s confirm it in practice.

▼ Structs whose second member is iv_tbl

      /* TYPE(val) == T_OBJECT */
struct RObject {
    struct RBasic basic;
    struct st_table *iv_tbl;
};

      /* TYPE(val) == T_CLASS or T_MODULE */
struct RClass {
    struct RBasic basic;
    struct st_table *iv_tbl;
    struct st_table *m_tbl;
    VALUE super;
};

(ruby.h)

iv_tbl is the Instance Variable TaBLe. It records the correspondences between the instance variable names and their values.

In rb_ivar_set(), let’s look again the code for the structs having iv_tbl.

if (!ROBJECT(obj)->iv_tbl)
    ROBJECT(obj)->iv_tbl = st_init_numtable();
st_insert(ROBJECT(obj)->iv_tbl, id, val);
break;

ROBJECT() is a macro that casts a VALUE into a struct RObject*. It’s possible that what obj points to is actually a struct RClass, but when accessing only the second member, no problem will occur.

st_init_numtable() is a function creating a new st_table. st_insert() is a function doing associations in a st_table.

In conclusion, this code does the following: if iv_tbl does not exist, it creates it, then stores the [variable name → object] association.

There’s one thing to be careful about. As struct RClass is the struct of a class object, its instance variable table is for the class object itself. In Ruby programs, it corresponds to something like the following:

class C
  @ivar = "content"
end

`generic_ivar_set()`

What happens when assigning to an instance variable of an object whose struct is not one of T_OBJECT T_MODULE T_CLASS?

▼ rb_ivar_set() in the case there is no iv_tbl

default:
  generic_ivar_set(obj, id, val);
  break;

(variable.c)

This is delegated to generic_ivar_set(). Before looking at this function, let’s first explain its general idea.

Structs that are not T_OBJECT, T_MODULE or T_CLASS do not have an iv_tbl member (the reason why they do not have it will be explained later). However, even if it does not have the member, if there’s another method linking an instance to a struct st_table, it would be able to have instance variables. In ruby, these associations are solved by using a global st_table, generic_iv_table (figure 7).

Let’s see this in practice.

▼ generic_ivar_set()

static st_table *generic_iv_tbl;

static void
generic_ivar_set(obj, id, val)
    VALUE obj;
    ID id;
    VALUE val;
{
    st_table *tbl;
 837
          /* for the time being you can ignore this */
    if (rb_special_const_p(obj)) {
        special_generic_ivar = 1;
    }
          /* initialize generic_iv_tbl if it does not exist */
    if (!generic_iv_tbl) {
        generic_iv_tbl = st_init_numtable();
    }
 844
          /* the process itself */
    if (!st_lookup(generic_iv_tbl, obj, &tbl)) {
        FL_SET(obj, FL_EXIVAR);
        tbl = st_init_numtable();
        st_add_direct(generic_iv_tbl, obj, tbl);
        st_add_direct(tbl, id, val);
        return;
    }
    st_insert(tbl, id, val);
}

(variable.c)

rb_special_const_p() is true when its parameter is not a pointer. However, as this if part requires knowledge of the garbage collector, we’ll skip it for now. I’d like you to check it again after reading the chapter 5 “Garbage collection.”

st_init_numtable() already appeared some time ago. It creates a new hash table.

st_lookup() searches a value corresponding to a key. In this case it searches for what’s attached to obj. If an attached value can be found, the whole function returns true and stores the value at the address (&tbl) given as third parameter. In short, !st_lookup(...) can be read “if a value can’t be found.”

st_insert() was also already explained. It stores a new association in a table.

st_add_direct() is similar to st_insert(), but it does not check if the key was already stored before adding an association. It means, in the case of st_add_direct(), if a key already registered is being used, two associations linked to this same key will be stored. We can use st_add_direct() only when the check for existence has already been done, or when a new table has just been created. And this code would meet these requirements.

FL_SET(obj, FL_EXIVAR) is the macro that sets the FL_EXIVAR flag in the basic.flags of obj. The basic.flags flags are all named FL_xxxx and can be set using FL_SET(). These flags can be unset with FL_UNSET(). The EXIVAR from FL_EXIVAR seems to be the abbreviation of EXternal Instance VARiable.

This flag is set to speed up the reading of instance variables. If FL_EXIVAR is not set, even without searching in generic_iv_tbl, we can see the object does not have any instance variables. And of course a bit check is way faster than searching a struct st_table.

Gaps in structs

Now you understood the way to store the instance variables, but why are there structs without iv_tbl? Why is there no iv_tbl in struct RString or struct RArray? Couldn’t iv_tbl be part of RBasic?

To tell the conclusion first, we can do such thing, but should not. As a matter of fact, this problem is deeply linked to the way ruby manages objects.

In ruby, the memory used for string data (char[]) and such is directly allocated using malloc(). However, the object structs are handled in a particular way. ruby allocates them by clusters, and then distribute them from these clusters. And in this way, if the types (or rather their sizes) were diverse, it’s hard to manage, thus RVALUE, which is the union of the all structs, is defined and the array of the unions is managed.

The size of a union is the same as the size of the biggest member, so for instance, if one of the structs is big, a lot of space would be wasted. Therefore, it’s preferable that each struct size is as similar as possible.

The most used struct might be usually struct RString. After that, depending on each program, there comes struct RArray (array), RHash (hash), RObject (user defined object), etc. However, this struct RObject only uses the space of struct RBasic + 1 pointer. On the other hand, struct RString, RArray and RHash take the space of struct RBasic + 3 pointers. In other words, when the number of struct RObject is being increased, the memory space of the two pointers for each object are wasted. Furthermore, if the size of RString was as much as 4 pointers, Robject would use less than the half size of the union, and this is too wasteful.

So the benefit of iv_tbl is more or less saving memory and speeding up. Furthermore we do not know if it is used often or not. In fact, generic_iv_tbl was not introduced before ruby 1.2, so it was not possible to use instance variables in String or Array at that time. Nevertheless, it was not much of a problem. Making large amounts of memory useless just for such functionality looks stupid.

If you take all this into consideration, you can conclude that increasing the size of object structs for iv_tbl does not do any good.

`rb_ivar_get()`

We saw the rb_ivar_set() function that sets variables, so let’s see quickly how to get them.

▼ rb_ivar_get()

VALUE
rb_ivar_get(obj, id)
    VALUE obj;
    ID id;
{
    VALUE val;
 966
    switch (TYPE(obj)) {
      /* (A) */
      case T_OBJECT:
      case T_CLASS:
      case T_MODULE:
        if (ROBJECT(obj)->iv_tbl &&
                  st_lookup(ROBJECT(obj)->iv_tbl, id, &val))
            return val;
        break;
      /* (B) */
      default:
        if (FL_TEST(obj, FL_EXIVAR) || rb_special_const_p(obj))
            return generic_ivar_get(obj, id);
        break;
    }
      /* (C) */
    rb_warning("instance variable %s not initialized", rb_id2name(id));
 980
    return Qnil;
}

(variable.c)

The structure is completely the same.

(A) For struct RObject or RClass, we search the variable in iv_tbl. As iv_tbl can also be NULL, we must check it before using it. Then if st_lookup() finds the relation, it returns true, so the whole if can be read as “If the instance variable has been set, return its value.”

(C) If no correspondence could be found, in other words if we read an instance variable that has not been set, we first leave the if then the switch. rb_warning() will then issue a warning and nil will be returned. That’s because you can read instance variables that have not been set in Ruby.

(B) On the other hand, if the struct is neither struct RObject nor RClass, the instance variable table is searched in generic_iv_tbl. What generic_ivar_get() does can be easily guessed, so I won’t explain it. I’d rather want you to focus on the condition of the if statement.

I already told you that the FL_EXIRVAR flag is set to the object on which generic_ivar_set() is used. Here, that flag is utilized to make the check faster.

And what is rb_special_const_p()? This function returns true when its parameter obj does not point to a struct. As no struct means no basic.flags, no flag can be set in the first place. Thus FL_xxxx() is designed to always return false for such object. Hence, objects that are rb_special_const_p() should be treated specially here.

Object Structs

In this section, about the important ones among object structs, we’ll briefly see their concrete appearances and how to deal with them.

`struct RString`

struct RString is the struct for the instances of the String class and its subclasses.

▼ struct RString

struct RString {
    struct RBasic basic;
    long len;
    char *ptr;
    union {
        long capa;
        VALUE shared;
    } aux;
};

(ruby.h)

ptr is a pointer to the string, and len the length of that string. Very straightforward.

Rather than a string, Ruby’s string is more a byte array, and can contain any byte including NUL. So when thinking at the Ruby level, ending the string with NUL does not mean anything. But as C functions require NUL, for convenience the ending NUL is there. However, its size is not included in len.

When dealing with a string from the interpreter or an extension library, you can access ptr and len by writing RSTRING(str)->ptr or RSTRING(str)->len, and it is allowed. But there are some points to pay attention to.

you have to check if str really points to a struct RString by yourself beforehand
you can read the members, but you must not modify them
you can’t store RSTRING(str)->ptr in something like a local variable and use it later

Why is that? First, there is an important software engineering principle: Don’t arbitrarily tamper with someone’s data. When there are interface functions, we should use them. However, there are also concrete reasons in ruby’s design why you should not refer to or store a pointer, and that’s related to the fourth member aux. However, to explain properly how to use aux, we have to explain first a little more of Ruby’s strings’ characteristics.

Ruby’s strings can be modified (are mutable). By mutable I mean after the following code:

s = "str"        # create a string and assign it to s
s.concat("ing")  # append "ing" to this string object
p(s)             # show "string"

the content of the object pointed by s will become “string”. It’s different from Java or Python string objects. Java’s StringBuffer is closer.

And what’s the relation? First, mutable means the length (len) of the string can change. We have to increase or decrease the allocated memory size each time the length changes. We can of course use realloc() for that, but generally malloc() and realloc() are heavy operations. Having to realloc() each time the string changes is a huge burden.

That’s why the memory pointed by ptr has been allocated with a size a little bigger than len. Because of that, if the added part can fit into the remaining memory, it’s taken care of without calling realloc(), so it’s faster. The struct member aux.capa contains the length including this additional memory.

So what is this other aux.shared? It’s to speed up the creation of literal strings. Have a look at the following Ruby program.

while true do  # repeat indefinitely
  a = "str"        # create a string with "str" as content and assign it to a
  a.concat("ing")  # append "ing" to the object pointed by a
  p(a)             # show "string"
end

Whatever the number of times you repeat the loop, the fourth line’s p has to show "string". And to do so, the expression "str" must every time create an object that holds a distinct char[]. But there must be also the high possibility that strings are not modified at all, and a lot of useless copies of char[] would be created in such situation. If possible, we’d like to share one common char[].

The trick to share is aux.shared. Every string object created with a literal uses one shared char[]. And after a change occurs, the object-specific memory is allocated. When using a shared char[], the flag ELTS_SHARED is set in the object struct’s basic.flags, and aux.shared contains the original object. ELTS seems to be the abbreviation of ELemenTS.

Then, let’s return to our talk about RSTRING(str)->ptr. Though referring to a pointer is OK, you must not assign to it. This is first because the value of len or capa will no longer agree with the actual body, and also because when modifying strings created as litterals, aux.shared has to be separated.

Before ending this section, I’ll write some examples of dealing with RString. I’d like you to regard str as a VALUE that points to RString when reading this.

RSTRING(str)->len;               /* length */
RSTRING(str)->ptr[0];            /* first character */
str = rb_str_new("content", 7);  /* create a string with "content" as its content
                                    the second parameter is the length */
str = rb_str_new2("content");    /* create a string with "content" as its content
                                    its length is calculated with strlen() */
rb_str_cat2(str, "end");         /* Concatenate a C string to a Ruby string */

`struct RArray`

struct RArray is the struct for the instances of Ruby’s array class Array.

▼ struct RArray

struct RArray {
    struct RBasic basic;
    long len;
    union {
        long capa;
        VALUE shared;
    } aux;
    VALUE *ptr;
};

(ruby.h)

Except for the type of ptr, this structure is almost the same as struct RString. ptr points to the content of the array, and len is its length. aux is exactly the same as in struct RString. aux.capa is the “real” length of the memory pointed by ptr, and if ptr is shared, aux.shared stores the shared original array object.

From this structure, it’s clear that Ruby’s Array is an array and not a list. So when the number of elements changes in a big way, a realloc() must be done, and if an element must be inserted at an other place than the end, a memmove() will occur. But even if it does it, it’s moving so fast that we don’t notice about that. Recent machines are really impressive.

And the way to access to its members is similar to the way of RString. With RARRAY(arr)->ptr and RARRAY(arr)->len, you can refer to the members, and it is allowed, but you must not assign to them, etc. We’ll only look at simple examples:

/* manage an array from C */
VALUE ary;
ary = rb_ary_new();             /* create an empty array */
rb_ary_push(ary, INT2FIX(9));   /* push a Ruby 9 */
RARRAY(ary)->ptr[0];            /* look what's at index 0 */
rb_p(RARRAY(ary)->ptr[0]);      /* do p on ary[0] (the result is 9) */

# manage an array from Ruby
ary = []      # create an empty array
ary.push(9)   # push 9
ary[0]        # look what's at index 0
p(ary[0])     # do p on ary[0] (the result is 9)

`struct RRegexp`

It’s the struct for the instances of the regular expression class Regexp.

▼ struct RRegexp

struct RRegexp {
    struct RBasic basic;
    struct re_pattern_buffer *ptr;
    long len;
    char *str;
};

(ruby.h)

ptr is the compiled regular expression. str is the string before compilation (the source code of the regular expression), and len is this string’s length.

As any code to handle Regexp objects doesn’t appear in this book, we won’t see how to use it. Even if you use it in extension libraries, as long as you do not want to use it a very particular way, the interface functions are enough.

`struct RHash`

struct RHash is the struct for Hash object, which is Ruby’s hash table.

▼ struct RHash

struct RHash {
    struct RBasic basic;
    struct st_table *tbl;
    int iter_lev;
    VALUE ifnone;
};

(ruby.h)

It’s a wrapper for struct st_table. st_table will be detailed in the next chapter “Names and name tables.”

ifnone is the value when a key does not have an associated value, its default is nil. iter_lev is to make the hashtable reentrant (multithread safe).

`struct RFile`

struct RFile is a struct for instances of the built-in IO class and its subclasses.

▼ struct RFile

struct RFile {
    struct RBasic basic;
    struct OpenFile *fptr;
};

(ruby.h)

▼ OpenFile

typedef struct OpenFile {
    FILE *f;                    /* stdio ptr for read/write */
    FILE *f2;                   /* additional ptr for rw pipes */
    int mode;                   /* mode flags */
    int pid;                    /* child's pid (for pipes) */
    int lineno;                 /* number of lines read */
    char *path;                 /* pathname for file */
    void (*finalize) _((struct OpenFile*)); /* finalize proc */
} OpenFile;

(rubyio.h)

All members have been transferred in struct OpenFile. As there aren’t many instances of IO objects, it’s OK to do it like this. The purpose of each member is written in the comments. Basically, it’s a wrapper around C’s stdio.

`struct RData`

struct RData has a different tenor from what we saw before. It is the struct for implementation of extension libraries.

Of course structs for classes created in extension libraries are necessary, but as the types of these structs depend on the created class, it’s impossible to know their size or struct in advance. That’s why a “struct for managing a pointer to a user defined struct” has been created on ruby’s side to manage this. This struct is struct RData.

▼ struct RData

struct RData {
    struct RBasic basic;
    void (*dmark) _((void*));
    void (*dfree) _((void*));
    void *data;
};

(ruby.h)

data is a pointer to the user defined struct, dfree is the function used to free that user defined struct, and dmark is the function to do “mark” of the mark and sweep.

Because explaining struct RData is still too complicated, for the time being let’s just look at its representation (figure 8). The detailed explanation of its members will be introduced after we’ll finish chapter 5 “Garbage collection.”

figure 8: Representation of `struct RData`

Chapter 2: Objects

Structure of Ruby objects

Guideline

VALUE and object struct

About struct types

The use of basic.flags

Objects embedded in VALUE

Small integers

Symbols

true false nil

Qundef

Methods

struct RClass

Methods search

Instance variables

rb_ivar_set()

generic_ivar_set()